Class GraphGroup

Inheritance Relationships

Derived Types

Class Documentation

class GraphGroup

Base class for managing the training process across one, multiple gpus, or even multiple machines with multiple gpus.

Subclassed by marian::AsyncGraphGroup, marian::SingletonGraph, marian::SyncGraphGroup

Public Functions

GraphGroup(Ptr<Options> options, Ptr<IMPIWrapper> mpi)
GraphGroup(Ptr<Options> options)
void initGraphsAndOpts()
virtual ~GraphGroup()
virtual void update(Ptr<data::Batch> batch) = 0
void increaseCostScaleFactor()
void decreaseCostScaleFactor()
void load()
void save(bool isFinal = false)
void swapWithSmoothed()
bool isMainProcess() const
void barrier() const
void validate()
void finalize()
virtual void setScheduler(Ptr<Scheduler> scheduler) = 0
float checkNanOrNorm(size_t i, size_t begin, size_t end)
float executeAndCollectNorm(const std::function<float(size_t, size_t, size_t)> &task)
float computeNormalizationFactor(float gNorm, size_t updateTrgWords)

This function computes are normalization factor that is applied to the gradient before an update.

Depending on various settings this will return a normalizer that can perform a combination of:

  • apply a cost scaling factor if cost scaling is enabled

  • normalize the gradient by the number of words in a batch if requested (turning ce-sum in to ce-mean). @TODO: once fp16 stability issues are proven to not be caused by this, remove.

  • re-scale the gradient based on a dynamic running average of gradient norms

Ptr<data::BatchStats> collectStats(Ptr<ExpressionGraph> graph, Ptr<models::ICriterionFunction> model, const std::vector<Ptr<Vocab>> &vocabs, double multiplier = 1.)

Determine maximal batch size that can fit into the given workspace so that reallocation does not happen.

Rather adjust the batch size based on the statistics collected here. Activated with --mini-batch-fit. In a multi-GPU scenario, the first GPU is used to determine the size. The actual allowed size is then determined by multiplying it with the number of devices, which is passed in as the ‘multiplier’.

Rather adjust the batch size based on the stastistics collected here. Activated with --mini-batch-fit. In a multi-GPU scenario, the first GPU is used to determine the size. The actual allowed size is then determined by multiplying it with the number of devices, which is passed in as the ‘multiplier’.

virtual Ptr<data::BatchStats> collectStats(const std::vector<Ptr<Vocab>> &vocabs) = 0
void setTypicalTrgBatchWords(size_t typicalTrgBatchWords)
double getTypicalTrgBatchWords()
void updateAverageTrgBatchWords(size_t trgBatchWords)

Protected Functions

size_t numberOfInputFiles()

Protected Attributes

Ptr<Options> options_
Ptr<ICommunicator> comm_
Ptr<IMPIWrapper> mpi_
std::vector<DeviceId> devices_
ShardingMode shardingMode_ = {ShardingMode::global}
std::vector<Ptr<ExpressionGraph>> graphs_
std::vector<Ptr<models::ICriterionFunction>> models_
std::vector<Ptr<OptimizerBase>> optimizerShards_
Ptr<Scheduler> scheduler_
bool finalized_ = {false}
double typicalTrgBatchWords_ = {0}
bool mbRoundUp_ = {true}
bool costScaling_ = {false}
float costScalingFactor_ = {1.f}
size_t costScalingFreq_ = {2000}
float costScalingMultiplier_ = {2.f}
float costScalingFactorMinimum_ = {1.f}
size_t noNanSeen_ = {0}
size_t nanSeen_ = {0}
bool checkGradientNan_ = {false}
bool dynamicGradientScaling_ = {false}
float dynamicGradientScalingFactor_ = {2.f}
bool dynamicGradientScalingUseLogs_ = {false}
size_t dynamicGradientScalingFadeout_ = {0ul}