Fast Neural Machine Translation in C++
Marian: Fast Neural Machine Translation in C++
Version: v1.12.0 65bf82f 2023-02-21 09:56:29 -0800
Usage: ./marian [OPTIONS]
-h,--help Print this help message and exit
--version Print the version number and exit
--authors Print list of authors and exit
--cite Print citation and exit
--build-info TEXT Print CMake build options and exit. Set to 'all' to print
advanced options
-c,--config VECTOR ... Configuration file(s). If multiple, later overrides earlier
-w,--workspace INT=2048 Preallocate arg MB of work space. Negative `--workspace -N`
value allocates workspace as total available GPU memory
minus N megabytes.
--log TEXT Log training process information to file given by arg
--log-level TEXT=info Set verbosity level of logging: trace, debug, info, warn,
err(or), critical, off
--log-time-zone TEXT Set time zone for the date shown on logging
--quiet Suppress all logging to stderr. Logging to files still works
--quiet-translation Suppress logging for translation
--seed UINT Seed for all random number generators. 0 means initialize
randomly
--check-nan Check for NaNs or Infs in forward and backward pass. Will
abort when found. This is a diagnostic option that will
slow down computation significantly
--interpolate-env-vars allow the use of environment variables in paths, of the form
${VAR_NAME}
--relative-paths All paths are relative to the config file location
--dump-config TEXT Dump current (modified) configuration to stdout and exit.
Possible values: full, minimal, expand
--sigterm TEXT=save-and-exit What to do with SIGTERM: save-and-exit or exit-immediately.
-m,--model TEXT=model.npz Path prefix for model to be saved/resumed. Supported file
extensions: .npz, .bin
--pretrained-model TEXT Path prefix for pre-trained model to initialize model weights
--ignore-model-config Ignore the model configuration saved in npz file
--type TEXT=amun Model type: amun, nematus, s2s, multi-s2s, transformer
--dim-vocabs VECTOR=0,0 ... Maximum items in vocabulary ordered by rank, 0 uses all
items in the provided/created vocabulary file
--dim-emb INT=512 Size of embedding vector
--factors-dim-emb INT Embedding dimension of the factors. Only used if concat is
selected as factors combining form
--factors-combine TEXT=sum How to combine the factors and lemma embeddings. Options
available: sum, concat
--lemma-dependency TEXT Lemma dependency method to use when predicting target
factors. Options: soft-transformer-layer,
hard-transformer-layer, lemma-dependent-bias, re-embedding
--lemma-dim-emb INT=0 Re-embedding dimension of lemma in factors
--dim-rnn INT=1024 Size of rnn hidden state
--enc-type TEXT=bidirectional Type of encoder RNN : bidirectional, bi-unidirectional,
alternating (s2s)
--enc-cell TEXT=gru Type of RNN cell: gru, lstm, tanh (s2s)
--enc-cell-depth INT=1 Number of transitional cells in encoder layers (s2s)
--enc-depth INT=1 Number of encoder layers (s2s)
--dec-cell TEXT=gru Type of RNN cell: gru, lstm, tanh (s2s)
--dec-cell-base-depth INT=2 Number of transitional cells in first decoder layer (s2s)
--dec-cell-high-depth INT=1 Number of transitional cells in next decoder layers (s2s)
--dec-depth INT=1 Number of decoder layers (s2s)
--skip Use skip connections (s2s)
--layer-normalization Enable layer normalization
--right-left Train right-to-left model
--input-types VECTOR ... Provide type of input data if different than 'sequence'.
Possible values: sequence, class, alignment, weight. You
need to provide one type per input file (if --train-sets)
or per TSV field (if --tsv).
--best-deep Use Edinburgh deep RNN configuration (s2s)
--tied-embeddings Tie target embeddings and output embeddings in output layer
--tied-embeddings-src Tie source and target embeddings
--tied-embeddings-all Tie all embedding layers and output layer
--output-omit-bias Do not use a bias vector in decoder output layer
--transformer-heads INT=8 Number of heads in multi-head attention (transformer)
--transformer-no-projection Omit linear projection after multi-head attention
(transformer)
--transformer-rnn-projection Add linear projection after rnn layer (transformer)
--transformer-pool Pool encoder states instead of using cross attention
(selects first encoder state, best used with special token)
--transformer-dim-ffn INT=2048 Size of position-wise feed-forward network (transformer)
--transformer-decoder-dim-ffn INT=0 Size of position-wise feed-forward network in decoder
(transformer). Uses --transformer-dim-ffn if 0.
--transformer-ffn-depth INT=2 Depth of filters (transformer)
--transformer-decoder-ffn-depth INT=0 Depth of filters in decoder (transformer). Uses
--transformer-ffn-depth if 0
--transformer-ffn-activation TEXT=swish
Activation between filters: swish or relu (transformer)
--transformer-dim-aan INT=2048 Size of position-wise feed-forward network in AAN
(transformer)
--transformer-aan-depth INT=2 Depth of filter for AAN (transformer)
--transformer-aan-activation TEXT=swish
Activation between filters in AAN: swish or relu (transformer)
--transformer-aan-nogate Omit gate in AAN (transformer)
--transformer-decoder-autoreg TEXT=self-attention
Type of autoregressive layer in transformer decoder:
self-attention, average-attention (transformer)
--transformer-tied-layers VECTOR ... List of tied decoder layers (transformer)
--transformer-guided-alignment-layer TEXT=last
Last or number of layer to use for guided alignment training
in transformer
--transformer-preprocess TEXT Operation before each transformer layer: d = dropout, a =
add, n = normalize
--transformer-postprocess-emb TEXT=d Operation after transformer embedding layer: d = dropout, a
= add, n = normalize
--transformer-postprocess TEXT=dan Operation after each transformer layer: d = dropout, a =
add, n = normalize
--transformer-postprocess-top TEXT Final operation after a full transformer stack: d = dropout,
a = add, n = normalize. The optional skip connection with
'a' by-passes the entire stack.
--transformer-train-position-embeddings
Train positional embeddings instead of using static
sinusoidal embeddings
--transformer-depth-scaling Scale down weight initialization in transformer layers by 1
/ sqrt(depth)
--bert-mask-symbol TEXT=[MASK] Masking symbol for BERT masked-LM training
--bert-sep-symbol TEXT=[SEP] Sentence separator symbol for BERT next sentence prediction
training
--bert-class-symbol TEXT=[CLS] Class symbol BERT classifier training
--bert-masking-fraction FLOAT=0.15 Fraction of masked out tokens during training
--bert-train-type-embeddings=true Train bert type embeddings, set to false to use static
sinusoidal embeddings
--bert-type-vocab-size INT=2 Size of BERT type vocab (sentence A and B)
--dropout-rnn FLOAT Scaling dropout along rnn layers and time (0 = no dropout)
--dropout-src FLOAT Dropout source words (0 = no dropout)
--dropout-trg FLOAT Dropout target words (0 = no dropout)
--transformer-dropout FLOAT Dropout between transformer layers (0 = no dropout)
--transformer-dropout-attention FLOAT Dropout for transformer attention (0 = no dropout)
--transformer-dropout-ffn FLOAT Dropout for transformer filter (0 = no dropout)
--cost-type TEXT=ce-sum Optimization criterion: ce-mean, ce-mean-words, ce-sum,
perplexity
--multi-loss-type TEXT=sum How to accumulate multi-objective losses: sum, scaled, mean
--unlikelihood-loss Use word-level weights as indicators for sequence-level
unlikelihood training
--overwrite Do not create model checkpoints, only overwrite main model
file with last checkpoint. Reduces disk usage
--no-reload Do not load existing model specified in --model arg
-t,--train-sets VECTOR ... Paths to training corpora: source target
-v,--vocabs VECTOR ... Paths to vocabulary files have to correspond to
--train-sets. If this parameter is not supplied we look for
vocabulary files source.{yml,json} and target.{yml,json}.
If these files do not exist they are created
--sentencepiece-alphas VECTOR ... Sampling factors for SentencePiece vocabulary; i-th factor
corresponds to i-th vocabulary
--sentencepiece-options TEXT Pass-through command-line options to SentencePiece trainer
--sentencepiece-max-lines UINT=2000000
Maximum lines to train SentencePiece vocabulary, selected
with sampling from all data. When set to 0 all lines are
going to be used.
-e,--after-epochs UINT Finish after this many epochs, 0 is infinity (deprecated,
'--after-epochs N' corresponds to '--after Ne')
--after-batches UINT Finish after this many batch updates, 0 is infinity
(deprecated, '--after-batches N' corresponds to '--after
Nu')
-a,--after TEXT=0e Finish after this many chosen training units, 0 is infinity
(e.g. 100e = 100 epochs, 10Gt = 10 billion target labels,
100Ku = 100,000 updates
--disp-freq TEXT=1000u Display information every arg updates (append 't' for every
arg target labels)
--disp-first UINT Display information for the first arg updates
--disp-label-counts=true Display label counts when logging loss progress
--save-freq TEXT=10000u Save model file every arg updates (append 't' for every arg
target labels)
--logical-epoch VECTOR=1e,0 ... Redefine logical epoch counter as multiple of data epochs
(e.g. 1e), updates (e.g. 100Ku) or labels (e.g. 1Gt).
Second parameter defines width of fractional display, 0 by
default.
--max-length UINT=50 Maximum length of a sentence in a training sentence pair
--max-length-crop Crop a sentence to max-length instead of omitting it if
longer than max-length
--tsv Tab-separated input
--tsv-fields UINT Number of fields in the TSV input. By default, it is guessed
based on the model type
--shuffle TEXT=data How to shuffle input data (data: shuffles data and sorted
batches; batches: data is read in order into batches, but
batches are shuffled; none: no shuffling). Use with
'--maxi-batch-sort none' in order to achieve exact reading
order
--no-shuffle Shortcut for backwards compatiblity, equivalent to --shuffle
none (deprecated)
--no-restore-corpus Skip restoring corpus state after training is restarted
-T,--tempdir TEXT=/tmp Directory for temporary (shuffled) files and database
--sqlite TEXT Use disk-based sqlite3 database for training corpus storage,
default is temporary with path creates persistent storage
--sqlite-drop Drop existing tables in sqlite3 database
-d,--devices VECTOR=0 ... Specifies GPU ID(s) to use for training. Defaults to
0..num-devices-1
--num-devices UINT Number of GPUs to use for this process. Defaults to
length(devices) or 1
--no-nccl Disable inter-GPU communication via NCCL
--sharding TEXT=global When using NCCL and MPI for multi-process training use
'global' (default, less memory usage) or 'local' (more
memory usage but faster) sharding
--sync-freq TEXT=200u When sharding is local sync all shards across processes once
every n steps (possible units u=updates, t=target labels,
e=epochs)
--cpu-threads UINT=0 Use CPU-based computation with this many independent
threads, 0 means GPU-based computation
--mini-batch INT=64 Size of mini-batch used during update
--mini-batch-words INT Set mini-batch size based on words instead of sentences
--mini-batch-fit Determine mini-batch size automatically based on
sentence-length to fit reserved memory
--mini-batch-fit-step UINT=10 Step size for mini-batch-fit statistics
--gradient-checkpointing Enable gradient-checkpointing to minimize memory usage
--maxi-batch INT=100 Number of batches to preload for length-based sorting
--maxi-batch-sort TEXT=trg Sorting strategy for maxi-batch: none, src, trg (not
available for decoder)
--shuffle-in-ram Keep shuffled corpus in RAM, do not write to temp file
--data-threads UINT=8 Number of concurrent threads to use during data reading and
processing
--all-caps-every UINT When forming minibatches, preprocess every Nth line on the
fly to all-caps. Assumes UTF-8
--english-title-case-every UINT When forming minibatches, preprocess every Nth line on the
fly to title-case. Assumes English (ASCII only)
--mini-batch-words-ref UINT If given, the following hyper parameters are adjusted as-if
we had this mini-batch size: --learn-rate,
--optimizer-params, --exponential-smoothing,
--mini-batch-warmup
--mini-batch-warmup TEXT=0 Linear ramp-up of MB size, up to this #updates (append 't'
for up to this #target labels). Auto-adjusted to
--mini-batch-words-ref if given
--mini-batch-track-lr Dynamically track mini-batch size inverse to actual learning
rate (not considering lr-warmup)
--mini-batch-round-up=true Round up batch size to next power of 2 for more efficient
training, but this can make batch size less stable. Disable
with --mini-batch-round-up=false
-o,--optimizer TEXT=adam Optimization algorithm: sgd, adagrad, adam
--optimizer-params VECTOR ... Parameters for optimization algorithm, e.g. betas for Adam.
Auto-adjusted to --mini-batch-words-ref if given
--optimizer-delay FLOAT=1 SGD update delay (#batches between updates). 1 = no delay.
Can be fractional, e.g. 0.1 to use only 10% of each batch
--sync-sgd Use synchronous SGD instead of asynchronous for multi-gpu
training
-l,--learn-rate FLOAT=0.0001 Learning rate. Auto-adjusted to --mini-batch-words-ref if
given
--lr-report Report learning rate for each update
--lr-decay FLOAT Per-update decay factor for learning rate: lr <- lr * arg (0
to disable)
--lr-decay-strategy TEXT=epoch+stalled
Strategy for learning rate decaying: epoch, batches,
stalled, epoch+batches, epoch+stalled
--lr-decay-start VECTOR=10,1 ... The first number of (epoch, batches, stalled) validations to
start learning rate decaying (tuple)
--lr-decay-freq UINT=50000 Learning rate decaying frequency for batches, requires
--lr-decay-strategy to be batches
--lr-decay-reset-optimizer Reset running statistics of optimizer whenever learning rate
decays
--lr-decay-repeat-warmup Repeat learning rate warmup when learning rate is decayed
--lr-decay-inv-sqrt VECTOR=0 ... Decrease learning rate at arg / sqrt(no. batches) starting
at arg (append 't' or 'e' for sqrt(target labels or
epochs)). Add second argument to define the starting point
(default: same as first value)
--lr-warmup TEXT=0 Increase learning rate linearly for arg first batches
(append 't' for arg first target labels)
--lr-warmup-start-rate FLOAT Start value for learning rate warmup
--lr-warmup-cycle Apply cyclic warmup
--lr-warmup-at-reload Repeat warmup after interrupted training
--label-smoothing FLOAT Epsilon for label smoothing (0 to disable)
--factor-weight FLOAT=1 Weight for loss function for factors (factored vocab only)
(1 to disable)
--clip-norm FLOAT=1 Clip gradient norm to arg (0 to disable)
--exponential-smoothing FLOAT=0 Maintain smoothed version of parameters for validation and
saving with smoothing factor. 0 to disable. Auto-adjusted
to --mini-batch-words-ref if given.
--guided-alignment TEXT=none Path to a file with word alignments. Use guided alignment to
guide attention or 'none'. If --tsv it specifies the index
of a TSV field that contains the alignments (0-based)
--guided-alignment-cost TEXT=ce Cost type for guided alignment: ce (cross-entropy), mse
(mean square error), mult (multiplication)
--guided-alignment-weight FLOAT=0.1 Weight for guided alignment cost
--data-weighting TEXT Path to a file with sentence or word weights. If --tsv it
specifies the index of a TSV field that contains the
weights (0-based)
--data-weighting-type TEXT=sentence Processing level for data weighting: sentence, word
--embedding-vectors VECTOR ... Paths to files with custom source and target embedding vectors
--embedding-normalization Normalize values from custom embedding vectors to [-1, 1]
--embedding-fix-src Fix source embeddings. Affects all encoders
--embedding-fix-trg Fix target embeddings. Affects all decoders
--fp16 Shortcut for mixed precision training with float16 and
cost-scaling, corresponds to: --precision float16 float32
--cost-scaling 8.f 10000 1.f 8.f
--precision VECTOR=float32,float32 ...
Mixed precision training for forward/backward pass and
optimizaton. Defines types for: forward/backward pass,
optimization.
--cost-scaling VECTOR ... Dynamic cost scaling for mixed precision training: scaling
factor, frequency, multiplier, minimum factor
--gradient-norm-average-window UINT=100
Window size over which the exponential average of the
gradient norm is recorded (for logging and scaling). After
this many updates about 90% of the mass of the exponential
average comes from these updates
--dynamic-gradient-scaling VECTOR ... Re-scale gradient to have average gradient norm if (log)
gradient norm diverges from average by arg1 sigmas. If arg2
= "log" the statistics are recorded for the log of the
gradient norm else use plain norm
--check-gradient-nan Skip parameter update in case of NaNs in gradient
--normalize-gradient Normalize gradient by multiplying with no. devices / total
labels (not recommended and to be removed in the future)
--train-embedder-rank VECTOR ... Override model configuration and train a embedding
similarity ranker with the model encoder, parameters encode
margin and an optional normalization factor
--quantize-bits UINT=0 Number of bits to compress model to. Set to 0 to disable
--quantize-optimization-steps UINT=0 Adjust quantization scaling factor for N steps
--quantize-log-based Uses log-based quantization
--quantize-biases Apply quantization to biases
--ulr Enable ULR (Universal Language Representation)
--ulr-query-vectors TEXT Path to file with universal sources embeddings from
projection into universal space
--ulr-keys-vectors TEXT Path to file with universal sources embeddings of target
keys from projection into universal space
--ulr-trainable-transformation Make Query Transformation Matrix A trainable
--ulr-dim-emb INT ULR monolingual embeddings dimension
--ulr-dropout FLOAT=0 ULR dropout on embeddings attentions. Default is no dropout
--ulr-softmax-temperature FLOAT=1 ULR softmax temperature to control randomness of
predictions. Deafult is 1.0: no temperature
--task VECTOR ... Use predefined set of options. Possible values:
transformer-base, transformer-big,
transformer-base-prenorm, transformer-big-prenorm
--valid-sets VECTOR ... Paths to validation corpora: source target
--valid-freq TEXT=10000u Validate model every arg updates (append 't' for every arg
target labels)
--valid-metrics VECTOR=cross-entropy ...
Metric to use during validation: cross-entropy,
ce-mean-words, perplexity, valid-script, translation, bleu,
bleu-detok (deprecated, same as bleu), bleu-segmented,
chrf. Multiple metrics can be specified
--valid-reset-stalled Reset stalled validation metrics when the training is
restarted
--valid-reset-all Reset all validation metrics when the training is restarted
--early-stopping UINT=10 Stop if the first validation metric does not improve for arg
consecutive validation steps
--early-stopping-on TEXT=first Decide if early stopping should take into account first,
all, or any validation metricsPossible values: first, all,
any
-b,--beam-size UINT=12 Beam size used during search with validating translator
-n,--normalize FLOAT=0 Divide translation score by pow(translation length, arg)
--max-length-factor FLOAT=3 Maximum target length as source length times factor
--word-penalty FLOAT Subtract (arg * translation length) from translation score
--allow-unk Allow unknown words to appear in output
--n-best Generate n-best list
--word-scores Print word-level scores. One score per subword unit, not
normalized even if --normalize
--valid-mini-batch INT=32 Size of mini-batch used during validation
--valid-max-length UINT=1000 Maximum length of a sentence in a validating sentence pair.
Sentences longer than valid-max-length are cropped to
valid-max-length
--valid-script-path TEXT Path to external validation script. It should print a single
score to stdout. If the option is used with validating
translation, the output translation file will be passed as
a first argument
--valid-script-args VECTOR ... Additional args passed to --valid-script-path. These are
inserted between the script path and the output
translation-file path
--valid-translation-output TEXT (Template for) path to store the translation. E.g.,
validation-output-after-{U}-updates-{T}-tokens.txt.
Template parameters: {E} for epoch; {B} for No. of batches
within epoch; {U} for total No. of updates; {T} for total
No. of tokens seen.
--keep-best Keep best model for each validation metric
--valid-log TEXT Log validation scores to file given by arg