Introduction
In this tutorial we will learn how to do efficient neural machine translation
using the Marian toolkit by optimizing the speed, accuracy and use of resources
for training and decoding of NMT models.
No background knowledge about Marian is required, but if you are completely new
to neural machine translation and Marian, you may take a look at at the
introductory tutorial to Marian first. We assume that the
reader is familiar with basic Linux commands.
The tutorial requires Marian in version v1.7.12+, which is currently only
available in the marian-dev
repository. However, most parts are also applicable for older versions of
Marian.
More information about Marian:
Installation
Marian can be compiled on machines with NVIDIA GPU devices and CUDA 8.0+ or on
CPU-only machines. For this tutorial we need Marian compiled with support for
CPU and SentencePiece. This requires installing Intel
MKL and Protocol Buffers first.
To download the repository and compile Marian, run the following commands:
git clone https://github.com/marian-nmt/marian-dev
cd marian-dev
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on
make -j
cd ../..
In this tutorial we assume that Marian is compiled in your home directory.
If everything worked correctly you can display the list of options with:
~/marian-dev/build/marian --help |& less
Models and data
First we need to download models and data (611MB) that we will use for the tutorial:
wget http://data.statmt.org/romang/marian-examples/marian-tutorial-mtm19.tgz
tar zxvf marian-tutorial-mtm19.tgz
cd marian-tutorial-mtm19
This will give you the following files:
marian-tutorial-mtm19
├ 1_decoding
│ ├ data
│ │ ├ newstest2013.de
│ │ ├ newstest2013.en
│ │ ├ newstest2014.de
│ │ └ newstest2014.en
│ ├ model.npz
│ ├ run-me.sh
│ └ vocab.spm
├ 2_training
│ ├ config.yml
│ ├ download-data.sh
│ ├ run-me.sh
│ └ train.log
├ 3_student
│ ├ data
│ │ ├ newstest2014.bpe.en
│ │ ├ newstest2014.de
│ ├ lex.s2t
│ ├ model.student.base
│ │ └ model.npz
│ ├ model.student.small
│ │ └ model.npz
│ ├ model.student.small.aan.nogate.noffn
│ │ └ model.npz
│ ├ run-me.sh
│ └ vocab.ende.yml
├ detruecase.perl
└ multi-bleu.perl
1. Decoding
For this part of the tutorial we will use an RNN model trained on the WMT’14
English-German corpus provided by the Stanford NLP
Group pre-processed with a true-caser
and SentencePiece-based subword segmentation. The model is in the
marian-tutorial-mtm19/1_decoding
folder.
Models trained with Marian can be decoded using the marian-decoder
command.
The basic usage requires specifying paths to the model and vocabularies:
~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm < data/newstest2014.en > output.de
where newstest2014.en
is an input file with pre-processed source sentences
and output.de
is an output file with translations.
Now we want to calculate the speed and quality of the translation. Marian
already prints the total time required to translate the input file (exclusive
of loading times of models and vocabularies) at the end of the decoding, for
example:
[2019-08-25 13:17:33] Total time: 119.24629s wall
To estimate the quality, we need to run a BLEU scorer:
cat output.de | ../multi-bleu.perl data/newstest2014.de
This should display:
BLEU = 23.90, 56.0/29.6/17.7/11.1 (BP=1.000, ratio=1.003, hyp_len=59469, ref_len=59297)
Batched decoding
Batched decoding parallelizes translation of multiple sentences. It generates
translation for whole mini-batches and significantly increases translation
speed, roughly by a factor of 10 or more.
In Marian there are a couple of options that are important for batched
translation:
--mini-batch
enables translation with a mini-batch size of N, i.e.
translating N sentences at once.
allows for better packing of the batch.
--mini-batch-words
allows to specify the size of a mini-batch in terms of a
number of words, not a number of sentences.
--maxi-batch
preloads M mini-batches, i.e. M x N sentences.
--maxi-batch-sort
sorts them according to source sentence length, this
An important option is also --workspace
or -w
, which set the working
memory. The default working memory is 512 MB and Marian will increase it to
match to requirements during translation, but pre-allocating memory makes it
usually a bit faster.
Last but not least, you can also parallelize translation by running it on
multiple GPUs using the --devices
option.
Task 1
Translate newstest2014.en using different settings for batched
translation, compare decoding times, and complete the table below.
Which parameter improves the translation speed the most? Are you getting
exactly the same outputs from line-by-line and batched translation?
System |
Speed (sec.) |
Line-by-line translation |
|
Adding mini-batch of 32 |
|
Switching to mini-batch of 64 |
|
Increasing workspace to 4GB |
|
Adding maxi-batch of 1000 |
|
Adding mini-batch-words of 2000 |
|
Optimizing parameters
There are a few options that impact the translation speed and quality, and beam
size is one of the most important. It determines a number of translation
hypotheses considered for each input word. Exploring too few hypotheses per
word may lead to a globally suboptimal translation, while using too large beam
size may increase resource usage and computation time without improving the
translation quality that much.
Options that can be easily tuned include:
--beam-size
determines a number of translation hypotheses explored at each
step of the beam search algorithm. Common beam sizes are 8 to 12, depending
on the model.
--max-length-factor
sets maximum target length as source length times
factor. The default value is 3.
--normalize
divides translation score by the translation length to the
power of \(\alpha\).
Task 2
Explore different configurations of the beam size and maximum length factor and
try to improve the translation time, keeping BLEU at a good level. Next grid
search the value of the length normalization parameter on newstest2013
and check if the improvement of BLEU transfers into newstest2014.de.
What are the reasonable values for the beam size? What is a danger of using too
small value for the maximum length factor?
This can be done with a simple bash script:
for i in 1 2 4 8 12 24; do
~/marian-dev/build/marian-decoder -b $i [OTHER OPTIONS] < data/newstest2013.en > output.b$i.de
done
A reasonable values to consider for --max-length-factor
are in a range of 1
to 3. The normalization parameter \(\alpha\) can be grid searched for values
around 1.0, usually in a range of 0.2 to 2.
Back-translation
Generation of back-translations
often require translation of a large amount of monolingual data (sometimes even
hundreds of millions of sentences), so optimizing this process can save you
quite a lot of time.
- If your dataset is large, consider splitting it into smaller chunks (see
GNU split)
and translate the chunks sequentially. This will not make your translation
faster, but more stable.
- Use batched translation and adjust your decoding parameters too speed up the
translation.
- Because monolingual data, especially if it come from web crawling, can
contain very long sentences, use smaller
--max-length
and turn on
--max-length-crop
.
- Use sampled back-translations with
--output-sampling
.
To test sampled translations, run several times:
head data/newstest2014.en | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm \
--output-sampling
2. Training
An efficient training of an NMT model might mean minimizing the training time
or improving the convergence. This is also choosing a model architecture that
is best suited for yours needs.
Model architecture
Transformer models often offer best quality, but are also more difficult to
train than RNNs, and need more careful setting of training parameters.
A good starting point for finding a good set of training hyper-parameters is
our repository with Marian
examples.
Since version 1.7.12 Marian provides the --task
option, which provides
pre-defined options for transformer-base, and transformer-big:
~/marian-dev/build/marian --task transformer-base --dump-config expand
You should keep in mind that those settings are just a starting point and they
should be adjusted for your scenario. For example, a low-resource scenario will
probably require stronger regularization, etc.
GPU memory
Training of some model architectures like Transformer may benefit from large
mini-batches.
--mini-batch-fit
overrides the specified mini-batch size and automatically
chooses the largest mini-batch for a given sentence length that fits the
specified memory. When --mini-batch-fit
is set, memory requirements are
guaranteed to fit into the specified workspace. Choosing a too small workspace
will result in small mini-batches which can prohibit learning.
In multi-GPU setting, a synchronous training (--sync-sgd
) can also serve as a
method to increase the effective size of mini-batches.
The option --workspace
sets the size of the memory available for the forward
and backward step of the training procedure. This does not include model size
and optimizer parameters that are allocated outsize workspace. Hence you cannot
allocate all GPU memory to workspace.
It is a bit tricky to determine the effective size of a mini-batch with
dynamically sized mini-batches. An alternative is to operate in terms of the
total workspace size that is used to generate mini-batches for one update,
which can be calculated with the following formula:
\[\text{Workspace (MB)} \times \text{Number of GPUs} \times \text{Delayed updates}\]
Cumulative or delayed gradient
updates increase the effective
batch size, or the amount of data used at each step of training. It can be
enabled with --optimizer-delay N
.
Task 3
Setup training of a transformer-base model with the total workspace of
ca. 48 GB per update, and calculate the average size of mini-batches from
training logs.
Download the training data:
cd ../2_training
bash download-data.sh
Start with the following command and set --workspace
, --devices
and
--optimizer-delay
options:
~/marian-dev/build/marian --task transformer-base -c config.yml
The optimal values for these options will differ depending on the number of
GPUs and available memory per single GPU.
3. Teacher-student
Knowledge distillation approaches, like the teacher-student
method, are used for training a
smaller student model to perform better by learning from a larger teacher
model.
In NMT it might mean training a strong (and slow) ensemble of Transformer
models as a teacher, then translating an entire source side of the training
data with the teacher, and finally training a small (and fast) model on
original source and translated target data. The student model has usually worse
BLEU, but is much faster. A useful side effect of the teacher-student learning
is an improved performance of the student with smaller beam sizes.
More details can be found in our paper on cost-effective and high-quality NMT
with Marian.
CPU decoding
The teacher-student method enables efficient translation on CPU. A couple of
things can speed it up more:
- Batched translation as presented in the first part of the tutorial.
- A lexical shortlist restricts the output
vocabulary to a small subset of translation candidates for each words.
- Auto-tuning of matrix product implementation with
--optimize
.
The --cpu-threads
option turns on the decoding on CPU.
Task 4
Experiment with different settings for decoding on a single CPU. Check
different student models, batched translation, lexical shortlists and
auto-tuning.
How does a student model perform with small beam sizes? What is the speed
improvement from a lexical shortlist on CPU and GPU? What are architectures of
student models and how large are they? What is the issue with decoding on
multiple CPU threads with model files in .npz format?
The models can be found in the ./3_student
folder. They have been trained on
true-cased and BPE-segmented data, so require pre- and post-processing:
cd ../3_student
cat data/newstest2014.bpe.en \
| ~/marian-dev/build/marian-decoder -m model.student.small/model.npz -v vocab.ende.yml vocab.ende.yml \
-d 0 --cpu-threads 1 \
--mini-batch 64 --maxi-batch 100 \
-b 1 --max-length-factor 1.5 -n 0.6 \
| perl -pe 's/@@ //g' \
| ../detruecase.perl \
| ../multi-bleu.perl data/newstest2014.de
The shortlist file is 3_student/data/lex.s2t
. Model-specific parameters can
be read from a model.npz
file using the script
marian-dev/scripts/contrib/model_info.py
.
That’s all for now. Hope you have found something useful here!