In the upcoming WMT2017 paper from Edinburgh Deep Architectures for Neural Machine Translation the authors compare multiple known and novel deep architecturs. Marian implements most of them as well as the tricks used to achieve faster convergence and smaller models.
Layer normalization is a technique that greatly improves convergence of RNNs (according to our own experiments by a factor 3-5) and also results in better quality at test time. It is comparable to feature normalization where features are normalized by subtracting their mean and dividing them by their variance. Instead of only applying it to features, layer-normalization is applied to the activations of the neural network. See here for more details.
Layer normalization is turned on by:
or in the config file with:
As the discussed models can become quite large, the tying of embedding matrices can help to reduce models size and memory footprints during training. As Press&Wolf show, tying target embeddings and the last layer of the output does not decrease quality and helps saving significant amounts of parameters. Activated by:
The authors investigate two types of depth for the decoder. Multi-layer RNNs and complex RNN cells that can again consist of cell-like feed-forward layers. In the decoder the first RNN layer (the conditional cell that contains the attention mechanism) and all other layers have separate depth settings.
We set the number of decoder layers with
the number of feedword layers with the cell of the first and the following layers with
--dec-cell-base-depth 4 --dec-cell-high-depth 2
Alternatively, this can again be set in the config file:
dec-depth: 4 dec-cell-base-depth: 4 dec-cell-high-depth: 4
As for decoders, encoders can have multiple layers of complex RNN cells to be set with:
We set the number of decoder layers with
--enc-depth 4 --dec-cell-depth 2
enc-depth: 4 enc-cell-depth: 2
Futhermore, the authors of the Edinburgh paper take a look at different encoder types:
--enc-depth 1, all three encoder types are reduced to the same single layer
Residual connection allow to skip over layers by calculating
y = f(x) + x where
f is the function represented by a layer. It is generally believed that residual
connections improve learning in deeper networks.
We activate residual connection between RNN layers on the command line with
or in the config file
During our own experiments, we found that LSTM cells work better for deeper models (GRU cell seem to work better for the default shallow model). The LSTM cell can be chosen separately for the encoder and decoder with the following switches
--enc-cell lstm --dec-cell lstm
or with the yaml config file entries
enc-cell: lstm dec-cell: lstm
This model will of course use a lot more memory on the GPU as the shallow models trained earlier. We increase the workspace memory to 6000 MB and create the config file as follows:
mkdir -p model.deep ./marian-dev/build/marian \ --type s2s \ --train-set data/corpus.clean.bpe.ro data/corpus.clean.bpe.en \ --valid-set data/newsdev2016.bpe.ro data/newsdev2016.bpe.en \ --vocabs data/vocab.ro.yml data/vocab.en.yml \ --model model.deep/model.npz \ --enc-depth 4 --enc-type alternating --enc-cell lstm --enc-cell-depth 2 \ --dec-depth 4 --dec-cell lstm --dec-cell-base-depth 4 --dec-cell-high-depth 2 \ --tied-embeddings --layer-normalization --skip \ --dim-vocabs 66000 50000 \ --mini-batch-fit --workspace 6500 \ --dropout-rnn 0.2 --dropout-src 0.1 --exponential-smoothing \ --early-stopping 5 --disp-freq 1000 \ --log model.deep/train.log --valid-log model.deep/valid.log \ --dump-config > model.deep/config.yml
As before, we can just use the config file to start our training process:
./marian-dev/build/marian -c model.deep/config.yml
Translation is done in the same way as for the shallow models. The model file constains information on its own architecture and the parameters can be ommitted on the command line.
Various models with multiple encoders and different attention mechanisms are discussed in our paper An Exploration of Neural Sequence-to-Sequence Architectures for Automatic Post-Editing available here.
A first version (not quite done yet) of a companion page with config files for Marian and data to train multi-encoder models is available here.
Back to Part 1: First steps with Marian
Continue with Part 3: A coding tutorial