Code Organisation¶
This purpose of this document is to outline the organisational structure of the Marian codebase. Each section of this document approaches an architectural component and highlights a subset of directories that are relevant to it.
Operating Modes¶
marian/src
├── command
├── rescorer
├── training
└── translator
The Marian toolkit provides several commands, covering different modes of operation. These are:
marian
marian-decoder
marian-server
marian-scorer
marian-vocab
marian-conv
Each of which has a corresponding file in the command
directory.
The main marian
command is capable of running all other modes (except server), see marian-main.cpp
for the implementation. By default, it operates in train
mode and corresponds to marian-train.cpp
. Other modes may be accessed by calling marian <X>
instead of marian-<X>
.
Training is covered by the main marian
command, with relevant implementation details kept inside the training
subdirectory. Translation is facilitated by code in the translator
subdirectory and is handled by the marian-decoder
command, as well as marian-server
which provides a web-socket service. marian-scorer
is the tool used to re-score parallel inputs or n-best lists, and uses code in the rescorer
subdirectory.
The remaining commands marian-vocab
and marian-conv
provide useful auxiliary functions. marian-vocab
is a tool to create a vocabulary file from a given text corpus. This uses components described in the Data section of this document.
marian-conv
exists to convert Marian model files from .npz
, .bin
as well as lexical shortlists to binary shortlists. It is also possible to use this command to emit an ONNX-compliant model representation. In addition to components defined in the Data section, this also makes use of Model specific components.
Finally, the implementation of the command-line-interface for these commands is described in the Utility section.
Data¶
marian/src
└── data
Data refers to the handling and representation of the text input to Marian. This consists of source code for the representation of the corpus, vocabulary and batches.
Internally, tokens are represented as indices, or Words
; some indices are reserved for special tokens, such as EOS
, UNK
. Vocabulary implementations are responsible for encoding and decoding sentences to and from the internal representation, whether that be a SentencePiece, Factors or Plain Text/YAML defined vocabulary file.
This directory is also responsible for generating batches from a corpus and performing any shuffling of the corpus or batches, as requested. Furthermore, when using a shortlist, their behaviour is also defined here.
Once the batches are generated they are passed as input to the expression graph.
Expression Graph¶
marian/src
├── functional
├── graph
├── optimizers
└── tensors
Marian implements a reverse-mode auto-differentiation computation graph. The relevant components reside in these subdirectories. The graph
subdirectory concerns the structure of the graph, its nodes: operators, parameters and constants, as well as how to traverse it, both forwards and backwards. Moreover, it defines the APIs for operations that the graph is able to perform.
The tensors
and functional
subdirectories contain the implementation of operations for the graph.
One component of the functional
subdirectory describes how functions operate on the underlying data types. This is a combination of standard operations on fundamental types, and SIMD intrinsics on extended types where available. The functional
namespace also provides useful abstractions that enable generic formulas to be written. It defines variable-like objects _1,_2
, such that _1 * cos(_2)
represents the product of the argument at index 1 with the cosine of the argument at index 2.
The tensors
subdirectory contains the definition of a tensor object. In Marian, a tensor is a piece of memory which is ascribed a shape and type which is associated with a backend (the compute device).
This directory also contains the implementations of tensor operations on CPU and GPU, as well as universal functions that dispatches the call to the relevant device.
More specific documentation is available that describes the graph, and how its operators are implemented.
Model¶
marian/src
├── models
├── layers
└── rnn
The subdirectories above constitute the components of a Model. There are two main types of model:
IModel
, which maps inputs to predictionsICriterionFunction
, which maps (inputs, references) to losses
The usage of these interfaces sometimes combined. As an example, Trainer
, an implementation of the ICriterionFunction
interface used in training contains an IModel
member from which it then computes the loss.
An important specialisation of IModel
is IEncoderDecoder
, this specifies the interface for the EncoderDecoder
class. EncoderDecoder
consists of a set of Encoders and Decoders objects, which implement the interface of EncoderBase
and DecoderBase
, respectively. This composite object defines the behaviour of general Encoder-Decoder models. For instance, the s2s
models implement a EncoderS2S
and DecoderS2S
, while transformer
models implement a EncoderTransformer
DecoderTransformer
. These two use cases are both encapsulated in the EncoderDecoder
framework. The addition of new encoder-decoder models only need implement their encoder and decoder classes. The EncoderDecoder
models are constructed using a factory pattern in src/models/model_factory.cpp
.
The export of an ONNX-compliant model is handled by code here.
marian/src
└── onnx
Utility¶
marian/src
└── common
The common
subdirectory contains many useful helper functions and classes.
The majority of which fall under one of these categories:
Command-line interface definition an Options object
Definitions, macros and typedefs
Filesystem and IO helpers
Logging
Memory management
Signal handling
Text manipulation
Type-based dispatching and properties
Beyond these areas, this folder also contains metadata, such as the program version, list of contributors, and the build flags used to compile it.
External Libraries¶
marian/src
└── 3rd_party
Many of the external libraries that Marian depends on are contained in 3rd_party
.
These libraries are either copied into place here and version-controlled via the marian repository, or are included here as a submodule. Of these submodules, many have been forked and are maintained under the marian-nmt organisation.
Tests and Examples¶
marian/src
├── examples
└── tests
There are basic tests and examples contained in marian/src
.
The unit tests cover basic graph functionality, checks on the output of operators, and the implementation of RNN attention, as well IO of binary files and manipulation of the options structure.
The examples in this subdirectory demonstrate Marian’s functionality using common datasets: Iris and MNIST. The Iris example, builds a simple dense feedforward network to perform a classification task. Over 200 epochs, it trains the network on target using mean cross-entropy. It then reports the accuracy of the model on the test-set. The MNIST example showcases more advanced features of Marian. It offers a choice of models (FFNN, LeNet), can leverage multi-device environments and uses a validator during training. This example more closely replicates the workflow of a typical Marian model, with batching of data and a model implemented in terms of Marian’s model interfaces.
marian
├── examples
└── regression-tests
Further tests and examples are contained in the root of the marian source code. The examples here are end-to-end tutorials on how to use Marian. These range from covering the basics of training a Marian model, to replicating the types of models presented at the Conference on Machine Translation (WMT).
Similarly, the tests in regression-tests
are more numerous and detailed. They cover some 250+ areas of the code. While the unit tests described above check basic consistency of certain functions, the regression tests offer end-to-end verification of the functionality of Marian.