This purpose of this document is to outline the organisational structure of the Marian codebase. Each section of this document approaches an architectural component and highlights a subset of directories that are relevant to it.
marian/src ├── command ├── rescorer ├── training └── translator
The Marian toolkit provides several commands, covering different modes of operation. These are:
Each of which has a corresponding file in the
marian command is capable of running all other modes (except server), see
marian-main.cpp for the implementation. By default, it operates in
train mode and corresponds to
marian-train.cpp. Other modes may be accessed by calling
marian <X> instead of
Training is covered by the main
marian command, with relevant implementation details kept inside the
training subdirectory. Translation is facilitated by code in the
translator subdirectory and is handled by the
marian-decoder command, as well as
marian-server which provides a web-socket service.
marian-scorer is the tool used to re-score parallel inputs or n-best lists, and uses code in the
The remaining commands
marian-conv provide useful auxiliary functions.
marian-vocab is a tool to create a vocabulary file from a given text corpus. This uses components described in the Data section of this document.
marian-conv exists to convert Marian model files from
.bin as well as lexical shortlists to binary shortlists. It is also possible to use this command to emit an ONNX-compliant model representation. In addition to components defined in the Data section, this also makes use of Model specific components.
Finally, the implementation of the command-line-interface for these commands is described in the Utility section.
marian/src └── data
Data refers to the handling and representation of the text input to Marian. This consists of source code for the representation of the corpus, vocabulary and batches.
Internally, tokens are represented as indices, or
Words; some indices are reserved for special tokens, such as
UNK. Vocabulary implementations are responsible for encoding and decoding sentences to and from the internal representation, whether that be a SentencePiece, Factors or Plain Text/YAML defined vocabulary file.
This directory is also responsible for generating batches from a corpus and performing any shuffling of the corpus or batches, as requested. Furthermore, when using a shortlist, their behaviour is also defined here.
Once the batches are generated they are passed as input to the expression graph.
marian/src ├── functional ├── graph ├── optimizers └── tensors
Marian implements a reverse-mode auto-differentiation computation graph. The relevant components reside in these subdirectories. The
graph subdirectory concerns the structure of the graph, its nodes: operators, parameters and constants, as well as how to traverse it, both forwards and backwards. Moreover, it defines the APIs for operations that the graph is able to perform.
functional subdirectories contain the implementation of operations for the graph.
One component of the
functional subdirectory describes how functions operate on the underlying data types. This is a combination of standard operations on fundamental types, and SIMD intrinsics on extended types where available. The
functional namespace also provides useful abstractions that enable generic formulas to be written. It defines variable-like objects
_1,_2, such that
_1 * cos(_2) represents the product of the argument at index 1 with the cosine of the argument at index 2.
tensors subdirectory contains the definition of a tensor object. In Marian, a tensor is a piece of memory which is ascribed a shape and type which is associated with a backend (the compute device).
This directory also contains the implementations of tensor operations on CPU and GPU, as well as universal functions that dispatches the call to the relevant device.
marian/src ├── models ├── layers └── rnn
The subdirectories above constitute the components of a Model. There are two main types of model:
IModel, which maps inputs to predictions
ICriterionFunction, which maps (inputs, references) to losses
The usage of these interfaces sometimes combined. As an example,
Trainer, an implementation of the
ICriterionFunction interface used in training contains an
IModel member from which it then computes the loss.
An important specialisation of
IEncoderDecoder, this specifies the interface for the
EncoderDecoder consists of a set of Encoders and Decoders objects, which implement the interface of
DecoderBase, respectively. This composite object defines the behaviour of general Encoder-Decoder models. For instance, the
s2s models implement a
transformer models implement a
DecoderTransformer. These two use cases are both encapsulated in the
EncoderDecoder framework. The addition of new encoder-decoder models only need implement their encoder and decoder classes. The
EncoderDecoder models are constructed using a factory pattern in
The export of an ONNX-compliant model is handled by code here.
marian/src └── onnx
marian/src └── common
common subdirectory contains many useful helper functions and classes.
The majority of which fall under one of these categories:
Command-line interface definition an Options object
Definitions, macros and typedefs
Filesystem and IO helpers
Type-based dispatching and properties
Beyond these areas, this folder also contains metadata, such as the program version, list of contributors, and the build flags used to compile it.
marian/src └── 3rd_party
Many of the external libraries that Marian depends on are contained in
These libraries are either copied into place here and version-controlled via the marian repository, or are included here as a submodule. Of these submodules, many have been forked and are maintained under the marian-nmt organisation.
Tests and Examples¶
marian/src ├── examples └── tests
There are basic tests and examples contained in
The unit tests cover basic graph functionality, checks on the output of operators, and the implementation of RNN attention, as well IO of binary files and manipulation of the options structure.
The examples in this subdirectory demonstrate Marian’s functionality using common datasets: Iris and MNIST. The Iris example, builds a simple dense feedforward network to perform a classification task. Over 200 epochs, it trains the network on target using mean cross-entropy. It then reports the accuracy of the model on the test-set. The MNIST example showcases more advanced features of Marian. It offers a choice of models (FFNN, LeNet), can leverage multi-device environments and uses a validator during training. This example more closely replicates the workflow of a typical Marian model, with batching of data and a model implemented in terms of Marian’s model interfaces.
marian ├── examples └── regression-tests
Further tests and examples are contained in the root of the marian source code. The examples here are end-to-end tutorials on how to use Marian. These range from covering the basics of training a Marian model, to replicating the types of models presented at the Conference on Machine Translation (WMT).
Similarly, the tests in
regression-tests are more numerous and detailed. They cover some 250+ areas of the code. While the unit tests described above check basic consistency of certain functions, the regression tests offer end-to-end verification of the functionality of Marian.