Class CorpusBase

Inheritance Relationships

Base Types

Derived Types

Class Documentation

class CorpusBase : public marian::data::DatasetBase<SentenceTuple, CorpusIterator, CorpusBatch>, public marian::data::RNGEngine

Subclassed by marian::data::Corpus, marian::data::CorpusNBest, marian::data::CorpusSQLite

Public Types

typedef SentenceTuple Sample

Public Functions

CorpusBase(Ptr<Options> options, bool translate = false, size_t seed = Config::seed)
CorpusBase(const std::vector<std::string> &paths, const std::vector<Ptr<Vocab>> &vocabs, Ptr<Options> options, size_t seed = Config::seed)
virtual ~CorpusBase()
virtual std::vector<Ptr<Vocab>> &getVocabs() = 0

Protected Functions

void initEOS(bool training)

Determine if EOS symbol should be added to input.

void addWordsToSentenceTuple(const std::string &line, size_t batchIndex, SentenceTupleImpl &tup) const

Helper function converting a line of text into words using the i-th vocabulary and adding them to the sentence tuple.

void addAlignmentToSentenceTuple(const std::string &line, SentenceTupleImpl &tup) const

Helper function parsing a line with word alignments and adding them to the sentence tuple.

void addWeightsToSentenceTuple(const std::string &line, SentenceTupleImpl &tup) const

Helper function parsing a line of weights and adding them to the sentence tuple.

void addAlignmentsToBatch(Ptr<CorpusBatch> batch, const std::vector<Sample> &batchVector)
void addWeightsToBatch(Ptr<CorpusBatch> batch, const std::vector<Sample> &batchVector)

Protected Attributes

std::vector<UPtr<std::istream>> files_
std::vector<Ptr<Vocab>> vocabs_
std::vector<bool> addEOS_

Determines if a EOS symbol should be added.

By default this is true for any sequence, but should be false for instance for classifier labels. This is set per input stream, hence a vector.

size_t pos_ = {0}
size_t maxLength_ = {0}
bool maxLengthCrop_ = {false}
bool rightLeft_ = {false}
bool tsv_ = {false}
size_t tsvNumInputFields_ = {0}
int weightFileIdx_ = {-1}

Index of the file with weights in paths_ and files_; -1 means no weights file provided.

int alignFileIdx_ = {-1}

Index of the file with alignments in paths_ and files_; -1 means no alignment file provided.

Protected Static Functions

size_t getNumberOfTSVInputFields(Ptr<Options> options)

Determine the number of fields from the TSV input that are associated with vocabs, i.e.

excluding fields that contain alignment or weights