Class CorpusBatch

Inheritance Relationships

Base Type

Derived Type

Class Documentation

class CorpusBatch : public marian::data::Batch

Batch of source(s) and target sentences with additional information, such as guided alignments and sentence or word-level weighting.

Subclassed by marian::data::BertBatch

Public Functions

CorpusBatch(const std::vector<Ptr<SubBatch>> &subBatches)
Ptr<SubBatch> operator[](size_t i) const

Access i-th subbatch storing a source or target sentence.

The order of subbatches is: 1st source sentence, 2nd source sentence, …, target sentence.

Return

Pointer to the requested element.

Parameters
  • i: position of the element to return

Ptr<SubBatch> front()

Access the first subbatch, i.e.

the source sentence.

Ptr<SubBatch> back()

Access the last subbatch, i.e.

the target sentence.

size_t size() const

The number of sentences in the batch.

size_t words(int which = 0) const

The total number of words in the batch (not counting masked-out words).

Pass which=0 for source words and -1 for target words.

size_t width() const

The width of the source mini-batch.

Num words + padded?

size_t sizeTrg() const

The number of sentences in the batch, target words.

size_t wordsTrg() const

The total number of words in the batch (not counting masked-out words).

size_t widthTrg() const

The target width (=max length) of the mini-batch.

size_t sets() const

The number of source and targets.

std::vector<Ptr<Batch>> split(size_t n, size_t sizeLimit)

Splits the batch into batches of equal size (except for last).

Return

Vector of pointers to new sub-batches (or nullptrs where run out of sub-batches)

See

marian::data::SubBatch::split(size_t n)

Parameters
  • n: number of sub-batches to split into

  • sizeLimit: Clip batch content to the first sizeLimit sentences in the batch

const std::vector<WordAlignment> &getGuidedAlignment() const
void setGuidedAlignment(std::vector<WordAlignment> &&aln)
std::vector<float> &getDataWeights()
void setDataWeights(const std::vector<float> &weights)
void debug(bool printIndices = false)

Prints the batch in a readable form on stderr for debugging.

Public Static Functions

static Ptr<CorpusBatch> fakeBatch(const std::vector<size_t> &lengths, const std::vector<Ptr<Vocab>> &vocabs, size_t batchSize, Ptr<Options> options)

Creates a batch filled with fake data.

Used to determine the size of the batch object. With guided-alignments and multiple encoders, those multiple source streams are expected to have the same lengths.

Return

Fake batch of the same size as the real batch.

Parameters
  • lengths: List of subbatch sizes.

  • batchSize: Number of sentences in the batch.

  • options: Options with “guided-alignment” and “data-weighting”.

Protected Attributes

std::vector<Ptr<SubBatch>> subBatches_
std::vector<WordAlignment> guidedAlignment_
std::vector<float> dataWeights_