Configuration

We provide an example pipeline which performs the data preparation, feature extraction, model training, tagging of the test data and evaluation of result. User does not have to use this pipeline or the configuration file format described below, the functions and classes provided in Marmot can be used any other way. However, the tool has a set of utilities designed to organise the experiment and a config file format that helps to use them efficiently.

The pipeline is managed with the config. It has blocks corresponding to each of the experiment stages. By managing their content user defines which features should be extracted from the data and which training algorithms should be used.

Pipeline

The pipeline consists of the following stages:

Alternatively, the data files can be parsed directly to context objects, if no additional representations are needed, or if all the additional representations are already provided in plain-text sentence-aligned files. Parsing

Config file format

YAML file format

The config file is written in [YAML format] (http://www.yaml.org/). A YAML document can be parsed by the Python yaml module into a dictionary object. This dictionary can be nested, i.e. values themselves can be dictionaries or lists of dictionaries.

The simplest example of a YAML file entry is a string with key and label separated by a colon: key: value

Value can be a list of strings, in this case every string should start with a dash: key: -- value1 -- value2 -- value3

Analogously, a value can be a dictionary: key: inner_key1: value1 inner_key2: value2 inner_key3: value3

For more detailed YAML format description see the [YAML guide] (http://www.yaml.org/spec/1.2/spec.html).

Declaration of a module

One of the core features of the config format is the ability to specify objects in the form that allows to load them and call directly during the script execution. I.e. if a new representation generator is created, no changes in the pipeline code are required to use the new object, it only has to be specified in the configuration file.

A new object can be included into the code by declaring a module that contains it.

The declaration of a module means that the script will load a Python class and create an instance of this class with the provided parameters. The keyword module is the path to the Python class. It can belong to any Python library and does not need to be in the Marmot directory. The keyword args is the list of arguments that are needed for this class initialisation.

  -- module: marmot.representations.pos_representation_generator.POSRepresentationGenerator
     args:
       -- tiny_test/tree-tagger
       -- tiny_test/english-utf8.par
       -- 'source'
       -- tiny_test/tmp_dir

Code in the example above will generate the instance of POSRepresentationGenerator which is located in /MARMOT_HOME/marmot/representations.pos_representation_generator.py. This declaration is equivalent to the following Python code:

from marmot.representations.pos_representation_generator import POSRepresentationGenerator
r = POSRepresentationGenerator('tiny_test/tree-tagger', 'tiny_test/english-utf8.par', 'source', 'tiny_test/tmp_dir')

The args format does not support the keyword arguments, so all the arguments should be declared in the order they appear in the init function of the class. However, the keyword arguments can be omitted. For example, in order to create the LogisticRegression classifier object with the penalty 'l1' you need to write the following code in the config:

-- module: sklearn.linear_model.LogisticRegression
   args:
     -- 'l1'

Compare this notation with the full list of parameters:

sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None)

If the initialisation has no parameters, no args keyword is needed:

module: sklearn.linear_model.LogisticRegression

Declaration of a function

An argument of a module can be the output of a function. A function is defined analogously to a class. If the value of a variable needs to be computed by a function, this variable should be declared in the config with two fields: func and args. func is the path to the function (can be a path to any Python function), args is the list of arguments for this function (can be omitted, if the function is called without arguments).

interesting_tokens:
    func: marmot.preprocessing.parsers.extract_important_tokens_wmt
    args:
      - word_level_quality_estimation/data/en_es/EN_ES.tgt_ann.test
      - 1

This declaration will envoke the execution of the following code:

from marmot.preprocessing.parsers import extract_important_tokens_wmt
interesting_tokens = extract_important_tokens_wmt('word_level_quality_estimation/data/en_es/EN_ES.tgt_ann.test', 1)

An argument of a function or a class can be a function as well. In that case the type of the argument should be declared as function_output:

module: marmot.util.corpus_context_creator.CorpusContextCreator
args:
    -- type: function_output
       func: marmot.preprocessing.parsers.parse_wmt14_data
       args:
         -- word_level_quality_estimation/data/en_es/EN_ES.tgt_ann.train
         -- word_level_quality_estimation/data/en_es/EN_ES.source.train
         -- type: function_output
            func: marmot.preprocessing.parsers.extract_important_tokens
            args:
              -- word_level_quality_estimation/data/en_es/EN_ES.tgt_ann.test
              -- 1

In this example the class CorpusContextCreator is created with one argument, which is the output of the function parse_wmt14_data. This function is called with three arguments, last of which is the output of the function extract_important_tokens. As you can see in this example, the declaration of a function can be recursive, i.e. its arguments can be outputs of a function as well. In this case the parser will go through the whole function tree calling the functions and returning their values where appropriate.

Expressing the pipeline in the config

The config file has several blocks that correspond to experiment stages:

Datasets

This block contains the datasets that should be used for the experiment and the objects that are needed to load these datasets. The block has to contain sublists each of which stands for a specific part of data (training, test, development, etc.). The "training" and "test" sections are compulsory in the example experiment run provided. Each dataset needs the declaration of a module:

training:
    - module: marmot.representations.wmt_representation_generator.WMTRepresentationGenerator
      args:
        - tiny_test/EN_ES.tgt_ann.train
        - tiny_test/EN_ES.source.train
        - tiny_test/tmp_dir
        - False

Several modules can be declared in each section. There can be any number of sections.

If any additional representations of the data are needed, the data needs to be parsed with subclasses of RepresentationGenerator, which return the dictionary of internal representations. If the data can be parsed directly to context objects, the subclasses of the Parser class can be used.

If there is no representation generator for the data format you want to use, you should create a new appropriate representation generator.

Representation Generators

In this block the additional representation generators are declared. The declaration uses the same module notation.

While the generators listed in the datasets block should take data files as inputs and return the internal representation of the data, the additional generators take the internal representations and extend them.

These representation generators are applied to all datasets declared in datasets block.

All generators declared in this section have to use the data generated earlier by the generators from datasets block. So all additional generators have pre-requisites: they need a particular data field to generate a representation (e.g. AlignmentDataGenerator needs both source and target exist). If there is no needed datafield, the generator will throw an error.

Feature extractors

The feature extractors are declared using the module notation. Every extractor takes a context object (an object that contains a training instance and all information about it) and extracts one or more features from it. Like representation generators, they look for particular data fields in the context object and throw an error if they are not able to find it. Some data fields can be generated on the fly -- e.g. if there is no part-of-speech tagging for a given example, but the POSFeatureExtractor is given a POS-tagger and parameters for it, it will generate the tagging and extract the POS features. However, this is usually inefficient. First of all, the features are extracted for each word, whereas the additional representations are usually generated for the whole sentence. Therefore, if a sentence contains 20 words, it will be tagged 20 times, if the POSFeatureExtractor is declared and no POS representation was generated in advance. Secondly, when calling the external tools much time is spent on the call operation itself, so calling Tree-Tagger once for 100 sentences (as it is done in a representation generator) is much faster than calling it 100 times for one sentence (which a feature extractor will do).

So acquiring all the needed representations in advance (by declaring relevant representation generators) will save much time.

List of Variables

There is a set of variables that should be declared in the config file: