Feature Extractors

Feature extractors take a context object as input, and output a list of values. For NLP tasks, feature values are typically scalars or strings. If your feature values are not scalars, it's likely that you'll need to do something like binarization to convert them to numbers, in order to learn models with them.

The features that are extracted for a particular experiment should be configured in the Marmot experiment config.

Interface

All feature extractors have two methods:

The methods have the same parameters for all feature extractors. So they can be called uniformly for a list of feature extractors, none of them requires special handling:

features = []
for f_e in feature_extractors:
    features.extend(f_e.get_features(context_obj))

Available feature extractors

Features extracted:

* source word aligned to the target token. If the target token is aligned to multiple words, they are concatenated.
* the left context of the source word aligned to the target token. In case of multiple alignments the context of the leftmost aligned source word is returned.
* the right context of the source word aligned to the target token. In case of multiple alignments the context of the rightmost aligned source word is returned.

Features extracted:

* token is a stopword (1 - is a stopword, 0 - is not as stopword)
* token is punctuation
* token is a proper noun
* token is a number

Features extracted:

* binary feature indicating whether the token occurs or does not occur in the pseudo-reference

Features extracted:

* length of the longest sequence of left context of the token that occurs in the LM.
* length of the longest sequence of right context of the token that occurs in the LM.
* backoff behaviour features as described in [Raybaud et al. 2011] (http://perso.crans.org/raybaud/cm-springer-utf8.pdf):
    * for ngram w<sub></sub> w<sub></sub> w<sub></sub>