Other translations of this page: None.

Maximum entropy language models: SRILM extension

This patch adds the functionality to train and apply maximum entropy (MaxEnt) language models to the SRILM toolkit. Currently, only N-gram features are supported.

As of SRILM 1.7.1, the extension is included in the main SRILM distribution – no patching is necessary. The documentation below now covers enabling, compiling and using this extension that is part of the main distribution. Note to users of the previous patch: some command line options have changed!

The extension uses a hierarchical parameter estimation procedure (Wu and Khudanpur 2002) for making the training time and memory consumption feasible for moderately large training data (hundreds of millions of words). Experiments indicate that the models trained with our implementation perform equally to or better than N-gram models built with interpolated Kneser-Ney discounting.

Please cite the following paper when you use this extension in your research:

 
@inproceedings{srilm-me,
  author    = {Alum\"{a}e, Tanel  and  Kurimo, Mikko},
  title     = {Efficient Estimation of Maximum Entropy Language Models with {N}-gram features: an {SRILM} extension},
  booktitle = {Proceedings of Interspeech 2010},
  month     = {September},
  year      = {2010},
  address   = {Chiba, Japan}
}

News

  • 2013-03-08: documented how to enable the extension that is now part of SRILM
  • 2013-02-06: added support for converting MaxEnt models to ARPA format (thanks to Josef Novak for pointing out the algorithm by Jun Wu). Also updated the patch for SRILM 1.7.0
  • 2012-04-12: added support for lattice-tool; also fixed memory deallocation bug (thanks to Lane Schwartz) that didn't actually harm main functionality
  • 2012-03-20: updated documentation concerning model adaptation
  • 2011-12-28: updated patch for SRILM 1.6.0
  • 2010-10-21: added a note for fixing compiling issues
  • 2010-10-20: added a note for compiling under a x86_64 system
  • 2010-10-04: the patch didn't work, fixed it now (thanks to Anoop Deoras)

Installation

  • The extension uses libLBFGS for parameter estimation. Download and install it (make, make install) before proceeding with the following steps. If you are planning to use a large corpus to train MaxEnt models, use floats instead of doubles in the LBFGS routine (this saves a lot of RAM, however, this is optional). To do this, before compiling libLBFGS, open the file include/lbfgs.h (in the libLBFGS directory) and change the following line
#define LBFGS_FLOAT    64

to

#define LBFGS_FLOAT    32
  • Download SRILM (1.7.1 or newer) and unpack.
  • Change into the SRILM main directory (the one with README, RELEASE)
  • In the file common/Makefile.machine.<arch> (e.g. common/Makefile.machine.i686-m64 if you use 64-bit Linux), add the following flag:
HAVE_LIBLBFGS = 1
  • Continue compiling SRILM as usual. Note that when compiling under x86-64 system (also known as amd64), SRILM tends to produce 32-bit binaries by default, and cannot link with a 64-bit libLBFGS. To fix this, set the following in the main SRILM Makefile:
MACHINE_TYPE := i686-m64
  • if you installed libLBFGS under /usr/local, SRILM should find the libLBFGS include and library files automatically. However, if you don't have root privileges, and you installed libLBFGS under you home directory (e.g. by using ./configure --prefix=$HOME), you might have to modify the SRILM makefiles to let SRILM know where it can find libLBFGS. For example, if you are compiling under i686-m64, you can modify common/Makefile.machine.i686-m64 and change the following lines:
# Other useful include directories.
ADDITIONAL_INCLUDES = -I$(HOME)/include

# Other useful linking flags.
ADDITIONAL_LDFLAGS = -L$(HOME)/lib

Usage

Training MaxEnt models

To train MaxEnt models, add the -maxent option to ngram-count (note that this is slightly different from the way the patch used to work!), e.g.:

ngram-count -text train.txt -vocab train.vocab -maxent -lm train.hme.gz -order 3 -debug 2

The MaxEnt model will be saved in a dedicated format to a file train.hme.gz ('hme' stands for hierarchical Maximum Entropy). All N-grams up to the specified order in the training data will be used as features. Currently, it's not possible to use feature cutoff.

Applying MaxEnt models

Use ngram with -maxent option, e.g.:

$ ngram -maxent -lm train.hme.gz -ppl test.txt

You can also apply interpolated MaxEnt models: the argument -mix-maxent means that any of the additional LMs is a MaxEnt model. E.g., you can mix two MaxEnt models:

ngram -maxent -lm train.hme.gz -mix-maxent -mix-lm train2.hme.gz -ppl test.txt -bayes 0

Or, you can mix a MaxEnt model with an ARPA model – just leave off -mix-maxent option then:

ngram -maxent -lm train.hme.gz -mix-lm train.arpa -ppl test.txt -bayes 0

Be sure to add the -bayes 0 switch (as in the above examples), which triggers dynamic interpolation of language models (MaxEnt models cannot be interpolated statically).

Rescoring lattices

You can use lattice-tool to rescore lattices using MaxEnt models. Just add -maxent switch, like when using ngram. The -mix-maxent switch should also work. Note that lattices rescored with MaxEnt models tend to be much bigger than the lattices created with N-gram models and you can even run out of memory during rescoring. One way to overcome this problem is to prune lattices before rescoring. This can be done in the same invocation of lattice-tool as rescoring. E.g.,:

lattice-tool \
  -in-lattice in/test.lat \
  -read-htk \
  -maxent \
  -lm test.3g.hme.gz \
  -out-lattice out/test.lat \
  -write-htk \
  -posterior-prune 0.001 

Adapting

MaxEnt models are good for adaptation. Suppose you have a large corpus of background data and a small corpus of adaptation data (a very common practical situation). Then you can first train a model on background data and then adapt it to the in-domain data. However, this is a bit tricky. In order to do the adaptation properly, we also need features only found in the in-domain data to be in the background model. This is because it's currently not possible to add new features during adaptation. To do this, use the adaptation data with weight 0 when training the background model (see note below), using the -text-has-weights option of ngram-count. I do it as follows, but there are probably more elegant ways to do this:

( cat background.txt | perl -npe '$_="1 " . $_;'; \
  cat adaptation.txt | perl -npe '$_="0 " . $_;';) | \
  ngram-count -text-has-weights -text - -vocab train.vocab -maxent -lm background.hme.gz -debug 2

Now, adapt the background model with adaptation data, using the -init-lm (NB! it used to be -maxent-prior option in the patch).

ngram-count -text adaptation.txt -vocab train.vocab -maxent -lm adapted.hme.gz -init-lm background.hme.gz -debug 2


Update: newer experiments has shown that having the in-domain data with zero weight (as opposed to full weight) when building the background model can cause numerical problems when adapting the model. Furthermore, including the in-domain data with full weight when building the background model often produces slightly better adapted models. My updated advice for building the background model is to use a simple approach like this:

cat background.txt adaptation.txt | \
ngram-count -text - -vocab train.vocab -maxent -lm background.hme.gz -debug 2

The adaptation step remains the same.

Tuning

MaxEnt models are estimated using the OWL algorithm (Andrew et al, 2007), using L1 and L2 regularisation. You can use two options for tuning the MaxEnt model training:

  • -maxent-alpha: A constant for L1 regularisation (by deafult 0.5)
  • -maxent-sigma2: A constant for L2 regularisation (default: 6 for estimation, 0.5 for adaptation)

The default values alpha=0.5 and \sigma^2=6 were empirically found to be globally optimal by Chen (2009), but of course you can try other values. We have found that for adaptation, smaller values for \sigma^2 work better, so the default for adaptation is 0.5.

Training on many CPU cores

The MaxEnt training routine is partially parallelized using OpenMP. To enable it, specify -fopenmp as an additional compiler option, e.g. in common/Makefile.machine.i686-m64 (this for g++, for other compilers, slightly different options might be needed):

GCC_FLAGS = -march=athlon64 -m64 -Wall -Wno-unused-variable -Wno-uninitialized -fopenmp

The parallelization is not highly efficient: we get about 2x speedup when training on 4 cores vs 1 core.

Converting MaxEnt model to ARPA format

MaxEnt models can be converted directly to ARPA format without any loss in accuracy (see Jun Wu's thesis, chap. 7.1.2 Mapping ME N-gram Model Parameters to ARPA Back-off Model Parameters, available here).

Use the option -maxent-convert-to-arpa of ngram-count to do this:

ngram-count -text train.txt -vocab train.vocab -maxent -lm train.arpa -order 3 -debug 2 -maxent-convert-to-arpa

The resulting model is a regular ARPA language model and can be thus used by any tool that supports the ARPA format.

Support

The patch is written by Tanel Alumäe. E-mail: tanel.alumae@phon.ioc.ee

References

J. Wu and S. Khudanpur, ``Building a topic-dependent maximum entropy model for
  very large corpora,'' in Proceedings of ICASSP,  Orlando, Florida,
  USA, 2002.    

G. Andrew and J. Gao, ``Scalable training of L1-regularized log-linear
  models,'' in Proceedings of the 24th International Conference on
  Machine Learning. Corvalis, Oregon, USA, 2007,  pp. 33--40.

S. F. Chen, ``Performance prediction for exponential language models,'' in
  Proceedings of HLT-NAACL, Boulder, Colorado, 2009,  pp. 450--458.

License

This patch is licenced under the term of the two-clause BSD Licence (note that SRILM itself however has a different license):

Copyright (c) 2009-2010  Tanel Alumäe
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
 
people/tanel/srilm-me.en.txt · Last modified: 2013/03/10 10:35 by tanel