Language Support

Currently rasa NLU is tested and readily available for the following languages:

backend supported languages
spacy-sklearn english (en), german (de), spanish (es), french (fr)
MITIE english (en)

These languages can be set as part of the Configuration.

Adding a new language

We want to make the process of adding new lanugages as simple as possible to increase the number of supported languages. Nevertheless, to use a language you either need a trained word representation or you need to train that presentation on your own using a large corpus of text data in that language.

These are the steps necessary to add a new language:

spacy-sklearn

Spacy already provides a really good documentation page about Adding languages. This will help you train a tokenizer and vocabulary for a new language in spacy.

As described in the documentation, you need to register your language using set_lang_class() which will allow rasa NLU to load and use your new languge by passing in your language identifier as the language Configuration option.

MITIE

  1. Get a ~clean language corpus (a wikipedia dump works) as a set of text files
  2. Build and run MITIE wordrep tool on your corpus. This can take several hours/days depending on your dataset and your workstation. You’ll need something like 128GB of RAM for wordrep to run - yes that’s alot: try to extend your swap.
  3. Set the path of your new total_word_feature_extractor.dat as value of the mitie_file parameter in config_mitie.json