asfentry.blogg.se - Opus bitext and monolingual data

Traditional approaches for language detection involve statistical techniques based on the availability of a training set that covers the set of languages that need to be identified. Language identification without training.This is the case of Serbian, since its official script is Cyrillic, yet a great amount of speakers use Latin alphabet, or Malaysian, which is most widely written in Latin, but a derivation of the Arabic script can also be used. Then again, there are languages that can be written with different scripts. Latin script is used for many European languages but several Asian languages use it as well, like Indonesian, Malaysian, Vietnamese, Tagalog… Similarly, Cyrillic script is used by some Eastern European languages but also in Asian languages like Kazakh or Kyrgyz. Not a translation: this sign shows the city of Belgrade's Serbian name in both Cyrillic and LatinĪ non-negligible number of languages have its own script that identifies them univocally (Armenian, Georgian, Gujarati, Telugu…) while some other scripts are used by many different languages.

The script is the set of characters that is used to write in a given language, and depends on the writing system that language is written in.

Of course, first clue in language identification is the script. In fact, at this very moment we are devising a method that concurrently detects whether a document is multilingual and estimates the proportion of the document that is written in each language for a set of more than 50 languages and language variants. Language identification techniques commonly assume that every document is written in one of a closed set of known languages for which there is training data and is thus formulated as the task of selecting the most likely language from that set.Īt Bitext, we want to remove this monolingual assumption, and address the problem of language identification in documents that may contain text from more than one language from the candidate set. This is where language identification comes to the front line. Even if getting these devices to detect languages other than English has proven to be a challenging task by itself, the next frontier will be to have interfaces capable of understanding and managing multiple languages or even capable of dealing with a real mixture of different languages at a time. However, these interfaces have been developed successfully for a small set of languages, mainly English. In the last months, we have seen many user interfaces turn into conversational chatbots that can be controlled or interacted using natural language. The explosion of Artificial Intelligence over the last year has generated an increasing interest in Natural Language Understanding (NLU) technologies to build systems capable of interacting with the end customers in their own language in a really “natural” way.