Miloš Jakubíček is the Chief Executive Officer (CEO) of Lexical Computing, a research company working in the area of language technologies, primarily at the intersection of corpus and computational linguistics and computer lexicography. By profession, he is an NLP researcher and software engineer. His research interests are devoted mainly to two fields: effective processing of very large text corpora and the parsing of morphologically rich languages. Since 2008, Miloš has been involved in the development of Lexical Computing’s flagship product, the Sketch Engine corpus management suite. Since 2011, he has been Director of the Czech branch of Lexical Computing leading the local development team of Sketch Engine and he became CEO of Lexical Computing in 2014. Miloš is also a fellow of the NLP Centre at Masaryk University, where his interests lie mainly in morphosyntactic analysis and its practical applications.
How to find multi-word expressions in corpora
In the talk I will present automatic methods for finding some types of multi-word expressions in corpora. I will present a very simple typology of multi-word expressions based on some standard properties like fixedness or discontinuity and show how these properties determine suitable ways for automatic identification of the respective multi-word expression in corpora. Special focus will be put on lexicographic applications where, unlike in the case of single-word units, mere frequency is not sufficient for generating multi-headword candidates and, again unlike in the case of single-word units, there are no widely recognized strategies for automatic identification of these multi-word candidates.