index - Computational linguistics

The computational linguistics group is involved in research with applications in language processing. We have, on the one hand, short-term activities in information retrieval and information extraction, and on the other hand, activities motivated by a longer-term strategy : construction and exploitation of language resources (lexicons and grammars). Some members of the team divide their activity between these two types, or have shifted between them during the recent years, establishing between them the links and synergy that are the hallmark of the team.

Our most applied activities are in the fields of information retrieval and information extraction using language resources. They contribute to finance the team through subsidized projects such as DoXa, and contracts with SME which belong to sectors such as publishing and internet monitoring. Some of our experiments on spelling correction by resources and rules are likely to lead soon to the creation of a start-up at Leuwen-the-New (Belgium). For information retrieval and extraction, we sometimes use techniques of lexical statistics. However, unlike mainstream language processing, which is practically restricted to statistical techniques, we intensively use language resources (lexicons and grammars).

Thus, our partners appreciate our services for

> complementarity with statistical methods as regards performances,

> adaptability of systems when unsatisfactory behavior is identified, because of the maintainability of the language resources.

These qualities result from intensive, fundamental work by this group, its partners and others which preceded them for several decades. We continue today these activities with the ambition to participate to the implementation of a long-term strategy.

We produce tools for language processing using language resources and rules. Most of these tools are available in two open-source platforms, Unitex, and Outilex, which have been created and are maintained and extended under the supervision of the laboratory. The first version of Unitex was developed in 2002, and Outilex, between 2002 and 2006.

The main operations that were investigated and implemented are the following :

> morphological analysis of Korean,

> preparation of textual corpora (collection, transcription, anonymisation),

> alignment of bitexts with manual correction and projection of a concordance of one of the texts onto the other,

> management of a library of local grammars,

> chunking, taking account of multi-word expressions,

> extraction of collocations,

> lexicalization of syntaxico-semantic grammars for syntactic parsing, and in particular processing of tables of lexicon-grammar.

The remainder of the group's activities is devoted to the language resources :

> creation of new lexicons for language processing, with morphosyntaxic (Korean, Romanian) or syntactic-semantic information (French, Korean, modern Greek, Italian, Romanian),

> extension and documentation of existing lexicons (French),

> standardization of formal models of lexicons for ISO,

> annotation of textual corpora,

> evaluation of grammars by comparison with annotated corpora,

> creation of lexicons of multiword expressions, either multilingual or taking into account several varieties of French (Belgium/France/Quebec/Switzerland).

Part of our work on resources involves purely linguistic reflexion on some phenomena that linguists have to deal with : for example, lexical frozenness, plurilingualism, or second language teaching.

The links between the various research topics above are at the heart of strategy of the team. The tools exploit the resources, but also make it possible to prepare corpora which are used for the evaluation of the resources. The resources and tools are used to implement final applications, whose performances give rise to a feedback on the resources.