CorpusWiki
tagging the languages of the world
Home   •   Help   •   Detect   •   Languages   •   Login

CorpusWiki

CorpusWiki is an online platform for the collaborative creation, annotation, and exploitation of freely available textual corpora for any language in the world. The main features of the system are highlighted below, for more information you can consult the documentation section.

Collaborative Corpora

CorpusWiki allows you to work on collaboratively created annotated corpora for all languages in the world. You can study any of the languages already in the system, or help create a corpus for new languages. Contribute to make your language available for linguistic research!

search for a languageread more

Collecting Texts

For many languages without a long written tradition or with very few speakers, creating a corpus of a reasonable size is not a trivial matter. That is why CorpusWiki attempts to make it easy for any speaker to contribute any text they might have.

read more

Searching Corpora

All corpora in CorpusWiki can be searching using the powerful Corpus Query Processor (CQP). For annotated texts, all annotation features can be used in complex search actions.

browse available languagesread more

Detecting Languages

CorpusWiki is capable of recognizing the language of a text in about 800 languages, and will automatically start to recognize each language for which there is a corpus in the system.

detect your languageread more

Annotating Corpora

CorpusWiki features an easy, graphical interface for providing each word in each text of the corpus with morphosyntactic features, as well as a labels for meaning and pronunciation. CorpusWiki will train an internal part-of-speech tagger to automatically assign the most likely features to each word in a new text.

browse available languagesread more

Corpus-Driven Dictionaries

The translation glosses in the CorpusWiki corpora are used to automatically generate a bilingual dictionary. Furthermore, it is possible to create a corpus-driven monolingual dictionary for each corpus. All dictionaries can be consulted online, and monolingual dictionaries can be downloaded in a number of standard dictionary formats such as XDXF, LIFT, and Shoebox.

browse available dictionariesread more


CorpusWiki is an initiative by Maarten Janssen of the IULA institute of the Universitat Pompeu Fabra in Barcelona, Spain.