Over the years we have developed a diverse pool of corpora, including large web-derived corpora and small but highly specialised ones.
Some of our corpora are publicly available, while access to others is restricted to our staff and studens. Click one of the links above to access our corpora.
Our freely-available resources include:
acWaC (academic Web-as-Corpus), a pool of corpus resources to study institutional-academic language
acWaC-EU, a corpus of web pages in English crawled from the websites of European universities. You can find further details about this corpus on the acWaC project website
acWaC-IT, a corpus of web pages in Italian crawled from Italian university websites and based on the same pipeline used to build acWaC-EU
WaCky (Web-As-Corpus Kool Yinitiative), a collection of large corpora built by automatically downloading texts from the web. We have made available corpora in English, French, German and Italian. To learn more about how these corpora were created go to the WaCky website.
La Repubblica, a corpus of Italian newspaper texts published between 1985 and 2000 (approximately 380M tokens).
EPIC, the European Parliament Interpreting Corpus
Bulletin, the Bulletin Corpus (German)
Our corpora are available through the NoSketch Engine online platform. The NoSketch Engine is an open-source tool for corpus management providing a powerful and user-friendly interface to perform corpus searches, generate word/keyword lists, retrieve collocations based on several statistical measures and much more.
We assume that textual data available through this platform are treated under the fair use doctrine.