EPTIC

European Parliament Translation and Interpreting Corpus

Home Consult Corpus Documentation The EPTIC Team Publications

Documentation

The European Parliament Translation and Interpreting Corpus (EPTIC) is developed at the University of Bologna and a few other universities responsible for different language components. At present EPTIC comprises texts in English, Italian and French; however, other language components, including Finnish, Polish and Slovene are currently on the way.

The corpus is currently made available through the NoSketch Engine platform (NoSke; Rychlý 2007).

EPTIC is an intermodal and parallel corpus of a complex structure. Data included in the corpus is derived from the official website of the European Parliament, which provides videos and verbatim reports of the plenary sessions together with the interpretations of the speeches, as well as their translations (the latter only for plenaries which took place until mid-2011). Subcorpora are aligned to each other at sentence level, and transcripts of speeches and interpretations are time-aligned with the corresponding videos.

Each language combination component in EPTIC includes the following subcorpora:

sources – spoken: orthographic transcripts of the original speeches;
sources – written: official verbatim reports of the source speeches;
targets – interpreted: transcripts of the interpretations;
targets – translated: translations of the verbatim reports.

EPTIC is also a multilingual corpus, which includes several language combinations and translation/interpreting directions. All language combinations feature English as one of the languages involved. The subcorpora completed so far are: English>French, French>English, English>Italian, Italian>English, Polish>English, English>Slovenian, English>German, English>Finnish.

Sizes of individual language components

The current sizes of individual language components of EPTIC are visible in the following Tables (count of tokens obtained via the Corpus info function of NoSkE).

	Sources		Targets
	Spoken	Written	Interpretations	Translations
English	43,138	41,047	55,139	58,651
French	35,648	34,063	31,935	35,566
Italian	21,208	20,646	27,329	31,816
Finnish	TBA	TBA	11,624	12,045
German	TBA	TBA	18,258	19,822
Slovenian	TBA	TBA	19,717	22,476
Polish	9,458	9,193	TBA	TBA

Characteristics of EPTIC texts

EPTIC is comprised mostly of short speeches ranging from 100 to 300 words, but there are also medium and long speeches (the latter exceeding 1000 words). Speeches are delivered at the European Parliament plenary sessions by different speakers, including mostly MEPs, but also commissioners and guests. They are devoted to a range of topics (see Table below). The composition of the individual subcorpora with respect to the topics covered in the debates can be inferred by clicking on the Corpus info link within the NoSke interface: in the Structures and attributes section, simply click on Topic.

Topics covered in EPTIC
Agriculture and Fisheries Economics and Finance Employment Environment Health Justice Politics Procedure and Formalities Society and Culture Science and Technology Transport

Filters (metadata) available in EPTIC

While searching the corpus, it is possible to filter the queries using the contextual information that allow to narrow down the query to, e.g., a speech delivered by a particular speaker, or a speech of particular length. The filtering options are described in the following Table.

As the corpus is aligned at sentence level, it is possible to search all 4 aligned subcorpora in a parallel search. In addition, the corresponding excerpt of a video of the source speech or the interpretation can be displayed.

Filter	Description
text.id	refers to the ID of the text
text.date	date on which the speech was delivered at the EP
text.length	length of the speech in general (short, medium, long)
text.lengthw	exact text length in words
text.duration	duration of the speech (short, medium, long)
text.durations	duration of the speech in seconds
text.speed	refers to the pace of delivery of the speech (low, medium, fast)
text.speedwm	speed of delivery expressed in words per minute
text.delivery	read vs. impromptu
text.topic	the general topic of the speech
text.topicspec	title of the debate
text.type	source-spoken/ source-written/ target-interpreted/ target-translated
text.wordcount	length of the speech in words
speaker.name	name of the speaker
speaker.gender	gender of the speaker
speaker.country	country the speaker represents
speaker.politfunc	the speaker’s political function at the EP
speaker.politgroup	the speaker’s political group
st.language	source text language
st.length	source text length
st.lengthw	source text length in words
st.duration	source text duration in general (short, medium, long)
st.durations	source text duration in seconds
st.speed	pace of the original speaker in general (low, medium, fast)
st.speedwm	pace of the original speaker in words per minute
st.delivery	mode of delivery of the source speech (read vs. impromptu)
interpreter.id	unique identifier of the interpreter
interpreter.gender	gender of the interpreter