The European Parliament Translation and Interpreting Corpus (EPTIC) is developed at the University of Bologna and a few other universities responsible for different language components. At present EPTIC comprises texts in English, Italian and French; however, other language components, including Finnish, Polish and Slovene are currently on the way.
The corpus is currently made available through the NoSketch Engine platform (NoSke; Rychlý 2007).
EPTIC is an intermodal and parallel corpus of a complex structure. Data included in the corpus is derived from the official website of the European Parliament, which provides videos and verbatim reports of the plenary sessions together with the interpretations of the speeches, as well as their translations (the latter only for plenaries which took place until mid-2011). Subcorpora are aligned to each other at sentence level, and transcripts of speeches and interpretations are time-aligned with the corresponding videos.
Each language combination component in EPTIC includes the following subcorpora:
EPTIC is also a multilingual corpus, which includes several language combinations and translation/interpreting directions. All language combinations feature English as one of the languages involved. The subcorpora completed so far are: English>French, French>English, English>Italian, Italian>English, Polish>English.
The current sizes of individual language components of EPTIC are visible in the following Tables (count of tokens obtained via the Corpus info function of NoSkE).
Sources | Targets | |||
---|---|---|---|---|
Spoken | Written | Interpretations | Translations | |
English | 24,136 | 22,782 | 53,615 | 58,561 |
French | 27,713 | 26,674 | 23,185 | 25,855 |
Italian | 20,016 | 19,591 | 20,352 | 23,234 |
Polish | 11,011 | 10,616 | TBA | TBA |
The English subset of EPTIC has a slightly different structure from the remaining ones. It comprises source texts originally delivered by native and non-native speakers as well as translations and interpretations from various languages. Since little information is provided at the European Parliament website about the translations into English, it is difficult to establish whether they are produced by native English speakers or not. The interpretations into English from various languages are carried out either by native or non-native English speakers.
English | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Spoken sources | Written sources | Interpretations | Translations | |||||||
Native | Non-native | Native | Non-native | From French | From Italian | From Polish | From French | From Italian | From Polish | |
into A | into B | |||||||||
10,840 | 13,296 | 10,312 | 12,470 | 23,670 | 18,390 | 11,555 | 25,046 | 20,335 | 13,180 | NA |
EPTIC is comprised mostly of short speeches ranging from 100 to 300 words, but there are also medium and long speeches (the latter exceeding 1000 words). Speeches are delivered at the European Parliament plenary sessions by different speakers, including mostly MEPs, but also commissioners and guests. They are devoted to a range of topics (see Table below). The composition of the individual subcorpora with respect to the topics covered in the debates can be inferred by clicking on the Corpus info link within the NoSke interface: in the Structures and attributes section, simply click on Topic.
Topics covered in EPTIC |
---|
Agriculture and FisheriesEconomics and Finance Employment Environment Health Justice Politics Procedure and Formalities Society and Culture Science and Technology Transport |
While searching the corpus, it is possible to filter the queries using the contextual information that allow to narrow down the query to, e.g., a speech delivered by a particular speaker, or a speech of particular length. The filtering options are described in the following Table.
As the corpus is aligned at sentence level, it is possible to search all 4 aligned subcorpora in a parallel search. In addition, the corresponding excerpt of a video of the source speech or the interpretation can be displayed.
Filter | Description |
---|---|
text.id | refers to the ID of the text |
text.date | date on which the speech was delivered at the EP |
text.length | length of the speech in general (short, medium, long) |
text.lengthw | exact text length in words |
text.duration | duration of the speech (short, medium, long) |
text.durations | duration of the speech in seconds |
text.speed | refers to the pace of delivery of the speech (low, medium, fast) |
text.speedwm | speed of delivery expressed in words per minute |
text.delivery | read vs. impromptu |
text.topic | the general topic of the speech |
text.topicspec | title of the debate |
text.type | source-spoken/ source-written/ target-interpreted/ target-translated |
text.wordcount | length of the speech in words |
speaker.name | name of the speaker |
speaker.gender | gender of the speaker |
speaker.country | country the speaker represents |
speaker.native | the speaker is speaking the native tongue or a foreign language |
speaker.politfunc | the speaker’s political function at the EP |
speaker.politgroup | the speaker’s political group |
st.language | source text language |
st.length | source text length |
st.lengthw | source text length in words |
st.duration | source text duration in general (short, medium, long) |
st.durations | source text duration in seconds |
st.speed | pace of the original speaker in general (low, medium, fast) |
st.speedwm | pace of the original speaker in words per minute |
st.delivery | mode of delivery of the source speech (read vs. impromptu) |
interpreter.gender | gender of the interpreter |
interpreter.native | the interpretation is delivered into the native tongue or the foreign language |