en | es | gl
|
Text resources
|
Publications
|
Team
|
Contact

About PaEnS


The English/Spanish parallel Corpus, PaEnS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which aims to collect a series of bilingual parallel corpora with Spanish as the central language. So far German/Spanish (www.corpuspages.eu) and this one.

PaEnS is a bilingual parallel corpus composed of two major parts: the core corpus and the supplements.

The core corpus is comprised of original texts in English and Spanish and their respective translations. It includes works of fiction —novels and short stories, making up around 80%— as well as non-fiction — especially psychology, essays and popular science texts. The selected works are represented not by the full texts, but rather by samples, allowing for a better cross-section of the texts. Breaks in the text (original and translation) are marked.

This part of PaEnS (s. below) contains some 16,000,000 tokens and 515.490 bisegments, i.e. pairs of aligned text chunks (sentences or subsentential units/segments).

To guarantee overall quality, the texts have been manually verified at different levels. The automatic alignment of the bisegments, performed by LF-Aligner, has been manually reviewed. The English texts have been lemmatized and pos-tagged by Treetagger and the Spanish texts by Freeling. After performing a manual check for systematic errors, the tags of both have been subsequently mapped to the Universal POS tags, that mark the core part-of-speech categories. In the future it is expected to offer a more fine-grained categories.

For each occurrence, the original source is provided, which includes information on the author, title, year of the first publication, and — if applicable —the edition used and the part or chapter within the work to which the specific occurrence belongs. The complete bibliographic data of the works included in PaEnS can be found here.

The supplements contain a total of more than 110 million words. If not otherwise specified, they are not undertaken any manual review. The supplements include so far:

  1. Ted-Talks, a corpus that collects the English originals and Spanish translations of the transcriptions of 4043 Ted-Talks from 2006 to 2020. The alignment of these segments has been manually reviewed.
  2. Europarl v7, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011.
  3. Global-Voices a corpus of texts written by an international, multilingual, primarily volunteer community of writers, translators, academics, and human rights activists. A group of Lingua volunteers make the stories available in dozens of languages.

In the near future, new collections of bilingual texts of diverse origin are expected to be added.

We aim at building a multifunctional and representative language resource for the language pair English / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems. PaEnS has also proven to be a very useful and widely used resource by translators and learners of English or Spanish as Foreign Languages at intermediate and advanced levels to obtain a multitude of translation suggestions made by humans and presented within examples of real language use.

For more detailed information about PaEnS, see the publications webpage.

Despite our best efforts, some mistakes have undoubtedly slipped through. If you come across any, please let us know by by clicking here.

Notice:

If you use PaEnS in your work, please indicate it and let us know: corpuspaens@usc.es. This way you contribute to the sustainability of the project.

Statistics PaEnS (Stand: 2021/12)

Core Corpus

LANGUAGE TOKENS WORDS RATIO BISEGMENTS WORKS
English Original 4.446.050 4.436.921 79,79 279.624 37
Spanish Translation 4.521.373 4.512.774 48,28
Spanish Original 3.755.859 3.754.583 41,66 235.866 38
English Translation 3.948.366 3.949.193 83,80
Total 16.671.648 16.577.618 63,38 515.490 75

Supplements: Europarl v7

LANGUAGE TOKENS WORDS RATIO BISEGMENTS
English 42.178.712 36.485.783 21,11 1.550.421
Spanish 44.128.158 44.128.158 20,61
Total 86.306.870 74.887.296 20,86 1.550.421

Supplements: TED-Talks

LANGUAGE TOKENS WORDS RATIO BISEGMENTS
English 8.676.842 7.043.470 11,95 430.667
Spanish 8.338.726 6.816.425 10,75
Total 17.015.568 13.859.895 11,35 430.667

Supplements: Global Voices

LANGUAGE TOKENS WORDS RATIO BISEGMENTS
English 15.285.853 12.724.972 12,21 680.530
Spanish 16.361.642 13.826.084 14,15
Total 31.647.495 26.551.056 13,18 680.530
                                                              
PaEnS Vers. 2.0
Last updated: 31.05.2022
Creative Commons Licencia Creative Commons
University of Santiago de Compostela
This project is funded by the State Research Agency (AEI) of Spanish Ministry of Science, Innovation and Universities (FFI2017-85938-R) and by the Department of Economy and Industry of the Galician Government (2017-PG023).