About PaEnS

The English/Spanish parallel Corpus, PaEnS, is part of an ongoing major project, PaCorES (www.pacores.eu), Parallel Corpora Spanish, which aims to collect a series of bilingual parallel corpora with Spanish as the central language. So far, the project includes three other corpora at different stages of completion, all of them freely available online: Corpus PaGes German < > Spanish, Corpus PaChes Chinese < > Spanish y Corpus PaFres French < > Spanish. So far, the project includes other three corpora: German/Spanish (www.corpuspages.eu), Chinese/Spanish (www.corpuspaches.eu) and French/Spanish (www.corpuspafres.eu).

PaEnS is a bilingual parallel corpus composed of two major parts: the core corpus and the supplements.

The core corpus is comprised currently of 222 original texts in English or in Spanish along with their respective translations. It includes works of fiction —novels and short stories, making up around 80% — as well as non-fiction (20.24%) — especially psychology, essays and popular science texts. The selected works are represented not by the full texts, but rather by samples, allowing for a better cross-section of the texts. Breaks in the text (original and translation) are marked.

This part of PaEnS (s. tables below) contains nearly 53 million tokens and over than 1.5 million bisegments, i.e. pairs of aligned text chunks (sentences or subsentential units/segments).

To guarantee overall quality, the texts have been manually verified at different levels. The automatic alignment of the bisegments, performed by LF-Aligner, some with youalign or Gargantua, has been manually reviewed. Since 2024, we have primarily used Bertalign for alignment, which, thanks to its transformer-based technology, achieves higher accuracy and consistency.

For POS tagging, we have chosen the tools that currently offer the highest accuracy according to the state of the art — Stanza for English and Freeling for Spanish. After lemmatization and tagging with these tools, a semimanual check was conducted to identify and correct systematic errors, and the resulting tags were then mapped to the Universal POS tagset , which captures the main part-of-speech categories. Future releases aim to include more fine-grained categories.

For each occurrence, the original source is provided, which includes information on the author, title, year of the first publication, and — if applicable —the edition used and the part or chapter within the work to which the specific occurrence belongs. The complete bibliographic data of the works included in PaEnS can be found here.

The supplements contain a total of more than 110 million words. If not otherwise specified, they are not undertaken any manual review. The supplements include so far:

Ted-Talks, a corpus that collects the English originals and Spanish translations of the transcriptions of 4043 Ted-Talks from 2006 to 2020. The alignment of these segments has been manually reviewed.
Europarl v7, a corpus that collects the proceedings (Verbatim reports) of the European Parliament from 1996 to 2011.
Global-Voices a corpus of texts written by an international, multilingual, primarily volunteer community of writers, translators, academics, and human rights activists. A group of Lingua volunteers make the stories available in dozens of languages.
OpenSubtitles v2018, a large collection of translated movie subtitles.

In the near future, new collections of bilingual texts of diverse origin are expected to be added.

We aim at building a multifunctional and representative language resource for the language pair English / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems. PaEnS has also proven to be a very useful and widely used resource by translators and learners of English or Spanish as Foreign Languages at intermediate and advanced levels to obtain a multitude of translation suggestions made by humans and presented within examples of real language use.

Despite our best efforts, some mistakes have undoubtedly slipped through. If you come across any, please let us know by by clicking here.

Notice:

If you use PaEnS in your work, please cite the article below and let us know: corpuspaens@usc.es. This way you contribute to the sustainability of the project.

Doval, Irene (2023): The English–Spanish parallel corpus PaEnS. Current trends on digital technologies and gaming for language teaching and linguistics, eds. I. Santos Díaz et al. Berlin: Peter Lang. pp.145-164.

Statistics PaEnS

PaEnS: Core Corpus

LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS	WORKS
English Original	13.123.320	11.364.819	0,535	806.226	120
Spanish Translation	13.746.700	12.046.344	0,529	806.226	120
Spanish Original	12.864.001	11.292.490	0,541	703.343	140
English Translation	13.176.524	11.541.833	0,527	703.343	140
Total	52.910.545	46.245.486	0,531	1.509.569	260

Supplements 1: Europarl v7

LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS
English	39.481.818	35.918.308	0,485	1.536.548
Spanish	41.476.923	37.600.223	0,465	1.536.548
Total	80.958.741	73.518.531	0,475	1.536.548

Supplements 2: TED-Talks

LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS
English	8.676.842	7.043.470	0,476	431.095
Spanish	8.338.726	6.816.425	0,506	431.095
Total	17.015.568	13.859.895	0,491	431.095

Supplements 3: Global Voices

LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS
English	15.285.853	12.724.972	0,558	680.530
Spanish	16.361.642	13.826.084	0,528	680.530
Total	31.647.495	26.551.056	0.543	680.530

Supplements 4: OpenSubtitles v2018

LANGUAGE	TOKENS	WORDS	MSTTRATIO*	BISEGMENTS
English	69.377.387	54.446.668	0,516	7.745.559
Spanish	62.207.848	49.007.151	0,570	7.745.559
Total	131.585.235	103.453.819	0.543	7.745.559

*MSTTR is the average TTR (Type/Token Ratio) for each non-overlapping segment of equal size (in this case 1000 tokens).
(Updated: 19/11/2024, Release v2.0)

Publisher Privacy & Terms of use

University of Santiago de Compostela

This project is funded by the State Research Agency (AEI) of Spanish Ministry of Science, Innovation and University (PID2021-125313OB-I00).