Bilingual + monolingual test corpora (EN, PT) from the DGT data

This MQXLZ test data set...

... has 141,560 TUs. It was made from a memoQ View by these steps:

Create a project with the desired source and target language combination from the EU languages available in the DGT archive.
Import the TMX files from the unpacked folder structure of a downloaded DGT ZIP archive. This may take a while.
Sort the imported TMX files by size in the list of translation documents in your test project.
Select and delete all TMX files with ZERO size. This means that the language pair was not available for those documents.
Create a view of all the TMX files. Just "glue" them together.
Export the view as an MQXLZ file.
(optional) After importing your MQXLZ file to an appropriate project, you can send the content to a LiveDocs corpus or to a TM if such actions are useful to you.

Download

Monolingual TXT files from the same DGT sample

These files may be useful for many things, such as segmentation testing. They were produced by exporting target text data as plain text, using the corresponding function under Documents > Export... in the working grid

Download

All DGT testing sample PT.zip

Download

All DGT testing sample EN.zip

Additional mono- and bilingual test corpora

... are available here.

Complete and Continue

Discussion

memoQuickies Resource Camp

Bilingual + monolingual test corpora (EN, PT) from the DGT data

This MQXLZ test data set...

Monolingual TXT files from the same DGT sample

Additional mono- and bilingual test corpora

0 comments