Bilingual + monolingual test corpora (EN, PT) from the DGT data
This MQXLZ test data set...
... has 141,560 TUs. It was made from a memoQ View by these steps:
- Create a project with the desired source and target language combination from the EU languages available in the DGT archive.
- Import the TMX files from the unpacked folder structure of a downloaded DGT ZIP archive. This may take a while.
- Sort the imported TMX files by size in the list of translation documents in your test project.
- Select and delete all TMX files with ZERO size. This means that the language pair was not available for those documents.
- Create a view of all the TMX files. Just "glue" them together.
- Export the view as an MQXLZ file.
- (optional) After importing your MQXLZ file to an appropriate project, you can send the content to a LiveDocs corpus or to a TM if such actions are useful to you.
Monolingual TXT files from the same DGT sample
These files may be useful for many things, such as segmentation testing. They were produced by exporting target text data as plain text, using the corresponding function under Documents > Export... in the working grid
Additional mono- and bilingual test corpora
... are available here.