About this data for testing....

Naming convention for files

Generally, the format for naming test data files in this section is as follows:

<somewhat arbitrary numerical code>_<domain/description>_<language code>.<file extension>

so something like

10001_dates_EN-US.txt, 10001_dates_UA.docx, 10001_dates_FR-CA.rtf, etc.

This naming convention for all files will use English in the course. The similarity of names for a given document in two different language will facilitate things like alignment to produce bilingual data, which may be helpful for learning purposes sometimes.

File formats

Most of the data provided will be in plain text format (encoding may differ as needed in some languages, usually UTF-8 will be used), RTF, Microsoft Office formats (Word, Excel and PowerPoint formats of various historical kinds), HTML, XML, PDF and some common formats for Open Source alternatives to Microsoft Office.

In this course, I will make no attempt to provide comprehensive file types and instead will keep the file types consistent for a given data set across its language instances. There may be some exceptions to this as required by a given language or to facilitate some learning exercises.


The greatest emphasis will be on providing test data in the language of this course (English). There are, of course, many Englishes, and my native one is an English of the southern California portion of the United States of America. In some cases, examples may be provided of an English from the United Kingdom or even some imperfect variant or unusual dialect if it makes sense to do so for some exercise.

The fact that persons in this course are all translators and all of you are reading this in English suggests that it is a good common language to use for demonstration purposes, as I tend to do in my public lectures these days. It can be distracting to many people to see examples that use languages with which they are unfamiliar, and I do try to avoid this where possible. Of course this is often an impossibility, and Pig Latin cannot serve all purposes for a demo target language.

My other languages of usual communication are German, Spanish and Portuguese in that order. I can be reasonably sure of providing good source texts in those languages. For others I must depend on friends and colleagues or even our good friend Mr. Gargle. My apologies in advance for any inconvenient errors due to my ignorance of a language and the possible use of inappropriate means of generating text in it. All corrections and voluntary submission of new language examples for sharing freely with others will be received gladly.

What about the test data in other sections of this course?

Sometimes that will be the same data with a different name or file format, and its quality may be less consistent. It can still be used, of course, and sometimes data found in this section may be included in ZIP files or other parts of a lesson elsewhere. Deal with it. I'll do my best to provide good data for testing, but in some cases this is no small task and a lack of skill, imagination or something else may result in less than perfect information being shared. All data and resources in this course are provided without warranty, in good faith, to help you learn.

Complete and Continue