Date rules: the devils are in the details

With auto-translation rules, memoQ has a significant advantage over other translation environments.

I described some of these advantages in an old blog post in which I described a solution for using a single ruleset to check the correct reformatting of dates in four source languages to the desired target language. Years ago, I published a crazy demonstration of date "mapping" with long dates in 17 source languages being translated to Swahili. But...

... it's easy to go astray if the context of use and language-specific details are not well understood.

Take French, for example. This course has some long date rules which use French as a source language. Without the advice of a competent French linguist, I would have messed those up for dates on the first day of every month, because only the first day bears a suffix on the number that every other day of the month lacks.

1er avril 2000 

15 mai 2023

5 juin 2023

etc.

In some rulesets like current versions for Ukrainian target texts, I've skipped the month+year (with no date) structure because I have the impression from reading annual reports from Ukrainian companies that I need to know more about the case governance of certain prepositions and do some special coding to handle that. I need to talk to some Ukrainians about that.

If you translate "German" to some sort of English, you might think that the months of Januar, Februar, März, April, Mai, Juni, Juli, August, September, Oktober, November and Dezember have you covered, right? Well... not if you encounter Austrian texts with "Jänner" (take a wild guess at the month). Or you encounter a text written by someone who decides that the form of the German word for June ("Juno") which is often used in conversation to avoid confusion with July is something that should be written. Your rules should anticipate those situations where possible.

Someone who translates only British English to another language might think that only one format need be used for the source input. Uh, no. I've lost track of how many times I have had to extend a "good" working ruleset to cover variations like:

12 August 2023

the 12th of August, 2023

this 12th day of August in the year of Our Lord 2023

-- not to forget the text written by some immigrant who studied in the US --

August 12th, 2023

and so on. You may think your input formats are simple until you actually start to create translation style and QA rules to handle them. It's an education.

And did I forget to talk about typos and the plain perversion of inconsistency? Missing spaces? Extra spaces? A comma typed where a period (point, full stop) was intended. Some of the rules provided in this course account for such things, others do not, and most of those differences are a matter of use histories often extending beyond a decade. Your rules will evolve as they are used and that's normal. A good thing. Prepare for it by anticipating these factors as best you can and including that information in internal documentation (XML comments written in a code editor, for example) and the instructions given to "experts" who might help with the coding.

You are the expert on what might be expected in the texts you work on, not some whiz kid regex genius who has never translated a single challenging document in your specialty before.

And those crazy four source language date rules???

You want a look at them? Sure. Here are some notes from the day I wrote them quickly to avert a client disaster for one of my LSPs. These rules are not the final version; later, I streamlined the way the lists worked and split the languages into different rulesets with one or two source languages each to improve performance.

This ruleset maps a number of source text date formats to the desired English target format specified by the client: DDmonth-abbreviationYYYY, for example 03Mar2021. It also handles day+month and month+year occurrences.

Source formats are:

  • International date YYYY MM DD with separators of various kinds, including spaces
  • German, Italian, Portuguese and Spanish long, abbreviated and short date formats with any capitalization or period punctuation present or missing

The current version 3.7 does not fill in the year for day+month occurrences. If this is needed let me know.

Download them here, go forth and sin no more on dates....

Complete and Continue  
Discussion

1 comments