Identifying possible segmentation issues retroactively

When considering whether to "upgrade" from the memoQ default or other segmentation rules, it may be helpful to take a retrospective look at cases where you or another translator or translation reviewer changed the segmentation, perhaps because of problems caused by abbreviations or other "unusual" text structutres.

How is this possible? Well, as of version 10.3, memoQ does not keep track of manually joined segments. So what other options are there.

Using our heads is not a bad option. Consider what general statements you can make that may apply to cases where segments were joined on the source text.

Well, sometimes segments were joined, simply because a translator decided to translate two or more sentences into one or perhaps two segments with some rearrangement of the order in which content is translated compared to the source text sequencing. Those cases are of no interest to us for segmentation.

Typically, in the languages I work with, these places will have a period (".") followed by a space and then a capital letter, number or symbol. So simply screening for taht structure may show you places where segmentation has been fixed or where segmentation needs testing with a given ruleset. Regex is the key here.

That text structure might be defined in regex as

\.\s(\p{S}|\p{N}|\p{Lu})

That is: a period, followed by a space, followed by a symbol, a number or an uppercase letter.

This is not specific to any language or alphabet.

To test this out, use that regex to filter a large document in a project or even a TMX file imported as a "translation document" after exporting from a translation memory.

Interesting sentences found can be copied from the working grid and then pasted into a text file in order to compile sentences to be tested on a segmentation ruleset.

Here is an example of regex used to find TM segments spoiled by the infamous bug for known abbreviation exceptions following a left parenthetical structure:

Some useful regexes related to segmentation issues are included in a small Regex Assistant library below.

Here is a Regex Assistant library...

... with some regexes that may prove useful when working with segmentation issues. It includes the regexes above and the one from my comment below.

If you come up with your own useful regexes related to segmentation, remember to save them in your Regex Assistant library!

Note: this export of segmentation-related regexes was update October 13, 2023

Complete and Continue  
Discussion

4 comments