The Future of Translation Memory (TM)

Avan Haziran 23, 2022

There have been several voices talking about the demise of TM recently, most notably Renato Beninatto who has made it a theme of several of his talks at industry conferences in the true agent provocateur spirit. More recently, apparently Jaap van der Meer said the same thing (dead in 5 years no less) at the final LISA standards summit event. (My attempt to link to the twitter trail failed since @LISA_org is no more). This resulted in comments by Peter Reynolds and some commentary by Jost Zetzsche (published in the Translation Journal) questioning these death announcements and providing a different perspective.

Since there have been several references to the value of TM to statistical MT (which by the way are all pretty much hybrid nowadays, as they try to incorporate linguistic ideas in addition to just data), I thought that I would jump in with my two cents as well and share my opinion.

So what is translation memory technology? At it’s most basic level it is a text matching technology whose primary objective is to save the professional translator from having re-translate the same material over and over again. The basic technology has evolved from segment matching to sub-segment matching or something called corpus-based TM.(is there a difference?) In it’s current form it is still a pretty basic database technology applied to looking up strings of words. Many of the products in the market focus a lot on format preservation and this horrible (and somewhat arbitrary quantification, I think) concept called fuzzy matching, which unfortunately has become the basis for translation payment determination. This matching rate based payment scheme I think is at the heart of marginalizing professional translation work, but I digress.

It makes great sense to me that any translator working on a translation project be able to easily refer to their own previous work, and possibly even all other translation work in the domain of interest to expedite their work. There are some, if not many translators who are also ambivalent about TM technology e.g. this link and this one. My sense is that the quality of the “text matching technology” is still very primitive in the current products, but the basic technology concept could be poised for a significant leap forward to be more flexible, accurate and linguistically informed in other parts of the text-oriented tools world, e.g. Search, natural language processing (NLP) and Text Analytics, where the stakes are higher than just making translation work easier (or finding a rationale to pay translators less). Thus, I would agree that the days are numbered for the old “klunker-type TM” technology, but I also think that new replacements will probably solve this problem in much more elegant and useful ways.

The old klunker-type TM technology has an unhealthy obsession with project and format related meta-data and I think we will see that in the evolution of this technology that linguistics will become more important. We are already seeing early examples of this next generation at Linguee. In a sense I think we may see the kind of evolution that we saw in word-processing technology, from something used by geeks and secretaries only, to something any office worker or executive could use and operate with ease. The ability to quickly access the reference use of phrases, related terms and context as needed is valuable, and I expect we will move forward in delivering useful, use-in-context material to a translator who uses such productivity tools.

It is clear that SMT based approaches do get better with more TM data and to some extent (up to 8 words) they will even reproduce what they have seen in the same manner that TM does. But we have also seen that there are limits to the benefit of ever growing volumes of data and that it actually matters more to have the “right” data in the cleanest possible form to get the best results. For many short sentences, SMT already performs as a TM retrieval technology, and we can expect that this capability will become more visible and more controls may become available to improve concordance and look-ups. We should also expect that the growing use of data-driven MT approaches will create more translation memory after post-editing, so TM is hardly going to disappear but hopefully it will get less messy. In SMT we are already developing tools to transform and change existing TM for normalization and standardization related reasons, to make it work better for specific purposes, especially when using pooled TM data. I think it will also be likely that many translation projects will start with pre-translation from a (TM+MT) process and hopefully a better, more equitable payment methodology.

The value of TM created from the historical development of user documentation is likely to change. This user documentation TM that makes up much of what is in the TDA repository is seen as valuable by many today, but as we move increasingly to making dynamic content more multilingual I think it’s relative value will decline. I also expect that the most valuable TM will be that which is related to customer conversations. Also, community based collaboration will play an ever increasing role in building leverageable linguistic assets and we are already seeing evidence of MT and collaboration software infrastructure working together on very large translation initiatives. It is reasonable to expect that the tools will get better to make collaboration smoother and more efficient.

I found this following video fascinating, as it shows just how complex language acquisition is, has delightful baby sounds in it, and also shows just what state-of-the-art database technology looks like in the context of the massive data we are able to collect and analyze today, all in a single package. If you imagine what could be possible as you use this kind of database technology on our growing language assets and TM, I think you will agree that we are going to see some major advances in the not so distant future, since this technology is already touching how we analyze large clusters of words and language today. It is just a matter of time as it starts impacting all kinds of text repositories in the translation industry, enabling us to extract new value from them. What is emerging from these amazing new data analysis tools, is the ability to see new social structures and dynamics that were previously unseen and to make our work more and more relevant and valuable. This may even solve the fundamental problem of helping us all understand what matters most for the customers who consume the output of professional translation.