Clarifying Copyright on Translation Data
- Created on 16 January 2013
A world without language barriers
This could be a reality within 10 to 15 years. The technology is there. What it will take is a very large-scale coordinated effort between governments, businesses and academia worldwide. We call it the Human Language Project. The goal is to reach sufficient adequacy and fluency in fully automatic translation so that most of the world’s citizens can speak and write their own language and be understood by everyone else.
In two articles TAUS wishes to focus on the pre-conditions for making this happen. This is the first article: a call on legislators and policy makers to consider how to address copyright law for the new legal entity of translation data. The second article will focus on defining and framing this mega-project.
Let’s make translation easier
The French president François Hollande demanded that the Dutch return all the borrowed words from the French language as of the year 2015. The Dutch people will then no longer be allowed to use many words – such as ‘dossier’, ‘portefeuille’, and ‘ordinair’ – that have become commonly used words in the Dutch language.
This news item was published on November 15 by the Dutch newspaper de Volkskrant, under the parody section of course. Who would seriously think that national governments can claim ownership over the words that people use? After all we all copy each other’s words in order to be better understood and communicate well. We have learned to do so ever since we were babies and heard our mothers say our first words to us.
And yet when reviewing the intellectual property rights on terminology and translations, it seems that we are entering a minefield. Publishers, authors and translators have the right to own words and users need to be aware of that. And the relevant law is different in different parts of the world.
We believe it is time for legislators in Europe, the United States, Canada and all other leading nations to clarify the current copyright law on a new technology phenomenon, namely translation data. During the last decade, innovation and creativity in technology, business processes and collective intelligence have made a remarkable impact on the global translation industry.
These forces have been generating resources and processes that have revolutionized the way companies and governments can communicate with users and citizens. This has resulted in the creation of a certain type of data that we feel should have an independent existence under the law, or at least be given special status as a technology-induced phenomenon.
There are parallels. A similar cry has gone up from communities involved in leveraging knowledge from large text corpora in the field of life-sciences, law and similar ‘big data’ research fields. In the UK, for example, there are efforts underway to release such data from the constraints of legal protection and allow non-commercial innovative text mining initiatives to thrive by changing the copyright conditions on academic journal content.
We would argue that translation memory usage could be conceived as a special case of mining for data, even though its ultimate usage will probably be commercial. Making exceptions under the copyright law for specific types of data – “translation words” - by a broader cohort of legislators might offer a better path through the legal maze.
We now realize that further progress will not simply depend on better technical fixes, but also on solving the daily conflicts between the principles of intellectual property law, corporate policies, business practices, and what we can call pragmatic use of data. These copyright issues are obviously not the only stumbling block to progress in general, but they do raise important issues over the long term, and relate to fields of technology-driven innovation that are similarly focused on leveraging intelligence from forms of data.
Here is a view of how we can collectively improve the legislative framework for all stakeholders in tomorrow’s translation industry.
Copyright law has had a difficult time in the digital age, now that file copying has become instantaneous and ubiquitous. We all agree that intellectual property rules are vital for protecting business value. Yet the practice of translation and localization (which creates products that we own but which somehow escape our total control as content) raises new and unexpected concerns for many digital stakeholders. Let’s take a closer look at the steps in the translation process and how these relate to copyright.
First there is the original work, the document written in the source language. IP rights to this document belong to the author or to the company that employs the author or that has purchased the services of the author as a subcontractor. Then there is the translation of the full source document.
The IP rights can belong to the translator or to the translation company, but are generally transferred to the publisher who hires the translator or the translation company to provide the translation service. It should be noted here that this copyright protection on translations is absolutely automatic under the law: it is not necessary to register the work or go through any red tape to transform a translation file into a copyrighted translation file. But translated works can be registered in an official IP repository to ensure stronger protection where necessary.
During the translation process a translation memory (TM) will be created - i.e. a list of all the sentences in source and target languages in a particular database format. Generally speaking, neither the original source document nor the translated document can be recreated automatically from this digital translation memory file. IP rights to the individual sentences – source and target –still belong to the author, translator or the company that employed them or paid for their services.
The critical issue here is that TM technology creates a new database with its own format and attributes, and this forms a completely new work with its own IP rights, made out of bits of other works. Once again, the IP rights to the translation memory belong either to the translator or translation company, or they may be transferred to the publisher who paid for the translation service if this is clearly stated in the service agreement.
That said, exclusively in Europe there is a so-called sui generis right on databases that may need to be reviewed. The 1996 Database Directive points to copyright protection for a database that contains non-original data but which nevertheless required a “substantial (intellectual) investment” to set it up. In other words, the structure and not the content are protected – though the content may obviously be copyright protected under some other head in the law.
The lack of full harmony within all EU countries about copyright issues on databases, and what counts as an “original” work under the law, is another knotty issue facing the quintessentially cross-border practice of translation, and its automated processes.
Policies, practices and pragmatists
The confusion about who owns what at which moment in the workflow can lead to a conflict between policies and practices. This opens up a source of profit for pragmatists and innovators. An ‘educated’ customer (i.e. one who is fully aware of the benefits of TM) will deem it good business practice if the translator re-uses the translation memory created during a previous job for a new job.
It will mean that the terminology will be more consistent and the price will be lower. If this educated customer provides the tools platform and access to the previously created translation memories, this process could become a welcome industry practice.
However, the vast majority of translation buyers rely on their translation vendors to manage the translation memories and tool platforms, even though they may insist in their service agreements that the IP rights to these translation memories belong to them. But what if translation buyers use the services of multiple vendors? What if they change vendors and the new vendor subcontracts the translation jobs to the same translators who did the original translations? In all these cases it will be hard to follow the best practice suggested above.
Even if the informed translation buyer legally owns the translation memories, the translation memory as a file (to which the IP rights belong) may not exist in the original form. It has probably been mixed and matched with new translation memories, creating a new database with its own new IP owner. The upshot of all this is that the educated translation buyers are completely lost; they probably wish they knew as little about the issue as the vast majority of organizations that buy translation services.
In recent years pragmatists and innovators have entered the market to make the most of this confusion around IP rights for translation memories. Innovators often help clarify issues, prompt welcome changes in inadequate legislation and pave the way for growth. In the translation industry, innovators offer online services and tools platform that make translation faster and easier for end users.
These users may not realize that they are allowing these new innovative providers to use their translations – not to recreate the original work, but to carry out research on translation technology, and generate derivative work. For these innovators, translation memories have become ‘data’ that help improve automatic translation engines.
A 21st century view on translation
Today’s practices, policies and principles of IP legislation all stem from a last-century definition of translation whereby translation memories were merely intended to help the translator do a better job a little faster and somewhat cheaper and more consistent than previously.
Today the focus is shifting from translation memories on hard disks to massive amounts of translation data in the cloud, in the form of parallel text corpora. These translation data may be accumulated from translation memories, or from online translation service platforms or harvested (‘crawled’ and aligned) from localized versions of web sites and other sources.
In addition, these data are often usefully annotated with attributes for domain, content type and source. In this way translation data are becoming the key to quantum leaps forward in translation efficiency. Google has demonstrated this already in the past five years by training new machine translation engines for 4032 different language pairs by using data, nothing but translation data.
In their article “The Unreasonable Effectiveness of Data” Google scientists Alon Halevy, Peter Norvig and Fernando Pereira (Published in March-April 2009 in IEEE Intelligent Systems) make the case for anyone who wants to train machines to translate to “go out and gather some data, and see what it can do.” Since 2009 many language service providers, large and small organizations and new-generation MT developers have followed this advice and started training MT engines with whatever data they could put their hands on.
Summary of key issues
What we are suggesting here is that complex multi-layer IP rights, conflicts between policies and practices, and somewhat old-fashioned definitions of translation memory and data tend to make it harder to build a prosperous, innovative and fast-growing global translation industry that helps the world to communicate better.
Obviously organizations need to legally protect their ownership rights over their content and its translations. But we feel that a clearer distinction between the ‘words’ of their content as data, and the structure of their content as published documents could be instrumental in opening up language data as a collective good.
The ideal situation
What the industry needs
In a November 2007 article entitled: You Must Remember This: The Copyright Conundrum of “Translation Memory” Databases, Francie Gow has studied the precarious position with respect to copyright laws for translators using translation memory tools. Copyright law does not allow translators and translation companies to use the translation memories they have built for one customer to help them on projects for another customer. Legally speaking they are required to destroy their translation memories after projects are completed.
That seems to be an extremely counterproductive ruling. Customers choose to work with professional translators because they are experienced and skilled. Like lawyers and management consultants, they are expected to keep libraries of past work to build on in the future, says Francie Gow, who investigates whether a case can be made for fair dealing under Canadian and US law when translators keep copies of translation memories for the leveraging of new translation jobs.
Her conclusions are that in some cases the use of translation memories may be permitted under the fair use/fair dealing exception to copyright legislation. But this too is a minefield. Translators and translation companies who do not want to upset their customers naturally stay on the safe side of the law and accept the consequences.
The translation industry today is under tremendous pressure to keep up with market demand. Volumes keep growing, turnaround times are shrinking and the cost per word has to come down. There are not enough professional translators in the world to meet this demand. By introducing more openness and greater ‘shareability’ of data under specific conditions into the current copyright law, the translation industry would be able to innovate and automate processes more efficiently, and in everyone’s interest.
If translation as data can be used freely to develop derivative work, improve quality and drive research into new technologies, then the industry – and that means all stakeholders: buyers, vendors and technology suppliers - would be able to flourish as never before. We could expect a broader range of services, much larger capacity, and new opportunities for technology innovation.
There will naturally be a concern to protect the underlying intent of copyright law – i.e. to protect originality and creative work. But in fairness, the IP criteria of originality and creativity should only be applied to a fraction of the words that are actually published these days. How many different translations would a user in any given country in the world like to see of the sentence: “Please enter your password”?
Companies, governments and NGOs along with citizens, end-users, tax payers and patients everywhere would benefit enormously if ninety percent of the words written and translated can be leveraged as data to drive better, quicker communication.
What the world needs now
If language were no longer a barrier for a Japanese student reading a French newspaper and a German consumer placing an order on a Greek web shop, the world would look very different. We would have a much better understanding across cultures, which might in turn diminish some of the risks of international conflict and political disintegration. Global business would grow exponentially. Breaking the language barrier would help streamline the globalization of business and politics.
Erik Ketzan in his article Rebuilding Babel: Copyright and the Future of Machine Translation Online (2006), says: “Technology may have put us on the moon, but machine translation has the potential to take us farther, across the gulf of comprehension that lies between people from different places.”
The ideals of a prosperous global translation industry and a world of better communications will benefit enormously from greater clarity about the translation-specific nature of copyright law.
New principles for a revision of copyright law
We estimate that 75% of texts written and translated in the world these days are published online. This makes it very hard to protect the usual intellectual property rights of all these texts. Innovators are active everyday using automatic spiders to crawl the web and align translations. In North America this use of translation data may be allowed under the exception of fair use and fair dealing.
Europe tends to interpret copyright law on a much stricter basis. Yet the innovators are ubiquitous, and they range from very large global IT companies in the USA to small start-ups in any part of the world. It is hard to call them ‘thieves’ and impossible to prosecute them. This creates unfair competition both inside and outside the translation industry. One way to clarify the law would be to create a more open, sharing translation environment, in which every stakeholder is free and able to use data to optimize their translation process.
Obviously the act of modernizing copyright law to reflect this reality among the leading nations of the world will not alone pave the way to a wonderful new world of global translation. There are many other issues at stake. But we would like to propose a couple of simple principles that focus on the specifics of the translation issue:
- A clear distinction must be made between the way Intellectual Property (IP) rights are treated for the text to be translated (the Source), the translation (the Target) and Translation Data as a new legal entity.
- Translation Data are defined as a database containing terms, phrases and segments of text, aligned between two or more languages. Translation Data in most cases contain phrases and segments from many Sources and Targets. If the database allows users to reconstruct the Source or Target, as referred to in the first principle here above, this will be considered an infringement on the IP rights assigned to the Source and Target.
- IP rights to the Source and Target may be held exclusively by the author, the translator or the company that is publishing the Source and Target.
- Translators and translation companies should be allowed to store, share and aggregate Translation Data for the purposes of developing derivative work, leveraging and reusing translations, research, and improving their services.
- The translator or the company that aggregates the Translation Data holds IP rights to the Translation Data in the form in which the data are stored and used in the database.
- Owners of Source and Target should know that they can legally protect their documents from copying when they publish on the web.
We hope that these six simple principles help to start the discussion towards a more open, data-sharing vision in the translation industry. We welcome your comments.
P.S. December 2012. Clearly this proposal to make translation data a shareable public good meshes with calls for a legislative rethink about the digital economy in Europe. On Dec 5, the European Commission agreed to modernize copyright to ensure it is “fit for purpose” and “allow new business models to emerge.”
This initiative will address six issues: cross-border portability of content; user-generated content; data and text-mining; private copy levies; access to audiovisual works, and cultural heritage.” Data and text mining are highly relevant here, and innovative reinvention in such areas might eventually lead to legislative reform. We shall naturally follow this work with close interest.
Authors: Jaap van der Meer and Andrew Joscelyne
References and further reading
- The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira (Google)
- You Must Remember This: The Copyright Conundrum of "Translation Memory" Databases, Francie Gow
- Rebuilding Babel: Copyright and the Future of Machine Translation Online, Erik Ketzan
- The Human Language Project: Inventing the Future of Translation Data, article on TAUS web site.
- Choose Your Own Translation Future, article on TAUS web site.