Advanced leveraging (AL): the technique of gaining greater benefits from existing TMs by exploiting parallel data at the sub-sentential level (e.g. short phrases). See the TAUS Report 'How to increase your leveraging' for a comprehensive overview of AL.
Application Programming Interface (API): a set of software components and protocols providing an interface for software programs to communicate with each other without human intervention. In different words, an API is a description of the way that a service communicates with other service(s). See the TAUS Translation API, a simple, open API specification for people wanting to adopt best practices for a translation services API inorder to interact with counterparts directly from your application or content management system. The API helps ensure interoperability for the most common tasks.
BCP 47: a normative IETF track that compiles recommendations on how to create a unique language tag from codes defined in several other normative sources.
Bitext: text in two natural languages organized as an ordered succession of ordered and aligned sources and target pairs.
BLEU score (BLEU): Bi-Lingual Evaluation Understudy, an algorithm for evaluating MT output against a reference human translation. Best used to evaluate improvements of an MT system over several cycles of training. BLEU is not a useful metric for MT end users trying to evaluate quality.
Common Locale Data Repository (CLDR): a project of the Unicode Consortium to provide locale data in XML format for use in computer applications.
Confidence estimation: a general machine learning approach for the methods used to characterize the behavior of NLP systems and attain confidence measures. In machine translation it provides an estimation of the probability of an output being correct given a generated translation. In contrast with traditional methods of MT evaluation reference translations are not required for this process.
Customization: Adapting an MT engine and tuning it to a specific enterprise or utilization objective. Using relevant data, appropriate terminology, and specific rules, optimized for a given customer or occasion of use. Using customer’s own data. Often associated with RBMT applications.
Defense Advanced Research Projects Agency (DARPA): an agency of the United Stated Defense department which is responsible for the development of new technologies for use by the military. In recent years, it has regularly launched R&D projects in the field of text and speech to speech translation. The TIDES program (Translingual Information Detection, Extraction and Summarization) has been working on MT for intelligence for the past decade.
Data cleaning: Removing unwanted tags and other items in parallel corpora (TMs) or terminology lists to improve quality for SMT processing.
Decoder: An SMT algorithm that searches the target document for a sentence that has the highest probability as a translation for a given source sentence.
Domain, in-domain: domain is an acknowledged universe of discourse, associated with an industry, company or product, exhibiting specific terminology and other linguistic features. In-domain terminology or language data covers terminology or data belonging to that industry, company or product
Dynamic Quality Framework (DQF): a framework for selecting best fit translation quality evaluation models, a knowledgebase documenting industry best practices for applying evaluation models, and shared tools to enable in duty benchmarking. See the TAUS Dynamic Quality Framework for more information.
Engine: An individual exponent of an MT system. A system could have several engines, e.g. covering different language pairs, or dedicated to specific domains.
European Telecommunications Standards Institute (ETSI): a non-profit, standardization organization within the telecommunications industry, specifically equipment making and network operating.
Example Based MT (EBMT): knowledge is acquired from a bilingual text using basic statistics (similar to learning by analogy). In many ways, it is an early form of SMT.
Fair, Reasonable And Non-Discriminatory terms (FRAND): a licensing obligation that is often required by standard-setting organizations for members that participate in the standard-setting process. (See Reasonable and non-discriminatory licensing)
See also: RAND
General Text Matcher (GTM): A software package that measures the similarity between texts by matching between the components of e.g. a text and its translation. GTM can be used to help evaluate MT, by checking whether all elements in the source are represented in the target.
Hybrid MT (HMT): an MT system that combines both rule based and statistical processes. Also, more generally describes any MT system that uses TMs and other data sources in the workflow.
International Components for Unicode (ICU): an IBM driven open source project which consists of two subprojects ICU4C and ICU4J that provide libraries for C and C++ and Java respectively. ICU is the common denominator for both Android and iPhone Operating Systems.
Internet Engineering Task Force (IETF): a non-membership standardization body. IETF creates Internet related Technical standards, protocols, processes and non-normative informational content as RFCs. IETF is institutionally and financially backed by the Internet Society.
Interoperability: this occurs when two or more systems exchange and process information without human intervention. See TAUS web section on Interoperability for more information.
Internationalization Tag Set (ITS): a mechanism to provide XML content with metadata facilitating localization or cultural adaptation. ITS contains definitions of 7 data categories primarily designed for internationalization of XML content as abstract data categories, ITS can be also implemented in non-XML environments.
Language/Localization Business Innovation (LBI): the drive to improve and extend translation automation processes in the localization and language industry as a whole.
Language pair: any two languages used in a translation context (source and target).
Language service provider (LSP): a company or organization also known as a translation agency, that recruits and manages in-house translators and freelancers to provide translation/language services to the industry or community.
Latent Dirichlet Allocation (LDA): a topic model which analyses a given corpus or data and discovers and segments latent topics that combined to make up its text or documents.
Latent Semantic Analysis (LSA): is mathematical method for computer modelling and simulation of the meaning of words and passages by analysis of representative corpora of natural text. LSA closely approximates many aspects of human language learning and understanding. It supports a variety of applications in information retrieval, educational technology and other pattern recognition problems where complex wholes can be treated as additive functions of component parts. One of the LSA applications is the search of closely related words.
METEOR: a software program that automatically evaluates the output of machine translation engines by comparing them to one or more reference translations. While METEOR is an improvement on BLEU, it is imited to use for English, French, German and Spanish only.
Monolingual data: language resources whose style and terms can feed into the output.
National Institute of Standards and Technology (NIST): a US federal agency, NIST Open Machine Translation (OpenMT) runs cycles of MT evaluation on various language pairs in which engines can compete.
Normalization: cleaning TMs so they are better able to train an SMT workflow. Includes checking and removing unnecessary inline tags, irrelevant bits of data, mistranslations of homonyms, acronyms spelled out in target versions, one-into-two sentence mismatches, punctuation inconsistencies and upper/lower case mismatches.
Organization for Advancement of Structured Information Standards (OASIS): formerly called SGML Open, OASIS concentrate on development of XML based specification that target interoperability and automation in specific areas, such as authoring, business transactions, web services, and localization.
Open Architecture for XML Authoring and Localization (OAXAL): an OASIS standards-based initiative which encourages the development of an Open Standards approach to XML Authoring and Localization.
Okapi Framework: a set of localization engineering transformation tools, an Open Source Reference Implementation of Localization Open Standards, such as XLIFF, TMX, TBX, SRX, ITS and the OAXAL reference architecture model.
Part-of-speech (POS) tagger: a software tool or library that assigns part of speech labels to each token from the input text (Ex. “he goes home” - > “he/PN goes/V home/NN”).
Phrase table: in SMT, a large set of n-gram (word or phrase) pairs over the source and target languages, together with their translation probabilities. Can grow to millions of items for a given translation job.
Postediting, Posteditor (PE): rapidly repairing MT output to align it with the end user’s expected quality levels. Usually carried out by translators specially trained to make rapid decisions on repair strategy per segment.
Preprocessing: a variety of operations on the source text to optimize it for MT. Usually involves ‘cleaning’ formatting errors and running regular expression checks to make a source text as high quality M-translatable as possible.
Reasonable And Non-Discriminatory (RAND): IPR mode that allows owners to charge for use of Essential Patents provided that the charge is reasonable and non-discriminatory. See also FRAND
Rule Based Machine Translation (RBMT): an MT engine built on algorithms that analyze the syntax of the source and uses rules to transfer the meaning to the target language by building a sentence. Contrast this with the processes of data searching and selecting on the basis of probabilities in SMT.
Royalty Free (RF): IPR mode that mandates and guarantees Royalty Free use of Essential Patents in order to implement a standard.
SAE J2450: a formal translation quality metric defined by the automotive industry that focuses on the following criteria for evaluation: Incorrect Term, Syntactic Error, Omission, Word Structure or Agreement, Misspelling, Punctuation, Miscellaneous Error.
Source Text (ST): the document in the source language.
Statistical Machine Translation (SMT): an MT system that uses algorithms to establish probabilities between segments in a source and target language document to propose translation candidates. Also known as ‘data-driven’ MT to contrast the approach with a RBMT system.
Statistical Postediting (SPE): an automated process whereby postedited output can be re-used as training data for an SMT system to improve quality in the next cycle and reduce the subsequent postediting load. Can also be used in a hybrid workflow.
Segmentation Rules eXchange (SRX): an XML-based standard that provides a way to describe how to segment text for translation and other language-related processes. SRX is meant to improve the TMX standard so that translation memory (TM) data exchanged between applications can be used more efficiently. Having this in place can increase the leverage that can be achieved when deploying TM data. SRX was maintained by Localization Industry Standards Association (LISA), until 2011 when it went bankrupt.
Subjective Sentence Error Rate (SSER): A metric used to evaluate MT quality whereby a human examiner gives a subjective judgment to each translated sentence using an error scale from 0.0 to 1.0. A score of 0.0 means the translation is semantically and syntactically correct, 0.5 means it is semantically correct and syntactically wrong and 1.0 means it is semantically wrong.
TAUS Labs: the area of TAUS focusing on shared research and development to advance the TAUS mission on behalf of members. Focus areas include translation quality evaluation, interoperability and machine translation. Visit www.tauslabs.com to find out more.
TAUS Search: an online string search tool enabling anyone to search the TAUS Data cloud for parallel sets of strings. See TAUS Search
TAUS Tracker: a series of detail language and translation technology directories found at www.taustracker.com.
TermBase eXchange (TBX): an ISO-approved open, XML-based standard used for exchanging structured terminological data including detailed lexical information. The framework for TBX is provided by the following ISO standards: ISO 12620, ISO 12200 and ISO 16642.
Text segmenter: a software tool or library that separates written text into meaningful units (tokens, sentences, paragraphs or documents). In many Asian languages, unlike western ones, words are not separated with spaces in a phrase.
Training data: the set of sentences selected during the process of setting up a SMT workflow used to train/customize the engine to the domain/languages in question.
Translation Error Rate (Plus) (TER(p): an automatic metric for measuring the number of edit operations needed to transform MT output into a human translated reference. Used to assess the post editing load.
TERp is a TER extension that automatically generates paraphrases and synonyms, stems words, and provides other powerful improvements.
Translation Memory Exchange (TMX): a vendor-neutral open XML standard to simplify the conversion of TMs between formats. The translation community has adopted TMX as the best way of importing and exporting translation memories. The current version is 1.4b, this version allows for the recreation of the original source and target documents from the TMX data.
UAX #9: this annex describes specifications for the positioning of characters in text containing characters flowing from right to left, such as Arabic or Hebrew. (See Unicode Standard Annex #9)
UAX #29: this annex describes guidelines for determining default segmentation boundaries between certain significant text elements: grapheme clusters (“user-perceived characters”), words, and sentences. (See Unicode Standard Annex #29)
Unicode Localization Interoperability Technical Committee (ULI): a committee that works to ensure interoperable data interchange of critical localization-related assets, including Translation Memory, Segmentation rules and Translation source strings and their translations. (See ULI Project)
Unicode standard: a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages. (See Unicode Standard)
UTR #20: Unicode in XML and other Markup Languages. Unicode, as its main target is plain text, contains many control, formatting, or otherwise stateful characters.
UTS #18: Unicode Regular Expressions gives general guidelines for regular expression engines how to comply with Unicode standard. Three levels are specified, of which two are default (one of them the minimum feasible for programmers, the other more end-user friendly) and the finest is language specific.
UTS #35: Unicode Locale Data Markup Language (LDML) specifies an XML vocabulary for encoding locale specific generic data categories (dates, amounts, decimals, units of measure, currency symbols etc.). Its main purpose is to enable the creation and maintenance of CLDR but is also used directly in programming frameworks such as .NET.
XLIFF: XLIFF is a powerful expressive XML vocabulary that facilitates end-to-end localization process automation. Apart from core structural elements, it has inline markup and segmentation encoding mechanisms, allows for generic file skeleton inclusion or referencing, fuzzy matching and glossary elements, project, tool, and status metadata. XLIFF makes up part of the OAXAL reference architecture, its current specification is v1.2 and the XLIFF technical committee are currently working towards v2.0.