Harvesting Polysemous Terms from e-commerce Data to Enhance QA
Polysemous words can be difficult to translate and can affect the quality of Machine Translation (MT) output. Once the MT quality is affected, it has a direct impact on post-editing and on human-assisted machine translation. The presence of these terms increases the risk of errors. We think that these important words can be used to improve and to measure quality of translations. We present three methods for finding these words from e-commerce data, based on Named Entity Recognition, Part-of-Speech and Search Queries.
In this paper, we discuss different methods which use meta information and richer context that may accompany source language input to improve machine translation quality. We focus on category information of input text as meta information, but the proposed methods can be extended to all textual and non-textual meta information that might be available for the input text or automatically predicted using the text content. The main novelty of this work is to use state-of-the-art neural network methods to tackle this problem within a statistical machine translation (SMT) framework. We observe translation quality improvements up to 3% in terms of BLEU score in some text categories.
We present three approaches to generate titles for browse pages in five different languages, namely English, German, French, Italian and Spanish. These browse pages are structured search pages in an e-commerce domain. We first present a rule-based approach to generate these browse page titles. In addition, we also present a hybrid approach which uses a phrase-based statistical machine translation engine on top of the rule-based system to assemble the best title. For the two languages English and German, we have access to a large amount of rule-based generated and human-curated titles. For these languages, we present an automatic post-editing approach which learns how to post-edit the rule-based titles into curated titles.
The use of post-editing of machine translation output is increasing throughout the language technology community. In this work, we investigate whether the MT system influences the human translator, thereby introducing "bias" and potentially leading to errors in the post-editing. We analyze how often a translator accepts an incorrect suggestion from the MT system and determine different types of bias errors. We carry out quantitative analysis on translations of eCommerce data from English into Portuguese, consisting of 713 segments with about 15k words. We observed a higher-than-expected number of bias errors, about 18 bias errors per 1,000 words. Among the most frequent types of bias error we observed ambiguous modifiers, terminology errors, polysemy, and omissions. The goal of this work is to provide quantitative data about bias errors in post-editing that help indicate the existence of bias. We explore some ideas on how to automate the finding of these error patterns and facilitate the quality assurance of post-editing.
In this paper, we introduce a hybrid search for attention-based neural machine translation (NMT). A target phrase learned with statistical MT models extends a hypothesis in the NMT beam search when the attention of the NMT model focuses on the source words translated by this phrase. Phrases added in this way are scored with the NMT model, but also with SMT features including phrase-level translation probabilities and a target language model. Experimental results on German->English news domain and English->Russian ecommerce domain translation tasks show that using phrase-based models in NMT search improves MT quality by up to 2.3% BLEU absolute as compared to a strong NMT baseline.
In this paper, we propose an effective way for biasing the attention mechanism of a sequence-to-sequence neural machine translation (NMT) model towards the well-studied statistical word alignment models. We show that our novel guided alignment training approach improves translation quality on real-life e-commerce texts consisting of product titles and descriptions, overcoming the problems posed by many unknown words and a large type/token ratio. We also show that meta-data associated with input texts such as topic or category information can significantly improve translation quality when used as an additional signal to the decoder part of the network. With both novel features, the BLEU score of the NMT system on a product title set improves from 18.6 to 21.3%. Even larger MT quality gains are obtained through domain adaptation of a general domain NMT system to e-commerce data. The developed NMT system also performs well on the IWSLT speech translation task, where an ensemble of four variant systems outperforms the phrase-based baseline by 2.1% BLEU absolute.