On April 24, we made a significant change on how we handle strings before sending them from Transifex to machine translation services and in particular Google Translate and Microsoft Translator Text, so that they use more of their machine learning (ML) capabilities. This post explains the change and the reasoning behind it.
Up to now, we had been sending strings as simple text, i.e.
format=text for Google and
textType=plain for Microsoft. Going forward, we are sending strings as
html for both services.
This is significant because it may change the translations you get back from the service, compared to before.
Why make this change
The previous implementation had issues with HTML in source strings. Sometimes the translations we got back from the MT service contained “garbage” like
0x5a4d6e521, which was not part of the source string. This happened only on certain strings and on certain target languages, and was completely unpredictable.
After making the change mentioned above, i.e. sending the strings as
html, we no more get garbage in the translation strings for all cases we tested against.
The reason is that with the new
html mode, we let the MT service handle the HTML parts of the strings, instead of taking care of them on the Transifex side. This essentially does not “confuse” the MT service as it did before, and thus it does not add garbage in the translation it sends us back.
As a result of the change we have introduced, the translations the MT services return may be different than before, under certain circumstances. This section contains examples of the change in behaviour of MT translation.
The previous implementation sometimes resulted in strings that were missing certain HTML tags, contained garbage characters and were incomplete. The new implementation fixes all 3 problems.
<h1>Journal Entries- Reference Guide</h1>` <div>This is an example of a problematic string with HTML.</div> <h5>A heading</h5> <ol> <li>A list item</li>
0x6b327526cba Guide de référence 0x7d4a2 A05bbbbbbbbbbbbbbbb0b
<h1>Écritures de journal - Guide de référence</h1> <div> Voici un exemple de chaîne problématique avec HTML. </div><h5> Un en-tête </h5><ol><li> Un élément de la liste </li>
The new implementation returns certain HTML entities in an escaped form, where the previous implementation returned them in an unescaped form.
that's life this is a quote (")
c'est la vie c'est une citation (")
c'est la vie c'est une citation (")
The new implementation may remove spaces and new lines in XML strings.
<![CDATA[ <p> A line <br/> Another line </p> ]]>
<![CDATA[ <p> Une ligne <br/> Une autre ligne </p> ]]>
<![CDATA[ <p>Une ligne<br/>Une autre ligne</p> ]]>
When sending source strings to an MT service, there might be parts of the string that we don’t want to be translated. Such examples are custom variables or HTML nodes; these we want to be preserved in the translation text exactly as they were in the source string.
In order to achieve this, Transifex modifies the source strings before sending them to the MT service, replacing these parts with “protection” placeholders, and then replaces them back with the original values, after getting the translation from the MT service. These placeholders look like this:
The problem is that the MT service sometimes gets confused by these placeholders and injects garbage chunks in random places in the translation string. These chunks look a lot like the placeholders we feed the MT service but are not identical, which means that the MT is creating them and adding them in the string, not Transifex.
This is due to the fact that MT services work with machine learning algorithms, and thus try to find (essentially generate) the best translation for the string they have received.
We have tried numerous formats of the placeholders but none had worked perfectly. This means that with any placeholder format there would be particular combinations of strings and target languages that would cause the translations to have garbage characters.
This is a difficult problem and there is no perfect solution. In order to be able to protect certain parts of the source strings, such as custom variables and HTML nodes, we need to modify the source strings before sending them to the MT services.
Doing that means that the services may sometimes respond with translations that are not of the highest quality. Adjusting our algorithm in order to fix one case may introduce issues in another case.
As we realize how important machine translation is becoming in modern localization workflows, we plan on researching and improving the behaviour on our side.
Any comments or suggestions are more than welcome!
Thank you for understanding.