Important changes in Machine Translation results

On April 24, we made a significant change on how we handle strings before sending them from Transifex to machine translation services and in particular Google Translate and Microsoft Translator Text, so that they use more of their machine learning (ML) capabilities. This post explains the change and the reasoning behind it.

The change

Up to now, we had been sending strings as simple text, i.e. format=text for Google and textType=plain for Microsoft. Going forward, we are sending strings as html for both services.

This is significant because it may change the translations you get back from the service, compared to before.

Related documentation:

Why make this change

The previous implementation had issues with HTML in source strings. Sometimes the translations we got back from the MT service contained “garbage” like 0x5a4d6e521, which was not part of the source string. This happened only on certain strings and on certain target languages, and was completely unpredictable.

After making the change mentioned above, i.e. sending the strings as html, we no more get garbage in the translation strings for all cases we tested against.

Short explanation

The reason is that with the new html mode, we let the MT service handle the HTML parts of the strings, instead of taking care of them on the Transifex side. This essentially does not “confuse” the MT service as it did before, and thus it does not add garbage in the translation it sends us back.

Results

As a result of the change we have introduced, the translations the MT services return may be different than before, under certain circumstances. This section contains examples of the change in behaviour of MT translation.

Benefits

The previous implementation sometimes resulted in strings that were missing certain HTML tags, contained garbage characters and were incomplete. The new implementation fixes all 3 problems.

Source strings:

<h1>Journal Entries- Reference Guide</h1>`
<div>This is an example of a problematic string with HTML.</div> <h5>A heading</h5> <ol> <li>A list item</li>

Previous translations:

0x6b327526cba Guide de référence 0x7d4a2
A05bbbbbbbbbbbbbbbb0b

New translations:

<h1>Écritures de journal - Guide de référence</h1>
<div> Voici un exemple de chaîne problématique avec HTML. </div><h5> Un en-tête </h5><ol><li> Un élément de la liste </li>

Drawbacks

HTML entities

The new implementation returns certain HTML entities in an escaped form, where the previous implementation returned them in an unescaped form.

Source strings:

that's life
this is a quote (")

Previous translation:

c'est la vie
c'est une citation (")

New translation:

c&#39;est la vie
c&#39;est une citation (&quot;)

XML

The new implementation may remove spaces and new lines in XML strings.

Source string:

<![CDATA[
    <p>
        A line
        <br/>
        Another line
    </p>
]]>

Previous translation:

<![CDATA[
    <p>
        Une ligne
        <br/>
        Une autre ligne
    </p>
]]>

New translation:

<![CDATA[
    <p>Une ligne<br/>Une autre ligne</p> ]]>

Technical details

When sending source strings to an MT service, there might be parts of the string that we don’t want to be translated. Such examples are custom variables or HTML nodes; these we want to be preserved in the translation text exactly as they were in the source string.

In order to achieve this, Transifex modifies the source strings before sending them to the MT service, replacing these parts with “protection” placeholders, and then replaces them back with the original values, after getting the translation from the MT service. These placeholders look like this: 0x5a4d6e521.

The problem is that the MT service sometimes gets confused by these placeholders and injects garbage chunks in random places in the translation string. These chunks look a lot like the placeholders we feed the MT service but are not identical, which means that the MT is creating them and adding them in the string, not Transifex.

This is due to the fact that MT services work with machine learning algorithms, and thus try to find (essentially generate) the best translation for the string they have received.

We have tried numerous formats of the placeholders but none had worked perfectly. This means that with any placeholder format there would be particular combinations of strings and target languages that would cause the translations to have garbage characters.

Conclusion

This is a difficult problem and there is no perfect solution. In order to be able to protect certain parts of the source strings, such as custom variables and HTML nodes, we need to modify the source strings before sending them to the MT services.

Doing that means that the services may sometimes respond with translations that are not of the highest quality. Adjusting our algorithm in order to fix one case may introduce issues in another case.

As we realize how important machine translation is becoming in modern localization workflows, we plan on researching and improving the behaviour on our side.

Any comments or suggestions are more than welcome!

Thank you for understanding.

6 Likes

Hi,

I just found this and have to say, that while you got rid of a few special characters returned,
you broke 100% of the markdown translations.

MT is useless now, as it removes all line feeds, and thus renders the functionality defect.

I would suggest to have a switch on

a) project basis
b) file types basis
c) organization basis

or all of those

to switch this behavior

Best regards,
Christoph

1 Like

Hi @christoph.cemper,

Thank you so much for taking the time to share your comments and suggestions.

This was actually the reason why we have decided to create this post. Our goal is to collect and act upon TX users’ feedback since this is definitely the best way to improve Transifex.

Our team will evaluate this further and proceed accordingly.

We will keep you posted!

@christoph.cemper I am glad to inform you that from now on, there will be an option available in both project’s and organization’s settings pages, that allows you to select which will be the mode of the MT service you use (HTML or Plain text). This is how these options appear in Transifex Web Interface:

27

I hope you find this useful :slight_smile: