Glossary false warning

I’m having a glossary match problem.
Because Finnish is a fusional/agglutinative language, glossary (mis-)matches don’t work too well. For instance, the dictionary word “location” is “sijainti” in Finnish, but “your location” is “sijaintisi”, and I get the warning “Glossary translation for term ‘location’ missing from translation”. Adding “your location” as a term doesn’t help, as each term is apparently looked for separately, even if a longer multi-word term it is part of is already present.
In our project, the translation is apparently accepted despite the warning, but how this could be handled in general (allowing alternate translations?).
Thank for a very functional site!

The glossary warning might not be relevant in some cases and for some languages specifically, where the same word might have a different grammar conjugation depending on its context in the phrase. For those cases, the best approach would be to just ignore the warning. You can explain to the organization administrators that the warning message is not relevant using the “comments” tool if you want.
Most of the administrators are aware of those limitations, and just leave the translation check set as a warning message. In case you get an “error” instead of “Warning”, you can contact the project maintainer and raise the problem, asking him to revert in the Translation check to a warning, so you can proceed with the translation.

We’re experiencing the same with Hebrew, but in my case I have over 60 other languages to maintain so I can’t just turn the warning off when it’s still relevant to most of them.

Can I turn this option off per language?

Hi @yaron!
Thanks for mentioning it has also be an issue in your organization.
Since all translation checks are in a file level, the best solution here is to leave it “on”, but as a “Warning” instead of “Error”.
A warning is just a way to ask a translator to review and make sure it makes sense to not follow the Glossary in that case. It does not prevent him to save translations.

Does it make sense?
Let me know :slight_smile:
Flavia

1 Like

LOL nope :slight_smile:

Although not critical this is definitely not a warning, there are 2 “proper” ways to approach this:

  1. Develop a conjugation system for each and every language (Costly solution, I can’t recommend this).
  2. Allow disabling the warning or errors per language or at least allow the language maintainer to shut down for the team as an override to the global checks.

I’m pretty sure there are other solutions and I’m willing to help if you don’t find these sufficient.
This is a well known problem with several languages I’m aware of such as:
Hebrew, Russian, Arabic, German, Amharic, for these I can testify from personal knowledge but there are possibly many others, I can also open a bug in gettext in order to make sure it’s handled correctly by the source project.

Hi Yaron,
We appreciate your feedback, and I understand that’s an area that might need some more flexibility due to differences in languages rules. I will share it with the product team for future improvements!
Thanks again for sharing further inputs about that here!

Best,
Flavia

Hi, thanks, I just thought about another option: allowing the translator to add conjugation for a term, that way the developers will only need to come up with a way to expand the current key:very structure to an extended solution with relations between terms.

This way you can have a conjugation contributed by the community and maybe even use ML to learn from it and suggest conjugations that do not exist yet.

The specific problem with Hebrew is that the definitive article is attached to the word, so for dog we will have dog as source and the dog as conjugation which is not the best way to run this thing.

The other possibility would be to define that any noun in Hebrew can have the definitive article and several others connected to it, this way it will be more like a script rather than having each and every trivial conjugation added to the glossary.

Not sure about the exact way to implement but it’s definitely a very smart addition to the glossary.

Any chance there’s some progress in this direction?

The exact way to implement it is this:

https://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/utils/glossary_checks.py

This will get rid of about 95% of false positives in my locale.

All we would need in the glossary to avoid ugly hacks is 1 textarea where translators can add a list of inflected forms, 1 term per line. And if the locale has a good hunspell stemmer, that won’t even be needed.

@gunchleoc @yaron
Thank you guys for your input, we appreciate it and have recorded all comments here!
Unfortunately, we were not able to get to Glossary enhancements and warnings yet. We will make sure to update you guys when we get the chance to work on it!
You can also follow the things we’ve been working on here: https://docs.transifex.com/whats-new/all

Please, keep providing us your feedback and comments, we are tracking all that and taking into consideration when discussing product needs :slight_smile:
Kind regards,

Can you elaborate a bit about the logic? We can sure try and get at least some of the logic into discussion and see if we can build some sort of solution step by step.

  1. Expand English words with some grammar rules. This will overgenerate a bit, e.g goose -> *gooses, but that doesn’t matter, because *gooses won’t match later anyway. This would actually also be useful for the Glossary tab, because people are adding plural forms as separate glossary entries, just so that they can get them displayed. This will of course give us more false positive hits at this stage, but it will also catch those occurrences of glossary terms that are being missed.
  2. Filter superwords. E.g. in a game that I translate, we have an “Arena” and a “Battle Arena”. “Arena” is a substring of “Battle Arena” in the source text, but that is not necessarily true for the target text. So if we find “Battle Arena”, remove it from the source string before checking for “Arena”.
  3. Use hunspell stemmer to identify matching target words. This works well for languages that have a rule-based hunspell dictionary rather than a flat list. I got rid of almost 50% false hits for Polish with this.
  4. Use target alternatives defined in the glossary to identify matching target words. Since Transifex doesn’t have a field for this, I used the | character as a separator in the “Note” field to hack this. So, for “Artifact”, I have a translation “ball-ealain” and a Note “bhall-ealain|buill-ealain|bhuill-ealain”

Unfortunately, I can’t attach a file here, otherwise I could give you full statistics. Here they are for the gd locale which I translate myself, after I fixed all the actual translation errors:

  • Translated entries: 3,785 (that’s segments, not words - word count for the project is ~40k)
  • Glossary entries: 346
  • Hits after 1: 3,050
  • Hits after 2: 1,183
  • Hits after 3: 1,178 (our hunspell is pretty much a flat list, but there are a few rules)
  • Hits after 4: 145

And for Polish, where the translation probably contained some actual errors:

  • Translated entries: 3,785
  • Glossary entries: 331
  • Hits after 1: 4,049
  • Hits after 2: 2,023
  • Hits after 3: 1,070
  • Hits after 4: N/A

For the 30 languages I had in my list, I got rid of 20% - 60% false positive errors with steps 1-3, depending on locale. After 4, the score for my locale (gd) was a whopping 95.25%.