Testing AI: a deep insight
Testing AI: a deep insight

The journey of AI testing continues. In this chapter, I will go one step further with the quality features that were highlighted in detail in the previous chapter. In addition to the process of creating specific test cases, I will also share my experience of how I implemented and executed them in OpenText ALM Octane. The results of the test executions suggest a clear winner – so who wins the race? The translator from Google, DeepL or Libre Translate?
Focal points for testing: Test criteria
Before test cases can be implemented, it must first be determined what is actually to be tested. If you think of a traditional translator such as a dictionary, the translation should simply be easy to understand and precise. But if I now imagine that I have to translate an email from a foreign language into German, then that’s a big challenge. Even with a suitable dictionary, I would have to look up the correct translation word by word. Differences in grammar and sentence structure between the languages are another matter.
A good AI translator should be able to understand the context of a sentence and translate it accordingly. The accuracy and completeness of the translation is the be-all and end-all here. In addition to the translation of technical terms, it is particularly important that the translation is embedded in a grammatically meaningful and comprehensible context. The original meaning and nuances of the text should be retained wherever possible.
The diversity of languages is also crucial. Being able to translate each of the approximately 7000 different languages in the world is a major challenge, even for a machine system. Nevertheless, AI translators should be able to support a large number of languages and automatically recognize languages in order to overcome language barriers.
In addition to recognizing languages, AI translators should also recognize ambiguities and handle them appropriately. “I just understand train station” is a prime example of this when translating proverbs word for word.
Test case creation and risks
When creating the test cases, I focused on the top 5 most spoken languages in the world, as well as German. The reason for this selection is the broad coverage of languages spoken worldwide. As I have unfortunately only grown to speak 3 of the 6 languages myself – German, English and French – I thought of a different approach than quickly learning three new languages. Using a translator such as Google Translate to create reference texts for other systems would certainly not be a good idea, especially if the translator itself is ultimately still under scrutiny. So after some research, I came across a huge test set from Microsoft that was published at the end of November 2022 and contains test data for 128 different languages – including all the languages I looked at.
Even if this test data has proven to be extremely useful, it should be used with caution. Especially with black box systems such as DeepL, it is not clear which data was actually used to train the AI. It is possible that precisely this publicly available information was used for this purpose. In a subsequent test, this would lead to false conclusions about the actual accuracy. The systems would already know the data and no statement could be made about how the system would perform in new, unknown situations.
As mentioned in the first chapter, sufficiently large amounts of data lend themselves to the use of machine metrics for evaluation. Two researchers from Jordan have done just that by comparing the English to Arabic translations from Libre Translate with other tools. The paper shows that Libre Translate performs satisfactorily in direct comparison with the translations of other machine translators and is on a par with Google Translate. But can this be true when my first impressions of Libre Translate were so disappointing?
Another study showed that journalistic writing posed the greatest challenge for the AI translators DeepL and Google Translate. The texts contained stylistic devices and other features that made accurate and fluent translation difficult. For this reason, and the aforementioned fact that Microsoft’s data is publicly available, I also included various online newspaper articles from the New York Times and other recent publications in the test cases.
I have translated the English texts from the online newspapers myself to the best of my ability and included them in my test cases. Other cases such as the correct translation of technical terms (for example “renal insufficiency” or “oesophagus”) or the correct use of gender-specific words such as job titles are also part of my tests. Where AI translators should score particularly well is in context-dependent translations. I took a closer look at this feature using German proverbs and ambiguous terms. Here, the AI systems should be able to do more than just translate the words word for word. Nevertheless, checking the system output is not trivial. I have to analyze the context beforehand and then recognize it in the translation.
Realization of test cases in ALM Octane
ALM Octane from OpenText is like the Swiss army knife for quality assurance in software development. In the case of the AI translators, I mainly used the tool to document test cases and test executions so as not to lose track and ensure that everything runs smoothly. I was also able to use it to present the test results very easily.
In the example above, I have created a test case to check the translation of an AI system for completeness. The tester first calls up the system to be tested, selects the language settings according to the specification and copies the text to be translated into the system’s input fields. The tester then checks step by step whether the translation returned by the system accurately reflects the actual content. If this is the case for each individual step, the test is passed, otherwise it is marked as “failed”. For this test case, it is important to note that the tester must be a person who speaks German, as the tester is given a certain amount of leeway in interpreting the translation.
Execution of the test cases
In addition to the Libre Translate and DeepL systems already presented in detail, we also took a closer look at the Google Translate and Reverso systems. The comparison with traditional dictionary translators, represented by LEO Dict and dict.cc, was also important.
In software testing, the terms “passed” and “failed” are fundamental to tracking the quality of the tested software. Even if only a single step during the execution of the tests does not lead to the result that is expected, the entire test is considered to have failed. Otherwise, the test is marked as “passed”.
At this point, I have made a further distinction. Tests of systems that do not support AI-specific features were marked as “skipped”. This applies in particular to test cases in which the systems are supposed to provide context-dependent translations. This makes it easier to differentiate which systems fulfill AI-specific functional requirements when evaluating and comparing the systems in ALM Octane. This includes, for example, the presence of feedback loops in the form of evaluations of the translations, which use the output of an AI system and the corresponding user actions to retrain and improve the models over time.
Evaluation of the results
If you look at the dashboard figures, DeepL clearly stands out. My initial suspicion that Libre Translate would not perform as well as the other systems has been confirmed – Libre Translate brings up the rear. Nevertheless, it should be noted that I limited myself to a small test set when designing the test cases – most of them were translations from German to English and vice versa.
For a closer look at translation quality, machine metrics with a sufficiently large test set should be considered. In addition, my tests mainly look at functional features. Testing of non-functional features should also be considered for a more detailed evaluation. These include, for example, the user-friendliness or performance of the systems. Ultimately, none of the systems tested in my test environment passed all the tests.
A brief outlook
In the next chapter, I will focus on a risk analysis in relation to AI translators. In doing so, I uncover possible risks posed by AI translators and also explain which solutions could play a role.
About us
We are a powerhouse of IT specialists and support customers with digitalization. Our experts optimize modern workplace, DevOps, security, big data management and cloud solutions as well as end user support. We focus on long-term collaboration and promote the personal development of our employees. Together, we are building a future-proof powerhouse and supporting customers on their path to successful digitalization.


