Testing AI: focus on quality features

Testing AI: focus on quality features

5. February 2024 | 6 min |

In the previous chapter, I described my first steps in the field of AI testing – from image classification to my own AI-assisted translator. Now let’s dive deeper into the world of testing AI systems and take a look at the features of the systems to characterize their suitability as translators.

On the trail of the ISTQB quality features

The ISTQB “Certified Tester Foundation Level” has established itself as an international standard and provides a solid basis for testing software. In the special module “Certified Tester AI Testing”, the quality characteristics that are particularly important in the evaluation of AI systems are highlighted. These characteristics help to assess the performance, reliability and efficiency of AI systems.

  • Flexibility and adaptability: The system should be able to adapt easily to different situations and environments, including new and unforeseen environments.
  • Autonomy in AI systems: This is about a system being able to work autonomously over a longer period of time. The creator of the system must define how long and under what conditions.
  • Evolution: A system should be able to improve itself when its environment changes. This is particularly important for self-learning AI systems.
  • Bias: Sometimes the results of AI systems can deviate from what is considered fair. It must be ensured that such deviations are controlled, for example in relation to gender or income.
  • Ethics in AI systems: AI systems should follow clear rules that ensure they serve people, respect democratic values and are transparent.
  • Side effects and reward hacking: If things are overlooked during development, this can lead to undesirable effects. An example of this is a translator that translates legal terms incorrectly, leading to legal misunderstandings. Reward hacking is the attempt of a system to achieve goals in an intelligent way, which may contradict the original intentions of the developers.
  • Transparency, interpretability and explainability: This is about how easily users can understand how the system works and why it produces certain results.
  • Functional safety and AI: It must be ensured that AI systems perform their tasks reliably and without unexpected errors, especially in safety-critical applications.
The ISTQB syllabus for “AI Testing” contains an entire chapter on quality characteristics.

A comparison of quality features: A look at Libre Translate and DeepL

Once I had familiarized myself with the AI-specific features, my aim was to apply them to the translators “Libre Translate” and “DeepL” and compare the two systems with each other. Both AI translators have already been introduced in the previous chapter.

When comparing Libre Translate and DeepL, flexibility and customizability are decisive factors. Libre Translate is flexible in the area of text-based translation and speech recognition. It is based on an OpenNMT model that has been trained with Argos Translate and enables other languages to be trained with tools such as Locomotive. In contrast, DeepL offers flexibility for different types of text and has advanced features for eloquent text revision (DeepL Write). I cannot assess the adaptation of the system to a new context, as DeepL has a black-box nature, so it is not clear exactly how the system works.

Both systems can process texts autonomously and understand the context, although DeepL has a better understanding of context. The translators also offer APIs for integration into various systems. Nevertheless, there is a possibility of incorrect translations, especially for specialized terms.

The two systems differ in terms of evolution. Libre Translate is not a self-learning system and requires training for new languages and contexts. In contrast, DeepL can improve its translation capabilities over time, but is also dependent on training.

In terms of bias, both systems show a so-called gender bias, with DeepL performing better in the test. Details will follow in the next article. Gender bias in translations is often evident in the use of gender-specific terms and phrases that can reinforce traditional role stereotypes. When translating various swear words from German into English and vice versa, I was unable to identify any systematic limitations in the translation of certain terms.

By clicking on individual words, DeepL offers alternative word suggestions, here even gender-specific.

From an ethical point of view, Libre Translate promotes sustainable development and helps to overcome language barriers. The system is transparent and open source. DeepL has similar ethical principles, although as a black box system it is less transparent.

Both systems show side effects such as gender bias, although DeepL allows alternative translations and fine adjustments to the output texts. Both systems are immune to reward hacking. A possible danger here would nevertheless be that the systems do not reflect the full content, so to speak, do not provide a 1:1 translation, but still convey the essential facts.

In terms of transparency, interpretability and explainability, Libre Translate offers the opportunity to take a closer look at and understand the system thanks to its open source nature. However, as a closed source system, DeepL strives for transparency through API documentation and blog articles that provide insights into how the system works.

Functional security is guaranteed with both systems. They are robust and comply with security protocols during data transmission (HTTPS encryption). In comparison, Libre Translate does not always deliver reliable translations, as you could already read in the first article: “summer trip in truthahn”. Libre Translate would therefore not be a suitable choice for security-critical applications.

In general, it is important to consider all of the quality features mentioned above when evaluating AI systems. Every user has different preferences and sets different priorities for these features.

While the AI translator Libre Translate scores points in the comparison with its transparent, open-source structure, DeepL offers advanced functions and self-learning capabilities, albeit with less transparency. At first glance, both systems fulfill the ISTQB-AI quality criteria in equal measure. But how can it be verified more precisely that the systems do not deliver faulty translations?

Key aspects of the quality inspection

When testing software, it is important to define clear test objectives in order to systematically search for possible errors in the software. This also applies to systems with AI. Specifically for the AI translators, I have defined the following test objectives for further evaluation:

  • Completeness check: Ensure that the translation covers the entire content of the original text without omitting important information.
  • Comprehensibility check: Ensure that the translation is easy to understand so that users can easily grasp the content in the target language.
  • Accuracy check: Ensure that the AI translation is precise and accurate to avoid misunderstandings or misinterpretations.
  • Handling technical terms: Ensure that the system recognizes technical terms correctly and uses them appropriately in the translation.

Outlook: Test cases in ALM Octane

The test objectives presented provide a solid framework for further evaluation of AI translators. In the next chapter, I will give a detailed insight into the development process of concrete test cases and how I implemented and executed them in Open Text ALM Octane. This outlook thus provides a deeper insight into the practical application of “ISTQB AI” in the context of AI testing.

About us

We are a powerhouse of IT specialists and support customers with digitalization. Our experts optimize modern workplace, DevOps, security, big data management and cloud solutions as well as end user support. We focus on long-term collaboration and promote the personal development of our employees. Together, we are building a future-proof powerhouse and supporting customers on their path to successful digitalization.

Contact

Do you have a request? Please contact us!

Do you have a request? Please contact us!

As your companion and powerhouse in the IT sector, we offer flexible and high-performance solutions.