Testing artificial intelligence

Testing artificial intelligence

18. January 2024 | 9 min |

Artificial intelligence (AI) is on everyone’s lips these days and is often mentioned in connection with more efficient and faster work. Nevertheless, systems such as ChatGPT are often criticized. Answers are said to be incorrect, questions are not answered due to restrictive filters or facts are even invented – AI creates its very own reality.

Ensuring quality and functionality plays an important role in the software industry. Errors should be found and rectified quickly, preferably before the product reaches the end customer. To achieve this, it is important to put software through its paces. But does this also work for AI? Can AI systems be tested and if so, how? In this blog series, I explore these questions and report on my findings on the topic of “Testing AI”.

All beginnings are … amazingly intuitive?

Getting started in the world of AI was not exactly easy due to the variety of systems and areas of focus. I haven’t had much contact with the topic so far. My experience so far has been limited to generating images with “Midjourney”, simple chatbot interactions with “ChatGPT” and using “Google Translate” to translate texts. So where is the best place to start and where do you set your own limits? There is a great danger of quickly getting lost in topics that sound very interesting but take a lot of time to get to.

The first step was to build up a basic understanding of how AI systems work and to deepen this knowledge step by step. The first port of call for this was the “Teachable Machine”, which allows you to create your own AI for image classification. This worked surprisingly easily and very intuitively, even without in-depth mathematical knowledge. All you had to do was hold an object in the camera of your smartphone or laptop and press a button on the web interface to take different images of the objects. With a separate training set for each pen, coffee cup and headphones, the AI was then able to distinguish between the objects in real time, which I only had to hold up to the camera.

The aim of machine learning is to recognize meaningful correlations from input data and derive rules from them. Using these rules, a system can then recognize trends, classify data or make predictions when new, unknown data is entered. In the example of the teachable machine, an upside-down coffee cup is still classified correctly even though it no longer corresponds to the data with which the system was originally trained.

Classification in machine learning
Classification in machine learning (Graphic: accompio PrimeTec GmbH)

After the first experiments with the image classification AI had already delivered promising and, above all, very clear results, the next goal on the agenda was to implement a similar system on our own machine. Fortunately, the developers of the Teachable Machine had published an older version of the system open source on GitHub. Even though this is the archived original version and the functions are kept quite simple, a look at the program code was quite revealing. The images are trained locally on the machine using Tensorflow. Communication with other services does not take place.

Workflow of input, learning and output in image recognition and classification using artificial intelligence.
The first version of the “Teachable Machine” reliably recognizes the blue pin, which is marked as class “purple”.

Image classification – what needs to be considered?

The system is ready and delivers initial results. In the test, the AI is even up to 95% sure that the recognized object is the one shown. In other cases, however, the objects are classified incorrectly. A coffee cup suddenly becomes a ballpoint pen and in another case, the AI can no longer clearly distinguish between the objects. How can this happen? During the creation of the training data, some important features emerged:

  • The background in the picture should remain as uniform as possible. A pen lying on the desk and visible in the picture is not a problem. Provided it can also be seen in the same place in all other images and later in the live webcam.
  • The object is more likely to be recognized if it is photographed from different perspectives. For example, it helps to rotate the coffee cup so that, in addition to the front and back, the bottom of the cup, the inside and several side and rotated views are also recorded. The more training data, the better the result.
  • Another important factor was interference. Here, it was sufficient for body parts such as arms, face or upper body to be visible in the image during the recording. Classification then worked less reliably.
  • The shape and color of the objects should also be taken into account. While the system can distinguish a green cup very well from a blue pen, the recognition of a green ballpoint pen is worse than that of a blue one.

AI in self-study

After gaining my first impressions by interacting with AI, I wanted to delve deeper into the topic. How exactly does AI work from a technical perspective? How are AIs created? What do you have to consider? And above all: How can AI be tested?

The c’t special issue “KI-Praxis” (2023/24) was a very helpful and important companion on the way to being able to test AI. This enabled me to better understand the basic content on the subject of AI and link it directly to my practical experience with image classification AI. But what happens next?

The ISTQB “Certified Tester Foundation Level” is an international standard and a recognized foundation in the field of software testing. Although the terms and concepts it contains are generally applicable to all software projects, the ISTQB goes one step further. The “Certified Tester AI Testing” curriculum aims to broaden your horizons in terms of artificial intelligence, especially in terms of testing AI systems – which is exactly what I need. The course content gave me a basic understanding of the various test methods, quality characteristics and risk aspects that are particularly important for testing AI systems.

The next big goal was set and the plan was to develop an acceptance test for an AI system with a focus on AI features. The next steps were to apply what we had learned and compare several AI systems with pre-trained models. But this is where the first problem arose. It is only possible to accept the system if a pre-trained model is available. However, the “Teachable Machine” set up did not have a function with which recorded images and trained models could be saved and recalled. This meant that it was not possible to set the system to a starting state from which future tests could be repeated under the same conditions. On the other hand, I could not find a similarly functioning AI system with which a meaningful comparison would have been possible. A change of scenery was therefore necessary.

The path to your own AI translator

With the rapid advances in technology, machine translation has reached an important milestone. Ten years ago, I was rather skeptical about machine translators. The translation quality of Google Translate was not convincing and the possible uses of the system were limited to a simple web interface. Today, some of the translations are said to be so good that they are indistinguishable from a human translation. But is that really true?

For further analysis, I initially focused on two different translation systems: DeepL and Libre Translate. Both are essentially machine translation systems. While Libre Translate is an open-source system based on OpenNMT, DeepL is a closed-door system – the company reveals little about how the system works in the background. Reason enough to take a closer look at the two systems and put them through their paces.

The translation quality of Libre Translate is funny, but disappointing.

After quickly setting up my own instance of Libre Translate, the question of how best to test the system arose just as quickly. One intuitive approach would certainly be to translate a given text into another language and then translate this translation back into the source language – a back and forth translation, so to speak. In this way, you could test how big the differences are between the original text and the twice-translated text, couldn’t you? Unfortunately not. Because instead of evaluating a single system, you would be evaluating two systems that work independently of each other – the translation from language A to language B and the back-translation from language B to language A. This approach is therefore not suitable for evaluating translation quality.

Another option would be to have human translators, experts with several years of experience in this field, assess the quality of the translation. The texts could then be examined by the experts for various characteristics.

  • How good is the quality of the translation?
  • How much of the meaning of the original text is retained in the translation?
  • How comprehensible is the translation?
  • Are there any missing words or incorrect word sequences in the translation?

However, the main problem with answering these questions is that they are partly subject to subjective evaluation – standardized metrics must first be developed with the experts. Even then, the assessment is usually subjective and a costly affair.

A less costly and much more reliable method is the use of machine-evaluated metrics. One of the first metrics for this was the “Bilingual Evaluation Understudy”, or BLEU for short. This metric looks at the agreement between machine and human results, provided that a sufficiently large amount of human-generated data sets is available. The closer the machine translation is to the professional human translation, the higher the quality rating.

 

A brief outlook

After this introduction to the world of AI testing, the following chapters will take a closer look at the systems presented, particularly with regard to the AI-specific ISTQB features. In addition to the metrics mentioned above, I will take a closer look at other test methods and quality features. In a comprehensive risk analysis, possible risks and problems of the systems as well as their possible solutions from a testing perspective are also presented – always with the aim of designing a general acceptance test for AI-supported translators.

About us

We are a powerhouse of IT specialists and support customers with digitalization. Our experts optimize modern workplace, DevOps, security, big data management and cloud solutions as well as end user support. We focus on long-term collaboration and promote the personal development of our employees. Together, we are building a future-proof powerhouse and supporting customers on their path to successful digitalization.

Contact

Do you have a request? Please contact us!

Do you have a request? Please contact us!

As your companion and powerhouse in the IT sector, we offer flexible and high-performance solutions.