in

The Key to AI Model Performance


Data quality plays a crucial role in the performance of AI models, particularly in areas like Computer Vision, where accurate data annotation is the foundation for training models. Ensuring high-quality data is essential. Even small mistakes can significantly affect AI systems, especially in critical applications like autonomous vehicles, medical imaging, and facial recognition. 

So, how is data quality determined, and what are the most important criteria? Let’s explore Keymakr’s insights.

What is Data Quality, and its Main Criteria for Computer Vision

When we ask, “What is quality data?” there’s no straightforward answer. It is not a single metric you can measure with one simple number. We define data quality based on specific criteria for the application we’re targeting.

In NLP, text data includes syntactic and semantic correctness.

In Computer Vision, data annotation means detecting objects in images. So, metrics like non-existent marked objects, undetected objects, and bounding box accuracy are essential. 

Let’s take a closer look at them:

False Positives: These occur when the model predicts an object that isn’t present. In autonomous driving, a false positive might detect a vehicle in an empty lane. This could cause the car to make unnecessary corrections.

False Negatives: This happens when the model fails to detect an existing object. For example, a self-driving car could miss a pedestrian or another vehicle. This could cause serious accidents.

Geometry Correctness: In image annotation, this is the accuracy of object boundary markings, like 2D bounding boxes. A boundary around a vehicle that is too big or small can confuse the AI model about the object’s true location.

Attribute Errors: Some apps need annotations beyond simple object detection. For instance, an object may be partially hidden. How well the annotation captures that can greatly affect the model’s performance.

Why is Data Quality Important?

Even small errors can have major real-world effects in AI, such as autonomous driving, medical diagnostics, and facial recognition. High-quality annotations must accurately mark objects, people, and other elements in the data. This is key to building reliable AI systems.

Autonomous Vehicles and Object Detection

Companies such as Tesla and Waymo collect vast amounts of data from their test vehicles, which are then meticulously annotated. This includes labeling objects like cars, pedestrians, traffic signs, and road markings. The accuracy of these annotations directly impacts the model’s ability to recognize and react to real-world scenarios. 

For instance, Tesla faced significant challenges with data quality related to its autopilot system. Early system versions struggled with accurately identifying objects, particularly in low-light conditions or when objects were partially occluded. These errors led to several high-profile accidents. To improve accuracy, Tesla expanded its team of human annotators, developed advanced annotation tools, and used machine learning to flag potential errors. This multi-pronged approach helped streamline the process and ensure continuous improvement in data quality.

Source: CNN.

Medical Imaging

AI models trained on medical images (e.g., X-rays, MRIs) need precise annotations. They must detect abnormalities like tumors and fractures. High-quality data helps models diagnose conditions better. It lowers false positive and negative rates. 

In 2020, Aidoc received FDA clearance for its AI software that analyzes CT scans to detect acute abnormalities. However, the company faced scrutiny regarding the accuracy and reliability of its algorithms. Some reports suggested that the AI’s performance could vary. It depended on the quality of the input data and the types of CT scans being analyzed. The algorithm struggled with images from varied demographics and equipment. This could cause false positives or missed diagnoses.

Aidoc’s experiences underscore the importance of ensuring that AI models are trained on diverse, high-quality datasets that reflect real-world conditions. Continuous validation and improvement of AI algorithms in various clinical settings are essential.

Source: Aidoc.

Ensuring Data Quality

With high-quality data being so important, how can businesses ensure their datasets meet the required standards? There are a few points:  

  • Label Guides: These documents are essential in defining what annotations should look like. For instance, traffic signs or vehicle guidelines could be as extensive as 100 pages. It must describe every object to annotate and how to handle specific situations (e.g., occluded or distant vehicles).
  • Quality Assurance Process: Once the dataset is annotated, a quality assurance (QA) process must be implemented. This often involves human experts (Human-in-the-loop). They manually review a subset of the annotated data. It ensures the annotations align with the label guide. This practice Keymakr provided very often — given a final point of annotation to the human validation.
  • Automation and Auditing: To supplement human annotators, companies can turn to automated QA tools. It uses AI to cross-check annotations for common errors like missed objects or inaccurate bounding boxes.
  • Compliance Standards. A new ISO standard, ISO 5259, is expected to be released either this year or next. It focuses on the quality of data for machine learning and provides many metrics to help distinguish good data from bad. Businesses must make their annotation processes auditable to comply with this regulation.
  • Employ Tools: Keymakr’s partner, QualityMatch, employs a distinctive approach to quality assurance. It combines AI and statistical methods. The platform simplifies complex annotation tasks into easy questions. By breaking tasks down into small, manageable pieces, QualityMatch creates a scalable way to ensure accuracy. Their system uses statistical models to predict when annotators will disagree. This lets teams focus on the most critical tasks. This way, errors in complex datasets are minimized, and models can be trained on reliable, high-quality data. 

To learn more about data quality and how to ensure it, listen to the podcast featuring Keymakr’s VP of Partnerships, Maria Greicer.

Transforming Text to SQL using OpenAI