Training Data

Teach AI with specially labeled data.

Your custom AI solution will require training data. Training data is used to "teach" AI models how to understand the world. A good set of training data will result in a model that will make accurate predictions on data in real-world scenarios. There are two basic considerations to keep in mind when building a training dataset: quantity and quality.

Data Quantity

When it comes to training AI models, more training data is generally better. More data means more examples from your model to learn from, and helps you improve the accuracy of your model.

How many inputs does my model need?

This is one of the most common questions that comes up when building a new model. Unfortunately, there are no hard and fast rules about the number of inputs that will be required for your particular use case. But as a general rule, if you are training a custom model on top of a Clarifai Model, you will need much less training data (typically tens to tens-of-thousands of inputs), than if you are building a "deep trained model" (typically thousands to millions of inputs).

Bias

Bias occurs when the scope of your training data is too narrow. If you only see green apples, you’ll assume that all apples are green and think red apples were another kind of fruit. If the training data contains only a small number of examples, it’ll react accordingly, taking it as truth. Small datasets make for a smaller worldview.

Data Quality

Models that perform well tend to be trained on data that is unique and photographed in a consistent way.

For best results, train your model with data that:

  • Adheres to concept descriptions laid out in a taxonomy

  • Represents the reality of the use-case

  • Has visually noticeable qualities - something that is not too subtle for humans to pick up on AND something that can be picked up through the noise of a photo.

Models that tend to perform poorly:

  • Trained on data with inconsistent compositions

  • Photos require outside context (relationships to people in portraits, etc)

  • Subject matter is subtle. Keep in mind, the model has no concept of language, so in essence, “what you see is what you get”.

  • Training set is cast too wide. If you train a concept of too many different kinds of images, and they are all visually different, the training set will become noisy. This will make it difficult for the model to find the visually distinct qualities to learn from, resulting in high levels of "variance".

Semantic Clarity (The importance of "of" vs "in")

When labeling an image, try to avoid labeling what is "in" the image, instead you will get better results if you label what the photo is "of". In cases where there are multiple objects in a scene, use a detector model, and label the detected regions separately.

Last updated

Was this helpful?