Guidelines and good practices

raul.oliveira@ilhasoft.com.br Updated by raul.oliveira@ilhasoft.com.br

For we to have an Intelligence built in the best possible way, meaning, with an assertive dataset in its predictions, we should follow a few good practices when creating the training examples.

In this article, we will learn about these good practices.

Guidelines

When training, some guidelines should be followed, they are:

  • Quantity
  • Balancing
  • Specificity
  • Variety

Each one of these topics is explained below.

Quantity

Most NLP algorithm models are based on the number of training examples to increase the prediction rate by intents. So, to have a high assertiveness we need to balance the relationship between number of sentences x number of intents in your dataset.

Below, are some quality classifications of the dataset according to the number of phrases trained by intent, given a dataset with 5 or less intents.

  • Minimum: 10 sentences per intention;
  • Good: 25 sentences per intention;
  • Optimal: 40 sentences per intention.

Some factors can influence these recommended numbers, such as a total amount of intelligence intents (which can influence the number of false positives). The more intents the dataset has, the more sentences per intent is required.

The chosen algorithm also affects this number. The algorithm that uses BERT, for example, as it makes use of a pre-trained model, usually needs fewer sentences to have a good result.

Balancing

Using a balanced number of sentences in all your Intelligence intentions decreases the chances of a bias towards a specific intention.

For example, if the intelligence has an intention X with 200 sentences and another intention with 50, the probability that the algorithm classifies entries as intent X may be higher because it has more examples. (considering that the entry was a new sentence never seen by training)

So, a good practice is to have an approximate number of sentences for your intentions, if possible.

Specificity

To decrease the number of false positives in the dataset and increase the accuracy, we recommend that the vocabulary of the sentences generated in the training respect the specificity rule.

This rule defines that all words specific to an intention must be added only in the sentences of that intention, and words that should not be interpreted as of any intention must be distributed among all intentions so that the algorithm does not associate those words with any specific topic.

For example, if I have an intelligence that identifies orders from a diner, with the intentions "food" and "drinks" I need to associate words related to each of the intents, such as "sandwich" for the first and "juice" for the second.

We would thus generate the training sentences with the terms "I would like to buy a sandwich" for "food" intention and "I want to buy a juice" for "drinks"

Note that the specific words like "sandwich" and "juice" are each associated with an intent, while the words "would like", "want", "to", "buy" and "a" are distributed between the two intents of so that if I just type "I would like to buy", intelligence will not associate with either intent, as it would have very low confidence.

Variety

Sentences structure is also an important factor in interpreting the user's input. For example, if the phrase "I would like to eat a pizza" is trained on the "food" intent, the algorithm would classify the phrase "I would love to eat a pizza" as the same intent, given that the sentence structure is similar (given a good amount phrases trained in this structure).

This means that the more varied the example sentences, both in structures and in words, the more likely it is that intelligence will predict more words related to that intention.

How did we do?

Training your Intelligence

Contact