Data Scientists
88,000
augmentations
6M
automated augmentations
Oil & Gas
Industry
Munich, Germany
Location
10,000+
Company size
We were looking for a supplier with NER expertise that could provide us with high-quality training data to feed into our existing ML service we were using to build a chatbot. We were very happy with the throughput and speed we received from super.AI.
Head of AI Solutions

Starting point: building a chatbot solution

The customer was building a chatbot solution for an internal knowledge base for its employees. They were using an existing ML service to build natural language into bots. They were looking for a reliable partner to help them classify the available information and format it into training data for their ML service.

They wanted to build a chatbot capable of recognising intent and named entities in user queries. The use case was of an assistant (voice-controlled in the final product, like Alexa) that allows users to query the customer's internal database of projects, e.g. “What are all the currently active projects in North America?” “Which ones of these deal are owned by group xyz?” They have tried Microsoft LUIS so far and the challenge was that the entity recognition system could not deal with the very internal entities they are using (“CI owner” vs something like “Person”)

Enhancing the chatbot solution with super.AI

The customer provided us with a list of entities, utterances and ground truth. super.AI then classified the utterances received from the customer and provided a compatible output. We approached the task in several steps:

  • Initial labeling 
  • Variable feedback labeling
  • Text augmentation 
  • Query augmentation with synonyms
  • Data augmentation

The starting point of the POC was a snippet of the internal database with 50+ columns and 20+ rows. The rows in the database are individual projects and the columns are the properties we are interested in.

To generate input for our data program, we then created queries that would target either one or several individual columns for one specific row of the table (“What is the number of users for project abc?“), or queries that filter the rows by column conditions (“How many projects have more then 10 users?“). These queries were originally formulated in abstract SQL language and then rewritten in natural language in a few (around 5) different ways.

The data program then performed text augmentation on these inputs - given a query, the data program produced many similar queries with the same intent but different wording. Our data program used three strategies (and their combinations) to achieve this:

-Swap single words with synonyms from a thesaurus

- Back translation,

- Using our crowd heroes (the crowd heroes had a creative task of coming up with the augmentations, then another hero had the task of verifying them against the original input for intent similarity).

Results

Based on the original list of entities, we generated 88,000 augmentations and by combining them automatically, 6M automations, sufficient training data for its ML solution to build its first chatbot iteration for internal testers. They were extremely happy both with the quality of the data as with the turnaround times.

We were looking for a supplier with NER expertise that could provide us with high quality training data to feed into our existing ML service we were using to build a chatbot. We were very happy with the throughput and speed we received from super.AI

Related Case Studies