Named entity recognition (NER)—sometimes referred to as entity chunking, extraction, or identification—is the task of identifying and categorizing key information (entities) in text. An entity can be any word or series of words that consistently refers to the same thing. Every detected entity is classified into a predetermined category. For example, an NER machine learning (ML) model might detect the word “super.AI” in a text and classify it as a “Company”.
NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is concerned with computers processing and analyzing natural language, i.e., any language that has developed naturally, rather than artificially, such as with computer coding languages.
This post explores the basics of how NER works, along with some high-level use cases and how you can apply it in your business or project.
At the heart of any NER model is a two step process:
Beneath this lie a couple of things.
Step one involves detecting a word or string of words that form an entity. Each word represents a token: “The Great Lakes” is a string of three tokens that represents one entity. Inside-outside-beginning tagging is a common way of indicating where entities begin and end. We’ll explore this further in a future blog post.
The second step requires the creation of entity categories. Here are some common entity categories:
These are just a few examples. You can create your own entity categories to suit your task, as well as provide granular rules for which entities belong to which categories in instances of ambiguity or task-specific ontologies.
To learn what is and is not a relevant entity and how to categorize them, a model requires training data. The more relevant that training data is to the task, the more accurate the model will be at completing said task. Train your model on Victorian gothic literature, and it will probably struggle to navigate Twitter.
Once you have defined your entities and your categories, you can use these to label data and create a training dataset (our named entity recognition data program can do this for you automatically). You then use this training dataset to train an algorithm to label your text predictively.
NER is suited to any situation in which a high-level overview of a large quantity of text is helpful. With NER, you can, at a glance, understand the subject or theme of a body of text and quickly group texts based on their relevancy or similarity.
Some notable use cases include:
If you think that your business or project could benefit from NER, it’s pretty easy to start out. There are a number of excellent open-source libraries that can get you going, including NLTK, SpaCy, and Stanford NER. Each has its own pros and cons, which we’ll be exploring in more detail soon.
But before you begin using one of these libraries to build a model, you will need to produce a relevant labeled dataset to train the model on. That’s where Canotic is there to help. Using our named entity recognition data program, you provide us your raw text and desired entities and categories. We’ll label the text you send and return a high quality training dataset that you can take to train and tailor your NER model.
If you’re interested in learning more or have a specialized use case, reach out to us. You can also stay tuned to our blog, where we’ll be running a series of posts covering different aspects of NLP over the coming months.