Named entity recognition for Twitter

In a previous HumanGeo blog post, Denny Decastro and Kyle von Bredow described how to train a classifier to isolate mentions of specific kinds of people, places and things in free-text documents, a task known as Named Entity Recognition (NER). In general, tools such as Stanford CoreNLP can do a very good job of this for formal, well-edited text such as newspaper articles. However, a lot of the data we need to process at Maxar comes from social media, in particular Twitter. Tweets are full of informal language, misspellings, abbreviations, hashtags, @mentions, URLs, and unreliable capitalization and punctuation. Also, users can talk about anything and everything on Twitter, and new entities that were never or rarely mentioned before could become suddenly popular. All these factors present huge challenges for general-purpose NER systems that were not designed for this type of text. Fortunately, there is a good deal of academic research on how to make NER better for Twitter data. In fact, every year since 2015 there has been a shared task at the Workshop on Noisy User-Generated Text (W-NUT) for Twitter NER. A shared task is a competition in which all the participants are asked to submit a program for a specific task, and the entries are scored and ranked based on a common metric. So we already know which system is the best of those that participated, but we don’t know if systems that didn’t compete are good. And, even the best system is of no use to us if we can’t get our hands on it. Unfortunately, none of the popular off-the-shelf NER tools have participated in this shared task, and I have only been able to find one entry, the seventh place winner from 2016, currently available on the internet. With this in mind, I decided to use the test data from the 2016 shared task to evaluate systems you can actually download and start using today to see how well they perform on tweets. The general-purpose NER systems I selected are Stanford CoreNLP, spaCy, NLTK, MITIE and Polyglot. The two Twitter-specific systems I selected are OSU Twitter NLP Tools and TwitterNER (the seventh place entry for 2016). Each system uses a slightly different set of entity types. So, I decided to map the types in the output of these systems to just PERSON, LOCATION and ORGANIZATION, which were common to all of them. I ignored any types that didn’t match these three. The Stanford CoreNLP NER tool can be run with several options that could potentially improve accuracy for tweets. In particular, there is a part-of-speech (POS) tagger optimized for tweets. Since part of speech is one feature used for NER, improving the POS tagger should also improve NER accuracy. Additionally, there are two options for dealing with text with inconsistent capitalization. This is a big problem for NER systems because, at least in well-edited text, capitalization is one of the strongest clues that a word is part of a proper noun, and likely to be a named entity. Systems trained only on well-edited text, therefore, tend to rely on capitalization too strongly when applied to text with inconsistent capitalization. The first option is to pre-process the text with a true-caser, which attempts to automatically figure out what should be the correct capitalization of the text. The second option is to use models that simply ignore case altogether. Here is the precision, recall and F1 score for these systems, sorted by highest F1 score first:

System Name	Precision	Recall	F1 Score
Stanford CoreNLP	0.526600541	0.453416149	0.487275761
Stanford CoreNLP (with Twitter POS tagger)	0.526600541	0.453416149	0.487275761
TwitterNER	0.661496966	0.380822981	0.483370288
OSU NLP	0.524096386	0.405279503	0.45709282
Stanford CoreNLP (with caseless models)	0.547077922	0.392468944	0.457052441
Stanford CoreNLP (with truecasing)	0.413084823	0.421583851	0.417291066
MITIE	0.322916667	0.457298137	0.378534704
spaCy	0.278140062	0.380822981	0.321481239
Polyglot	0.273080661	0.327251553	0.297722055
NLTK	0.149006623	0.331909938	0.205677171

Precision measures the fraction of the entities the system came up with that were correct, whereas recall measures the fraction of the correct entities that the system was able to find. The F1 score is the harmonic mean of these two numbers. Which of these numbers is most important to you will depend on how you plan to use NER. For example, if the output of the NER system is always reviewed by a human, you might prefer a high-recall/low-precision system over a low-recall/high-precision system. In this case, the human reviewers can always toss out any bad entities the system outputs. But if the system simply doesn’t report entities at all, the reviewers will never see them. On the other hand, if something important happens automatically to all the entities the system outputs, you might prefer the low-recall/high-precision system so any entities the system outputs are as likely to be as correct as possible. All other things being equal, if you just want one number to look at, use the F1 score.Out of the box, Stanford CoreNLP is the winner as measured by F1 score, though TwitterNER has much higher precision. It is interesting to note that none of the alternative configurations for Stanford CoreNLP resulted in any improvement. The improved POS tagger didn’t change the results at all for any of the entity types I examined (though it did change the results for some other entity types), indicating that POS tagging plays a relatively minor role. True-casing and caseless models made things even worse. My guess is the true-caser is probably creating more capitalization errors than it is fixing. The drop from the caseless models probably means the capitalization information, as unreliable as it is, is still useful overall. Given they were designed explicitly for Twitter, it is somewhat surprising that TwitterNER and OSU Twitter NLP Tools did not get the highest F1 scores. But, they were trained on a fairly small amount of data compared to the general-purpose systems, even if the quality of the data was better for this task. One improvement easily made to all the systems is to exclude detected entities that are @mentions. These do refer to accounts, which correspond to either a person or an organization, so it would be natural to categorize them as entities. However, they are not marked as entities in the test data, since they are easy to identify with nearly 100 percent accuracy with a regular expression. Account profile information is likely to be a better source for distinguishing between people and organizations than the text of the tweet. Here are the results for all systems with @mention entities excluded:

System Name	Precision	Recall	F1 Score
Stanford CoreNLP	0.526838069	0.453416149	0.487377425
Stanford CoreNLP (with Twitter POS tagger)	0.526838069	0.453416149	0.487377425
TwitterNER	0.661496966	0.380822981	0.483370288
OSU NLP	0.524096386	0.405279503	0.45709282
Stanford CoreNLP (with caseless models)	0.547077922	0.392468944	0.457052441
Stanford CoreNLP (with truecasing)	0.413084823	0.421583851	0.417291066
MITIE	0.340364057	0.457298137	0.390260063
spaCy	0.28426543	0.380822981	0.325535092
Polyglot	0.273080661	0.327251553	0.297722055
NLTK	0.149006623	0.331909938	0.205677171

The improvement is only significant for MITIE and spaCy. But, as expected, no scores went down, so it’s still worth doing. Since TwitterNER can be easily retrained, let’s see if we can make it better. The W-NUT Twitter NER shared task includes a set of training data all participants are required to use. If they use any additional training data it’s considered cheating. From a research perspective this is a really good idea, because this way, you know the winner won because it was the best algorithm; not just because it used the most training data. But if you want the best system, you want to throw as much training data as you can at it. Fortunately, there are at least three more sets of tweets annotated for named entities available on the internet:

A set of manually-created annotations (Hege)
A set of crowdsourced annotations (Finin)
The Twitter section of the 2017 W-NUT test data (W-NUT 2017)

One challenge with using data from other sources is inconsistency in formatting that you have to be careful about. I cleaned up the following issues from this data:

The W-NUT 2017 data incorrectly splits hashtags and @mentions into two tokens (e.g., “@” and “username” rather than “@username”). I rejoined them.
All three of these sources annotate @mentions as Person entities. I removed the Person entity annotations for all @mentions.
All URLs and numbers are replaced with “URL” and “NUMBER,” respectively. This reduces data sparsity without sacrificing too much information, since it usually doesn’t matter what the URL or number is exactly for the purpose of doing NER, and there is literally an infinite number of them. But TwitterNER has specialized features for numbers and URLs that expect numbers to look like numbers and URLs to look like URLs. So I replaced all “NUMBER” tokens with “1” and all “URL” tokens with “http://url.com.”

Here are the results if you train a TwitterNER with this data in addition to the shared task training data:

System Name	Precision	Recall	F1 Score
TwitterNER (with Hege training data)	0.657213317	0.413819876	0.507860886
TwitterNER (with W-NUT 2017 training data)	0.675307842	0.404503106	0.505948046
TwitterNER (with Finin training data)	0.598086124	0.388198758	0.470809793

After adding either the Hege or the W-NUT 2017 data, TwitterNER now has the highest F1 score of all the systems, though adding the Finin data actually decreases the F1 score. This is likely because the quality of the Finin annotations is not the best since they were crowdsourced, rather than produced by a smaller number of well-trained annotators like the other datasets. If we combine just the W-NUT 2017 and Hege data, we get a small but measurable additional improvement:

System Name	Precision	Recall	F1 Score
TwitterNER (with W-NUT 2017 and Hege training data)	0.652276759	0.42818323	0.51699086

So for most use cases, TwitterNER with this extra training data is the best NER system to use for Twitter, since its F1 score and precision are the highest. In particular, if you need a high-precision system, it’s significantly better than any other options. However, its recall is still a bit lower than Stanford CoreNLP’s. So, if recall is especially important to you, you might want to stick with CoreNLP. The source code for this evaluation is available here. Like a lot of academic software, TwitterNER takes quite a bit of time and expertise to get up and running. So, I created an easier-to-use version bundled with the best model (with the added W-NUT 2017 and Hege training data) here.

Prev Post Back to Blog Next Post

Named entity recognition for Twitter

Email Subscription

Related posts

Introducing Maxar ARD: Accelerating the Pixel-To-Answer Workflow with Analysis-Ready Data

Artificial Intelligence for Analysts in the Age of Rapid Revisit

Emerging Space Technologies and the Evolution of Maxar: Earth Intelligence