Debunking NLP: Named Entities
September 7, 2017
This blog post is part 3 of our Debunking Natural Language Processing (NLP) series. Throughout this series, Ellen will highlight several features that help Clarabridge users go beyond simple topic analysis. This series will show you how new types of analysis aren’t so farfetched after all!
When we think of grammar, some of us are still haunted by nightmarish flashbacks of diagramming sentences in middle school. A mess of lines and dashes, the diagrams were intended to help us understand how words were related to each other. As a self-proclaimed language nerd even as a teen, I loved diagramming sentences (thanks, Mr. Dopko!) but in reality they carry very little value in communicating effectively with my peers. Beyond parts of speech, noun clauses, subjects and predicates, lies a world of complex meaning embedded in words and phrases. Putting words together in the proper order is a skill we master as toddlers but truly understanding what words mean requires a lifetime of focus. The same pattern is true of a Natural Language Processing (NLP) engine. The rules of English (or any language) are relatively straightforward to teach a machine. The subtleties of meaning, on the other hand, are wrought with challenges.
Take for example this sentences:
“There’s a red and green striped jaguar in my garage.”
How would you interpret it? This sentence may refer to a uniquely colored feline or a one-of-a-kind automobile. Both of those interpretations are valid but each would result in a tremendously different reaction from an audience. Let’s look at a slightly different version of this sentence:
“There’s a red and green striped Jaguar in my garage.”
This small orthographic change (capitalizing the J in “jaguar”) pushes the interpretation towards recognizing it as a brand name instead of a mammal.
Named Entity Recognition, or the ability to deduce whether a word, or sequence of words, is a proper noun, is a well-studied challenge in the language community. There are both linguistic and statistical approaches to tackling the problem. For example, we can teach the NLP to look for small clues in capitalizations, presences of suffixes (ie: LLC, Inc, etc), changes in punctuation and part of speech sequences. Machine learning systems can be trained to spot certain patterns, as well. These strategies perform decently well on well-formatted text, however when these essential clues are missing (like we often see in social media data), it becomes nearly impossible to disambiguate meaning accurately.
Unfortunately, spotting these terms within a sentence is only half the battle. There are a multitude of kinds of Named Entities. Within your dataset, you might see products, brands, companies, people, and locations. As you can imagine, being able to analyze any of these groups of terms would provide distinct value to customer experience analysis that goes beyond just analyzing the topics in your data. You probably already know the top ten topics in your survey data by heart. But, could you tell your stakeholder the top ten products mentioned? Top ten employees? Celebrities? Cities? Brands? Looking at your data through the dimension of named entities opens up a whole new world of analysis in CX data.
What I particularly love about using named entities to drive analysis is that it starts to expose unexpected truths. So many of the entities mentioned have nothing to do with your brand and would not have been noticed if you were only looking for your direct brands or competitors. For example, when analyzing data about the vintage soda Surge, we found a huge surge (pun intended) in mentions of eBay as Surge enthusiasts were trying to buy and sell bottles in live auctions. We worked with a major US airline that received significant backlash for switching the brand of coffee served on board. Another online retailer faced negativity when their online payment system was suddenly incompatible with PayPal. Our use case book is chock full of these kinds of stories in which one company’s issues were actually based on the biases (or preferences) for other brands or individuals that were completely unrelated to their own. If these companies only looked at their known topics and their own list of products, they would have been completely blind to these insights from tangentially related organizations. In my C3 presentation in Miami this year, I described this idea as a network of brands and people related to your own. You stand to gain and to lose the most by those at the periphery of your network – not the ones closest to you.
While topic analysis is commonplace, and I’d dare to say required to be competitive in the customer experience analysis marketplace, the ability to analyze named entities is actually rare. In order to offer this kind of semantic analysis, a tool must have a very mature NLP foundation that serves up enough information at the part of speech and grammar levels to give a Named Entity Recognition module a fighting chance. Even rarer is the tool that can add on top of that the ability to sort named entities into different groups and disambiguate names from cities from brands.
Over the next two blog posts, I am going to continue the discussion of Named Entity Recognition by looking closer at Clarabridge’s World Awareness functionality which allows users to automatically identify products, brands, companies, people and more. Stay tuned!
To read the previous blog posts in this series, please visit:
Ellen Falci is Clarabridge’s NLP/Enrichment Product Manager. Follow Ellen on Twitter at @ellenfalci.