Debunking Natural Language Processing: Detecting Spam
February 26, 2019
For this week’s topic of spam, I want to remind you of this Monty Python sketch from 1970. It starts with customers trying to order a breakfast in a diner. The waitress details what’s on the menu:
“Well, there’s egg and bacon,
Egg, sausage and bacon,
Egg and spam,
Egg, bacon and spam,
Egg, bacon, sausage and spam,
Spam, bacon, sausage and spam,
Spam, egg, spam, spam, bacon and spam,
Spam, sausage, spam, spam, spam, bacon, spam, tomato and spam,
Spam, spam, spam, egg and spam,
Spam, spam, spam, spam, spam, spam, baked beans, spam, spam, spam and spam.”
Ah, Monty Python at its finest. And, recently, an apt description of my email inbox and my voicemail. Spam, Hawaii’s favorite processed meat and Western culture’s favorite protein to mock, has an impressive résumé. Hormel offers 15 flavors of Spam and has sold over 8 billion cans across 44 countries since its introduction in 1937. But, Spam’s influence isn’t just culinary; it has also had an understated influence on our technological world. Inspired by the Monty Python sketch, certain abusive users of Bulletin Board Systems and Multi User Dungeons would repeat “spam” a massive number of times to scroll other previous messages off the screen. Soon, the term became a moniker for the unwanted junk we find on the Internet that obfuscates the content that we actually want to see.
Spam content can be a real burden to those trying to find meaning from their customer feedback data. It can increase the effort needed for analysis and can cause misinterpretation of customer needs. In most cases, customer experience analysts want to analyze the customer’s organic voice, not the inorganic voice of automated bots or fake customers. Looking at volumes and sentiment of mentions of specific products or topics without regard to whether the content is spam can be very misleading.
Not all spam, though, is created equal. By classifying the types of inorganic messages that appear frequently in customer feedback data, we can gain a better picture of how the market views a brand or product. The types of spam present in each dataset may vary. On social media, auto-generated messages in content such as news headlines, reviews and articles abound. In email sources, job requests or solicitations for corporate sponsorship may get in the way. Analyzing the language used in the inorganic messages bares its own utility. When these spam documents mention a brand or product, they may reveal how customers and potential customers use, advertise and perceive that brand or product.
A tool that blindly looks at words and phrases is limiting itself to parsing of discrete words; a tool focused on understanding will tease out the organic versus inorganic messages and classify them into their associated types. By exposing these types to end users, such a tool empowers analysts to isolate the true feedback and determine meaningful insights. The Clarabridge proprietary Content Type Detection feature uses a machine learning algorithm to identify and tag spam content and then classify it into subtypes: advertisement, coupon, link or undefined. Users can choose either to purge any spam documents upon ingesting or to retain them for analysis. Analysts can also customize this feature by training the algorithm on their own data and with custom subtypes. With the Clarabridge Content Type Detection, users can go beyond basic topic and sentiment analysis. Considering the integrity of a message provides a different dimension and a unique analytical lens for many stakeholders that would be missed or misinterpreted if all documents were viewed the same way or were viewed purely by their independent words. Just the same as how “egg and spam” is not the same as “spam, spam, spam, egg and spam.” Bon appetit.