Clarabridge Blog

 

Dec 7

Written by: Justin Langseth
12/7/2009 3:11 PM  RssIcon

Trade-offs are a part of life. I, for instance, used to only eat half of the Oreos® in the cookie packs in our break room. I’d rather do that than put extra time on a treadmill. In classification, there is an inherent trade-off between Recall and Precision, and I’d like to provide some background on those terms and on why the trade-off occurs.

Trade-offs are a part of life. I, for instance, used to only eat half of the Oreos® in the cookie packs in our break room. I’d rather do that than put extra time on a treadmill. In classification, there is an inherent trade-off between Recall and Precision, and I’d like to provide some background on those terms and on why the trade-off occurs.

Recall and Precision are some of the key metrics for the text mining industry. Let me define each of them:

According to Wikipedia (http://en.wikipedia.org/wiki/Precision_and_recall), “Precision can be seen as a measure of exactness or fidelity, whereas Recall is a measure of completeness… Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search, and Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved).”

When you are categorizing your customer feedback, you want to make sure that as many example expressions of a category are captured in that category as possible (high Recall), but you also want to ensure the things placed in that category are there correctly (high Precision). 

Which of the two metrics is more important depends on the downstream use case.  Most users, when asked, will say that in fact both are important, which indeed is often the case. Say your goal is to determine the root cause of a particular high volume issue - where the true root cause can be teased out of potentially only a fraction of the occurrences. In that case, Precision may be more important than Recall, as you can get the answer without considering all the occurrences.

This is similar to using sampling in a manual categorization or sentiment analysis. However, in this case, it could also be said that the root cause could probably be determined even if the data is not 100% precise, as the signal would still be visible even through some amount of noise. However if you only considered 1% of the actual occurrences of the problem (low Recall), you may not be able to determine what the problem actually is, or you may not have known there was a problem there at all in the first place.

 

 

In the example above, you see some healthy foods, and see a circle indicating the “Fruit” category. On the left, you see that the category has both high (but not perfect) Precision of 90%, as well as high (but again not perfect) Recall of 90%. On the right side, you see a category that has 100% precision because everything in the circle is a fruit. However you see that the recall is very low, only 20%, because most of the fruits didn’t get put into the category.

When discussing accuracy of categorization, sentiment, or any other aspect of text mining, you need to measure both Precision and Recall. The word “accuracy” is often misused to refer only to precision. If the pictures above though represent your breakfast plate, in most cases you’d prefer to get 9 fruits on your plate instead of only 2. Assuming you are not allergic to vegetables, that is, but if you were you'd prefer your plate to look like the right side, although you will get less total food, it will be 100% safe for you. Again, a trade-off, that depends on how hungry you are and what you’re allergic to. That will vary by person.  

Back in the text mining world, if your goal is to get a reasonable and accurate count of how many customers are affected by a certain issue with respect to a particular store or product, then higher Recall is important, as you want to ensure as many occurrences of that problem are detected as possible. Of course, you also want reasonable Precision, as you don’t want to count things as problems that aren’t.

So we believe that both high Precision and high Recall are potentially important to any particular use case, and you really can’t say one is inherently more important than the other. That is why you need to focus on both. If someone tells you that Precision is all that matters and not to worry about Recall, or just doesn’t mention Recall and focuses only on Precision, I would find that assertion questionable. In addition, anyone who talks about “accuracy” without separately discussing and acknowledging both Precision and Recall may be trying to hide a flaw in their process or technology that makes only one side of the equation possible to achieve.

 


Your name:
Gravatar Preview
Your email:
(Optional) Email used only to show Gravatar.
Your website:
Title:
Comment:
Security Code
CAPTCHA image
Enter the code shown above in the box below
Add Comment   Cancel 

 

butter   closed   cup   cups   customer   experience   experiences   fix   half   hershey's   insights   issues   media   mouth   negative   open   peanut   problem   track   web  

 

  

Search

Become a Clarabridge Fan!

Clarabridge on Facebook
Home Privacy Policy Terms of Use Contact Us ©2011 Clarabridge®. All rights reserved.
Follow Clarabridge