Not All Insights Are Equal: Speech-to-Text vs. Phonetics-Based Spotting

By: Shorit Ghosh

September 22, 2021

Contact Center

Speech-to-text and PhoneticsWhen it comes to speech analytics, it is important to understand the method behind the transcription process to ensure you are extracting the best insights possible.  There are two main methods used to extract insights from calls coming through the contact center: phonetics-based and speech-to-text.

Phonetic transcription uses the international phonetic alphabet (IPA) to transcribe spoken words. The IPA is clunky and indecipherable for the uninitiated.  Most legacy speech analytics tools rely on phonetic transcriptions to capture how the words sound in order to match those sounds with an index file of target words.  They look at speech and text data differently and fail to provide a consistent and accurate view of the true voice of the customer.

In addition, looking for meaning in a call based on the pattern of sounds is expensive and unsatisfactory: expensive because audio files are large and thus costly to store, and unsatisfactory due to lower accuracy and a higher rate of false positives.

The audio file size also impacts speed due to indexing inefficiency, which can become even more expensive depending on the transcription vendor being used.

Speech-to-Text As A Better Option

Another other option for businesses is to use a speech-to-text approach to transcription. Speech-to-text avoids the complexity of phonetic transcription by capturing words based upon the sounds the algorithm “hears.”

Take a look at table below to uncover some of the advantages and disadvantages to each transcription approach:

Phonetic Approach Speech-To-Text Approach



   Speed and Velocity


Typically faster at processing audio
into an engine. However, search process itself is much slower since phonemes cannot be as efficiently indexed the way whole words can.

Also speed to insight is slower due to false positives.


A speech-to-text approach transcribes faster and at a higher quality. This approach can transcribe up to 1000 audio-hours of speech in just 1 hour.  Faster, better output means more reliable analytics and insights.



   Precision and Accuracy 


A phonetic approach has lower precision. False positives have to be
manually filtered out for accurate topic detection adding days to analysis requests.


Search process is faster and more accurate. False positives can be surfaced much faster and removed.  It also Translates audio recordings into searchable text.



   Identifying Emerging Trends


Can be used to search for “known” terms but will not enable automatic discovery and emerging trends (you have to know what to search for).



No pre-knowledge of what you are searching for is required. Themes will automatically emerge.


   Context and Depth


Phonetic queries are only able to spot high-level keywords and miss out on understanding the intent and meaning of what is being said. This results in an inability to map to sophisticated topic models that support drill downs for deeper insights.


An interaction is not always about a single topic. This approach analyzes every phrase of the transcribed text to identify all topics and subtopics within a single interaction. This granularity supports root cause analysis and drives actionability.



   Complexity and Flexibility


It takes months to create a handful of phonetic queries. These follow a complex format and cannot be modified by a business user. This means users cannot easily change their queries to spot emerging trends or
respond to changing business needs.



Our models can be easily customized using a business user-friendly interface. This makes it much easier to customize the topics for changing business needs.


   Storage Footprint and Cost


Phonetic queries need high quality audio to be retained for analysis. Vendors charge a hefty amount to retain audio beyond 60 days. Historical analysis becomes expensive and trend analysis is effectively non-existent.


The total cost of ownership is much lower, because there is no limit to the amount of text data that can be stored. This also makes it easy to analyze trends over time and perform historical analysis.



   Operational Burden


Phonetic queries cannot be reused for text data from emails, chats, social media and other sources. Separate queries need to be created for all the other sources of feedback. These queries then run on separate databases.



Can leverage the same taxonomy/framework across multiple text-based sources including surveys, chats, emails, direct messages, and complaints.

What Our Customers Say About the Clarabridge Transcription Difference:

“Clarabridge transcripts are easily legible, and topic recall was 20% higher on average when we compared them with other speech analytics tool.” VP of Contact Center, Major Healthcare Enterprise

“Clarabridge helped us get a unified view of both our contact centers and CX data. We had no idea we could easily spot the topics that are making customer switch channels from website to contact centers, racking up costs. We were able to quickly introduce more self-service options because of this, saving us millions.” VP of Customer Insights, Large Retail Company

Context Matters

Transcribing an entire call into text with a speech-to-text approach produces higher accuracy at a faster speed. While old-school transcription simply matches phonemes – the indivisible units of sound in a language – to dictionary words, best-in-class transcription engines understand the grammatical context of words and phrases. Those leading speech analytics solutions also apply rules and machine learning-enhanced Natural Language Understanding to map voice of the customer data to industry-tuned topic models and derive additional insights around sentiment, emotion, effort, and intent.

While the phonetic approach can identify key words and phrases, the speech-to-text approach translates audio recordings into text that is actually searchable. In addition, precision is impacted by the approach taken.  The phonetic approach has lower precision, and false positives must be manually filtered out for accurate topic detection, adding days to analysis requests whereas with the speech-to-text approach, there is a much higher precision rate due to high probability of existence of word in dictionary and faster insights to action.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the Author:
Shorit Ghosh is the Vice President of North America Services at Clarabridge. Shorit manages a team of consulting managers, business consultants and technical architects to help his customers improve their own customer experience, increase revenue, and reduce cost and churn.


Clarabridge vs. Legacy Speech Analytics
Clarabridge vs Legacy Speech Analytics

Digital transformation has enabled modern contact centers to engage with customers across multiple channels that span beyond calls. With interactions happening over emails, chats, private messages, and social media, today’s omnichannel contact centers can now address more data than ever before.

We compared Clarabridge with Legacy Speech Analytics tools based on 7 criteria that cover omnichannel analytics, speed to insight, scalability, transcription quality, and more.

Keep The Conversation Going