Playground

Why Having Realistic Speech Generation Matters in Customer Service

Post by:

Tejas Shahasane

September 16, 2024

Picture this: You are in a hurry to report a lost credit card. With your heart racing, you call customer support. Instead of a friendly voice that would calm you down, you are met with a robotic voice that is completely unaware of your anxiety. It gives policy numbers when you are shouting, ‘Oh, just help me!’

This is a familiar situation whenever there is automation of customer service. But it doesn’t have to be this way. Customer service remains among the most important aspects of any business since it is an interaction point that defines the overall experience of customers. Most of the communication that we have today is through a screen where instead of having an actual conversation with the customer, they are interacting with a chatbot or going through a menu when making a call, the tools that we have to communicate with the customers have never been more valuable.

Companies that are implementing front-line customer support automation usually use 2 types of technology. ASR (automatic speech recognition) and TTS (Text-to-speech). The integration of ASR & TTS is what makes it possible for businesses to set up a customer support hotline – A number you can call and speak your issues to and it answers back vocally as well, in real-time.

However, not all TTS systems are the same. The distinction between a voice that is monotone and robotic, and a voice that is conversational can greatly affect the level of customer satisfaction and the degree to which customers remain loyal to a brand.

The Audiobook Trap

Most of the current TTS systems, especially for Indic languages, are based on audiobooks. On the surface, this decision makes a lot of sense and appears to be quite rational. Audiobooks are filled with a great amount of comprehensible, intelligible, and appropriately paced speech. They offer well-organized content which is easily accessible and thus suitable for training TTS systems. But why exactly are TTS systems, especially the ones intended for Indic languages, trained on audiobooks in the first place?

The answer lies in the nature of audiobook data itself:

Availability and Quality of Data: Audiobook data is available in large quantities and is also easily obtainable. Audiobooks are usually produced by professional readers in studio-like conditions, which means that the sound quality of audiobooks is consistently good and does not have any background noise or other interference.
Structured and Well-Paced Content: Audiobooks are intended to be read aloud and the material of audiobooks is inherently oral. This means they come with natural pauses, clear punctuation, and a steady pace all of which are important in training a TTS system to produce natural and fluent speech.
Standardized Language Use: Audiobooks are formal in the use of the language, and this means that they do not use local idioms, slang, or even accents. This standardization is useful when developing a reference model that must be easily comprehensible to a large number of people.
Resource Efficiency: TTS system training is a computationally intensive task. Audiobooks, which have large blocks of speech uninterrupted, are more effective in the utilization of these resources.

Although the data from audiobooks has given a good starting point for early TTS systems, it also has its drawbacks. Audiobook speech, as it is called, is formal, rehearsed, and not flexible enough to accommodate the conversational needs of customer service interactions.

The Real Conversation Gap

Here's the reality: Audiobook data has proved useful as a starting point, but it is not as useful when it comes to mimicking the natural, real-time conversation of humans. Although conversational data is more challenging to acquire and analyze, it is significantly superior to other forms of data in terms of how people communicate. It is a good representation of the conversation since it includes breaks, changes in the tone, and colloquialisms that make conversations interesting and realistic.

That's why at Gan.AI, we've developed myna-mini—an AI text to speech API that has been trained on conversational data. This model represents a significant leap forward in our ability to generate speech that feels genuinely human, to be apt for customer service scenarios where empathy and responsiveness are key.

The Limitations of Audiobook Data in Customer Service

The reliance on audiobook data presents several significant limitations, particularly in the context of customer service:

Lack of Spontaneity: Audiobook data is therefore scripted and rehearsed and this makes it to be very polished but not flexible at all. This lack of spontaneity is a disadvantage because audiobook-trained TTS systems are not able to adapt to the unpredictability of real-life conversations.
Impersonal and Formal Tone: Audiobooks are meant to be well articulated and easy to understand, but this clarity is achieved by sacrificing the listener’s ability to feel the narrator’s compassion. In customer service, for instance, a tone that is too professional or even robotic will have the customers feeling like they are speaking to a computer rather than a representative.
Inability to Handle Colloquialisms and Slang: Audiobooks do not use informal language, idioms, slangs, and accents as they use formal language as much as possible. However, customer service interactions are full of colloquial language, idioms, and other language peculiarities.
Challenges with Emotional Context: The current audiobook-trained TTS systems are not optimized to capture the emotional aspect of the conversation. While they are capable of providing information to the customer in a correct manner, they are unable to vary their pitch or speed according to the mood of the customer.

The Role of Code-Mixing in Indian Conversations

Code-mixing is not an idiosyncrasy; it is an essential aspect of interaction in India. In all the areas of conversations, ranging from financial transactions, customer service, and daily life. It is common to find speakers code-switching to convey certain ideas. For instance, while the rest of the conversation is in Hindi, Tamil, or any other regional language, one is likely to hear terms such as account, payment, or refund.

This linguistic blending poses a problem to TTS systems as it is difficult to determine the right blend of the two languages. A system that does not accommodate code-mixing is not only a system that is not in touch with the real world but also a system that does not work. For example, if a customer is used to saying “Mujhe refund chahiye” but the TTS system does not understand or reproduce the English word “refund” properly, then the customer becomes confused and frustrated.

This is where myna-mini truly shines. By training on conversational data that includes a variety of code-mixed examples, myna-mini is adept at handling the fluidity of language that is so characteristic of Indian conversations. It ensures that interactions feel natural and intuitive, regardless of the linguistic blend in play.

Gan.AI: Pioneering Conversational TTS

We are aware of the shortcomings of TTS systems trained on audiobooks, and we are determined not to be limited by them. In our approach, we emphasize the richness of conversational data – the raw, unstructured data that reflects how people actually talk. This data is more difficult to gather and analyze and yet it is much more useful in the creation of TTS systems that can deliver natural-sounding conversations.

Our TTS model Myna-mini was developed to support the complexity of natural language dialogues. Myna-mini is designed to work with different dialects, tones, pacing, and, most importantly code-mixing.

Real-World Impact of Natural Speech

The implications of natural-sounding, realistic speech technology extend far beyond just improving customer service calls. Let's explore some potential applications:

Education

Imagine a virtual tutor explaining complex concepts in a mix of English and the student's native language:

"So, beta, quadratic equation को solve करने के लिए, pehle हमें standard form में लाना होगा: ax² + bx + c = 0. Samajh में आया?"

(Translation: "So, child, to solve a quadratic equation, we first need to bring it to the standard form: ax² + bx + c = 0. Did you understand?")

This natural language mixing can make learning more accessible and relatable for students across India.

Healthcare

In rural areas with limited access to healthcare professionals, natural-sounding speech systems could provide crucial medical information in local dialects:

"आपको diabetes है, इसलिए रोज़ाना blood sugar check करना बहुत ज़रूरी है। समझे ना?"

(Translation: "You have diabetes, so it's very important to check your blood sugar daily. Understood?")

This approach ensures that vital health information is communicated clearly and in a familiar style.

Financial Inclusion

For those new to banking, a conversational system could explain complex financial concepts using relatable examples:

"FD यानी Fixed Deposit ऐसा होता है जैसे आप अपने पैसे को एक safe में lock कर दें। जब तक वो lock है, उस पर interest मिलता रहेगा।"

(Translation: "An FD, or Fixed Deposit, is like locking your money in a safe. As long as it's locked, you'll keep earning interest on it.")

This kind of explanation, blending familiar concepts with financial terms, can help demystify banking for millions.

The Broader Implications of Realistic Speech Generation in Customer Service

When it comes to customer service, delivering a human-like experience is key—and that’s why realistic speech generation is so important. Whether it's a hospital, a school, or a bank, customers don’t just want information; they want to feel understood and connected. A robotic voice just doesn’t cut it anymore.

Imagine a healthcare provider using an automated system to remind patients about appointments or explain treatment plans. If the voice sounds cold and mechanical, patients might tune out or feel disconnected. But with realistic TTS, the voice sounds warm, empathetic, and natural—like a real person talking. This builds trust, increases comfort, and ensures patients fully understand their care. And when patients feel well-informed and supported, satisfaction goes up—so does loyalty.

The same goes for education. Students hearing realistic voices while learning, especially in online environments, stay engaged. A lively, natural voice explaining complex topics can make all the difference in comprehension and retention. This keeps students motivated and boosts the reputation of the educational institution—ultimately improving outcomes.

In banking, trust is everything. If a customer is navigating financial information through an automated system, a realistic voice makes the interaction feel professional yet personal. It reassures customers, leading to better engagement and a stronger relationship with the brand.

In short, realistic speech generation makes customer service more human. It’s not just about the message—it’s about how that message is delivered. And when it’s delivered in a natural, relatable way, the impact is powerful: better customer satisfaction, stronger trust, and ultimately, a healthier bottom line.

Wrapping up

In customer service, how you say something is just as important as what you say. Realistic, conversational speech generation is a necessity; especially in fields that deal with people and their grievances. It's about crafting interactions that feel genuine, that build trust, and that leave customers feeling valued and understood.

Gan.AI's dedication to developing TTS systems that are responsive, empathetic, and finely tuned to the unique linguistic environment of India is setting a new standard in the industry. With innovations like myna-mini, we are bridging the gap between machine-generated speech and human conversation, helping businesses not just meet but exceed their customers' expectations.