Speech-based technology companies have existed since the early 2000s, but with the proliferation of personal assistants such as Alexa, Siri, Google Home and Cortana, we’re now entering what’s being called as a ‘voice-first’ era. Advances in deep-learning based AI techniques are yielding higher rates of accuracy, with tech giants leading the way. IBM and Microsoft have claimed error rates below 6% this year.
Researchers reckon that AI will be able to do the job of a telephone banking agent in the next 5-10 years. We reached out to India’s speech-based AI companies to see if they are experiencing any similar tailwinds, business models that have found product-market fit in, and some of the uniquely Indian challenges they face.
The Mixed Language Problem
India has 122 major languages, 30 of which were spoken by more than a million native speakers, according to 2001 census data. While diversity can be seen as a strength in other contexts, it doesn’t help when training speech engines.
Machine learning-based speech recognition companies thrive in environments where countries have one or two languages, says C Mohan Ram, MD at LatticeBridge Infotech, a Chennai-based speech AI company, operational since 2002. One of the early pioneers in this space, its clients include Indian Railways, besides banks and telecom companies.
Apart from 11 Indian languages, the company supports Arabic, which has helped the firm bag clients such as Etisalat, Etihad Airways, and Du, as customers.
However, mixed language inputs are still an unsolved problem. When faced with a dialect that is based on multiple languages, these natural language systems with automatic learning abilities start wrongly guessing things.
“How do you classify Hyderabadi? It is neither Hindi, nor Telugu, nor Urdu. It’s a mixture of all the three,” Ram says. Hyderabadi is a common reference to Dakhini or Dakkhani, an Urdu dialect with roots dating back seven centuries in the Deccan Plateau.
This multilingual input problem adversely affects the use of these systems in use cases like IVR (Interactive Voice Response), and is one of the reasons why several of LatticeBridge’s deployments are less in the natural language processing mode, and more on the directed dialog mode, where the caller is led through a series of Yes/No type questions, he says.
“Even in our case where we’ve successfully deployed and shown very good live scenarios handling millions of calls, we’ve had to remove it (natural language processing) because of mixed languages. If you remove AI or learning, then the system doesn’t deliver on an optimal percentage of automation, and then the customer gets disappointed,” says Ram. LatticeBridge’s deep learning speech engines have “failed miserably” in Kannada especially, with 77% accuracy, he says, while in Hindi and Tamil, it is 98%.
Speech-based AI is context sensitive and contextual sensitivity remains a challenge, says Raja Manohar, founder and CEO of Hexolabs, an IIT Kanpur-based IT service company whose offerings include solutions that use intelligent speech recognition. “Most languages have several dialects that would have slightly different semantics. Take US English and Indian English for example. During my school days, we used call eraser as rubber, ruler as scale. I am sure it would be called differently in other geographies. It would be a challenge to comprehend when we don’t know the origin, training of the end user,” he says. “We have a long way to reach 5-6% error rate in deployable mission-critical solutions.”
Not all forms of speech-based AI are the same, and aren’t comparable by the same yardstick, says Mohan. He classifies speech AI systems into four categories:
- embedded systems such as those in a car, with a limited vocabulary of commands,
- desktop systems, which have a slightly larger vocabulary
- dictaphone type software where you have to train the system with your voice, and
- networked speech applications, which are designed to support millions of users.
“We (LatticeBridge) are positioned in the network speech application space and we are not doing the other three,” he says. “In this space, there are no limitations in terms of processing power, which means that Gaussian mixture models and Hidden Markov Models can be loaded, and intelligent deep learning algorithms can be linked there,” he says.
Core speech based AI research is expensive and time-consuming, best left to those tech giants (such as Amazon, Microsoft, IBM, and Google) with deep pockets and access to big data, says Manohar. On the flip side, they could be ripe acquisition targets.”Indian speech AI research companies have a great opportunity to collaborate or get acquired by these companies as they have a niche expertise on the context-based speech solutions this geography.”
Speech Analytics, Voice Biometrics
To get an understanding of some of the business models that are being unlocked by speech-based AI companies, we reached out to Uniphore Software Systems, an IIT Madras incubated company. Founded in 2008, Uniphore raised a $6.8M Series B funding round in August this year. The company offers virtual assistants (akeira), voice biometrics (amVoice), and speech analytics (auMina) services in a SaaS-based model, and serves over 70 enterprise customers across the globe.
India and South East Asia are key markets for Uniphore, says Umesh Sachdev, Co-founder and CEO over an email interaction. “The US is just starting to respond to our products. We have recently set up an R&D facility in the US, considering North America as a potential market for us,” he says.
“The high frequencies of machine-synthesized output is very different from high frequencies from original speech. We have some technologies to beat that.” – Mohan Ram, MD, LatticeBridge Infotech
He cited the example of a non-banking financial company in India, which uses auMina, its speech analytics solution, to help improve credit collections. “The insights provided by our solution, such as Intent to Pay, Promise to Pay (P2P), collectors efficiency etc, has resulted in faster realisation of debt collections for the company,” Sachdev says.
To give a use case for its voice biometrics technology, he cited an example of how a multi-national outsourcing company in the U.K used its voice biometrics solution following an outbreak of a series of fraudulent activities in its contact centre division. “This fraudulent activity not only resulted in monetary losses to the client as well as the service provider but also exposed them to legal scrutiny owing to the compromise of customer information,” he says. Using Uniphore’s auMina technology, agents were required to log-in to the ERP using voice biometrics, which took care of the problem. “In case of automation, you will be connected within seconds of calling the contact centres. Voice biometrics can authenticate a user, instead of going through routine questions on ‘date of birth’ to ‘mother’s maiden name’,” he explains.
Ram says that voice biometrics is a key focus area for LatticeBridge as well, and put aside security concerns of AI that can mimic voices with a technical breakdown. “The high frequencies of machine-synthesized output is very different from high frequencies from original speech. We have some technologies to beat that. Of course, somebody will learn this and try to build a high-frequency generator, like a human being. We have to keep the thief-and-police game on,” he says.
Better accuracy = Impending disruptions
As speech engines improve on their error rates, Mohan Ram sees medical transcription industry being disrupted first. “The medical dictionary can be loaded to improve the recognition rate. It doesn’t matter which language it is – when you say neurosis, it is neurosis – no matter what the language. That will help in very quick adoption. A majority of the BPO, KPO guys will lose business because of this automation,” he says. According to a 2014 report, the medical transcription industry had an estimated market size of over $40 billion in 2012 and was expected to grow to reach $60 billion by 2019.
Rajiv Poddar, Founder of Scribie, a US and Bengaluru-based transcription service, is unsure of error rates in the single digits, such as those claimed by IBM and Microsoft. “Our niche is 99% accuracy. We cannot achieve that accuracy right now with AI. If you download and read that paper, it is actually for just one dataset, which is called Switchboard. It does not apply to all kinds of audio. If you feed it data, like this phone call, it won’t work. The accuracy will be something like 70-80%,” he says.
The company has built its own ASR (automatic speech recognition) engine, that is used as an assistive technology and produces transcripts with around 75% accuracy, he says. He was sceptical of robo-transcription services like Trint, which claim 95-99% accuracy.
In my personal experience, transcription AI services can be exceptionally poor at transcribing phone calls. I fed my call with Poddar into two web-based transcription services mentioned here, Trint and Spext. The former didn’t have any specific support for Indian accents, while the latter did. In any case, both were disastrously bad, with accuracy rates in the single digits. It didn’t help that the audio file was from a phone call.
That said, another robo-transcription service, Swiftscribe.ai was able to save me a few hours of work when used on a seasoned public speaker like DJ Patil.
Ashutosh Trivedi, co-founder of Spext, an AI transcription startup based out of Bengaluru and San Francisco, insists that that 99% accuracy is unlikely in the next five years. The startup provides automatic translation of an audio file in 50+ languages in a consumer SaaS product and uses IBM and Google’s voice AI engines for higher accuracy.
“Real-world accuracy is somewhere between 85 to 98%. Current technology is not going to take us to 99% in the next five years,” says Trivedi. “Even Geoff Hinton said recently – we have to start from scratch again. That’s the theory behind Spext; otherwise, we’d be irrelevant. It’s better to focus on increasing the productivity of human beings, and give them more power, and take some of the drudgeries out of their work.”
Subscribe to FactorDaily
Our daily brief keeps thousands of readers ahead of the curve. More signals, less noise.
Updated at 9.45am on October 10, 2017 to make clear Hyderabadi is a common reference to Dakhini, the Urdu dialect commonly used in Hyderabad.
Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures and Vijay Shekhar Sharma among its investors. Accel Partners is an early investor in Flipkart. Vijay Shekhar Sharma is the founder of Paytm. None of FactorDaily’s investors have any influence on its reporting about India’s technology and startup ecosystem.