Voice is an inflection point in the man-machine relationship. But it’s not perfect yet

Tyagarajan S February 7, 2017 10 min

FactorFuture_band

The euphoria of potential technology disruptions headed our way are exciting as hell, but they could also get quite confusing.

Welcome to edition one of “Factor Future” (#FactorFuture) —  our exploratory journey into the future-verse. Every week, we’ll telescope onto an interesting subject that is likely to shape our future, and put it under the scanner, sometimes with a cynical frown. Through this we hope to excite, educate and start a discussion.

This week’s topic: Voice platforms, or as I call it, ‘madly talking to things’.

 Shouting at stuff

By 2018, nearly a third of our interactions with technology will border on lunacy  —  we’ll be shouting out, whispering to and cajoling machines to do things for us. Christened varyingly as the next operating system, platform, interface and sometimes just with the generic ‘big thing’ tag by the tech-media, conversing-with-things will likely become ubiquitous and more useful than smartphones in the near future.

The reasons are powerful:
— Voice makes access to computing power more ubiquitous — conversations can flow from our homes to cars to gyms.
— Voice democratises computing access— you don’t need to spell or type or even read to interact with a computer
— Voice is our easiest gateway to natural language which in itself is a powerful control mechanism to convey our thoughts in an efficient manner
— Voice is much faster than thumbing into little screens — atleast, until we have Elon Musk’s handy neural-lace to plug us into machines.

Everyone believes that voice platforms are going to be big but no one knows how big. There is a flurry of activity in the market around voice. But has voice actually crossed the tipping point to become the next great computing interface ready to revolutionize our world?

A is for Alexa

FF_Alexa
Which platform will win the war to capture our voices?

 

Alexa, the newly crowned empress of voice assistants, is now coming to conquer the world. This friendly personality, squatting inside Amazon Echo’s functional cylinder, is leading an aggressive charge into the lives of American citizens from their car (Alexa, ask my FordMobile to lock my car!) to kitchen (LG, Whirlpool and others are adding support to their new appliances) to the living room.

There is a boss-battle of sorts happening for our voice-share between Alexa (Amazon Echo), Google (Google Home), Siri (Apple iPhone and HomeKit) and Cortana (Microsoft’s software and enterprise apps). Facebook’s intelligent voice-based assistant, M, is being tested and still not really competing in the real-world.

And, Alexa is winning it.

By some estimates Amazon has sold anywhere between 11 to 18 million Amazon Echos till date (it is in about 6% of American homes already). Google Home is languishing at about a million units. Apple’s snoozing  —  perfecting within its closed ecosystem. And Microsoft  —  well, they have the enterprise to save them.

FF_Infographic

But the many, glorious victory pronouncements for Alexa are premature. Google and Apple have the advantage of an OS in your pocket  —  they can integrate better with your daily life. Google also has access to an incredible amount of public and personal data. All of them are working hard on their AI but Google has made visibly spectacular progress.

Meanwhile, the field is getting crowded. Recently, Lenovo unveiled an “Alexa clone” that sounds even better and Nvidia put out a little button of a thing, called Spot, that plugs directly to the outlet and waits for your voice. Then there’s LingLong DingDong —  the square-bottomed, circular-topped Alexa rival that speaks Mandarin and Cantonese and aspires to hit the motherlode in the next big goldrush: the $22.8 billion Chinese smarthome market.

FF_LingLong
DingDong has a square bottom and a round top — a design meant to represent the idea that heaven is round and earth is square

Enough bets are being placed on the immediate cash-pile of selling the hardware and software voice platforms.

“You have 5000 results, should I read them out one by one?”

Speech computing isn’t so radically new. There have been attempts to get machines to listen and respond to us starting with Audrey back in 1952. Audrey could recognize ten spoken numerical digits (only by male voices and with high sensitivity to how they’re said).

FF_Audrey
Audrey was developed by Bell Laboratories in the early 1950s. It was a pretty basic system and could only recognize the numbers one through nine.

IBM experimented with machines to do arithmetic based on voice commands in the 1960s, a talking type-writer in the ‘80s and a talking web-browser in the ‘90s.

But it’s only been in the last few years, that voice platforms have exploded in the consumer scene.

The sudden exuberance could perhaps be attributed to rapid advances in speech recognition (we have upwards of 95% accuracy), speech synthesis, deep learning, cloud and having several decades worth of data having come online. According to Andrew Ng, Chief Scientist at Baidu, 99% accuracy of speech recognition could be the game changer. While it would be harder to achieve, it’s already within our sights.

FF_Accurate_Rate
Baidu’s (“Google of China”) Deep Speech 2 has 96% accuracy and recognizes spoken words better than most humans.

But the design challenges are bigger. Voice platforms are unsophisticated fancy toys in design and UX terms right now.

For all the hoopla about voice being a natural interface, talking to robots is plain weird for those who’ve been trained to interact with machines in formal, syntactic ways (our google queries are simple commands and not conversations). Merely adding a voice doesn’t make it easy for us to slip into smalltalk with algorithms. Over time, we’ll be trained (much as these machines) on how to get chatty with the voice in the air.

Yet, it’ll be a while before we can do that in public  —  70% of voice interactions happen inside homes or cars today. This is the insight that’s propelled Alexa to success even though Apple had Siri on the iPhone back in 2011.

FF_Voice_Location
Alexa’s popularity can be partly explained by the fact that it cracked the way consumers want to use voice — in the privacy of their homes and cars — and bringing the right design for it.

It is also impossible to have a meaningful back-and-forth conversation with most intelligent voice assistants today. And they aren’t truly invisible. They have to be summoned by keywords (or names) to get them to respond. While having a name and identity can endear us to these tools initially, it can get jarring and unintuitive quite fast  —  calling out ‘Alexa’ a hundred times a day can get exhausting quickly.

Compared to a Graphical User Interface, where visual cues can create more certainty and offer complex navigational routes, using voice interfaces can get tricky beyond simple, repetitive tasks. Users can articulate what they want in infinite ways (and what the platform says can also be interpreted differently by users) causing ‘communication problems’. Moreover, navigating through complex choices and information can get tricky in a one-dimensional platform like voice  —  we’d have to resort to nightmarish variants of IVR tree-hierarchy on steroids.

Today, applications on voice platforms don’t necessarily offer a great user experience and only 3% of users return in the second week. To truly get the platform beyond its novelty factor, Machine learning and A.I. have a huge role to play. There cannot be a UX for voice without a significant parallel growth in the intelligence of our virtual pets.

One night stands with Android

FF_AI_Killers
Most killer AI minds spoke in dulcet tones to humans they were busy destroying.

It is not a coincidence that movies featuring AI often have them speaking to us (often with ominous calm as they wreak havoc).

Right after visual cues (if it looks human, it’s probably human), voice is the biggest signal of personhood to us — a recent research revealed that merely adding voice to a computer-generated script increases the likelihood of mistaking the text’s creator for a human.

It isn’t surprising that the species that’s wired to anthropomorphize (‘loyal’ cars and ‘stupid’ bed edges) will get attached to our speaking machines despite their frustrating ineptitude. Many people are already increasingly comfortable with the idea of falling in love these friendly voices and do so already.

Imagine then, a world where these voices get increasingly intelligent and sophisticated. For instance, Her’s (film) portrayal of Theodore’s increasingly intimate relationship with Samantha, his virtual assistant, seems predestined as we move towards human-robot marriages becoming the norm.

Having a voice speak to us can be therapeutic and may even alter our perceptions of self and the world. Alexa already allows you to whine about your day and receive canned words of support and encouragement. For an increasingly self-involved, socially-disconnected world, merely having a semi-intelligent voice speak to you could be enriching. No wonder, then, that Google Home is hiring artists from Pixar and The Onion to make its assistant sound smarter and wittier.

But the danger here is that we may be reinforcing stereotypes by giving these voices identities that cater to our existing biases.

It’s obvious that most voice assistants have female identities, and this reinforces sexist stereotypes, especially because we get to bark orders at submissive female-voiced assistants that flirt with us. And lonely men are increasingly using them as vehicles of sexual communication. As we progress into a world where machines assume human-like identities, we need to think deeper about these issues.

Big Brother gets ears

How do you feel about having a microphone listening in to everything you say at home, your car, office and even while are out jogging? In four years, 3.5 billion computing devices will come with microphones, and fewer than 5% will have keyboards.

Someone (or something) is constantly listening to your and other voices around you. Voice will also be increasingly recorded by default, to be searched through, culled out and perused for later. Can law enforcement demand the recordings of these “always-on” devices? This isn’t a theoretical question  —  in a recent murder investigation, police in the US served a warrant to Amazon to give the recorded information because Echo was at the scene of crime.

In an increasingly connected set of platforms and devices, it also becomes difficult to track all the little breadcrumbs we leave about ourselves. An innocuous conversation with Google Home can be matched with social media behavior and recent emails to build some seriously detailed profiles — which could potentially be stolen or hacked.

Things could get more sinister here. We’re already close to giving machines the ability to come up with their own voice, thanks to Google WaveNet. These voices sound natural and can be generated on the fly (unlike the previous approach of using human recorded voices). As machine intelligence improves, this means that soon we would be unable to ascertain whether the voice on the other side is human or not. And that’s a scary thought even without Singularity. Hackers could hijack systems and generate fake conversations to phish or subvert. Social engineering can be taken to whole new levels.

———————

Voice looks inevitable. Sometime in the future, children would be mystified to hear that there was a time when our cars and toasters didn’t talk to us. It will also open up the next frontier in our relationship with machines.

But, it’s a long way there from the nascent first wave of voice applications taking over our lives. Our hyper-optimism, driven by the fact we’ve solved some major technical challenges around voice recognition, voice synthesis and AI, is out of sync with some of the big design and interaction gaps that remain. In order to solve them voice needs to redesign itself from ground up (and not rely on the digital computing framework we’ve designed on all these years).

Until then, we’ll be keeping tabs on it for you.


Disclosure: FactorDaily is owned by SourceCode Media, which counts Accel Partners, Blume Ventures and Vijay Shekhar Sharma among its investors. Accel Partners is an early investor in Flipkart. Vijay Shekhar Sharma is the founder of Paytm. None of FactorDaily’s investors have any influence on its reporting about India’s technology and startup ecosystem.