In November 2014, Amazon released the first smart speaker that had a digital assistant named Alexa. For those not familiar with what this is, its simply a cloud-connected smart speaker that a user can issue voice based commands to do simple things like play music, seek weather, traffic, and news updates and also control other smart devices in the home. All this is done by initiating a conversation with the smart speaker by uttering a ‘wake’ word before the command such as ‘Alexa how is the weather today?’ and Alexa would respond with the weather update, here the word ‘Alexa’ is the wake word and the minute it hears this, it actively listens for the next words and decodes what you are saying by use of cloud-based speech recognition systems. As of mid 2019, Amazon estimates that 30% of American and European homes have a smart speaker, up from 22% a year ago. This is about 100 million Alexa devices in the market.
Not to be be left behind, in 2016, Google also released a virtual assistant called Google Assistant and was initially available on select smart speakers but in 2019 made it available in over 1 Billion android phones in the world (talk about scale!). Other virtual assistant flavors include Apple’s Siri that’s available in all iPhones, Microsoft Cortana in Windows 10 and Samsung’s Bixby.
With these assistants available in smart speakers and phones, a user is able to interact with a computing device such as phone or personal computer to access information and carry out tasks that would have traditionally required them to use an input device such as a touch screen, mouse or keyboard. For example, instead of unlocking my phone screen and opening Google maps to check traffic conditions to say Galleria Mall, all I need to do now is say to my phone ‘Hey Google, traffic to Galleria Mall?’ and the assistant would answer back with the results like ‘There is moderate traffic to Galleria mall, from where you are, it should take you 7 minutes to get there’. I can also initiate a phone call by simply saying ‘Hey google, call Thomas Sankara’ and the assistant will search for his number in my phone book and initiate the call without me touching the phone. On the appliances and electronics side, I will no longer need to look for the TV remote and change channels and I can instead simply say ‘TV, change channel to BBC news’ and its done. This is so good in many ways because:
- Its much faster and involves fewer steps to get the same results if not better
- It is more natural and intuitive than current interfaces that often need some training/skill or even literacy to use
- I can do all this while my hands and eyes are occupied doing something else. For example if I’m driving, I can still get to use maps and make calls without looking at or touching the phone. another example if I could ask the TV to change channels while I’m busy preparing a sandwich.
Other than accessing information from the internet as per the above examples, voice based assistants can also be used to control smart devices and appliances (explains Samsung’s foray with bixby) schedule/cancel meetings and open apps in the phone, all by using voice commands.
Why is this a big deal?
With the recent advances in Artificial Intelligence and Machine learning, Speech recognition systems have become pretty accurate in deciphering words in human speech. With speech being a highly variable input because everyone has a unique voice and accent and variable surrounding noises, it was initially difficult to get computing systems to understand human speech, but with AI and Machine learning advances in the last 5 years, this is now possible. Google assistant and Alexa can now decipher English speech and accent by Lemaiyan from Narok or Billy Ray Cyrus from Texas with near equal accuracy for both inputs.
The biggest leverage that voice has is that AI systems that power these digital assistants are now being trained in various languages and dialects. As of mid 2019, Amazon’s Alexa supports seven overall languages: English, French, German, Italian, Japanese, Portuguese (Brazilian), and Spanish. Google Assistant on the other hand currently supports sixty overall languages including Swahili, Telugu, Gujarati, Zulu, Mandarin and many more.
With the addition of more languages currently ongoing, a voice based interaction with the Internet through mobile phones and smart speakers means that people who were previously locked out of the benefits of the Internet because they could not read and write would all over sudden be able to access the limitless opportunities that being connected presents to them in the comfort of their local language. It will soon be possible for everyone in the world to search for information on the internet, interact with a mobile phone or computer apps, home appliances and electronics by simply speaking to it using the local language. This will be the most significant step in bridging the digital divide since the liberalization of telecommunications in the 1990’s and can be leveraged to create a more equal society. The multiplier effect of this is mind boggling if you think about it. A farmer in Eldoret will be able to seek markets for his produce or even operate a herbicide spraying drone by issuing voice commands in his local language, A mother in rural Sri Lanka will be able to seek nutritional information for her child by speaking the local language to her phone’s digital assistant, set reminders for hospital visits or school meetings without the need for her to know how to read and write in English. A non Greek speaker will also be able to participate in conversations taking place in Greek seamlessly by using the assistant to translate the conversations back and forth.
The popularity of voice based interaction is also growing with the touch screen slowly taking a backseat as the main user interface to the treasure trove that is the Internet and modern appliances and electronics. The below stats sampled from developed countries lend to the fact that voice based user interface to technology and the services it provides is on a hockey stick trajectory in adoption (source):
- 40% of adults use voice search on a daily basis (Forbes)
- 52% of people use voice search while driving (Social Media Today)
- 65% of consumers ages 25-49 years old talk to their voice-enabled devices daily (PwC)
- On average, more men than women use voice search at least once per month (Social Media Today)
- A study conducted by Uberall found that 21% of respondents were using voice search on a weekly basis (Search Engine Watch)
- Close to 50% of people are now researching products using voice search (Social Media Today)
- The number of voice search increased by 35x from 2008 to 2016 (Kleiner Perkins)
- A HubSpot survey found that 74% of respondents had used voice search within the last month (HubSpot)
- Mobile voice search on Google is now translated in over 60 languages (Wikipedia)
With the main mode of interaction with the online world being voice based, the rise of voice based services will also be on the rise. Organizations are today deploying chatbots and voicebots to answer customer queries, take orders and fulfill them. For example, in the USA, its now possible to order pizza from Pizza hut by simply saying ‘Alexa, order pizza hut’ and it will provide the menu options. If you instead say ‘Alexa reorder pizza hut’, then it proceeds to re-order what you ordered last time. This improves the efficiency of service delivery as these bots are available 24/7 at nearly zero marginal cost per additional customer unlike hiring humans to do the work. These systems are also very well versed in the specific details and operations of the company and know were each bit of information is in the organization. A chatbot does not need to put the customer on hold to confirm something from sales or finance department, it has access to all this information and can serve the customer in real-time.
Social media will also move from the current text and multimedia based platforms such as Facebook to voice based personas or avatars. Instead of curating an abstract Facebook wall with posts and status updates, people will curate voice avatars that will be continuously trained to learn information about us and even speak on our behalf (in our exact voice even). For example, a person can train his avatar to respond to questions on social media on their behalf. If my Avatar has access to my calendar and I have allowed it to respond to people (or specific people) about my schedule and itinerary for the day, then another avatar/user can ask it where the other user is or what they will be doing at 3PM today and get and answer. My avatar can also represent me in online meetings and take note of what was discussed and what my take aways or action points from the meeting are, and share this with me at the end of the day. The blurring of the line between social media and real-life will also happen as this avatar can also take on responsibilities in real life. For example. Instead of the HR manager sending a mail to staff inviting them for a physical meeting to brief them on the new staff medical cover, the manager can instead invite all staff avatars to the meeting and leave me to do more productive activities during the meeting time, a win-win for everyone. The avatars being AI based systems, will also be more efficient in recalling information and analysis better than a human and can be used to carry out repetitive tasks or work on my behalf and I get paid. The avatar efficiency and closeness to my offline behavior and character will be a function of how much information I allow it to learn about me. The more I let it learn about me (how I speak, my moods, my social life, my work life, my plans for the day etc), the closer it will be to resembling me as I am in real-life. Mix this with all the information that is on the Internet and you have yourself a virtual worker who can work on my behalf and also interact with others online while I sleep or go fishing in Murang’a. This is the idea behind Microsoft Cortana, create a digital assistant for the workplace that can learn about you and assist you in your work in the office to schedule and remind me of meetings, look for information in the company ERP systems, respond to emails, read reports and take action, etc.
Despite all these possibilities, the issue of privacy and security is at the forefront as the major road block to voice based user interface adoption. For example, is your smart speaker or google assistant on your phone constantly listening to your conversations that are outside the wake word? Can hackers eavesdrop into your intimate or personal one-on-one talk with others in the room?
The truth is there will be no escape to voice adoption as it presents the most natural way for most humans to use and control technology and also allow technology to talk back to us with feedback or results in a way we understand. With the coming hyper-connected world and IoT devices, the current user interfaces such as touch screens will be unable to make us efficiently interact with technology. There is therefore a need for the developers of these systems to put in place measures that will build trust in these systems and instill confidence that the systems are not being abused or used to intrude into our private spaces, thoughts and speech.
The other fear is the cybersecurity aspect. There was a story last year where hackers used AI speech generation systems to imitate the voice of a company CEO on phone and stole a large amount of money. (Read about it here or a local version of the same here). This presents a new threat by voice based systems to the cyberspace and this needs to be dealt with in the design and implementation of these systems.
Finally, web based systems and apps are these days being designed with the ‘mobile first’ philosophy, this is about to change into Voice first, Watch this space.