The incantatory web

I am an obsessive, but also forgetful. This leads to the phenomenon where I suddenly remember something that used to occupy my thoughts completely ten years ago, but has since disappeared. At the moment of remembrance I can often be found to shout out the name of that thing, amazed by both its familiarity and strangeness. Maybe it was an album I used to play repeatedly, or a writer I used to think was the only poet worth consideration.

And there hanging in the air is the sound of a word or phrase – a conjuring to re-ignite the brain.

And now conjured, the search begins. I can now stick the name of that album into google, or more commonly these days into iTunes to see if I already have it, or if not, whether Apple Music can bring it to me in a flash. And yes, there it is. It turns out it was there all the time. But until I had called its name it had rested invisibly in the woods hidden by the multitude, obscured by my present obsessions.

The other day I could hear one of my sons shouting short phrases in the other room. It seemed to be the same phrase over and over, and was on the edge of annoying, so I thought I’d better investigate. I found my youngest son holding his iPad about 20cm away from his mouth and shouting something about minecraft into the YouTube Kids app. His hope seemed to be like that of an Englishman abroad, that the louder one shouts the more likely one will be understood. In this case after about six loud incantations the app sprang into action, inadvertently training my son that shouting is the future.

Is shouting the future?

I was reading the latest episode in the adventures of Elon Musk. He believes his Neuralink is the future. A device that connects the brain, and therefore the mind, to the machine the user wishes to control. But is the mind where it is at? Is that where our intent can be clearly identified?

Have you noticed your google searches becoming more complex, more wordy, more like big long questions, often phrased just like someone might say them out loud? I have. I have inched toward a use of the web that is very much like speaking, and it works. But my son has decided he wants to skip out that pesky step of thinking in words and then typing them into a box (a rather small box!) – he is of the view that he should think about what he wants, say it loudly and have his machines do his bidding. What initially to me looked plain wrong, has got me thinking about how we connect ourselves to our machines in the endless journey of discovery that characterise our current lives.

Our voices are such a natural and flexible expressive tool, and sometimes it is only when we have heard ourselves out loud that we know what we mean.

Right now we are sunning ourselves on the high mountain peaks of the hype curve of Machine Learning and AI – marvelling at how they will change our lives and economies beyond recognition – robotising and automating all around us. But something that is often buried below the sparkly layer of speculation about grand change, is the potential for us humans to be able to better connect to one another’s experiences through our own vernacular, to tweak our digital engines of discovery to make use of our own organs of intent – our voices.

Clearly this idea is motivating the tech giants (and the cartoon inventors) who are now committing to create a set of devices ready to listen to us and to try and make sense of our wishes. Wishes for what? There seem to be three categories of wishes that can be granted by affordable technology right now.

Three wishes

The first wish is to simply control our machines – much like the “Flashmatic” TV remote control developed in 1955, we want to be able to use simple voiced commands issued at a distance to make our machine do things (this is also true for our dogs). Spotify are wrestling with a workable version of voice control on Android right now that enables drivers to shout “next track” when Phil Collins’ Sussudio comes on. We see this technology being embedded into TVs, games consoles and increasingly the software tools that pervade every kind of mobile device. This has opened a whole new area of interface design that attempts to seamlessly blend multiple modes of interaction together in way that doesn’t feel more complex for users.

The second wish is to find what we want by speaking its name – Google has found out that our attitude to voice search is changing fast. They say “research into consumer perceptions of voice and text search has shown that while 57 percent of consumers still find text search highly functional, a growing number — 45 percent, in fact — feel that voice is the future. A massive 83 percent of the consumers we surveyed believe voice makes it even easier to find what they want from brands, while even more of them (89 percent) feel voice makes search faster.”. We have already adopted a very different attitude from a few years ago when voice search made its initial steps into the mainstream. At first we were not impressed, and to be fair the technology more often disappointed the user, and made the whole process of finding things harder not easier. In fact Google found that early users of that less satisfactory form of voice search have maintained their circumspection and are currently much less ready to give it another go, possibly the downside of the digital maxim that you should just get your product out there as quick as you can and improve it in the wild (as per Zuckerberg’s much cited ‘move fast and break things’).

The key technologies that are improving the experience of voice search right now are Natural Language Processing and the use of all kinds of inference data to better understand the context of the user’s question and therefore deliver a more relevant and useful result for that exact moment.

All of this presents some interesting challenges for content discovery online. As more searches become spoken, and shaped by the structures of spoken language, content creators will need to think harder about how to use written language in and around their content.

I was talking to Trish Thomas, Head of Digital engagement at the Southbank Centre the other day about the challenge of introducing audiences to art, music and culture that is new to them. It really is OK right now if you can call to mind the name of an artist or a work – the web has truly got you covered. But what if you asked the web a broader, looser question –  “I am getting into art, where should I start?”. Thinking about that combination of understanding the words and the context, wouldn’t it be startling and exciting if a 16 year old, bored and trapped in a suburb of London asked Google that question and was told about a couple of exhibitions they could visit in the centre of London right now. Perhaps it would spark a journey of discovery that started with a spoken question, progressed on a cheap bus ride and culminated in a moment of feeling independent, amazed and inspired at the Hayward.

The third wish is to get things done by having a conversation – this is a field that is truly fascinating to watch. First Siri and then Alexa and Cortana. Siri is still the most used virtual assistant, but Alexa is hot on its heels. A Verto Analytics survey saw usage of it jumping 325 percent in monthly active users – from 0.8 million to 2.6 million monthly users over the last year.

This is interesting because it connects to a behavioural shift –  the study found that phone-based personal assistants, Siri and Samsung’s S-Voice, are declining in popularity, while those associated with voice computing in the home, like Alexa and Google Home’s mobile app, are growing. It seems people may feel more comfortable attempting conversations with their machines at home.

From my experience of these interfaces they feel nothing at all like having a conversation yet. More like acting in a play where I’ve half forgotten my lines. The technology that underpins a good experience goes way beyond Natural Language Processing and gets right to the heart of the matter – working out the meaning of our intent, and then constructing new meaning in a rapid back and forth exchange. For many this is one of the big hopes for Artificial Intelligence and its poster child Machine Learning. The theory is that the content and data of the web can be used to train our machines to better and more quickly infer our likely meaning. Coupled with smoother and more reliable Natural Language Processing, this field of inference can help the machine not only know what we just said, but also make a good guess about what we might expect as appropriate response.

Of course there is something of a space race to develop these technologies, pitting start-ups and giants like IBM against each other. Facebook have just inserted Natural Language Processing into their Messenger Platform enabling developer and brands to more easily build voice based interactions into their digital experiences – something is sure to drive even more mainstream adoption of this talking to machines behaviour.

But I also spotted the other day that Mozilla continues to develop its own public-spirited approach to innovation, and they too are interested in the voice. They have a project called Common Voice that aims to train its machines on over 10,000 hours of user generated bits of spoken word. You can join in and help right now in the hope that this new open-source database will help create an ever more diverse set of ways that machines can truly listen to us.

Seeking a personal connection

On a recent visit my dad wanted to make a confession. Siri and I are constant companions he told us. Siri lives on his iPad and his iPad is rarely a few feet from him at any moment – listening out for his command. ‘Hey Siri, play Summertime’. At the moment he mostly seems to be using Siri as a way to remember what he needs to do.  As soon as a thought occurs to him about some activity he needs to do he asks Siri to set a reminder. Hey Siri remind me on Tuesday to buy a new phone.

Much of this post was written in Google Docs using the Voice Typing functionality – except for the bits I wrote on the train as I am not yet ready to be shouting at my computer in public.

Leave a comment