The history of conversational interfaces starts a long time ago. They are slowly but steadily reaching homes all around the world. Amazon had sold 100 million Alexa devices by January 2019 and the numbers only rose through the year. For iPhone users this kind of interaction was not new. iOS has Siri since 2011 and many of us had already interacted with conversational interfaces even before. Remember Clippy, a feature of Microsoft Office back in the 90s? But did you know that the history of conversational interfaces begins in the Middle Ages?
For conversational interfaces to exist, technologies like natural language understanding or voice generation have gone through a long journey to get to a quality and a speed that makes seamless interaction possible.
The Year 1000
The pope Silvester II, in the year 1000, claimed he had built a machine that could answer questions randomly with yes or no. This is a fascinating character. Originally known as Gerbert of Aurillac, he was the first pope as is described in Wikipedia as a humanist before humanism. He was the first to introduce in Europe the decimal numeral system using Hindu-Arabic numerals. Legends said that the power and intelligence of this pope, and particularly his capacity to create his talking machine, came from an alliance with the devil himself.
Ramon Llull, a referent for conversational interfaces and artificial intelligence
Ramón Llull, who lived in the 1200s, dreamed of a machine that could answer questions about the Universe. He believed that this machine could even prove the existence of God. This is one of the first depictions of what we call today Artificial Intelligence in literature. A person could input a series of rules and the machine would make logical operations to answer questions.
Conversational interfaces in the XII and XIII Century: Talking heads
Robert Grosseteste was another religious man who had made advances in logic. He even tried his chances at proving the holy trinity. He built, in the XII Century, a talking head made out of Bronze. In the XIII Century, Roger Bacon and Albert The Great tried building machines that resembled heads and could speak. Albert The Great even claimed his creation could predict the future! Thomas of Aquinas destroyed the machine to protect his friend. People considered these creations wizardry, not technology.
The enchanted head of The Quixote
Don Quixote, a book first published by Miguel de Cervantes in 1605, featured on of these talking heads. Chapter 62 of The Quixote talks about an enchanted head that can answer to questions whispered into its ear. This artifact relied on a person that, sitting in another room, could hear the questions and answer them. In Cervantes’ book, Quixote is tricked by this fake machine.
Conversational interfaces in the XVII Century: Misurgia Universalis, Descartes
The history of conversational interfaces is plagued with interesting characters. In 1650 Athanasius Kircher wrote Misurgia Universalis a book that explained how to create a human-resembling head that was able not only to produce sounds, but also to move its eyes and mouth. A few years later, in 1662, Descartes was studying and wrote about machines that could move on their own. He wrote “I suppose the human body to be just a statue or a machine made of earth… we see clocks, artificial fountains, mills, and other similar machines which, even though they’re only made by men, have the power to move of their own accord in many ways”. This inspired Athanasius Kircher to take over his studies and he writes, in 1673, his book Phonurgia Nova.
Conversational interfaces in the XVIII Century: Queen Christina’s polyglot talking head
In the XVIII Century Queen Christina of Sweden commissioned Valentín Merbitz to build a talking head for her. According to the stories this head could speak 5 languages and used ventriloquism to work. More talking heads were created in the XVIII Century. In these days they also stopped being considered a dark art and began to be appreciated as scientific inventions. This change in perception is important for the history of conversational interfaces.
Conversational interfaces in the 1700s: The work of a lifetime
Abate Mical created two talking heads that could have a conversation in the Royal Academy of Sciences in Paris. This work took him 30 years and ruined him, financially. Lavoisier and Laplace, two of the most important scientists of the time, wrote a description of the machine. This work also includes the internal mechanism of the two-headed-machine.
Wolfgang von Kempelen built, in this same Century, a machine that could be adjusted to reproduce different accents. It did not have a human appearance, but it was accompanied by a book that explained the mechanism in detail.
Austrian inventor Joseph Faber created a talking machine called Euphonia in the nineteenth century. Manipulating a keyboard attached to a system of air bags and stings, a person could generate sounds that reminded of a human voice. This machine was usually exposed together with a female mask and a hanging dress,
Fast forward to the last century, the Voder is the first electronic voice synthesiser. It was presented by Bell Telephone Laboratories in 1939. This machine could combine sounds to pronounce words, and combine words to form sentences. The Voder could be tailored to intonate, and had different voices. These voices could be changed with little changes in the mechanism. It was a manually operated system. Playing it requiring training, ten finger, two foot paddles, a knee leaver and arm switch to generate sounds. The sounds quality was better than most voice synthesisers until the late 1990s.
The imitation game
In the 50s a few milestones for Artificial Intelligence and Conversational Interfaces take place: in 1950 Turing proposes his test to determine machine intelligence, in 1956 McCarthy organizes the Dortmund Conference, the first professional AI meeting for the area and the event where the term Artificial Intelligence is coined, and in 1957 Noam Chomsky introduces his generative grammar theory, key for the study of Natural Language Processing.
In the 1960s the world sees a rebirth of Artificial Intelligence and the first text-based chatbots, Eliza (1966) and Parry (1972), are created. They have little intelligence and rely on their personalities (a psychologist and a schizophrenic patient respectively) to give credibility to their interactions.
In 1970 Winograd publishes his project SHRDLU, a natural language processor that can operate with objects in a restricted world of pyramids and prisms displayed through a graphic interface.
Other attempts to make conversational interfaces that engage users, like Jabberwaky (1988), ALICE (1995) or the well-known agent for Microsoft Office, Clippy, are launched to the market.
Big tech bets: Watson, Siri, Cortana, Alexa, Bixby, Google Assistant, M…
Since IBM’s Watson presentation in 2006 the big companies have presented different projects that show that conversational interfaces are on everyone’s priorities lists. These projects are not just agents, but full platforms for developers to create conversational experiences based on the different advances commoditised by these companies.
In Apple’s words: “iPhone 4S also introduces Siri, an intelligent assistant that helps you get things done just by asking. Siri understands context allowing you to speak naturally when you ask it questions, for example, if you ask “Will I need an umbrella this weekend?” it understands you are looking for a weather forecast. Siri is also smart about using the personal information you allow it to access, for example, if you tell Siri “Remind me to call Mom when I get home” it can find “Mom” in your address book, or ask Siri “What’s the traffic like around here?” and it can figure out where “here” is based on your current location. Siri helps you make calls, send text messages or email, schedule meetings and reminders, make notes, search the Internet, find local businesses, get directions and more. You can also get answers, find facts and even perform complex calculations just by asking.”
Below is the first demo of Siri, at the Apple October 2011 event:
These technology companies made large investments to improve speech recognition and generation, natural language understanding, smart devices, UX patterns and machine learning models for language, and processed millions of characters of documents and conversations to get us where we are today. Engineers, linguists and product people worked together to make this possible. We are standing on the shoulders of giants.
With the creation of these platforms, product managers and developers followed the “build it and they will come” motto and started creating conversational products. Alexa skills, facebook messenger bots, telegram bots and other conversational experiences were used as a way to interact with customers in different industries and sectors.
These chatbots and voice skills can automate customer support, check in flights or distribute curated news as a replacement to email newsletters and are generally limited in functionality and integrations and with a natural language understanding far from the conversational intelligences we have seen in movies.
Getting smarter: Mitsuku
Conversational interfaces today are far from the expectations of our peers from the Middle Ages had and far from the expectations, we have after watching science fiction movies that feature talking robots. Apart from the quality of the synthesized voice, conversations are not as natural as we would like them to be. Mitsuku, a chatbot created by Steve Worswick, is so far the best conversational interface when it comes to fluidity and understanding. She is a five-time winner of the Loebner Prize Turing Test, an annual competition that challenges experts to tell humans and chatbots apart based on conversations and that gives a prize to the most human bot.
The new Euphonia
With the generalisation of conversational interfaces comes a necessary concern about accessibility. Euphonia is also the name of a project by Google that aims to expand the research in speech generation to be able to serve people with conditions that affect speech, such as multiple-sclerosis. You can read technical details about this experiment on this post published in the Google AI blog or learn about it watching the video below.
Conversational interfaces are better understood looking at the advances that made them possible. I can talk about this topic at your conference, my lines are open at email@example.com