Natural Language Understanding has been a hot topic for some years. With products like Apple’s Siri and Amazon’s Echo, more and more people interact with technology trough giving it voice commands. To a certain extent these technologies have been quite successful in that they allow a wide range of tasks to be accomplished by speaking to a device. But these technologies also have their short-comings and there are examples about comical interpretations abound on the Internet.
So how do these technologies like Siri actually work? Since they are not open technologies, I can only speculate. But past projects that I have worked on allow me to make an educated guess. At some point I worked at Avatar Reality on the Blue Mars project to build a framework for autonomous agents. That’s a fancy way of saying avatars that are controlled by a computer program rather than a human player. Programming autonomous agents can be roughly split into two parts. One part is about interacting with the virtual world. That involves path-finding, tracking other players and events in the game. The other part is interacting with the other players. And while there were ‘physical’ interactions built in the game, like shaking hands, giving a hug or dancing, the main way characters interact with each other is through language. Which is obviously where the natural language understanding comes in. Some player says something and the autonomous agent will generate a response. Either by performing some sort of action or by saying a response that is relevant to what was said. In case of a textual (or ‘verbal’) response, the sentence gets mapped to a set of patterns or templates using something similar to regular expressions. When a match is found, a corresponding reply is emitted. Ideally this response has some of the words or subject of the conversation in it to make it relevant.
Generally, the initial reaction of the human players to these Non-Playing Characters (NPC) is of amusement. But people quickly tire of interacting with them. The most common complaints are that the conversation is repetitive, the answers are irrelevant to what was being said or the NPC abruptly changes subject all the time. It’s not for nothing that nobody has come even close to passing the Turing test. In order to improve the responses a lot of tricks have been attempted. Most of these focus on capturing essential elements from the conversation like what is the object of the previous sentence, what is the subject. What is the general subject of the conversation, and so on. The thinking is that the context of the conversation is important, which it obviously is. But is it actually possible to extract enough information from a short conversation to provide this context? I’m going to argue here that it is in fact not possible and that it’s a big reason why natural language understanding is so hard.
Consider the following sentence: “Bob told John he didn’t pass the test.” So who didn’t pass the test? It’s actually ambiguous. But conversations are full of sentences like these and usually we deal with them anyway. So how would an AI disambiguate a sentence like this? Of course it could look further back in the conversation and see if there’s any hint who did a test recently. This is what was being referred to earlier by examining the context of the conversation. But let’s change the sentence in a few subtle ways, and see what happens. “The teacher told John he didn’t pass the test.” Grammatically there’s not much difference between this sentence and the earlier one. They are actually both equally ambiguous. And yet any human would assume that in the latter sentence, it was John not passing the test. Next: “Bob told dad he didn’t pass the test.” This time, we’d assume it was Bob who didn’t pass the test. So suddenly the roles of the two people in the sentence flipped. Even though grammatically nothing changed. From these examples it’s very clear that context is unlikely to disambiguate these sentences. What is embedded in these sentences is intricate knowledge about the world, the people in it, the roles these people play and the relationships between these people. Everybody knows that teachers have students who take tests. Kids or young people are more likely to be students than older people. So a dad is less likely to be a student than his children. And so on, and so on.
But these things are not set in stone. If a response came like “Don’t worry dad, you’ll pass next time!” what happens? A couple of interesting things actually. First of all, a quick mental readjustment happens to make both sentences make sense. The roles of the people in the sentences quickly changes based on the further input. In both sentences. Secondly, it might have caused you to chuckle. And that is because the first sentence had raised a certain expectation with the audience. And that expectation was next proven false. That results in a surprise with the listener. It’s a surprise, but still not totally illogical or impossible. And that I believe is the essence of humour. A joke raises a strong expectation with the audience and then violates that expectation in a way that still resonates.
So I’d like to finish this with claiming that a good sign of passing the Turing test would be an AI that can take or make a joke.