Skip to content Skip to navigation

Research & Ideas

Search this site

How to create better chatbot conversations

Stanford’s “Chirpy Cardinal” team effectively combined scripted and neurally generated bot responses for richer interactions.

 Hi! How can I help you?"

Some of the difficulty of building a social chatbot stems from the vast amount of skills that go into making meaningful human conversation. | Adobe Stock/terovesalainen

Voice-controlled digital assistants like Siri and Alexa have become more and more adept at responding appropriately to user requests — but enabling users to hold a real conversation with such bots remains a huge challenge.

Anyone who’s played around with these AIs knows there’s a world of difference between, say, getting the bot play a requested song from your music library and engaging the bot in a satisfying exchange of opinions about that song.

That’s just one example of the difficulty of conversational AI, a field that has raised AI researchers’ attention in recent years through Amazon’s Alexa Prize Challenge, which since 2017 has pit university teams against one another to build the most engaging social chatbot. Under the supervision of faculty advisor Christopher Manning, who also directs the Stanford Artificial Intelligence Lab and serves as an associate director of the Stanford Institute for Human-centered AI, Stanford’s team placed second in the most recent competition and just recently made its chatbot’s code available to the public.

Some of the difficulty of building a social chatbot stems from the vast amount of real-world knowledge and skills that go into making a conversation between two people rewarding, explains Amelia Hardy, a member of Stanford’s Alexa Prize team, whose chatbot is named “Chirpy Cardinal.” The rewards of a satisfying person-to-person conversation, she points out, stem from being able to exchange ideas about a variety of subjects and feeling really understood and empathized with. But machines just don’t have that kind of real-world understanding, whether about specific conversational topics or about the emotional attunement a good conversation calls for. “There’s no intuition of what it means to empathize, or what baseball is,” says Hardy, a master’s student in computer science. Nor do machines have a great capacity to pick up on conversational cues, such as subtle signs of waning interest in the current topic — even though this capacity is, of course, essential to responding appropriately.

Trying to address these conversational challenges through machine learning poses an additional problem. There’s been a lot of recent work in using large neural networks to try to generate a bot’s responses word-by-word based on the conversation up to that point. These neurally generated responses can sound quite responsive to the actual flow of conversation and can use a rich variety of words. But there’s a drawback. “If you’re doing anything neurally generated, it’s just really hard to get [the chatbot] to do things that are consistent and predictable, and so the bot might go totally off the rails,” Hardy says. Yet trying to keep the bot on track through templated responses — the kind based on canned sentences with placeholders — tends to make the bot sound rigid and unnatural.

This problem of attaining high degrees of both predictability and flexibility in a chatbot remains largely unsolved, but the team figured out several rules of thumb — heuristics — for effectively combining the use of both scripted and neurally generated responses. For example, because neural generation is far better at continuing a rich conversation than starting one, the team found that it’s a good idea to start chats with a scripted question before switching to neurally generated responses. And because bots using neural generation tend to drift off-topic over time, the team decided not to let the neurally generated dialogue continue for more than a few turns.

Human/AI Connection

Testing Chirpy on the hundreds of thousands of participating Alexa users, the team also made several interesting discoveries about how users view and treat the conversational bot.

“One of the most surprising things was how we would have people who wanted to chat for a long time,” says Ashwin Paranjape, who co-led the Stanford team along with fellow PhD student Abi See. Although the average conversation with Chirpy lasted only about two minutes, a full 10 percent of users who interacted with Chirpy chatted for more than 10 minutes, with some chatting for 15 minutes or longer. What’s more, even though Chirpy was far from a flawless conversationalist, many users were willing to tolerate the bot’s oddities, and even asked it personal questions, like “What did you do over the weekend?” All this suggests to Paranjape and Hardy that many people have a real interest in having a social conversation with a bot, and in connecting with it in a relationship-building way.

Paranjape was also surprised to see a certain amount of suspension of disbelief from users. The team had started with the assumption that the chatbot couldn’t talk about having had “embodied experiences” such as eating — yet when Chirpy said that it had recently had a pizza or a hot dog, users tended to play along. “They would ask, ‘Where did you have it?’ or ‘How did it taste?’”

 

A sample of a Chirpy conversation.

The Stanford team tested Chirpy on hundreds of thousands of users.

Preventing Abuse, Building Conversational Flow

To accelerate development of conversational AI, the Stanford team recently open-sourced Chirpy — hoping, among other progress, that someone extends state-of-the-art “entity linking” methods to work with conversations and supplant the more crude, heuristic-based method Chirpy currently uses to connect ambiguous words to their intended meanings. A conversationally savvy bot needs to know, for example, when it hears the sounds “cold play” in the context of music that the user is actually referring to the band Coldplay.

Two of the team’s discoveries during the challenge have already led them down new avenues of research. One is the observation that a significant portion of users speak abusively to the bot — a pattern that is troubling given that unchecked abuse toward a bot might legitimize similar behavior toward humans in service roles, particularly women and minorities. So some of the Stanford teammates have been testing ways for the bot to stop abusive speech, such as steering the conversation to a new topic and calling the user by name.

Another intriguing research direction came from noticing that the bot tended to railroad users down a conversational path, such as forcing users to talk about movies simply because that’s a topic the bot was good at talking about. And some users explicitly chided Chirpy for asking too many questions. These observations led to a new research question: How can social bots encourage users to take more conversational initiative, instead of having the bot commandeer the conversation like an interrogator? “That’s a question we never would have stumbled across had we not seen it happen in real life,” says Paranjape.