This blog post is about the NAACL 2019 paper What makes a good conversation? How controllable attributes affect human judgments by Abigail See, Stephen Roller, Douwe Kiela and Jason Weston. [paper] [code/demo] [slides]


The Natural Language Generation task spectrum

When I think about Natural Language Generation (NLG) tasks, I imagine them on the following spectrum:1

TODO

On the left are tasks like Machine Translation (MT), which are less open-ended (i.e. there is a relatively narrow range of correct outputs given the input). Given the close correspondence between input and output, these tasks can be accomplished mostly (but not entirely) by decisions at the word/phrase level. On the right are tasks like Story Generation and Chitchat Dialogue, which are more open-ended (i.e. there is a huge range of appropriate outputs given the input). For these tasks, the ability to make high-level decisions (e.g. ‘what should happen next in the story?’ or ‘should we change the subject of discussion?’) is central to the task.

While neural Language Model (LM) based approaches have so far been successful for tasks on the left, they have well-documented difficulties with tasks on the right, such as repetitious and generic output (under certain decoding algorithms). More broadly, neural LMs seem to struggle to learn to make the necessary high-level decisions.

For this reason, control – that is, the ability to specify desired attributes of the text at test time – is an attractive idea for open-ended neural NLG. For example, if we can control the repetitiveness or genericness of the text, we can fix the aforementioned errors. Furthermore, if we can control certain high-level attributes of the text (e.g. whether to change the subject, or whether to ask a question), then perhaps we can make some high-level decisions for the neural LM.

It is well-established that all NLG evaluation is difficult – for example, the MT and summarization communities continue to use the BLEU and ROUGE automatic metrics despite their well-documented problems. But for open-ended NLG, evaluation is even more difficult. In the absence of a useful automatic metric to capture overall quality, we rely on human evaluation. But even that is complex – when evaluating dialogue, should we evaluate single turns or multiple turns? Should evaluators take part in conversations interactively or not? What questions should be asked, and how should they be phrased?

Three research questions

In this work, we use chitchat dialogue as a setting to better understand the issues raised above. In particular, we control multiple attributes of generated text and human-evaluate multiple aspects of conversational quality, in order to answer three main research questions:

Research Question 1: How effectively can we control the attributes?
Quick answer: Pretty well! But some control methods only work for some attributes.

Research Question 2: How do the controllable attributes affect conversational quality aspects?
Quick answer: Strongly – we get improvements by controlling repetition, question-asking, and specificity vs genericness.

Research Question 3: Can we use control to make a better chatbot overall?
Quick answer: Yes! Though the answer can depend on the definition of ‘better overall’.

The PersonaChat task

We use PersonaChat, a chitchat dataset containing conversations between two participants, each of which has a ‘persona’. Our task is to build a chatbot that can converse with a human in this setting.

TODO

In the PersonaChat task, both participants are supplied with a persona and instructed to get to know each other.

The PersonaChat task was the focus of the NeurIPS 2018 ConvAI2 Competition. Most successful teams built neural sequence generation systems (see the competition report). In particular the winning team, Lost in Conversation, used a finetuned version of OpenAI’s GPT language model, which is pretrained on a very large amount of text (985 million words).

We use a simple baseline – a standard LSTM-based sequence-to-sequence architecture with attention. On each turn, the bot’s persona is concatenated with the dialogue history to form the input sequence, and the output is generated using beam search. We pretrain this model on 2.5 million Twitter message/response pairs, then finetune it on PersonaChat.

Four controllable attributes of text

TODO

We control four attributes of the output text.

Neural LMs often produce repetitive, generic or irrelevant text, especially when decoding using beam search. Motivated by this, we control the repetitiveness, specificity and response-relatedness of the output text. These attributes are defined simply: repetitiveness as n-gram overlap, specificity as word rareness, and response-relatedness as the embedding similarity of the bot’s response to the human’s last utterance.

Lastly, we also control the rate at which the bot asks questions (here we regard an utterance to contain a question if and only if it contains ‘?’). Question-asking is an essential component of chitchat, but one that must be balanced carefully. By controlling question-asking, we can find and understand the right balance.

Aspects of conversational quality

In our evaluations, we ask Amazon Mechanical Turk crowdworkers (‘Turkers’) to chat with our bots for six turns before asking them to rate several different aspects of the conversation (most are on a scale from 1 to 4).

TODO

We collect human evaluations for six lower-level aspects of conversational quality.

Some of the aspects – such as avoiding repetition, making sense, and fluency – are designed to capture certain basic error classes (like repeating, saying nonsensical things, or disjointed language). The others – interestingness, listening, and inquisitiveness – encompass other important elements of conversation, each of which must be balanced.

TODO

We also collect human evaluations for two definitions of overall quality - humanness and engagingness.

Lastly, we ask the Turker to rate the bot with respect to two different notions of overall quality. To measure humanness, we ask the Turker whether they think they spoke to a bot or a human (i.e. a Turing test question). To measure engagingness, we ask the Turker how much they enjoyed the conversation.

Many dialogue studies use either engagingness or humanness as a single stand-alone quality metric. In particular, in the ConvAI2 competition, only engagingness was used for human evaluation. Given that we use the exact same wording of the engagingness question, our evaluation is a superset of ConvAI2’s.

Control methods

In the recent neural sequence generation literature, there are many proposed methods to generate text with some desired attribute. However, due to NLG evaluation difficulties, it’s not clear which of these control methods is most effective. Furthermore, many of these methods are attribute-specific – meaning they’re designed to control just one particular attribute of the text (e.g. specificity), rather than any general attribute.

In this work, we use two simple existing methods for general-purpose control, and use them to control all four text attributes. Aside from helping us to build a better chatbot, this also allows us to better understand the relative effectiveness of the control methods themselves.

Control method 1: Conditional Training (CT)

A standard sequence-to-sequence model learns , the conditional probability of the output text given the input text .

A Conditional Training model learns , the conditional probability of the output text given the input text and a control variable , which specifies the desired output attribute. For example, to control specificity, we might set to HIGH or LOW to get a very specific or a very generic response to What’s your favorite hobby?

TODO

Controlling specificity with Conditional Training

The CT model is trained to predict given and (where is provided via automatic annotation). Then at test time, can be chosen by us.

Several researchers have proposed versions of this method (Kikuchi et al 2016, Peng et al 2018, Fan et al 2018), using various methods to incorporate into the model. We represent with a learned embedding, and find that concatenating to each decoder input is most effective. We can even concatenate multiple control embeddings and learn if we wish to simultaneously control several attributes.

Control method 2: Weighted Decoding (WD)

Weighted Decoding (Ghazvininejad et al 2017, Baheti et al 2018) is a simple technique, applied during decoding, to increase or decrease the probability of words with certain features.

For example, to control specificity with Weighted Decoding, we use the rareness of a word as a feature. On each step of the decoder, we update the probability of each word in the vocabulary, in proportion to its rareness. The size of the update is controlled by a weight parameter, which we choose – allowing us to encourage more specific or more generic output. In the example below, we increase the probability of rarer words, thus choosing I like watching sunrises rather than I like watching movies.

TODO

Controlling specificity with Weighted Decoding

This method requires no special training and can be applied to modify any decoding algorithm (beam search, greedy search, top-k sampling, etc). Weighted Decoding can be used to control multiple attributes at once, and it can be applied alongside Conditional Training.

Research Question 1: How effectively can we control the attributes?

We find that Weighted Decoding is effective to control attributes that can be easily defined at the word-level, like repetition, specificity, and response-relatedness (shown below). However, the method yields degenerate output when the feature weight is too high – for example, devolving into a long list of related words (drinks, espresso, latte, tea).

TODO

Controlling response-relatedness using Weighted Decoding (WD). By increasing response-relatedness, we obtain a more on-topic response (I do, usually at starbucks).

Because Weighted Decoding controls attributes using word-level features, it cannot control attributes such as question-asking, which are more naturally defined at the sentence-level.

We find that Conditional Training is effective to control simple attributes of the output text, such as specificity and question-asking. In particular, it usually produces output that is well-formed and has the desired attribute – this makes it less risky than Weighted Decoding (see below for example).

TODO

Controlling specificity using Weighted Decoding (WD) and Conditional Training (CT). By increasing specificity, we obtain more interesting, personalized responses.

However, we find Conditional Training is less effective at learning to control relationships between the input and output, such as response-relatedness. In addition, Conditional Training can’t control attributes without sufficient training data – meaning it is ineffective to control repetition, because our training data does not contain the kind of severely repetitive output we wish to prevent.

Overall, though the control methods didn’t work for every attribute, we find that each of our four attributes can be satisfactorily controlled by at least one of the two methods.

Research Question 2: How do the controllable attributes affect conversational quality aspects?

We find that reducing repetition gives large boosts to all human evaluation scores. This is not surprising, as our beam search baseline model repeats itself a lot (especially across utterances), creating a very frustrating user experience. However, this does demonstrate the importance of multi-turn evaluation (vs single response generation), which is needed to detect across-utterance repetition.

By increasing specificity to around human levels, we obtain improvements to interestingness, listening and engagingness. However, finding the right balance is difficult – increasing specificity too much leads to lower making sense and fluency scores.

We also find that by increasing question-asking rate to 65.7%, we achieve better inquisitiveness, interestingness and engagingness. Interestingly, this rate is higher than both the baseline (50%) and humans (28.8%) – implying that, in chitchat settings such as these, more question-asking is often received well.

Lastly, we were unable to obtain an improvement in any of our evaluation categories by controlling response-relatedness. Though we hoped that increasing response-relatedness would create a chatbot that appears more attentive, friendly and interested in the user, Turkers did not rate the ‘more responsive’ bots well. In particular, these bots received lower scores for fluency and making sense, and consequently lower overall scores for humanness and engagingness too. As with specificity, attempting higher response-relatedness is a risky strategy, as it increases the chance of the bot saying something that sounds unnatural or nonsensical.

Research Question 3: Can we use control to make a better chatbot overall?

The first answer is yes! By controlling repetition, specificity and question-asking, we achieve near-human engagingness (i.e. enjoyability) ratings.

TODO

Engagingness (i.e. enjoyability) ratings for humans and selected models.

In particular, our raw engagingness score matches that of the ConvAI2 competition winner’s GPT-based model.2 This is especially notable because our model is much smaller (a 2-layer LSTM-based model vs 12-layer Transformer-based model), and is trained on 12 times less data.

However, on the humanness (i.e. Turing test) metric, all our models are nowhere near human-level!

TODO

Humanness (i.e. Turing test) ratings for humans and selected models.

We’ve observed that our bots are (almost) as engaging as humans, but they’re clearly non-human. What does this mean?

Firstly, our results demonstrate that engagingness is not the same as humanness. While both metrics are frequently used alone for evaluation, our results show the importance of measuring more than one.

Secondly, we suspect that on this task, the human ‘engagingness’ performance may be artificially low. This is because Turkers chatting for money, using artificial personas, are less engaging conversationalists than people who are genuinely chatting for fun. This may explain why the human-level engagingness scores are easy to match.

Conclusions

  • Control is a good idea for your neural sequence generation dialogue system. Using simple control, we matched the performance of a GPT-based contest winner. We expect these techniques would yield even better results when applied to a highly pretrained language model like GPT.
  • We investigated two general-purpose control methods which have complementary strengths and weaknesses. If you want to control a fairly simple attribute of the output text, and you have sufficient training examples of the attribute, then Conditional Training is probably a good idea. If you don’t have the training data, or the attribute is harder to learn, then Weighted Decoding may be more effective – though you need to be careful as the method can produce degenerate output.
  • Multi-turn phenomena (such as repetition across utterances, and question-asking frequency) are important to conversations – so we need multi-turn eval to detect them.
  • Engagingness is not the same as humanness, so think carefully about which to use as an overall quality metric.
  • Paid Turkers are not very engaging conversationalists, and perhaps aren’t even good judges of whether a conversation is engaging. Though it raises other evaluation challenges, humans chatting for fun may be a better source of genuine judgments.
  • Whether you’re a human or a bot: Don’t repeat yourself. Don’t be boring. Ask more questions.

Outlook

This project involved a lot of manual tuning of control parameters, as we attempted to find the best combination of settings for the four attributes. This was a long and laborious process, requiring not only many expensive hours of Turker evaluation time, but also many hours of our own evaluation time as we chatted to the bots.

I’m reminded of QWOP – a simple game in which you press four buttons (Q, W, O and P) to control the individual muscles in a runner’s legs. Though the aim of the game is to run as far as possible, the entertainment comes from the absurd difficulty of the task.

TODO

QWOP is a game in which you attempt to run by pressing four buttons that each control a different part of the runner's legs.

Manually controlling four low-level text attributes is not the most principled, nor the most scalable way to build a good conversational dialogue system – just as manually controlling the four parts of the runner’s legs is not the most principled way to run a marathon. However, for the neural sequence generation systems we are using today, this kind of control can be useful and effective – getting us a little further down the track, if not all the way to the finish line.


For further details on this work, check out the paper.

If you’d like to chat to the bots yourself, follow the instructions here – it only takes a few minutes to set up!


Footnotes

  1. Sasha Rush showed a similar diagram during his talk at the NeuralGen 2019 workshop. See “Open Questions” slide here

  2. Though we used the exact same wording as ConvAI2 for our Engagingness question, the comparison of raw scores should be considered as a rough indication of a similar overall quality, not an exact comparison.