ConvAI3: Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)

SCAI workshop @ EMNLP 2020

ConvAI2: Overview of the competition

There are currently few datasets appropriate for training and evaluating models for non-goal-oriented dialogue systems (chatbots); and equally problematic, there is currently no standard procedure for evaluating such models beyond the classic Turing test.

The aim of our competition is therefore to establish a concrete scenario for testing chatbots that aim to engage humans, and become a standard evaluation tool in order to make such systems directly comparable.

This is the second Conversational Intelligence (ConvAI) Challenge. The previous one was conducted under the scope of NIPS 2017 Competitions track. This year we aim to improve over last year:

Prize

The winning entry in human evaluations will receive $20,000 in Mechanical Turk funding – in order to encourage further data collection for dialogue research. The winner in the automatic metrics also receives $5000 in AWS compute.

News

Dataset

Dialogues collected during DeepHack.Chat hackathon and `wild’ evaluation round are available online.

Human Evaluation Leaderboard

Rank Creator Rating Persona detect
1 🍐🍌 Lost in Conversation [code] 3.11🍌 0.9
2 🍐🍎 🍎 🍎 🤗 (Hugging Face) 2.68 0.98
3 🍐 Little Baby(AI小奶娃) 2.44 0.79
4 🍐 Mohd Shadab Alam 2.33 0.93
5 🍐 Happy Minions 1.92 0.46
6 🍐 ADAPT Centre 1.6 0.93
KV Profile Memory ParlAI team 2.44 0.76
Human MTurk 3.48 0.96

Automatic Evaluation Leaderboard (hidden test set)

Rank Creator PPL Hits@1 F1
1 🍐 🤗 (Hugging Face) 16.28🍎 80.7🍎 19.5🍎
2 🍐 ADAPT Centre 31.4 - 18.39
3 🍐 Happy Minions 29.01 - 16.01
4 🍐 High Five - 65.9 -
5 🍐 Mohd Shadab Alam 29.94 13.8 16.91
6 🍐 Lost in Conversation - 17.1 17.77
7 🍐 Little Baby(AI小奶娃) - 64.8 -
8 Sweet Fish - 45.7 -
9 1st-contact 31.98 13.2 16.42
10 NEUROBOTICS 35.47 - 16.68
11 Cats’team - 35.9 -
12 Sonic 33.46 - 16.67
13 Pinta 32.49 - 16.39
14 Khai Mai Alt - 34.6 13.03
15 loopAI - 25.6 -
16 Salty Fish 34.32 - -
17 Team Pat - - 16.11
18 Tensorborne 38.24 12.0 15.94
19 Team Dialog 6 40.35 10.9 7.27
20 Roboy - - 15.83
21 IamNotAdele 66.47 - 13.09
22 flooders - - 15.47
23 Clova Xiaodong Gu - - 14.37
Seq2Seq + Attention ParlAI team 29.8 12.6 16.18
Language Model ParlAI team 46.0 - 15.02
KV Profile Memory ParlAI team - 55.2 11.9

🍎 denotes the best performing model for each metric on the hidden test set. The rank is determined by sorting by the minimum rank of the score in any of the three metrics, where ties are broken by considering the second (and then third) smallest ranks. The 🍐 teams have made it to the next round (top 3 in each metric). For these teams: please make sure your GitHub repo contains an interactive script (see here for example) that allows your model to function in interactive mode. (To be clear, this must be the _same model that we evaluated for the leaderboard above.)_

Models by ParlAI team are baselines, and not entries into the competition; code is included for those models.

Note that the scripts that you can run locally will give metrics on the validation set, not the hidden test set which is reported here (for that, you need to submit your code, see below).

Validation Leaderboard We also provide an additional validation set leaderboard here.

PersonaChat ConvAI2 Dataset

Persona-Chat training set consists of conversations between crowdworkers who were randomly paired and asked to act the part of a given provided persona (randomly assigned, and created by another set of crowdworkers). The paired workers were asked to chat naturally and to get to know each other during the conversation. This produces interesting and engaging conversations that learning agents can try to mimic.

The Persona-Chat task aims to model normal conversation when two interlocutors first meet, and get to know each other. Their aim is to be engaging, to learn about the other’s interests, discuss their own interests and find common ground. The task is technically challenging as it involves both asking and answering questions, and maintaining a persistent persona, which is provided.

Conversing with current chit-chat models for even a short amount of time quickly exposes their weaknesses. Common issues with chit-chat models include:

This competition aims to find models that address those specific issues. The baseline systems we have already run indicate that there is hope we can make steps in that direction.

The dataset consists of 164,356 utterances in over 10,981 dialogs, some of which are set aside for validation. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas, each consisting of at least 4 profile sentences, setting aside 200 never seen before personas for validation. To avoid modeling that takes advantage of trivial word overlap, we crowdsourced additional rewritten sets of the same personas, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging.

More details can be found in the paper describing the dataset.

The competition dataset is available in our open source system ParlAI, more specifically here. That is, install ParlAI and then do:

python examples/display_data.py --task convai2 --datatype train

to look at the data.

Source code for baseline methods for the competition are also already provided in ParlAI here, including training loop and evaluation code for the automatic evaluation metrics. Baseline results are provided in the paper, although the dataset is now larger. (We have hence run new baselines that appear on the leaderboard above.)

As the original test set was released, we have crowdsourced further data for a hidden test set unseen by the competitors for automatic evaluation.

Evaluation

Competitors’ models will then be compared in three ways:

The winning dialogue systems will be chosen based on these scores.

Metrics

There are three types of metrics we will evaluate:

Protocol

We will run live volunteer chat throughout the competition so that competitors can try out their bots talking to humans and to collect live data, if they so wish (however, they are also free to only use the fixed train/test format at this stage).

The automated metrics will be used to obtain a shortlist of best performing systems, likely the top 3 scoring systems from each of the three metrics (Perplexity, F1 and hits@k). If those three leaderboards feature the same models at the top we will take systems further down the leaderboards, up to a maximum of 10. These systems will be evaluated in the final live experiments on Mechanical Turk and via volunteers using the same scoring protocols, already described.

During NeurIPS the `wild’ live conversation can continue, and the best performing systems will be showcased and conversed with.

We will declare winners in both the automated metrics tracks, and in the live evaluations (which will be considered the grand prize, being more important). The latter will consist of the weighted average of the Turk and wild (volunteer) scores. Finally, the solutions and any data collected will be made open source to the community.

Rules

Model Submission

To submit an entry, create a private repo with your model that works with our evaluation code, and share it with the following github accounts: emilydinan, klshuster, jaseweston, JackUrb, varvara-l, madrugado.

See this directory for example baseline submissions.

You are free to use any system (e.g. PyTorch, Tensorflow, C++,..) as long as you can wrap your model with ParlAI for the evaluation. If you use PyTorch your models should work with PyTorch 0.4. The top level README should tell us your team name, model name, and where the eval_ppl.py, eval_hits.py etc. files are so we can run them. Those should give the numbers on the validation set. Please also include those numbers in the README so we can check we get the same. We will then run the automatic evaluations on the hidden test set and update the leaderboard. You can submit a maximum of once per month. We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on September 30th.

Schedule

Up until September 30th competitors will be able to submit models (source code) to be evaluated on the hidden test set using automated metrics (which we will run on our servers).

Ability to submit a model for evaluation by automatic metrics to be displayed on the leaderboard will be available by April 6th. The current leaderboards will be visible to all competitors.

`Wild’ live evaluation can also be performed at this time to obtain evaluation metrics and data, although those metrics will not be used for final judgement of the systems, but more for tuning systems if the competitors so wish.

On September 30th the source code submission system will be locked, and the best performing systems will be evaluated over the next month using Mechanical Turk and the `wild’ live evaluation.

Winners will be announced at NeurIPS 2018.

FAQ

  1. Why does the eval_ppl.py script report different perplexity than my model training/testing logs? The eval_ppl script will give different performance than the perplexity reported by our baseline model training scripts. For instance, the model reported perplexity for the seq2seq model includes “easier” tokens such as predicting the “end” token which the model appends to each target. The separate eval_ppl script does a more careful job of evaluating (comparable across models) and doesn’t include these extra special tokens from the model.

  2. Which personas (self:original, self:revised, etc.) will my model be evaluated on? All submissions for the leaderboard will be evaluated using the “self:original” personas.

  3. How do I view the data? The competition dataset is available in our open source system ParlAI, more specifically here. That is, install ParlAI and then run python examples/display_data.py --task convai2 --datatype train to look at the data.

  4. What is the best way to store my model file for submission to the competition? We recommend storing your model files and any other large files you are using with Git-LFS. Please see more information here.

  5. Can I submit different models for different metrics? No. This is supposed to measure the performance of a single model in varying ways.

DeepHack.Chat Hackathon

This is not a compulsory part of the competition, but you may also be interested in the following hackathon:

If you submit your solution before the 15th of June, you can participate in the qualification round of DeepHack.Chat hackathon which will take place in Moscow on July 2-8. We will select 10 to 12 teams whose systems score best in terms of automatic metrics, and invite them to participate in the final round of the hackathon. At the hackathon teams will further improve their systems and listen to lectures from the top researchers in the field. Participants will also take part in live human evaluation of dialogue systems of other teams. The winning team will get a travel grant to NeurIPS to ConvAI finals.

If you want to participate in the hackathon, please include a file ``TEAM’’ in your repository when you submit your models. The file should contain names and emails of the members of your team. The team should consist of no more than 5 people.

Organizing team

The organizing team comes from multiple groups — Moscow Institute of Physics and Technology, Facebook AI Research, University of Montreal, McGill and Carnegie Mellon University.

The Team consists of: Mikhail Burtsev, Varvara Logacheva, Valentin Malykh, Ryan Lowe, Iulian Serban, Shrimai Prabhumoye, Emily Dinan, Douwe Kiela, Alexander Miller, Kurt Shuster, Arthur Szlam, Jack Urbanek and Jason Weston.

Advisory board: Yoshua Bengio, Alan W. Black, Joelle Pineau, Alexander Rudnicky, Jason Williams.

Partners

Platinum Partner

Gold Partner

ConvAI 2017

Webpage of the 1st NIPS Conversational Intelligence Challenge is available at convai.io/2017