The Conversational Intelligence Challenge 2 (ConvAI2)

NIPS 2018 Competition

ConvAI2: Overview of the competition

There are currently few datasets appropriate for training and evaluating models for non-goal-oriented dialogue systems (chatbots); and equally problematic, there is currently no standard procedure for evaluating such models beyond the classic Turing test.

The aim of our competition is therefore to establish a concrete scenario for testing chatbots that aim to engage humans, and become a standard evaluation tool in order to make such systems directly comparable.

This is the second Conversational Intelligence (ConvAI) Challenge. The previous one was conducted under the scope of NIPS 2017 Competitions track. This year we aim to improve over last year:

Prize

The winning entry will receive $20,000 in Mechanical Turk funding – in order to encourage further data collection for dialogue research.

Current Leaderboard

Model Creator PPL Hits@1 F1
Seq2Seq + Attention ParlAI team 32.09🍎 16.6 16.18🍎
Language Model ParlAI team 46.0 - 15.02
KV Profile Memory ParlAI team - 56.8🍎 11.9

🍎 denotes the current best performing model for each metric on the hidden test set.

Models by ParlAI team are baselines, and not entries into the competition; code is included for those models.

Note that the scripts that you can run locally will give metrics on the validation set, not the hidden test set which is reported here (for that, you need to submit your code, see below). We do provide an additional validation set leaderboard here for your convenience.

PersonaChat ConvAI2 Dataset

Persona-Chat training set consists of conversations between crowdworkers who were randomly paired and asked to act the part of a given provided persona (randomly assigned, and created by another set of crowdworkers). The paired workers were asked to chat naturally and to get to know each other during the conversation. This produces interesting and engaging conversations that learning agents can try to mimic.

The Persona-Chat task aims to model normal conversation when two interlocutors first meet, and get to know each other. Their aim is to be engaging, to learn about the other’s interests, discuss their own interests and find common ground. The task is technically challenging as it involves both asking and answering questions, and maintaining a persistent persona, which is provided.

Conversing with current chit-chat models for even a short amount of time quickly exposes their weaknesses. Common issues with chit-chat models include:

This competition aims to find models that address those specific issues. The baseline systems we have already run indicate that there is hope we can make steps in that direction.

The dataset consists of 164,356 utterances in over 10,981 dialogs, some of which are set aside for validation. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas, each consisting of at least 5 profile sentences, setting aside 200 never seen before personas for validation. To avoid modeling that takes advantage of trivial word overlap, we crowdsourced additional rewritten sets of the same personas, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. Evaluation will take place on both types of persona.

More details can be found in the paper describing the dataset.

The competition dataset is available in our open source system ParlAI, more specifically here. That is, install ParlAI and then do:

python examples/display_data.py --task convai2 --datatype train

to look at the data.

Source code for baseline methods for the competition are also already provided in ParlAI here, including training loop and evaluation code for the automatic evaluation metrics. Baseline results are provided in the paper, although the dataset is now larger. (We have hence run new baselines that appear on the leaderboard above.)

As the original test set was released, we have crowdsourced further data for a hidden test set unseen by the competitors for automatic evaluation.

Evaluation

Competitors’ models will then be compared in three ways:

The winning dialogue systems will be chosen based on these scores.

Metrics

There are three types of metrics we will evaluate:

Protocol

We will run live volunteer chat throughout the competition so that competitors can try out their bots talking to humans and to collect live data, if they so wish (however, they are also free to only use the fixed train/test format at this stage).

The automated metrics will be used to obtain a shortlist of best performing systems, likely the top 3 scoring systems from each of the three metrics (Perplexity, F1 and hits@k). If those three leaderboards feature the same models at the top we will take systems further down the leaderboards, up to a maximum of 10. These systems will be evaluated in the final live experiments on Mechanical Turk and via volunteers using the same scoring protocols, already described.

During NIPS the `wild’ live conversation can continue, and the best performing systems will be showcased and conversed with.

We will declare winners in both the automated metrics tracks, and in the live evaluations (which will be considered the grand prize, being more important). The latter will consist of the average of the Turk and wild (volunteer) scores. Finally, the solutions and any data collected will be made open source to the community.

Rules

Model Submission

To submit an entry, create a private repo with your model that works with our evaluation code, and share it with the following github accounts: edinan@fb.com, jaseweston@gmail.com, jju@fb.com, kshuster@fb.com.

See this directory for example baseline submissions.

We will then run the automatic evaluations and update the leaderboard. You can submit a maximum of once per month. We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on September 30th.

Schedule

Up until September 30th competitors will be able to submit models (source code) to be evaluated on the hidden test set using automated metrics (which we will run on our servers).

Ability to submit a model for evaluation by automatic metrics to be displayed on the leaderboard will be available by April 6th. The current leaderboards will be visible to all competitors.

`Wild’ live evaluation can also be performed at this time to obtain evaluation metrics and data, although those metrics will not be used for final judgement of the systems, but more for tuning systems if the competitors so wish.

On September 30th the source code submission system will be locked, and the best performing systems will be evaluated over the next month using Mechanical Turk and the `wild’ live evaluation.

Winners will be announced at NIPS 2018.

Organizing team

The organizing team comes from multiple groups — Moscow Institute of Physics and Technology, Facebook AI Research, University of Montreal, McGill and Carnegie Mellon University.

The Team consists of: Mikhail Burtsev, Varvara Logacheva, Valentin Malykh, Ryan Lowe, Iulian Serban, Shrimai Prabhumoye, Emily Dinan, Douwe Kiela, Alexander Miller, Kurt Shuster, Arthur Szlam, Jack Urbanek and Jason Weston.

Advisory board: Yoshua Bengio, Alan W. Black, Joelle Pineau, Alexander Rudnicky, Jason Williams.

ConvAI 2017

Webpage of the 1st NIPS Conversational Intelligence Challenge is available at convai.io/2017