There are currently few datasets appropriate for training and evaluating models for systems that are neither domain-specific goal-oriented dialogue systems nor chit-chat bots.
The aim of our competition is therefore to establish a concrete scenario for testing conversational systems that aim to satisfy user’s information need in a conversational manner.
The goal of the current challenge is to explore the situation of users asking ambigous questions and the systems providing clarifications. Unlike traditional search systems, where the users can browse answers from multiple sources, conversational systems are limited to one answer per user’s request. Therefore, a better understanding is needed as to when and how the system should clarify the users’ need or help them refine it. Detailed description of the challenge can be found here.
Stay tuned for the price announcements.
Validation Leaderboard We will also provide an additional validation set leaderboard.
The challenge will be run in two stages:
The first stage is essentially a classification and ranking problem. For a given user query a participating system must decide whether a clarification is neededand, if so, give the best clarifying question.
The competion is run in two stages, each with a unique dataset.
In Stage 1, we provide to the participants an extended version of the Qulacdataset. The extension includes:
|# facets||~ 1300|
|# total questions||3,929|
|# single-turn conversations||~ 16K|
|# multi-turn conversations||~ 2 million|
|# documents||~ 2.6 million|
TODO: finish description
At that stage the participating systems are put in front of human users. The systems are rated on their overall performance. At each dialog step, a system should give either a factual answer to the user’s query or ask for a clarification. Therefore, the participants would need to:
The participants would need to strike a balance between asking too many questions and providing irrelevant answers.
Note that the setup of this stage is quite different from the Stage 1. Participating systems would likely need to operate as a generative model, rather than a retrieval model. One option would be to cast the problem as generative from the beginning, and solve the retrieval part of Stage 1, e.g., by ranking the offered candidates by their likelihood.
Alternatively, one may solve Stage 2 by retrieving a list of candidate answers (e.g., by invoking Wikipedia API or the Chat Noir API that we describe above) and ranking them as in Stage 1.
Competitors’ models will then be compared in three ways:
The winning dialogue systems will be chosen based on these scores.
There are three types of metrics we will evaluate:
Automated metrics -
Crowd workers -
`Wild’ Live Chat with Volunteers - Finally, we will solicit volunteers to also chat to the models in a similar way to the crowdsourcing setup. As volunteers, unlike crowd workers, are not paid and will likely not follow the instructions as closely, the distribution will likely be different, hence serving as a test of the robustness of the models. This setup will be hosted through the Messenger and Telegram APIs.
See https://github.com/aliannejadi/ClariQ for example baseline submissions.
You are free to use any system (e.g. PyTorch, Tensorflow, C++,..) as long as you can wrap your model for the evaluation. The top level README should tell us your team name, model name, and where the eval_ppl.py, eval_hits.py etc. files are so we can run them. Those should give the numbers on the validation set. Please also include those numbers in the README so we can check we get the same. We will then run the automatic evaluations on the hidden test set and update the leaderboard. You can submit a maximum of once per month. We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on DATE.
Up unti DATE competitors will be able to submit models (source code) to be evaluated on the hidden test set using automated metrics (which we will run on our servers).
Ability to submit a model for evaluation by automatic metrics to be displayed on the leaderboard will be available by DATE. The current leaderboards will be visible to all competitors.
`Wild’ live evaluation can also be performed at this time to obtain evaluation metrics and data, although those metrics will not be used for final judgement of the systems, but more for tuning systems if the competitors so wish.
On DATE the source code submission system will be locked, and the best performing systems will be evaluated over the next month using crowd workers and the `wild’ live evaluation.
Winners will be announced at EMNLP 2020.
The challenge is happening together with the Search-oriented Conversational AI workshop at EMNLP 2020 and builds upon the two previously run ConvAI challenges: ConvAI @ NIPS 2017 (The Conversational Intelligence Challenge) and ConvAI2 @ NeurIPS 2018 (The Persona-Chat Challenge).