ClariQ Challenge (ConvAI3)

SCAI workshop @ EMLP 2020

SCAI Challenge / ConvAI3: Overview of the competition

There are currently few datasets appropriate for training and evaluating models for systems that are neither domain-specific goal-oriented dialogue systems nor chit-chat bots.

The aim of our competition is therefore to establish a concrete scenario for testing conversational systems that aim to satisfy user’s information need in a conversational manner.

The goal of the current challenge is to explore the situation of users asking ambigous questions and the systems providing clarifications. Unlike traditional search systems, where the users can browse answers from multiple sources, conversational systems are limited to one answer per user’s request. Therefore, a better understanding is needed as to when and how the system should clarify the users’ need or help them refine it. Detailed description of the challenge can be found here.

Registration form

Please register here.

Github repo

https://github.com/aliannejadi/ClariQ

Prize

Stay tuned for the price announcements.

News

Stage 1: Automatic Evaluation Leaderboard (held out test set)

Validation Leaderboard We will also provide an additional validation set leaderboard.

Stage 2: Human Evaluation Leaderboard

TBD

#Challenge Design

The challenge will be run in two stages:

  1. Stage 1: static dataset The participants are given a static dataset, similar to Qulac, which they are free to use for training and development purposes. At the end of the first stage their systems are then ranked based on a held-out dataset of the same nature.
  2. Stage 2: human-in-the-loop The TOP-N systems from the stage 1 are exposed to the real users. Their responses—answers and clarifyingquestions—are rated by the users.

The first stage is essentially a classification and ranking problem. For a given user query a participating system must decide whether a clarification is neededand, if so, give the best clarifying question.

Datasets

The competion is run in two stages, each with a unique dataset.

Stage 1

In Stage 1, we provide to the participants an extended version of the Qulacdataset. The extension includes:

Dataset statistics:

Feature Value
# topics 298
# facets ~ 1300
# total questions 3,929
# single-turn conversations ~ 16K
# multi-turn conversations ~ 2 million
# documents ~ 2.6 million

TODO: finish description

Stage 2

At that stage the participating systems are put in front of human users. The systems are rated on their overall performance. At each dialog step, a system should give either a factual answer to the user’s query or ask for a clarification. Therefore, the participants would need to:

The participants would need to strike a balance between asking too many questions and providing irrelevant answers.

Note that the setup of this stage is quite different from the Stage 1. Participating systems would likely need to operate as a generative model, rather than a retrieval model. One option would be to cast the problem as generative from the beginning, and solve the retrieval part of Stage 1, e.g., by ranking the offered candidates by their likelihood.

Alternatively, one may solve Stage 2 by retrieving a list of candidate answers (e.g., by invoking Wikipedia API or the Chat Noir API that we describe above) and ranking them as in Stage 1.

Evaluation

Competitors’ models will then be compared in three ways:

The winning dialogue systems will be chosen based on these scores.

Metrics

There are three types of metrics we will evaluate:

Protocol

TBD

Rules

Model Submission

To submit an entry, create a private repo with your model that works with our evaluation code, and share it with the following github accounts: varepsilon, aliannejadi.

See https://github.com/aliannejadi/ClariQ for example baseline submissions.

You are free to use any system (e.g. PyTorch, Tensorflow, C++,..) as long as you can wrap your model for the evaluation. The top level README should tell us your team name, model name, and where the eval_ppl.py, eval_hits.py etc. files are so we can run them. Those should give the numbers on the validation set. Please also include those numbers in the README so we can check we get the same. We will then run the automatic evaluations on the hidden test set and update the leaderboard. You can submit a maximum of once per month. We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on DATE.

Schedule

Up unti DATE competitors will be able to submit models (source code) to be evaluated on the hidden test set using automated metrics (which we will run on our servers).

Ability to submit a model for evaluation by automatic metrics to be displayed on the leaderboard will be available by DATE. The current leaderboards will be visible to all competitors.

`Wild’ live evaluation can also be performed at this time to obtain evaluation metrics and data, although those metrics will not be used for final judgement of the systems, but more for tuning systems if the competitors so wish.

On DATE the source code submission system will be locked, and the best performing systems will be evaluated over the next month using crowd workers and the `wild’ live evaluation.

Winners will be announced at EMNLP 2020.

Organizing team

Previous competitions

The challenge is happening together with the Search-oriented Conversational AI workshop at EMNLP 2020 and builds upon the two previously run ConvAI challenges: ConvAI @ NIPS 2017 (The Conversational Intelligence Challenge) and ConvAI2 @ NeurIPS 2018 (The Persona-Chat Challenge).