ConvAI3: Clarifying Questions for Open-Domain Dialogue Systems (ClariQ)

SCAI workshop @ EMNLP 2020

NEW! Challenge winners announced!

ClariQ Overview

ClariQ (pronounced as Claire-ee-que) challenge is organized as part of the Conversational AI challenge series (ConvAI3) at Search-oriented Conversational AI (SCAI) EMNLP workshop in 2020. The main aim of the conversational systems is to return an appropriate answer in response to the user requests. However, some user requests might be ambiguous. In IR settings such a situation is handled mainly through the diversification of a search result page. It is however much more challenging in dialogue settings. Hence, we aim to study the following situation for dialogue settings:

The main research questions we aim to answer as part of the challenge are the following:

The detailed description of the challenge can be found in the following document.

How to participate?

News

Awards

Google is sponsoring the competition award with the GCP Credits, which we are planning to award as follows:

IMPORTANT! Please note that GCP Credits can only be granted where available and are subject to the Terms and Conditions and product availability. If the competition winner is from a region where the program is not launched, we will, unfortunately, not be able to issue the prize.

Challenge Design

The challenge will be run in two stages:

Stage 1: intial dataset (sponsored by Microsoft Research)

In Stage 1, we provide to the participants the datasets that include:

To answer RQ1: Given a user request, return a score [1 −4] indicating the necessity of asking clarifying questions.

To answer RQ2: Given a user request which needs clarification, returns the most suitable clarifying question. Here participants are able to choose: (1) either select the clarifying question from the provided question bank (all clarifying questions we collected), aiming to maximize the precision, (2) or choose not to ask any question (by choosing Q0001 from the question bank.)

The dataset is stored in the following repository https://github.com/aliannejadi/ClariQ, together with evaluation scripts and baseline.

Stage 2: human-in-the-loop (sponsored by Amazon Science)

The TOP-4 systems from Stage 1 are exposed to real users. Their responses—answers and clarifying questions—are rated by the users. At that stage, the participating systems are put in front of human users. The systems are rated on their overall performance. At each dialog step, a system should give either a factual answer to the user’s query or ask for clarification. Therefore, the participants would need to:

The participants would need to strike a balance between asking too many questions and providing irrelevant answers.

Note that the setup of this stage is quite different from Stage 1. Participating systems would likely need to operate as a generative model, rather than a retrieval model. One option would be to cast the problem as generative from the beginning and solve the retrieval part of Stage 1, e.g., by ranking the offered candidates by their likelihood.

Alternatively, one may solve Stage 2 by retrieving a list of candidate answers (e.g., by invoking Wikipedia API or the Chat Noir API that we describe above) and ranking them as in Stage 1.

For further information about Stage 2 design and auxiliary files, please visit ClariQ repo.

Timeline

Winners will be announced at SCAI@EMNLP2020 which will take place in November 19-20 (exact details TBD).

Evaluation

Participants’ models will then be compared in two ways after two stages:

The winning will be chosen based on these scores.

Metrics

There are three types of metrics we will evaluate:

Rules

Submission

Stage 1

Please send two files per run as to clariq@convai.io, indicating your team’s name, as well as your run ID. Each team is allowed to send a maximum of one run per week. You’ll also need to share your GitHub repository with us. The run files should be formatted as described below.

Run file format

Each run consists of two separate files:

Below we explain how each file should be formatted.

Question ranking

This file is supposed to contain a ranked list of questions per topic. The number of questions per topic could be any number, but we evaluate only the top 30 questions. We follow the traditional TREC run format. Each line of the file should be formatted as follows:

<topic_id> 0 <question_id> <ranking> <relevance_score> <run_id>

Each line represents a relevance prediction. <relevance_score> is the relevance score that a model predicts for a given <topic_id> and <question_id>. <run_id> is a string indicating the ID of the submitted run. <ranking> denotes the ranking of the <question_id> for <topic_id>. Practically, the ranking is computed by sorting the questions for each topic by their relevance scores. Here are some example lines:

170 0 Q00380 1 6.53252 sample_run
170 0 Q02669 2 6.42323 sample_run
170 0 Q03333 3 6.34980 sample_run
171 0 Q03775 1 4.32344 sample_run
171 0 Q00934 2 3.98838 sample_run
171 0 Q01138 3 2.34534 sample_run

This run file will be used to evaluate both question relevance and document relevance. Sample runs can found in ./sample_runs/ directory. Also, sample Google Colab Notebookes are available. Please check ClariQ repo for more information.

Clarification need

This file is supposed to contain the predicted clarification_need labels. Therefore, the file format is simply the topic_id and the predicted label. Sample lines can be found below:

171 1
170 3
182 4

More information and example run files can be found at https://github.com/aliannejadi/ClariQ.

Stage 2

To submit an entry, create a private repo with your model that works with our evaluation code, and share it with the following github accounts: aliannejadi, varepsilon, julianakiseleva.

See https://github.com/aliannejadi/ClariQ for example baseline submissions.

You are free to use any system (e.g. PyTorch, Tensorflow, C++,..) as long as you can wrap your model for the evaluation. The top level README should tell us your team name, model name, and where the eval_ppl.py, eval_hits.py etc. files are so we can run them. Those should give the numbers on the validation set. Please also include those numbers in the README so we can check we get the same. We will then run the automatic evaluations on the hidden test set and update the leaderboard. You can submit a maximum of once per week.

We will use the same submitted code for the top performing models for computing human evaluations when the submission system is locked on September 9, 2020 September 17, 2020. The deadline for submitting systems for Stage 2 is November 1, 2020.

The submitted systems must produce the output in a reasonable time window (1-2 hours) on our system with the following configuration:

Note: All submitted systems must be accompanied with a ready-to-use Docker container.

Automatic Evaluation Leaderboard (hidden test set)

Document Relevance

Dev

ًRank Creator Model Name MRR P@1 nDCG@3 nDCG@5
- ClariQ Oracle BestQuestion 0.4882 0.4187 0.3337 0.3064
1 Karl Roberta 0.3640 0.2813 0.2002 0.1954
2 NTES_ALONG Reranker-v4 0.3761 0.3000 0.2113 0.1955
3 Soda BERT+BM25-v2 0.3180 0.2437 0.1625 0.1550
4 NTES_ALONG Reranker-v2 0.3573 0.2625 0.2112 0.1982
5 ClariQ BM25 0.3096 0.2313 0.1608 0.1530
6 Soda BERT+BM25 0.3096 0.2313 0.1608 0.1530
7 ClariQ NoQuestion 0.3000 0.2063 0.1475 0.1530
8 Pinta BERT-v3 0.3596 0.2750 0.1879 0.1882
9 NTES_ALONG BM25+Roberta 0.3606 0.2813 0.1942 0.1891
10 CogIR BERTlets-fusion
-topics-passages
0.3103 0.2125 0.1747 0.1701
11 Pinta BERT 0.3297 0.2250 0.1792 0.1701
12 CogIR BERTlets-fusion-
topics-div-passages-v2
0.3315 0.2500 0.1763 0.1660
13 CogIR BERTlets-fusion-
topics-div-passages
0.3236 0.2250 0.1739 0.1653
14 Pinta BERT-v2 0.3158 0.2313 0.1669 0.1600
15 Karl Roberta-v2 0.3811 0.2938 0.2193 0.2093
16 NTES_ALONG Reranker-v3 0.3520 0.2687 0.2033 0.1925
17 NTES_ALONG Recall+Rerank 0.3627 0.2750 0.2047 0.1935
18 Soda BERT-based-v2 0.3306 0.2437 0.1699 0.1702
19 Soda BERT-based 0.3497 0.2625 0.1849 0.1762
20 Algis USE-QA 0.3517 0.2563 0.1943 0.1815
21 Pinta Triplet 0.3573 0.2688 0.1988 0.1920
22 TAL ML Roberta+++ 0.3619 0.2750 0.2060 0.1979
23 NTES_ALONG Recall+Rescore 0.3722 0.2813 0.2185 0.2047
24 NTES_ALONG BM25_plus+Roberta 0.3587 0.2813 0.1952 0.1869
25 Algis BART-based 0.3628 0.2687 0.2003 0.1914
26 TAL ML Roberta++ 0.3583 0.2687 0.1977 0.1931
27 ClariQ BERT-ranker 0.3453 0.2563 0.1824 0.1744
28 ClariQ BERT-reranker 0.3453 0.2563 0.1824 0.1744
- ClariQ Oracle WorstQuestion 0.0841 0.0125 0.0252 0.0313

Test

ًRank Creator Model Name MRR P@1 nDCG@3 nDCG@5
- ClariQ Oracle BestQuestion 0.4881 0.4275 0.2107 0.1759
1 Karl Roberta 0.3190 0.2342 0.1265 0.1130
2 NTES_ALONG Reranker-v4 0.3140 0.2379 0.1229 0.1097
3 Soda BERT+BM25-v2 0.3216 0.2453 0.1196 0.1097
4 NTES_ALONG Reranker-v2 0.3034 0.2119 0.1171 0.1033
5 ClariQ BM25 0.3134 0.2193 0.1151 0.1061
6 Soda BERT+BM25 0.3134 0.2193 0.1151 0.1061
7 ClariQ NoQuestion 0.3223 0.2268 0.1134 0.1059
8 Pinta BERT-v3 0.3044 0.2119 0.1131 0.1021
9 NTES_ALONG BM25+Roberta 0.3045 0.2156 0.1108 0.1025
10 CogIR BERTlets-fusion
-topics-passages
0.3025 0.2193 0.1078 0.0983
11 Pinta BERT 0.2934 0.2045 0.1078 0.0969
12 CogIR BERTlets-fusion-
topics-div-passages-v2
0.2885 0.1859 0.1072 0.1010
13 CogIR BERTlets-fusion-
topics-div-passages
0.2908 0.1970 0.1055 0.0990
14 Pinta BERT-v2 0.2815 0.1933 0.1043 0.0934
15 Karl Roberta-v2 0.2890 0.1933 0.1035 0.0941
16 NTES_ALONG Reranker-v3 0.3006 0.2230 0.1031 0.0970
17 NTES_ALONG Recall+Rerank 0.2948 0.1933 0.1029 0.0919
18 Soda BERT-based-v2 0.2803 0.1896 0.1021 0.0981
19 Soda BERT-based 0.2600 0.1784 0.0983 0.0915
20 Algis USE-QA 0.2782 0.1822 0.0978 0.1003
21 Pinta Triplet 0.2672 0.1747 0.0968 0.0906
22 TAL ML Roberta+++ 0.2835 0.2007 0.0965 0.0915
23 NTES_ALONG Recall+Rescore 0.2799 0.1970 0.0955 0.0856
24 NTES_ALONG BM25_plus+Roberta 0.2720 0.1822 0.0930 0.0870
25 Algis BART-based 0.2622 0.1710 0.0923 0.0848
26 TAL ML Roberta++ 0.2602 0.1747 0.0922 0.0833
27 ClariQ BERT-ranker 0.2562 0.1784 0.0896 0.0821
28 ClariQ BERT-reranker 0.2553 0.1784 0.0892 0.0818
- ClariQ Oracle WorstQuestion 0.0541 0.0000 0.0097 0.0154

Question Relevance

Dev

ًRank Creator Model Name Recall@5 Recall@10 Recall@20 Recall@30
1 NTES_ALONG Reranker-v4 0.3604 0.6749 0.8478 0.8761
2 NTES_ALONG Reranker-v3 0.3648 0.6753 0.8510 0.8744
3 NTES_ALONG Reranker-v2 0.3648 0.6738 0.8417 0.8633
4 CogIR BERTlets-fusion-
topics-div-passages-v2
0.3542 0.6424 0.7653 0.7997
5 TAL ML Roberta++ 0.3649 0.6694 0.8265 0.8587
6 Karl Roberta-v2 0.3611 0.6539 0.7993 0.8384
7 CogIR BERTlets-fusion
-topics-passages
0.3555 0.6429 0.7640 0.7854
8 Karl Roberta 0.3618 0.6631 0.8128 0.8434
9 Soda BERT+Bm25 0.3454 0.6166 0.7354 0.7621
10 Soda BERT+Bm25-v2 0.3398 0.6166 0.7525 0.7792
11 Soda BERT-based 0.3523 0.6247 0.7354 0.7636
12 NTES_ALONG Recall+Rerank 0.3674 0.6678 0.7869 0.8085
13 Soda BERT-based-v2 0.3544 0.6287 0.7544 0.8177
14 Pinta BERT 0.3492 0.6196 0.7337 0.7632
15 CogIR BERTlets-fusion-
topics-div-passages
0.3528 0.6393 0.7506 0.7890
16 NTES_ALONG BM25_plus+Roberta 0.3637 0.6409 0.7484 0.7793
17 Pinta BERT-v3 0.3583 0.6358 0.7616 0.7852
18 NTES_ALONG Recall+Rescore 0.3648 0.6553 0.8230 0.8367
19 ClariQ BERT-ranker 0.3494 0.6134 0.7248 0.7542
20 Pinta BERT-v2 0.3528 0.6393 0.7506 0.7890
21 Algis BART-based 0.3333 0.5910 0.6689 0.6926
22 Algis USE-QA 0.3469 0.6112 0.7052 0.7228
23 NTES_ALONG BM25+Roberta 0.3629 0.6389 0.7285 0.7657
24 TAL ML Roberta+++ 0.3508 0.6215 0.7672 0.7978
25 ClariQ BERT-reranker 0.3475 0.6122 0.6913 0.6913
26 ClariQ BM25 0.3245 0.5638 0.6675 0.6913
27 Pinta Triplet 0.3471 0.5871 0.6653 0.6846

Test

ًRank Creator Model Name Recall@5 Recall@10 Recall@20 Recall@30
1 NTES_ALONG Reranker-v4 0.3404 0.6329 0.8335 0.8744
2 NTES_ALONG Reranker-v3 0.3414 0.6351 0.8316 0.8721
3 NTES_ALONG Reranker-v2 0.3382 0.6242 0.8177 0.8685
4 CogIR BERTlets-fusion-
topics-div-passages-v2
0.3384 0.6314 0.8073 0.8573
5 TAL ML Roberta++ 0.3395 0.6251 0.8176 0.8568
6 Karl Roberta-v2 0.3355 0.6237 0.7990 0.8492
7 CogIR BERTlets-fusion
-topics-passages
0.3314 0.6149 0.8074 0.8448
8 Karl Roberta 0.3406 0.6255 0.8006 0.8436
9 Soda BERT+BM25 0.3272 0.6061 0.8013 0.8433
10 Soda BERT+BM25-v2 0.3013 0.5866 0.8006 0.8433
11 Soda BERT-based 0.3338 0.6099 0.8023 0.8432
12 NTES_ALONG Recall+Rerank 0.3435 0.6296 0.7959 0.8424
13 Soda BERT-based-v2 0.3067 0.5893 0.7991 0.8415
14 Pinta BERT 0.3438 0.6228 0.7987 0.8409
15 CogIR BERTlets-fusion-
topics-div-passages
0.3333 0.6225 0.8039 0.8392
16 NTES_ALONG BM25_plus+Roberta 0.3361 0.6219 0.7960 0.8360
17 Pinta BERT-v3 0.3291 0.6066 0.7902 0.8345
18 NTES_ALONG Recall+Rescore 0.3432 0.6229 0.7857 0.8211
19 ClariQ BERT-ranker 0.3440 0.6242 0.7849 0.8190
20 Pinta BERT-v2 0.3271 0.6012 0.7884 0.8188
21 Algis BART-based 0.3408 0.6156 0.7721 0.8081
22 Algis USE-QA 0.3454 0.6071 0.7688 0.8013
23 NTES_ALONG BM25+Roberta 0.3329 0.6027 0.7650 0.8004
24 TAL ML Roberta+++ 0.3212 0.5786 0.7204 0.7739
25 ClariQ BERT-reranker 0.3444 0.6062 0.7585 0.7682
26 ClariQ BM25 0.3170 0.5705 0.7292 0.7682
27 Pinta Triplet 0.3330 0.5809 0.7289 0.7589

Clarification Need Prediction

ًRank Creator Model Name Dev Test
P R F1 P R F1
1 TAL ML Roberta+++ 0.6039 0.5600 0.5551 0.5981 0.6557 0.6070
2 CactusJam Roberta+
Stats
0.6200 0.5800 0.5717 0.5963 0.5902 0.5416
3 TAL ML Roberta++ 0.5807 0.5400 0.5375 0.5290 0.5574 0.5253
4 Algis Roberta+
CatBoost
0.1402 0.2800 0.1854 0.5171 0.5246 0.5138
5 NTES_ALONG cneed_add_
prior_v2
0.6200 0.6000 0.5984 0.5007 0.5082 0.5018
6 NTES_ALONG cneed_merge 0.5830 0.5200 0.5192 0.4847 0.5082 0.4960
7 NTES_ALONG cneed_dist 0.5452 0.5200 0.5177 0.4852 0.4918 0.4868
8 Karl Roberta-v2 0.4609 0.4600 0.4356 0.4465 0.5410 0.4871
9 NTES_ALONG Roberta+prior 0.4554 0.4600 0.4567 0.4926 0.4754 0.4799
10 Algis BartBoost 0.7008 0.7000 0.6976 0.4813 0.4754 0.4756
11 Soda BERT-based-v2 0.5218 0.4800 0.4253 0.3931 0.4918 0.4350
12 Soda BERT-based 0.5224 0.5400 0.5180 0.3901 0.4754 0.4282
13 Soda BERT+BM25 0.5386 0.5600 0.5360 0.3930 0.4754 0.4273
14 Soda BERT+BM25-v2 0.5992 0.5800 0.5793 0.4304 0.4262 0.4207
15 Pinta BERT-v3 0.4083 0.5000 0.4248 0.3789 0.4590 0.4147
16 Pinta Triplet 0.4161 0.4800 0.4178 0.3665 0.4590 0.4074
17 Pinta BERT 0.5204 0.5000 0.5014 0.3929 0.4098 0.4004
18 NTES_ALONG Roberta 0.3967 0.5200 0.4305 0.3546 0.4754 0.3995
19 Pinta XGB 0.5027 0.4800 0.4841 0.3776 0.3770 0.3771

Stage 2: Multi-turn Conversations Evalation

Document Relevance

Test

ًRank Creator Model Name MRR P@1 nDCG@3 nDCG@5
1 NTES_ALONG ClariQ_Select_System 0.1798 0.1161 0.0553 0.0536
2 TAL ML MCAN 0.1669 0.1067 0.0522 0.0494

Challenge Winners

Organizing team

Previous ConvAI competitions

Sponsors

Sponsorship of data collection:

The Stage 1 of the challenge is sponsored by:

The Stage 2 of the challenge is sponsored by:

Participants prize is sponsored by: