SemEval Task 4 Commonsense Validation and Explanation

SemEval 2020 Task 4 - Commonsense Validation and Explanation (ComVE)

Introduction

Welcome to Commonsense Validation and Explanation (ComVE) Challenge!

The task is to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. We designed three subtasks. The first task is to choose from two natural language statements with similar wordings which one makes sense and which one does not make sense; The second task is to find the key reason from three options why a given statement does not make sense; The third task asks machine to generate the reasons and we use BLEU to evaluate them.

Formally, each instance in our dataset is composed of 10 sentences: {s1, s2, o1, o2, o3, r1, r2, r3}. s1 and s2 are two similar statements which in the same syntactic structure and differ by only a few words, but only one of them makes sense while the other does not. They are used on our first subtask called Validation, which requires the model to identify which one makes sense. For the against-common-sense statement s1 or s2, we have three optional sentences o1, o2 and o3 to explain why the statement does not make sense. Our subtask 2, named Explanation (Multi-Choice), requires that the only one correct reason be identified from two other confusing ones. For the same against-common-sense statement s1 or s2, our subtask 3 naming Explanation (Generation), asks the participants to generate the reason why it does not make sense. The 3 referential reasons r1, r2 and r3 are used for evaluating task 3.

Example:

Task A: Validation

Task: Which statement of the two is against common sense?

Statement1: He put a turkey into the fridge.

Statement2: He put an elephant into the fridge.

Task B: Explanation (Multi-Choice)

Task: Select the most corresponding reason why this statement is against common sense.

Statement: He put an elephant into the fridge.

A: An elephant is much bigger than a fridge.

B: Elephants are usually white while fridges are usually white.

C: An elephant cannot eat a fridge.

Task C: Explanation (Generation)

Task: Generate the reason why this statement is against common sense and we will use BELU to evaluate it.

Statement: He put an elephant into the fridge.

Referential Reasons:

1. An elephant is much bigger than a fridge.

2. A fridge is much smaller than an elephant.

3. Most of the fridges aren’t large enough to contain an elephant.

Paper

For more detailed information, please refer to this link.

Please contact the task organisers or post on the competition forum if you have any further queries.

You can use the following if you want to use our dataset or cite our work:

Cunxiang Wang, Shuailong Liang, Yili Jin, Yi-long Wang, Xiaodan Zhu, and Yue Zhang. 2020.SemEval-2020 task 4: Commonsense Validation and Explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics.

@inproceedings{wang-etal-2020-semeval,

title = "{S}em{E}val-2020 Task 4: Commonsense Validation and Explanation",

author = "Wang, Cunxiang and

Liang, Shuailong and

Jin, Yili and

Wang, Yilong and

Zhu, Xiaodan and

Zhang, Yue",

booktitle = "Proceedings of The 14th International Workshop on Semantic Evaluation",

year = "2020",

publisher = "Association for Computational Linguistics",

}

@inproceedings{wang-etal-2019-make,

title = "Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation",

author = "Wang, Cunxiang and

Liang, Shuailong and

Zhang, Yue and

Li, Xiaonan and

Gao, Tian",

booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",

month = jul,

year = "2019",

address = "Florence, Italy",

publisher = "Association for Computational Linguistics",

url = "https://www.aclweb.org/anthology/P19-1393",

pages = "4020--4026",

abstract = "Introducing common sense to natural language understanding systems has received increasing research attention. It remains a fundamental question on how to evaluate whether a system has the sense-making capability. Existing benchmarks measure common sense knowledge indirectly or without reasoning. In this paper, we release a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. In addition, a system is asked to identify the most crucial reason why a statement does not make sense. We evaluate models trained over large-scale language modeling tasks as well as human performance, showing that there are different challenges for system sense-making.",

联系我们