CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
View the Dataset and Code on GitHub CEBaBing/CEBaB
Abraham, Eldar David; Karel D’Oosterlink; Amir Feder; Yair Gat; Atticus Geiger; Christopher Potts; Roi Reichart; and Zhengxuan Wu. 2022. CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. Ms., Stanford University, Technion – Israel Institute of Technology, and Ghent University.
CEBaB was created primarily to facilitate the evaluation of explanation methods for NLP models.
The dataset was created by Eldar David Abraham, Karel D’Oosterlink, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. It was not created on behalf of any other entity.
The dataset creation was funded by Meta AI.
The instances represent short restaurant reviews with aspect-level sentiment labels and text-level star ratings. Instances also include metadata related to the restaurant, the review, and the annotation process.
The dataset has 15,089 instances.
The dataset begins with a sample of actual reviews from OpenTable, all written in 2010. This is a tiny sample of all OpenTable reviews written in that time period. Our dataset creation process involved crowdsourcing edits to these reviews and new sentiment labels for them. All such crowdwork is included.
Each instance is a JSON dictionary with a large number of fields. The precise structure is documented as part of the official dataset distribution.
There are a number of labels associated with each instance. We refer to the primary dataset documentation for details.
We have not deliberately excluded any information.
Yes, examples come in groups: an original and various edits of that original, targeting different aspects of the original. These relationships are encoded in the instance ids.
Yes, the dataset is released with an inclusive train set, an exclusive train set (a proper subset of the inclusive one), a dev set, and a test set.
We are not aware of any errors, noise, or redundancies, but this is a naturalistic dataset, so it is safe to assume that such things exist.
The dataset is self-contained.
The dataset consists of public OpenTable reviews and edits of those reviews that were done by crowdworkers. In light of this, we are reasonably confident that it does not contain confidential data.
We are not aware of any such instances in the dataset, but we have no comprehensively audited it with these considerations in mind.
Yes, it is a dataset of restaurant reviews.
No information about individuals is included in the metadata.
We think that this is not possible. However, since it is a dataset of naturalistic texts, it is likely that it would be possible to identify individuals via the content of the original restaurant reviews.
We are not aware of any such sensitive data.
The original restaurant reviews and associated metadata were downloaded from OpenTable.com in 2010 by Christopher Potts. The edits and associated sentiment labels were created in early 2022 in a crowdsourcing effort on Mechanical Turk that was administered by Potts.
The dataset was collected on the Mechanical Turk platform using HTML templates that are included in the dataset distribution.
The original reviews were sampled from the larger set that Potts downloaded in 2010 by a random process focused on U.S. restaurants.
The dataset was crowdsourced. Workers were paid US$0.25 per example in the editing phase and U$0.35 per batch of 10 examples in the labeling phases.
The crowdsourcing effort was conducted January 31, 2022, to February 24, 2022.
The dataset collection process was covered by a Stanford University IRB Protocol (PI Potts). Information about this protocol is available upon request.
All instances were collected via Amazon’s Mechanical Turk platform.
All the individuals involved were crowdworkers who opted in to the tasks.
All the individuals involved were crowdworkers who opted in to the tasks.
All the instances in our dataset use only anonymized identifiers of individuals, so we do not have a mechanism for allowing people to withdraw their work.
No.
No, no preprocessing was done beyond the sampling described above and the formatting required to put examples into our JSON format.
The raw data were saved for now.
We are not releasing this preprocessing code, but we are open to sharing it with researchers upon request.
As of this writing, the dataset has been used only for the experiments in the paper that introduced it.
Yes.
The dataset’s most obvious application are text-level and aspect-level sentiment analysis, and assessment of causal explanation methods.
Yes, the dataset is designed primarily to answer specific scientific questions about sentiment analysis and causal explanation methods. As such, it should be regarded as highly limited when it comes to real-world tasks involving sentiment analysis or any other kind of textual analysis. It was not created with such applications in mind; no effort was made, for example, to ensure coverage across restaurants, cuisines, regions of the U.S., or any other category that might impact a real-world sentiment analysis system in significant ways.
The only uses we endorse are (1) text-level and aspect-level sentiment analysis experiments aimed at providing scientific insights into NLP modeling techniques, and (2) assessment of causal explanation methods.
The dataset is distributed publicly.
The dataset is distributed via the current repository and on the Hugging Face website.
It is presently available.
The dataset is released under a Creative Commons Attribution 4.0 International License.
No, not that we are aware.
No, not that we are aware.
The dataset creators are supporting and maintaining the dataset. It is hosted on Github and at the Hugging Face website.
The dataset owners can be contacted at the email addresses included with the paper, or via the dataset’s Github repository.
Not as of this writing, but we will create one at the dataset’s Github site as necessary.
Yes.
There are no applicable limits of this kind.
Yes, they will be available in the dataset’s Github repository.
Yes, we are open to collaboration of this kind.