CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior
View the Dataset and Code on GitHub CEBaBing/CEBaB
✅ English-language benchmark to evaluate causal explanation methods.
✅ Human-validated Aspect-based Sentiment Analysis (ABSA) benchmark.
Eldar David Abraham, Karel D’Oosterlink, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, Zhengxuan Wu. 2022. CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. Ms., Stanford University, Technion – Israel Institute of Technology, and Ghent University.
@unpublished{abraham-etal-2022-cebab,
    title={{CEBaB}: Estimating the Causal Effects of Real-World Concepts on {NLP} Model Behavior},
    author={Abraham, Eldar David and D'Oosterlinck, Karel and Feder, Amir and Gat, Yair Ori and Geiger, Atticus and Potts, Christopher and Reichart, Roi and Wu, Zhengxuan},
    note={arXiv:2205.14140},
    url={https://arxiv.org/abs/2205.14140},
    year={2022}}
Dataset files can be downloaded from CEBaB-v1.1.zip. Our v1.1 differs from v1.0 only in that v1.1 has proper unique ids our examples and corrects a bug that led to some non-unique ids in the previous version. There are no changes to other critical fields.
Note that we recommend you use HuggingFace Datasets library to use our dataset. See below for a 1-linear data loading.
The dataset consists of train_exclusive/train_inclusive/dev/test splits:
train_exclusive.jsontrain_inclusive.jsontrain_observational.jsondev.jsontest.jsonThe Datasheet for our dataset:
CEBaB is mainly maintained using the HuggingFace Datasets library:
"""
Make sure you install the Datasets library using:
pip install datasets
"""
from datasets import load_dataset
CEBaB = load_dataset("CEBaB/CEBaB")
This function can be used to load any subset of the raw *.json files:
import json
def load_split(splitname):
    with open(splitname) as f:
        data = json.load(f)
    return data
{    
    'id': str in format dddddd_dddddd as the concatenation of original_id and edit_id,
    'original_id': str in format dddddd,
    'edit_id': str in format dddddd,
    'is_original': bool,
    'edit_goal': str (one of "Negative", "Positive", "unknown") or None if is_original,
    'edit_type': str (one of "noise", "service", "ambiance", "food"),
    'edit_worker': str or None if is_original,
    'description': str,
    'review_majority': str (one of "1", "2", "3", "4", "5", "no majority"),
    'review_label_distribution': dict (str to int),
    'review_workers': dict (str to str),
    'food_aspect_majority': str (one of "Negative", "Positive", "unknown", "no majority"),
    'ambiance_aspect_majority': str (one of "Negative", "Positive", "unknown", "no majority"),
    'service_aspect_majority': str (one of "Negative", "Positive", "unknown", "no majority"),
    'noise_aspect_majority': str (one of "Negative", "Positive", "unknown", "no majority"),
    'food_aspect_label_distribution': dict (str to int),
    'ambiance_aspect_label_distribution': dict (str to int),
    'service_aspect_label_distribution': dict (str to int),
    'noise_aspect_label_distribution': dict (str to int),
    'food_aspect_validation_workers': dict (str to str),
    'ambiance_aspect_validation_workers': dict (str to str),
    'service_aspect_validation_workers': dict (str to str),
    'noise_aspect_validation_workers': dict (str to str),
    'opentable_metadata': {
        "restaurant_id": int,
        "restaurant_name": str,
        "cuisine": str,
        "price_tier": str,
        "dining_style": str,
        "dress_code": str,
        "parking": str,
        "region": str,
        "rating_ambiance": int,
        "rating_food": int,
        "rating_noise": int,
        "rating_service": int,
        "rating_overall": int
    }
}
Details:
'id': The unique identifier this example (an combination of two ids listed below).'original_id': The unique identifier of the original sentence for an edited example.'edit_id': The unique identifier of the edited sentence.'is_original': Indicate whether this sentence is an edit or not.'edit_goal': The goal label for the editing aspect if it an edited example, else None.'edit_type': The aspect to modify or to label with sentiment if it an edited example, else None.'edit_worker': Anonymized MTurk id of the worker who wrote 'description'. These are from the same family of ids as used in 'aspect_validation_workers'.'description': The example text.'review_majority': The review-level label for the editing aspect chosen by at least three of the five workers if there is one, else no majority.'review_label_distribution': Review-level rating distribution from the MTurk validation task.'review_workers': Individual response for review-level rating from annotators. The keys are lists of anonymized MTurk ids, which are used consistently throughout the dataset.'*_aspect_majority': The aspect-level label for the editing aspect chosen by at least three of the five workers if there is one, else no majority.'*_aspect_label_distribution': Aspect-level rating distribution from the MTurk validation task.'*_aspect_label_workers': Individual response for review-level rating from annotators. The keys are lists of anonymized MTurk ids, which are used consistently throughout the dataset.'opentable_metadata': Metadata for the review.Here is one example,
{
    "id": "000000_000000",
    "original_id": "000000",
    "edit_id": "000000",
    "is_original": true,
    "edit_goal": null,
    "edit_type": null,
    "edit_worker": null,
    "description": "Overbooked and didnot honor reservation time,put on wait list with walk INS",
    "review_majority": "1",
    "review_label_distribution": {
        "1": 4,
        "2": 1
    },
    "review_workers": {
        "w244": "1",
        "w120": "2",
        "w197": "1",
        "w7": "1",
        "w132": "1"
    },
    "food_aspect_majority": "",
    "ambiance_aspect_majority": "",
    "service_aspect_majority": "Negative",
    "noise_aspect_majority": "unknown",
    "food_aspect_label_distribution": "",
    "ambiance_aspect_label_distribution": "",
    "service_aspect_label_distribution": {
        "Negative": 5
    },
    "noise_aspect_label_distribution": {
        "unknown": 4,
        "Negative": 1
    },
    "food_aspect_validation_workers": "",
    "ambiance_aspect_validation_workers": "",
    "service_aspect_validation_workers": {
        "w148": "Negative",
        "w120": "Negative",
        "w83": "Negative",
        "w35": "Negative",
        "w70": "Negative"
    },
    "noise_aspect_validation_workers": {
        "w27": "unknown",
        "w23": "unknown",
        "w81": "Negative",
        "w103": "unknown",
        "w9": "unknown"
    },
    "opentable_metadata": {
        "restaurant_id": 6513,
        "restaurant_name": "Molino's Ristorante",
        "cuisine": "italian",
        "price_tier": "low",
        "dining_style": "Casual Elegant",
        "dress_code": "Smart Casual",
        "parking": "Private Lot",
        "region": "south",
        "rating_ambiance": 1,
        "rating_food": 3,
        "rating_noise": 2,
        "rating_service": 2,
        "rating_overall": 2
    }
}
We host our analyses code at our code folder.
This section contains the leaderboard for some of the best scores obtained on CEBaB as a five-class sentiment classification task. To add scores please consider a pull request.
| Model Architecture | Metric | Approx | S-Learner | INLP | 
|---|---|---|---|---|
| BERT | L2 | 0.81 (± 0.01) | 0.74 (± 0.02) | 0.80 (± 0.02) | 
| BERT | COS | 0.61 (± 0.01) | 0.63 (± 0.01) | 0.59 (± 0.03) | 
| BERT | NormDiff | 0.44 (± 0.01) | 0.54 (± 0.02) | 0.73 (± 0.02) | 
| RoBERTa | L2 | 0.83 (± 0.01) | 0.78 (± 0.01) | 0.84 (± 0.01) | 
| RoBERTa | COS | 0.60 (± 0.01) | 0.64 (± 0.01) | 0.58 (± 0.01) | 
| RoBERTa | NormDiff | 0.45 (± 0.00) | 0.59 (± 0.01) | 0.81 (± 0.01) | 
| GPT-2 | L2 | 0.72 (± 0.02) | 0.60 (± 0.02) | 0.72 (± 0.01) | 
| GPT-2 | COS | 0.59 (± 0.01) | 0.59 (± 0.01) | 1.00 (± 0.00) | 
| GPT-2 | NormDiff | 0.41 (± 0.01) | 0.40 (± 0.01) | 0.58 (± 0.03) | 
| LSTM | L2 | 0.86 (± 0.01) | 0.73 (± 0.01) | 0.79 (± 0.01) | 
| LSTM | COS | 0.64 (± 0.01) | 0.64 (± 0.01) | 0.74 (± 0.02) | 
| LSTM | NormDiff | 0.50 (± 0.01) | 0.53 (± 0.01) | 0.60 (± 0.01) | 
CeBaB has a Creative Commons Attribution 4.0 International License.