Skip to content

Latest commit

 

History

History
115 lines (88 loc) · 7.54 KB

README.md

File metadata and controls

115 lines (88 loc) · 7.54 KB

WikiQA-Salience: A dataset for evaluating entity salience prediction on extremely short documents

Introduction

WikiQA-Salience is a dataset for evaluating entity salience prediction on extremely short question-answer pair passages.

We leveraged the WikiQA dataset as a starting point to create a new entity salience dataset from publicly available data. The WikiQA corpus is an answer sentence selection (AS2) dataset where the questions are derived from query logs of the Bing search engine, and the answer candidates are extracted from Wikipedia. The examples are Q/A pairs in natural language with full-sentence (non-factoid) answers, which resemble the type of responses provided by conversational assistants.

Our dataset augments a subset of the Q/A pairs in WikiQA with named entities extracted from the question and answer text and linked to Wikidata and ground truth labels of the salience of each entity to the Q\A pair.

Construction

We first selected the Q/A pairs in the WikiQA corpus where the answer is labeled as correctly answering the question (positive pairs). Then we applied the ReFinED named entity resolution model to the combined question-answer text to extract named entities and augmented each entity with the name, description and aliases of the entity from WikiData. Since WikiData descriptions are typically extremely brief, we further augment the entities in the dataset with more detailed information from Wikipedia pages (wherever these are available) including the Wikipedia summary (i.e., the first section of the page) and the first 100 noun-phrases from the article. Ground truth labels were generated by crowd workers on the Amazon Mechanical Turk platform who rated the relevance of each entity to the Q/A pair it was extracted from on a three level scale (“High”, “Moderate”, “Low”).

The finished dataset consists of 687 annotated Q/A pairs with the linked entity data from ReFinED, entity details from WikiData, and the (5-pass) crowd worker salience ratings. The 687 Q/A pairs contain 2113 entities (unique at the Q/A pair level), and the mean length of the question-answer text is just 190.6 characters and 32.9 words. The distribution of the entity salience labels is significantly skewed towards salient entities with 1089 rated “High”, 535 rated “Moderate” and 489 rated “Low”. Additional details on the construction of the dataset can be found in our paper (TBD).

Statistics

mean std min max
characters 190.6 68.4 49 589
words 32.9 11.2 8 89

Table 1: Q/A pair context size

number of entities count
2 281
3 188
4 125
5 71
6 22

Table 2: Distribution of number of entities

median rating count
High 1089
Moderate 535
Low 489
Total 2113

Table 3: Distribution of ground truth labels

Contents

The dataset (wikiqa_salience.jsonl), in the JSON lines format, contains the following fields:

  • QuestionID: QuestionID from the original WikiQA dataset
  • SentenceID: SentenceID from the original WikiQA dataset
  • entities: a list of entity objects

Each entity object contains the following fields:

  • text: the mention text
  • category: the coarse mention type (from ReFinED)
  • predicted-entity-types: the predicted entity types
  • wikidata-entity-id: the WikiData entity ID
  • el-score: the ReFinED entity linking model confidence score
  • start-char: the start character of the mention text within the passage
  • end-char: the end character of the mention text within the passage
  • backend: the name of the entity linking model (i.e. "refined")
  • wikidata-entity-name: the canonical name of the entity in WikiData
  • wikidata-entity-description: a short textual description of the entity from WikiData
  • wikidata-entity-aliases: a list of aliases for the entity from WikiData
  • gt-rating-mean: the mean normalized numeric rating in the range [0, 1]
  • gt-rating-std: the standard deviation of the normalized numeric ratings.
  • gt-rating-median: the median normalized numeric rating in the range [0, 1]
  • gt-ratings-raw: a list of strings containing the ratings from each pass of annotation from the set {"High", "Moderate", "Low"}.
  • sum-first-section: Wikipedia page summary from the first section of the page
  • sum-noun-phrase-spacy: the first 100 noun-phrases from the article
  • sum-keywords-spacy: first 100 key phrases using Spacy
  • sum-keywords-rake: first 100 key phrases using Rake

Joining with the original WIKIQA dataset

The WikiQA-Salience dataset does not contain the question and answer pairs from the original WikiQA dataset only the extracted entities, enrichments from Wikidata/Wikipedia, and the ground truth labels. The datasets can be easily merge as described below, where WIKIQA_PATH point to the WikiQA.tsv file from the WikiQA corpus and the ENTITY_SALIENCE_DATA_PATH points to the wikiqa_salience.jsonl file from this dataset.

import pandas as pd
wikiqa_df = pd.read_csv(WIKIQA_PATH, sep="\t")
entities_df = pd.read_json(ENTITY_SALIENCE_DATA_PATH, lines=True)
joined_df = entities_df.join(wikiqa_df.set_index(["QuestionID", "SentenceID"]), on=["QuestionID", "SentenceID"], how="inner")

Security

See CONTRIBUTING for more information.

Cite

Please cite our paper if you use this dataset for your own research:

@inproceedings{bullough-etal-2024-predicting,
    title = "Predicting Entity Salience in Extremely Short Documents",
    author = "Bullough, Benjamin  and
      Lundberg, Harrison  and
      Hu, Chen  and
      Xiao, Weihang",
    editor = "Dernoncourt, Franck  and
      Preo{\c{t}}iuc-Pietro, Daniel  and
      Shimorina, Anastasia",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = nov,
    year = "2024",
    address = "Miami, Florida, US",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-industry.5",
    pages = "50--64",
    abstract = "A frequent challenge in applications that use entities extracted from text documents is selecting the most salient entities when only a small number can be used by the application (e.g., displayed to a user). Solving this challenge is particularly difficult in the setting of extremely short documents, such as the response from a digital assistant, where traditional signals of salience such as position and frequency are less likely to be useful. In this paper, we propose a lightweight and data-efficient approach for entity salience detection on short text documents. Our experiments show that our approach achieves competitive performance with respect to complex state-of-the-art models, such as GPT-4, at a significant advantage in latency and cost. In limited data settings, we show that a semi-supervised fine-tuning process can improve performance further. Furthermore, we introduce a novel human-labeled dataset for evaluating entity salience on short question-answer pair documents.",
}

License

This project is licensed under the CC-BY-SA 4.0 License.