This repository stores all the intermediate and final data artifacts of LexC-Gen for both NusaX and SIB-200 tasks.
For researchers or practitioners who directly want to use our LexC-Gen generated data, we refer you to our datasets hosted on HuggingFace:
The datasets on HuggingFace has the structure of {id, text, label}
. For instance, for NusaX sentiment analysis, an example is
{'id': '1',
'text': 'Anchorwoman : Hai , pubuet n't reuhung atra aneuk kumuen meulawan buli aneuk miet , ikat atra getnyan fingers ngeun saboh boh manok ngeun jangka gobnyan ho saboh pillar .'
'label': 1}
This repository stores the intermediate data artifacts of LexC-Gen for both NusaX and SIB-200 tasks. The data artifacts include:
- raw generated English texts data after step (2) (
.txt
format in{task}-lexcgen-raw-data/
) - raw texts converted to csv (
.csv
format in{task}-lexcgen-processed-data/
) - filtered data after input-label consistency filtering, which is after step (3) (
filtered-*.csv
) - tokenized English data with Stanza after filtering (
tokenized_filtered-*.csv
) - translated to respective low-resource languages using Gatitos bilingual lexicon, which is after step (4) (
translated-*.csv
)
The file string name is in the format of: {model_name}-{task_type}-en-{lang}-ctg-total{size}
. Here are their descriptions:
model_name
: LLM used to generate lexicon-conditioned datatask_type
:sa
for sentiment analysis andtm
for topic classification (tm
because originally we call it topic modeling)lang
: low-resource language codesize
: 1K, 10K, 100K generated data size, which refers to the size of LexC-Gen generated data before filtering.
@inproceedings{yong2024lexcgen,
title = {LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons},
author = {Zheng-Xin Yong and Cristina Menghini and Stephen H. Bach},
booktitle = {Findings of the Empirical Methods in Natural Language Processing: EMNLP 2024},
year = {2024}
}