LexC-Gen Generated Data Repository

This repository stores all the intermediate and final data artifacts of LexC-Gen for both NusaX and SIB-200 tasks.

HuggingFace 🤗

For researchers or practitioners who directly want to use our LexC-Gen generated data, we refer you to our datasets hosted on HuggingFace:

The datasets on HuggingFace has the structure of {id, text, label}. For instance, for NusaX sentiment analysis, an example is

{'id': '1',
 'text': 'Anchorwoman : Hai , pubuet n't reuhung atra aneuk kumuen meulawan buli aneuk miet , ikat atra getnyan fingers ngeun saboh boh manok ngeun jangka gobnyan ho saboh pillar .'
 'label': 1}

Intermediate Data Artifacts

This repository stores the intermediate data artifacts of LexC-Gen for both NusaX and SIB-200 tasks. The data artifacts include:

raw generated English texts data after step (2) (.txt format in {task}-lexcgen-raw-data/)
raw texts converted to csv (.csv format in {task}-lexcgen-processed-data/)
filtered data after input-label consistency filtering, which is after step (3) (filtered-*.csv)
tokenized English data with Stanza after filtering (tokenized_filtered-*.csv)
translated to respective low-resource languages using Gatitos bilingual lexicon, which is after step (4) (translated-*.csv)

The file string name is in the format of: {model_name}-{task_type}-en-{lang}-ctg-total{size}. Here are their descriptions:

model_name: LLM used to generate lexicon-conditioned data
task_type: sa for sentiment analysis and tm for topic classification (tm because originally we call it topic modeling)
lang: low-resource language code
size: 1K, 10K, 100K generated data size, which refers to the size of LexC-Gen generated data before filtering.

Bibtex

@inproceedings{yong2024lexcgen,
  title = {LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons},
  author = {Zheng-Xin Yong and Cristina Menghini and Stephen H. Bach},
  booktitle = {Findings of the Empirical Methods in Natural Language Processing: EMNLP 2024},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
nusax-lexcgen-processed-data		nusax-lexcgen-processed-data
nusax-lexcgen-raw-data		nusax-lexcgen-raw-data
sib-lexcgen-processed-data		sib-lexcgen-processed-data
sib-lexcgen-raw-data		sib-lexcgen-raw-data
README.md		README.md
lexcgen-figure.png		lexcgen-figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LexC-Gen Generated Data Repository

HuggingFace 🤗

Intermediate Data Artifacts

Bibtex

About

Releases

Packages

BatsResearch/LexC-Gen-Data-Archive

Folders and files

Latest commit

History

Repository files navigation

LexC-Gen Generated Data Repository

HuggingFace 🤗

Intermediate Data Artifacts

Bibtex

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages