The repository will walk you through the process of building a complete Sentiment Analysis model, which will be able to predict a polarity of given review (whether the expressed opinion is positive or negative). The dataset on which the model is going to be trained is popular IMDb movie reviews dataset.
-
The first notebook covers data loading from the raw dataset, feature extraction and analysis, also text preprocessing and train/val/test sets preparation.
-
The second tutorial contains instructions on how to set up the vocabulary object that will be responsible for the following tasks:
- Creating dataset's vocabulary.
- Filtering dataset in terms of the rare words occurrence and sentences lengths.
- Mapping words to their numerical representation (word2index) and reverse (index2word).
- Enabling the use of pre-trained word vectors.
Furthermore, we will build the BatchIterator class that could be used for:
- Sorting dataset examples.
- Generating batches.
- Sequence padding.
- Enabling BatchIterator instance to iterate through all batches.
-
In the third notebook, the bidirectional Gated Recurrent Unit model will be built. In our neural network we will implement and use the following architectures and techniques: bidirectional GRU, stacked (multi-layer) GRU, dropout/spatial dropout, max-pooling, avg-pooling. The hyperparameters fine-tuning process will be presented. After choosing the proper parameters set, we will train our model and determine the generalization error.
-
BiGRU with additional features
In this notebook, we will implement the bidirectional Gated Recurrent Unit model that uses features extracted in the first tutorial.
-
This notebook covers the implementation of the bidirectional Gated Recurrent Unit model, which uses pre-trained Glove word embeddings together with additional features.
-
In this notebook, we will build the Convolutional Neural Network model for text classification.
-
Transformer model for classification
Implementation of the Self-Attention Transformer model for the classification task.
Dataset is available under the following link: http://ai.stanford.edu/~amaas/data/sentiment/
Unpack the downloaded tar.gz file using:
tar -xzf acllmdb.tar.gz
Rearrange the data to the following structure:
dataset
├── test
│ ├── positive
│ ├── negative
├── train
├── positive
└── negative
-
Create a virtual environment (conda, virtualenv etc.).
conda create -n <env_name> python=3.7
-
Activate your environment.
conda activate <env_name>
-
Create a new kernel.
pip install ipykernel
python -m ipykernel install --user --name <env_name>
-
Go to the directory:
.local/share/jupyter/kernels/<env_name>
and ensure that kernel.json file contains the path to your environment python interpreter (can be checked bywhich python
command).
{
"argv": [
"home/user/anaconda3/envs/<env_name>/bin/python",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "<env_name>",
"language": "python"
}
-
Install requirements.
pip install -r requirements.txt
-
Restart your environment.
conda deactivate
conda activate <env_name>
Inside your virtual environment launch the jupyter notebook, and open the notebook file (with .ipynb extension), then change the kernel to the one created in the preceding step (<env_name>). Now you are ready. Follow me through the tutorial.
Model | Test accuracy | Validation accuracy | Training accuracy |
---|---|---|---|
BiGRU | 0.880 | 0.878 | 0.908 |
BiGRU with extra features | 0.882 | 0.881 | 0.898 |
BiGRU with Glove vectors | 0.862 | 0.862 | 0.842 |
TextCNN | 0.859 | 0.847 | 0.833 |
Transformer | 0.883 | 0.880 | 0.912 |
- https://pytorch.org/docs/stable/index.html
- https://arxiv.org/pdf/1801.06146.pdf
- https://arxiv.org/pdf/1705.02364.pdf
- https://arxiv.org/pdf/1408.5882.pdf
- https://arxiv.org/pdf/1706.03762.pdf
- http://www.peterbloem.nl/blog/transformers
- https://en.wikipedia.org/wiki/Sentiment_analysis
- https://monkeylearn.com/sentiment-analysis/#sentiment-analysis-use-cases-and-applications
- https://www.kaggle.com/praveenkotha2/end-to-end-text-processing-for-beginners
- https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
- https://spacy.io/api/annotation#pos-tagging
- https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob.sentiment
- https://scikit-learn.org/stable/modules/manifold.html
- https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76
- https://medium.com/@sonicboom8/sentiment-analysis-with-variable-length-sequences-in-pytorch-6241635ae130