Find New Sentiments in Text Data

General Overview

The main purpose of this tutorial is to target a particular Natural Language Processing (NLP) problem, in this case Sentiment Analysis.
Code:
- github.com/gmihaila/unt_hpc/tree/master/workshops/march_3_2019
Start Jupyter Lab on Talon:
- jupyterlab.hpc.unt.edu
  Alternative
Workshop:
- content/word_embeddings_sentiment_clustering.ipynb
- content/bert_sentiment_clustering.ipynb

Machine Learning Overview

Neural Networks.
Explainability.
Word Embedding.
K-mean Clustering.
BERT Contextual Embeddings.

Libraries Overview

Scikit-learn
TensorFlow 2.0+
PyTorch
Lime more info
Natural Language Toolkit NLTK

Dataset used:

IMDB moview reviews sentiment dataset:
- This is a dataset for binary sentiment classification containing a set of 25,000 highly popular movie reviews for training, and 25,000 for testing.
- For the clustering part of the tutorial we will combine the train and test data for a total of 50,000 movies reviews text data and their negative/positive labels.

Content:

Train custom word embeddings using a small neural network.
Explain model predicitons with Lime.
Use a state of the art Language model like BERT to encode text data data into fixed vector feratures.
Perform K-means clustering on all 50,000 movies reviews.
Find best k in K-means.
Explore different values of k in K-means.
Observe the overlap between true lables and predicted clusters.
Fine-grained sentiment analysis with clustering.

NLP - Fine-grained Sentiment Analysis

Sentiment classifiers are used in binary classification (just positive or just negative sentiment).
Fine-grained sentiment classification is a significantly more challenging task!
Typical breakdown of fine-grained sentiment:

NLP - Fine-grained Sentiment Analysis

Binary class labels may be sufficient for studying large-scale positive/negative sentiment trends in text data such as Tweets, product reviews or customer feedback, but they do have their limitations.
When performing information extraction with comparative expressions, for example:
“This OnePlus model X is so much better than Samsung model X.”
Fine-grained analysis can provide more precise results.
“The location was truly disgusting … but the people there were glorious.”
Dual-polarity sentences can confuse binary sentiment classifiers.

Word Embedding

Vector representations of a particular word.
Use a Nerual Network to create the vector representations.
Sentence Embedding: vector representations of a sentence (average of word embeddings).
Train Neural Network to create our own embeddings.

Word Embedding + K-means Clustering

Use our pre-trained neural network model to get vector representations for each review.
Use K-means to cluster all 50,000 movies reviews.
Fine-grained sentiment analysis.

BERT Word Embeddings

BERT (Bidirectional Encoder Representations from Transformers), released in late 2018.
With regular word embeddings, each word has a fixed representation.
BERT produces word representations that are dynamically informed by the words around them.
Example: “The man was accused of robbing a bank.” “The man went fishing by the bank of the river.”
For our experiments we will use BERT pre-trained model bert-base-nli-stsb-mean-tokens.

BERT Sentence Embedding + K-means Clustering

Use BERT pre-trained model to get 768 vector representation for each review.
Use K-means to cluster on all 50,000 movies reviews.
Fine-grained sentiment analysis.

Setup Python Environment

Open a temrinal in your Jupyter Lab session and type:

module load jupyter
pip install git+https://github.com/arvkevi/kneed
pip install nltk
pip install tensorflow-datasets
pip install lime
pip install git+https://github.com/UKPLab/sentence-transformers

Now restart kernel.

Let’s start!

Code:
- github.com/gmihaila/unt_hpc/tree/master/workshops/march_3_2019
Start Jupyter Lab on Talon:
- jupyterlab.hpc.unt.edu
Workshop content:
- content/word_embeddings_sentiment_clustering.ipynb
- content/bert_sentiment_clustering.ipynb

Final Notes

Got to use jupyterlab on Talon (I hope).
Improved your Machine Learning skills.
Find sentiments in text data.
Questions?
You’re welcome!

Sources:

https://github.com/gmihaila

Slides

Fine-grained Sentiment Analysis in Python

tf word_embeddings

RAPIDS

Introduction to Word Embedding and Word2Vec

BERT Word Embeddings Tutorial

Lime: Explaining the predictions of any machine learning classifier