General Overview

Machine Learning Overview

  • Neural Networks.
  • Explainability.
  • Word Embedding.
  • K-mean Clustering.
  • BERT Contextual Embeddings.

Libraries Overview

  • Scikit-learn
  • TensorFlow 2.0+
  • PyTorch
  • Lime more info
  • Natural Language Toolkit NLTK

Dataset used:

  • IMDB moview reviews sentiment dataset:
    • This is a dataset for binary sentiment classification containing a set of 25,000 highly popular movie reviews for training, and 25,000 for testing.
    • For the clustering part of the tutorial we will combine the train and test data for a total of 50,000 movies reviews text data and their negative/positive labels.

Content:

  • Train custom word embeddings using a small neural network.
  • Explain model predicitons with Lime.
  • Use a state of the art Language model like BERT to encode text data data into fixed vector feratures.
  • Perform K-means clustering on all 50,000 movies reviews.
  • Find best k in K-means.
  • Explore different values of k in K-means.
  • Observe the overlap between true lables and predicted clusters.
  • Fine-grained sentiment analysis with clustering.

NLP - Fine-grained Sentiment Analysis

  • Sentiment classifiers are used in binary classification (just positive or just negative sentiment).
  • Fine-grained sentiment classification is a significantly more challenging task!
  • Typical breakdown of fine-grained sentiment:

NLP - Fine-grained Sentiment Analysis

  • Binary class labels may be sufficient for studying large-scale positive/negative sentiment trends in text data such as Tweets, product reviews or customer feedback, but they do have their limitations.

  • When performing information extraction with comparative expressions, for example:
  • “This OnePlus model X is so much better than Samsung model X.”
  • Fine-grained analysis can provide more precise results.

  • “The location was truly disgusting … but the people there were glorious.”
  • Dual-polarity sentences can confuse binary sentiment classifiers.

Word Embedding

  • Vector representations of a particular word.
  • Use a Nerual Network to create the vector representations.
  • Sentence Embedding: vector representations of a sentence (average of word embeddings).
  • Train Neural Network to create our own embeddings.

Word Embedding + K-means Clustering

  • Use our pre-trained neural network model to get vector representations for each review.
  • Use K-means to cluster all 50,000 movies reviews.
  • Fine-grained sentiment analysis.

BERT Word Embeddings

  • BERT (Bidirectional Encoder Representations from Transformers), released in late 2018.
  • With regular word embeddings, each word has a fixed representation.
  • BERT produces word representations that are dynamically informed by the words around them.
  • Example: “The man was accused of robbing a bank.” “The man went fishing by the bank of the river.”
  • For our experiments we will use BERT pre-trained model bert-base-nli-stsb-mean-tokens.

BERT Sentence Embedding + K-means Clustering

  • Use BERT pre-trained model to get 768 vector representation for each review.
  • Use K-means to cluster on all 50,000 movies reviews.
  • Fine-grained sentiment analysis.

Setup Python Environment

Open a temrinal in your Jupyter Lab session and type:

module load jupyter
pip install git+https://github.com/arvkevi/kneed
pip install nltk
pip install tensorflow-datasets
pip install lime
pip install git+https://github.com/UKPLab/sentence-transformers

Now restart kernel.

Let’s start!

Final Notes

  • Got to use jupyterlab on Talon (I hope).
  • Improved your Machine Learning skills.
  • Find sentiments in text data.
  • Questions?
  • You’re welcome!

Sources: