Sentiment Analysis for Amazon reviews data

In the context of E-commerce era, customers express their opinions on the E-commerce platforms by writing reviews to products they bouthgt. For E-commerce platforms, brands sold on the platforms and customers, it is especially important to know customers' attitude to the product.

How can we leverage the knowledge of data science to analyze customers' attitude to a product? The answer is using sentiment analysis based on the reviews data.

The reason why I choose the topic is derived from my personal experience of buying furniture on the Amazon. When I decided to buy a computer desk on the Amazon, it took me much time to browse reviews for different kinds of products and compare the advantages and disadvantages among them. From what I learnt in the Marketing Analytics course, I wonder if I can scrape reviews data and analyze them using some models. Thus, I assumed I am a data analyst in a E-commerce company. At first, I should set several tasks for me to solve the problem: analyzing customers' attitude to a product.

Scrape reviews text and reviews score from Amazon.com (Selenium and BeautifulSoup in Python)
Transform the text data into numerical data
Apply some Machine Learning models to predict reviews score based on reviews text and select the best model (Scikit-learn in Python)

Beyond the pre-defined problem, the solution can also be used to solve similar problems in E-commerce industry for the following context:

1. Before launching new products, the seller can pre-launch the products to a small group of customers and get feedback. The sentiment analysis can be applied to analyze customers' feedback to a product.

2. People express their opinions to products on social media platforms, for example, Twitter. Companies can collect people opinions on such social media to see users' attitude.

Scraping data from Amazon!

There are tons of product pages and products on Amazon. Web scraping is very useful for people to collect data on web pages instead of copying and pasting the data manually.

Step 1: import some packages needed

import os
import re
import time
import datetime
import random
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Step 2: set up Web Driver using Selenium.

I decided to only scrape the computer desk reviews data. Thus, I search for computer desk on Amazon and find its link. I also show a view to help you get a glimpse of the product page. There are dozens of products on each page.

driver = webdriver.Firefox()
link_to_scrape_product_links = 'https://www.amazon.com/s?k=computer+desk&ref=nb_sb_noss_1'
driver.get(link_to_scrape_product_links)
time.sleep(6)

Then I need to get the html object for each product page and use BeautifulSoup to parse it.

for i in range(9):
one_product_link = all_product_links[i]
driver.get(one_product_link)
time.sleep(8)
raw_data_bs = BeautifulSoup(driver.page_source)

We need to browse the reviews page for each product. We can get reviews page for each product using Selenium.

reviews_page_link = 'https://www.amazon.com' + raw_data_bs.find_all('a', {'data-hook':'see-all-reviews-link-foot'})[0].attrs['href']
driver.get(reviews_page_link)
time.sleep(5)

Step 3: Scrape the total number of reviews, reviews score and reviews text

There is a total number of reviews shown on the reviews page. It can be helpful for me to know how many reviews we need to scrape for each products. I scraped the html and used regular expressions to locate the number and save it to a float number.

Trick 1: Different formats for reviews containing image and video when storing reviews text

In order to solve the different formats, I find that the reviews with video contain so-called 'children' under the bs.find('span', {'data-hook':'review-body'}) tag. It can be solved by adding a if-else statement to justify if there is a video in the reviews.

# see if there is video in the reviews
children = []
if bs.find('span', {'data-hook':'review-body'}).find('input', {'type':'hidden'}):
for child in bs.find('span', {'data-hook':'review-body'}).children:
children.append(child)
one_review_text = children[4].get_text()
one_review['review_text'] = " ".join(one_review_text.split())
else:
try:
one_review_text = bs.find('span', {'data-hook':'review-body'}).get_text()
except:
one_review_text = ''
one_review['review_text'] = one_review_text.strip()

Trick 2: Sleep time needed to be above 8s when opening a product page, 10s when opening a reviews page.

At first, I set time.sleep(5) and every time the max number of reviews I can get is 170. After setting the time.sleep(8), I can get all the reviews for each product. When opening a new reviews page, I need to set time.sleep(random.randint(10,13)), otherwise, I can only get the reviews on the first page.

Trick 3: Each product only shows top reviews.

(about 1000 reviews for each product)

Although there are more than 10,000 reviews for some top products, Amazon only shows top reviews which contain reviews text. So, I need to click the 'next page' button until the final page to get all the reviews data.

#driver.find_element_by_link_text("Next page")
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//li[@class='a-last']")))
driver.find_element_by_xpath("//a[contains(text(), 'Next page')]").click()
time.sleep(random.randint(10,13))

Finally, I get all reviews data for top 10 products, including 10,768 rows. The following image can give you a quick glimpse of the reviews data looks like.

Cleaning Review Data to prepare for Machine Learning

Now we have 10,768 rows of raw customer reviews text and review score. We need to split our data into three groups, training, validating, and testing group in order to select the best model.

Step 1: Transforming text data into numbers

Although Machine Learning can be used to train different models, we should transform our reviews text into numbers (a bag of words representation). We use CountVectorizer to convert the text to a vector of counts.

First, CountVectorizer tokenizes the text, meaning it divides the sentences up into just words, and make every letter lowercase.

corpus = df['review_text'].to_list()
vectorizer_count = CountVectorizer(lowercase = True, ngram_range = (1,1),max_df = 0.96,min_df = 0.001)

X = vectorizer_count.fit_transform(df['review_text'].values.astype('U'))
features_frequency = pd.DataFrame({'feature': vectorizer_count.get_feature_names(),'feature_frequency' : X.toarray().sum(axis=0)})
X.shape
sns.barplot(x="feature", y="feature_frequency", data=features_frequency.sort_values(by='feature_frequency',ascending=False).head(100))

Step 2: Using TVT to split our data

#Perform a TVT split
df['ML_group'] = np.random.randint(10,size=df.shape[0])
df['ML_group'] = (df['ML_group']<=7)*0 + (df['ML_group']==8)*1 +(df['ML_group']==9)*2

X_train = X[np.where(df['ML_group']==0)[0],:]
X_valid = X[np.where(df['ML_group']==1)[0],:]
X_test = X[np.where(df['ML_group']==2)[0],:]

y_train = df.loc[df['ML_group']==0,['review_stars']]['review_stars'].to_numpy()
y_valid = df.loc[df['ML_group']==1,['review_stars']]['review_stars'].to_numpy()
y_test = df.loc[df['ML_group']==2,['review_stars']]['review_stars'].to_numpy()

Predicting through various Machine Learning models

After getting the data and sampling training, validating and testing data, it is ready for me to train Machine Learning models. There are many possible classification techniques, or classifiers, for me to predict the score for each reviews. For the customer reviews data we scraped, it is difficult for me to know which Machine Learning models can make better classification before running models for our data.

Therefore, I decide to select and train several Machine Learning models (or classifiers): logistic regression classifier, Lasso classifier, and K-nearest neighbors classifier.

In order to see the performance of Machine Learning models, we need to draw confusion matrices for testing data, validating data and test data.

Logistic regression classifier:

def logistic_reg_classifier_mult_labels(X_train,y_train,X_valid,y_valid,X_test,y_test):
categories = pd.DataFrame(np.sort(np.unique(y_train))).reset_index()
categories.columns = ['index','label']

train_list = []
valid_list = []
test_list = []
# cat=1
for cat in categories['label'].to_list():
y_train_c = 1*(y_train==cat)
clf = linear_model.LogisticRegression(tol = 0.0001,
max_iter = 10000,
random_state = None).fit(X_train, y_train_c)

train_list.append( clf.predict_proba(X_train)[:,1])
valid_list.append( clf.predict_proba(X_valid)[:,1])
test_list.append( clf.predict_proba(X_test)[:,1])

' . Topic probability matrix'
train = pd.DataFrame(train_list).transpose()
valid = pd.DataFrame(valid_list).transpose()
test = pd.DataFrame(test_list).transpose()

'Choosing your predictive category for the y '
train['label_hat'] = train.idxmax(axis=1)
valid['label_hat'] = valid.idxmax(axis=1)
test['label_hat'] = test.idxmax(axis=1)

conf_matrix_train = confusion_matrix(y_train, train['label_hat'].to_numpy())
conf_matrix_valid = confusion_matrix(y_valid, valid['label_hat'].to_numpy())
conf_matrix_test = confusion_matrix(y_test, test['label_hat'].to_numpy())

return(conf_matrix_train, conf_matrix_valid, conf_matrix_test)

We can see the confusion matrices for training data and testing data. The precision accuracy is very low even for training data. Although the precision accuracy is 5.15% and 8.92% for training and testing data. It's interesting that most of predications are in another diagonal line which shows our predication are not too bad.

KNN Classifier

conf_matrices = []
precision_accuracy = []
for k in range(1,20):
clf = KNeighborsClassifier(n_neighbors=k).fit(X_train , y_train )
y_hat_valid = clf.predict(X_valid)
conf_matrices.append(confusion_matrix(y_valid, y_hat_valid))

However, in the actual cases, the criteria for selecting the best model is not training error, but test error. We need to use the TVT(train_split_test) function to split the training data into a smaller training set and a validation set. Training my models against the training data and evaluating them against validation data can help get the best model with least test error.

K-fold cross-validation is a resampling procedure which can be used to evaluate machine learning models on a limited data sample.

Considering the reviews data we have, we can use 10-fold cross-validation to randomly split the training data into 10 folds, then it trains each Machine Learning model on 9 folds and evaluates on the other one fold for 10 times.

Here, we can use Scikit-Learn's function to do the 10-folds cross-validation.

from sklearn.model_selection import cross_val_score
scores_means = []
for k in range(2,20):
clf = KNeighborsClassifier(n_neighbors=k).fit(X_train , y_train )
clf1 = KNeighborsClassifier(n_neighbors=k).fit(X_test , y_test )
scores = cross_val_score(clf, X_valid, y_valid, cv=10, scoring='accuracy')
print(scores.mean())
scores_means.append(scores.mean())

plt.xlim([0, 20])
plt.ylim([0, 1])
a = list(range(1,19))
plt.plot(a,scores_means)

Most Precise Model: KNN classifer when k = 2

Precision Accuracy: 71.24%

Corresponding Confusion Matrix:

My model can predict the review sentiment with 71.24% precision. The precision accuracy is above 80% for KNN classifier. It makes sense that my model can predict the sentiment of the review text, because I trained the model with data from the same set. The precision accuracy can achieve over 70% which shows KNN is a good classifer for the reviews text classification.