Navigating the Amazon Jungle: Product Research and Analysis through Python and R

8 min readOct 28, 2023

In the vast and ever-changing landscape of online marketplaces, understanding product performance is most important. Therefore, this project started with the importance of managing products performance on the online environment.

Steps and Methods:

1 . Data collected using web scraping methods on Phyton

2 . Exploratory Statistic Analysis

3. Data exploration with Text Mining techniques on R.

Data Extracted from Amazon

Python was utilized to extract data from Amazon since any other third-party software was allowed to be used within the organization were this project took place. There’s diferent specific sofwares for this porpuse, which I had used before in diferent contexts, and are a easy and good way to extract reviews online, however, limitations must be considered on the standards built on those programs (e.g., Octoparse, Scrapstorm or Mozenda).

The script about the extraction processes used in Python is presented below:

import requests
from bs4 import BeautifulSoup
import pandas as pd

reviewlist = []

def get_soup(url):
    r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 2})
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup


def get_reviews(soup):
    reviews = soup.find_all('div', {'data-hook': 'review'})
    try:
        for item in reviews:
            review = {
            'product': soup.title.text.replace('Amazon.co.uk:Customer reviews:', '').strip(),
            'title': item.find('a', {'data-hook': 'review-title'}).text.strip(),
            'rating':  float(item.find('i', {'data-hook': 'review-star-rating'}).text.replace('out of 5 stars', '').strip()),
            'body': item.find('span', {'data-hook': 'review-body'}).text.strip(),
            }
            reviewlist.append(review)
    except:
        pass

for x in range(1,999):
    soup = get_soup(f'https://www.amazon.co.uk/product-reviews/B07WD58H6R/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
    print(f'Getting page: {x}')
    get_reviews(soup)
    print(len(reviewlist))
    if not soup.find('li', {'class': 'a-disabled a-last'}):
        pass
    else:
        break

df = pd.DataFrame(reviewlist)
df.to_excel('sony-headphones.xlsx', index=False)
print('Fin.')

Reference into this artifact of code, following the methods used by John Watson Rooney, which does python tutorials aimed at users who want to web scrape product reviews online.

Exploratory Statistics — Sales Performance

Many findings were presented to analysts related to sales performance, trends, patterns, and fluctuations observed in sales data using charts and graphs to illustrate trends over time. Generally speacking, correlation between sales and star rating performance exists. However, since the sales performance are not under this article scope, no output is presented with the intention to maintain it confidential.

Exploratory Statistics — Customer’s Feedback

Top+ Rating by Products & All time ratings Time Series Analysis

4 Charts of n. of Review & Rating performance

Simple exploratory statistics were used to uncover the number of reviews and rating performance during the time under analysis.

Text Mining — Customer’s Feedback

The text mining techniques employed are varied, among them, sentiment analysis, cluster dendograms, word co-occurrence, relationships between words (positive & negative) and topic-modeling. Reference to Text Mining with R: A Tidy Approach for more information. Code scripts were applied as below:

#Code to Wordcloud, frequency and 10 sentiments

# Build corpus
library(tm)
corpus <- iconv(TR1500TOR_5_T$Reviews, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))

library(tm)
setwd("~/Desktop")
reviews <- readLines("xxxxx.txt")
corpus <- Corpus(VectorSource(reviews))
inspect(corpus[1:5])

# Clean text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])

# Remove links/URL's 
#removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
#cleanset <- tm_map(cleanset, content_transformer(removeURL))

# Remove duplicate words
#cleanset <- tm_map(cleanset, removeWords, c('aapl', 'apple'))
#cleanset <- tm_map(cleanset, gsub, 
                   #pattern = 'stocks', 
                   #replacement = 'stock')

# Term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]

# Bar plot
w <- rowSums(tdm)
w <- subset(w, w>=3)
barplot(w,
        las = 2,
        col = rainbow(50))

# Write table (export table)
write.table(w, file = "xxxxx.csv", sep = ",")


# Word cloud
library(wordcloud)
w <- sort(rowSums(tdm), decreasing = TRUE)
set.seed(222)
wordcloud(words = names(w),
          freq = w,
          max.words = 25,
          random.order = F,
          min.freq = 5,
          colors = brewer.pal(8, 'Dark2'),
          scale = c(5, 0.3),
          rot.per = 0.7)

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word', 'freq')
wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)

letterCloud(w,
            word = "apple",
            size=1)

# Sentiment analysis
library(syuzhet)
library(lubridate)
library(ggplot2)
library(scales)
library(reshape2)
library(dplyr)

# Read file
senti <- iconv(TR5000R2427EBALLR$Reviews, to = 'utf-8')

# Obtain sentiment scores
s <- get_nrc_sentiment(senti)
head(s)
senti[4]
get_nrc_sentiment('delay')

# Bar plot
barplot(colSums(s),
        las = 2,
        col = rainbow(10),
        ylab = 'Count',
        main = 'Sentiment Scores for reviews')

write.table(s, file = "xxx.csv", sep = ",")

#Code for Topic Modeling 

#librarys required
require(topicmodels)
require(pdftools)
require(tm)
require(tidytext)
require(ggplot2)
require(dplyr)

#Load all PDF files (check out video/ you have to create a new project and insert in the file the pdf documents)
All_Files<- list.files(pattern = "pdf$")
All_opinions <- lapply(All_Files, pdf_text)

document<-Corpus(VectorSource(All_opinions))

#Clean data
document<-tm_map(document, content_transformer(tolower))
document<-tm_map(document, removeNumbers)
document<-tm_map(document, removeWords, stopwords("english"))
document<-tm_map(document, removePunctuation, preserve_intra_word_dashes = TRUE)
document<-tm_map(document, stripWhitespace)
document<-tm_map(document, removeWords, c('device', 'water'))


#Create documen-term matrix
DTM<-DocumentTermMatrix(document)

#Create model with 4 topics
Model_lda <- LDA(DTM, k = 12, control = list(seed = 1234))
Model_lda

#Show the probability of a word being associated to a topic
beta_topics <- tidy(Model_lda, matrix = "beta")#create the beta model
beta_topics#shows all the information in beta_topics

#Grouping the terms by topic
beta_top_terms <- beta_topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%
  ungroup() %>%
  arrange(topic, -beta)

#Display the grouped terms on the charts
beta_top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered()

#filters terms by topics
tidy(DTM) %>%
  filter(document == 3) %>%
  arrange(desc(count))

#Examining per document per topic probability
gamma_documents <- tidy(Model_lda, matrix = "gamma")
gamma_documents

#create a data frame with gama results
doc_gamma.df <- data.frame(gamma_documents)
doc_gamma.df$chapter <- rep(1:dim(DTM)[1],4)

#plot gamma results
ggplot(data = doc_gamma.df, aes(x = chapter, y = gamma,
  group = factor(topic), color = factor(topic)))+
  geom_line()+facet_wrap(~factor(topic), ncol = 1)

# Code Cluster Dendrogram

# Read text file
library(tm)
setwd("~/Desktop")
reviews <- readLines("Word11.txt")

# Build corpus
corpus <- Corpus(VectorSource(reviews))
inspect(corpus[1:5])

# Clean text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])

# Remove duplicate words
cleanset <- tm_map(cleanset, removeWords, c('everything', 'device', 'water', 'heater', 'can', 'boiler', 'even', 'use', 'also', 'now', 'good'))


# Create Term Document Matrix
tdm <- TermDocumentMatrix(cleanset,
                          control = list(minWordLength=c(1,Inf)))
t <- removeSparseTerms(tdm, sparse=0.88)
m <- as.matrix(t)

# Plot frequent terms
freq <- rowSums(m)
freq <- subset(freq, freq>=80)
barplot(freq, las=2, col = rainbow(25))

# Hierarchical word clustering using dendrogram
distance <- dist(scale(m))
print(distance, digits =  2)
hc <- hclust(distance, method = "ward.D")
plot(hc, hang=-1)
rect.hclust(hc, k=5)

# Nonhierarchical k-means clustering of words/tweets
m1 <- t(m)
set.seed(222)
K <- 12
kc <- kmeans(m1, k)

#Code to co-occurrence network analysis

require(pdftools)
require(tm)
require(tidytext)
require(dplyr)
require(igraph)
require(tidyr)
require(ggraph)

#load all files
All_Files<- list.files(pattern = "pdf$")
All_opinions <- lapply(All_Files, pdf_text)

#Create corpus
document<-Corpus(VectorSource(All_opinions))

#Clean text
document<-tm_map(document, content_transformer(tolower))
document<-tm_map(document, removeNumbers)
document<-tm_map(document, removeWords, stopwords("english"))
document<-tm_map(document, removePunctuation, preserve_intra_word_dashes = TRUE)
document<-tm_map(document, stripWhitespace)
document<-tm_map(document, removeWords, c('device', 'water'))

PDFDataframe<-data.frame(text = sapply(document, as.character),
                         stringsAsFactors = FALSE)

#Create Bigrams
New_bigrams <- PDFDataframe%>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)
New_bigrams

#Count Bigrams frequency
New_bigrams %>%
  count(bigram, sort = TRUE)

#Separate Bigrams and Remove Stop Words
bigrams_separated <- New_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

#new bigram counts
bigram_counts <- bigrams_filtered %>%
  count(word1, word2, sort = TRUE)
bigram_counts

#Filtering for Specific words
bigrams_filtered %>%
  filter(word1 == "bosch") %>%
  count(word2, sort = TRUE)

#create bigraph
#use words with a count larger than 5 to find relationship
bigram_graph <- bigram_counts %>%
  filter(n > 5) %>%
  graph_from_data_frame()

#create the graph based on relationships
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

# Code to positive and negative words

require(dplyr)
require(tidytext)
require(readr)
require(ggplot2)
library(tm)
require(pdftools)
require(igraph)
require(ggraph)

#count the most common words on the page
document %>%
  count(word, sort = TRUE)

#create chart with top words
bosch %>%
  count(word, sort = TRUE) %>%
  filter(n > 100) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

#Sentiment analysis starts here
bingwords<-get_sentiments("bing")

#number of positive and negative words in bing
sentiments %>%
  group_by(sentiment) %>%
  summarise(noOfWords = n())

# Using bing words to explore the document
bing_word_counts <- admire %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

#wordcloud of most popular words
require(wordcloud)
covid %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word,n,max.words = 100))

#We can view this visually to assess the top words for each sentiment
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ggplot(aes(reorder(word, n), n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment", x = NULL) +
  coord_flip()

This project embarked on a journey through online reviews available online at Amazon, employing tools and methodologies to uncover the patterns hidden within the data. Results showcased the power of data through the use of Python and R Studio in decoding the language of the market. The scripts, methods and visualizations used are key to understand customer behavior, sales dynamics, and customer satisfaction.

In conclusion, our travel through the Amazon jungle reinforced a fundamental truth: data is the compass that guides successful ventures in the digital world. By embracing this inovative tools, we can navigate the complexities of online markets, adapting our strategies and products to align with the constantly changing needs and desires of consumers.

As we leave this journey, we carry with us not just the insights of this project, but also the knowledge that in the digital environment, the most deep discoveries are often hidden in the patterns and words of the consumers.

Happy navigating, fellow explorers. 🌿🚀

Navigating the Amazon Jungle: Product Research and Analysis through Python and R

Written by Frederico Carvalho

No responses yet