Clustering using K-Means

Cluster Analysis using K-means

I am a graduate student studying Business Analytics – Data Science at the University of Texas at Dallas. I am what you might call a Machine Learning enthusiast. As long as I can remember I have always been interested in computers and technology.

One of my fondest memories as a twelve-year-old child was getting to use my uncle’s new 1st generation iPhone for the first time. Thus, a career in technology was written in stone for me at a young age. Naturally the topics of Machine Learning and Artificial Intelligence are of great interest to me. This new field has truly fascinated me with the many different uses for its new techniques, such as: the advances made in Self-Driving Cars, the many helpful functions that virtual personal assistants, such as Siri and Alexa, serve in our daily lives, and the manner in which business problems of great magnitude can be solved much more efficiently through the “art” of prediction. After graduating, I hope to be able to participate in this new and exciting field, and am optimistic about the many positive changes it will bring.

And I am currently enrolled as a graduate student at the University of Texas at Dallas studying Business Analytics with a focus in Data Science; I am working on my second semester of a four semester program at the time of this post.

I spent the summer working at RandomTrees Inc., a company that provides AI Enterprise solutions to its clients, and the knowledge I gained was invaluable to my future career in the Data Science and Machine Learning space.

Before the summer I had a very limited knowledge of Python and its many facets and packages related to Data Science and Visualization. After spending the Summer as an intern, I was able to learn advanced concepts in Python, NumPy, Pandas, SciKit Learn, and Matplotlib. I truly feel that I now have a better grasp of the Python language, and how its many significant uses in the Data Science field can be used to uncover useful insights in data.

For my assignment, I was tasked with finding out valuable business information and insights about the customers of a retail grocery store chain, specifically at their UK locations. Some of this information included metrics like: the number of times a specific customer shops there, the number of and what specific items they purchase together. My data source was a CSV spreadsheet that contained information about the customers such as their: store assigned Customer ID, Quantity of items purchased, Unit Price of each item, Item ID, Stock ID, and the Total Amount the customers spent at the store. Each item purchased was a separate entry, so, by using count and groupby functions to get insight into additional metrics such as how many times each customer purchased, or how often they visited the store. The data was collected by aggregating the register data at the time of purchase for each customer over two years (2010-11)

For my algorithm to analyze the data, I used Principle Component Analysis to reduce the dimensionality of the data, and then K-means Clustering to run the analysis. I first began by cleansing the data of duplicate and null values. I then selected only the United Kingdom data as my scope for analysis. I found that 3 Principle Components were a safe, and statistically conservative value to use. Furthermore, after using an Elbow Curve, I found that the optimum number of cluster to use for the k value in the k-means algorithm was 4. 

Through my clustering analysis, I found that many of the stores’ repeat customers purchase the same items together throughout the year. This can be attributed to regular grocery shopping habits. After examining the number of times an item was purchased, I also found that there seemed to be a fairly even distribution of the items purchased. This information could be beneficial to the store’s inventory reorder protocol, as it appears that they may need to keep fairly equal amounts of most of their major products in stock at all times.

Through measurements of recency, frequency, and monetary value calculations, I came across some interesting findings. As anyone would expect I found that customers who shopped at the stores frequently, shopped fairly recently, and as expected contributed to a moderate to higher monetary value. However, the more peculiar conclusion I found that some customers who don’t shop at the store frequently also had a high monetary value. This finding tells us that the store is able to provide value to new or infrequent shoppers, and should continue to focus their efforts on keeping their repeat business.

Working at RandomTrees was a unique experience because instead of the routine and basic work that many interns at large corporations are given, I was given the opportunity to participate in real work. This job gave me a great, and unique learning opportunity to work alongside experienced Data Scients, Data Analysts, and Engineers. It was an eye opening experience to see their work, and gain some foresight into my future career. Being in this environment allowed me to learn the best practices in the field of Data Science and Machine Learning, and was truly a great inspiration for me to continue learning about this quickly advancing field.

Article written by

Rohith Selvarajan