I am Senior year student studying Computer Science through the TAMS program. I am interested in data science, machine learning, and statistics. I am passionate about expanding my knowledge base and keeping up-to-date with trends and breakthroughs in the industry.
Mining For Gold
Data is all around us, holding great promise to those who can capture and use it effectively. This is the role of data scientists and ML experts. By using the data analysis process, I aimed to do the same and see if I could solve a real-world problem.
Data is all around us, holding great promise to those who can capture and use it effectively. This is the role of data scientists and ML experts. By using the data analysis process, I aimed to do the same and see if I could solve a real-world problem.
I then needed to collect my data. Luckily, stock statistics are public knowledge that are not hard to procure. The dataset was presented to me by Ms. Palisetti. It went back about two years, and was already structured into a .csv file. In other business cases, we may have needed to scrape the web or access databases. Often, the result is a messier, more unstructured dataset.
It was finally time to explore the data. I chose to use python and it’s relevant libraries numpy, pandas, and scikit-learn to try to make some headway.
Though the data was largely structured, there remained some work to be done before we could apply our model. This step, data cleansing, is hugely important for what’s to follow. To start, our dataset looked like the following:
We can see certain rows have missing values. The solution was to perform mean-value imputation. That is, we simply substituted the mean volume into those fields without data. After removing the ‘K’s and ‘%’s, we proceeded to the final issue. One of our columns was the date- a categorical variable. To convert that into values suitable for our model,
we split the numerical date into a day, month, and year variable
At last, we were done cleansing the data. However, there remained some steps to improve our results. Firstly, we applied a log transform to the data. This useful tool helps curb the effect of outliers that may skew the overall patterns of the data.
We also removed the “open”, “high”, and and “low” columns as they were too multicollinear with the price- they added little beyond what was already expressed in the price column. I made this decision after calculating the variance inflation factor for each column, a statistic that holds high values when the data is highly correlated with other columns.
Now that our data was ready to feed into our model, we had to decide upon a model to use. Our model would be a regression model, as volume is a continuous variable. To try and make predictions, I chose two models: linear regression and random forest.
I then split the data into training and testing sets. The idea was to have each model learn and make adjustments during a ‘training phase’. I would then evaluate it by having it predict values and comparing it with the testing set. I chose to have the oldest 80% be training data. The rest was for testing.
The first model I chose was multiple linear regression. I chose this model for its ease of implementation. In multiple linear regression, we seek to find the line such that the sum of the squares of each point to our line if minimized. Algebraically, the line takes the form of
Volume= c0 + c1 *Price + c2*Change + c3*Year + c4*Month + c5*Day
We ran the model, and the results were as follows:
Understanding these metrics would help me understand how my model performed. For instance, the coefficients stated that increasing the price by one while keeping the others constant, would affect the volume by -0.06. The MSE, RMSE, and MAE are generated from the distances from each prediction to the actual value. Lastly, the r^2 value, the percent of variance explained by the model. Here, it is actually quite poor, being close to zero.
I then tried random forest. In this algorithm, multiple decision trees are combined. A single decision tree is a model of possible conditions for variables and their consequences on the target variable. It is a so called ‘boosting’ algorithm as it combines the power of the weak learning decision trees.
I chose to use 1000 trees in my forest. To measure the performance of my The result from our test set were as follows:
The errors seemed to have decreased, and the predictions seem to take a more organic shape. We can conclude that linear regression may have been too simple to capture the nuances of our data. Random forest on the other hand, was more robust. That means that we could use it to predict future volumes with a some degree of accuracy. We would definitely have an edge over an non-informed investor.
Both models could be further improved. Next, I could use robust linear regression to better account for outliers. For my random forest, I could undergo hyper-parameter tuning.
Still, no model can be perfect. The world is full of noise that doesn’t fit neatly with an overall trend. Data scientists nevertheless seek to mine for that gold in the ore. As I continue to develop my skills, I’ll only become better at this process. Hopefully, I’ll be extracting life-changing insights from data soon!