US Companies Stock Market Prediction - Post Recession

Citadel Chicago Datathon - Summer 2017

Akash, Aman, Hang, Nima - Team 17

1. Problem Statement

Predicting stock price trends by interpreting the seemingly chaotic market data has warranted investigation by both investors and researchers. Among the multitude of methods that have been employed to model this dependence, machine learning techniques are the most popular by far, owing to the capability of identifying stock trends from massive amounts of data that capture the underlying stock price dynamics. As we all know, trading firms keep iterating their models as market and economic conditions vary. As a result, there are no universal equity models and evaluation standards.

In this datathon, with the huge range of datasets available to us, we decided to try and model the financial trends of a company by finding correlations with it's recruitment needs and economic trends of the state it is based out of. Hence, the question we tackled in this datathon is Can we predict a company's stock prices on the basis of its job posting activity and the economic conditions of the state it is based in.

1.1. Significance

Stock trading is one of the main investment activities in the business market. In the past, in order to maximize benefits,investors have developed several stock analysis algorithms that can help them forecast the movement of the stock price. Whether stocks will rise or fall in a certain period of time is invaluable information to the investors. Predicting the direction of stock prices is particularly important for value investing.

We believe answering the question we have formulated may be a major piece of the stock market puzzle and is vital in deepening our understanding of the relationship between economic trends of a state and their effect on the stock pricing of companies based in that state. Therefore, prediction of stock prices by modeling its correlations with economic conditions and job posting activity can serve as a strong predictive tool for investors.

1.2. Datasets

To analysis the relationship between stock prices and the job openings and economic parameters we need to use the following datasets:

Data Description
jobs Job openings data (title, company, location, category, dates, etc.) for over 400 companies from August 2007 to December 2015.
companies Important details (name, scrape dates, location, tickers, sectors, etc.) on various companies.
econ_state Economic data (GDP, per capita income, unemployment rate, etc.) for all 50 states and the District of Columbia, taken from 1980 – 2016.
geographic Latitude and longitude data organized alphabetically by city and state.
financial Time series of financial data (close price, ex-dividends, split adjustments, adjusted close price) for over 3,000 stocks on U.S. exchanges, taken from 2007 – 2016.

Moreover, we created some other sub datasets using above-mentioned data.

2. Non-Technical Executive Summary

In this section we focus on Exploratory Data Analysis (EDA) to determine the importance of the features and then we do some feature engineering based on the results of this section.

2.1. Interactive Geographical distribution of job openings and stock prices

Here we focus on the top 5 industries based on the frequesny of their job openings: (1) General Management and Business, (2) Accounting and Finance, (3) Restaurants and Food Services, (4) Technology, (5) Retail. In the following bar plot, the X-axis is the unique category ID for industies and the y-axis is the number of job posts:

In [1]:
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
import tensorflow as tf
import pandas as pd
import seaborn as sns

df_jobs = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/Datathon/jobs.csv')
In [2]:
plt.figure(figsize=(12,6))   
df_jobs_cat = df_jobs['category_id'].dropna()
df_jobs_cat = pd.Series.to_frame(df_jobs_cat)
df_jobs_cat.columns=['category_id']

df_job_count = df_jobs_cat.groupby('category_id').size()

bar_width=0.4
plt.bar(np.arange(0,len(df_job_count))-0.5*bar_width,df_job_count, alpha=0.5, width = bar_width, color= 'b', label='Number of jobs in Cat ID')
Out[2]:
<Container object of 143 artists>

We normalized the number of job posts for these five top industries and plot them over the U.S. map based on their zip codes

In [3]:
df_zip = pd.read_csv('/mnt/c/Users/nsalehi/Desktop/Datathon/zipcodes.csv')
In [4]:
df_top5 = df_jobs.loc[(df_jobs['category_id'] == 46) | (df_jobs['category_id'] == 1) | (df_jobs['category_id'] == 122) | (df_jobs['category_id'] == 141) | (df_jobs['category_id'] == 127)]
In [5]:
df_top5_gr = df_top5[['zip','category_id']].groupby(['zip','category_id']).size()
df_top5_gr = df_top5_gr.reset_index()
df_top5_map = pd.merge(df_top5_gr, df_zip, how='inner', left_on = 'zip', right_on = 'ZIP')
df_top5_map.columns = ['zip', 'category_id', 'count', 'ZIP', 'LAT', 'LNG']
In [9]:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/Datathon/4444.png")
Out[9]:

The colors in the map represents the industries as follows:

Color Industry
Blue Retail
Red Technology
Green Restaurants and Food Services
Purple Accounting and Finance
Yellow General Management and Business

Moreover, the radius of the circles represent the normalized number of postings. As we can see in the map the job openings for tech industry is aggregated mostly in west and east coast while accounting and finance jobs are more distributed in midwest. The other industries have lower number of jobs openings in compare to these three. We can conclude that the job openings are very location specific.

2.2. Relationship between Job Opennings and Overal Return

As we can here, there seems to be a strong correlation between the job postings and overall results across time periods.

In [8]:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/Datathon/image_1.png")
Out[8]:

2.3. Key Findings

Based on initial analysis of our chosen datasets, we established that state-wise job posting data can provide invaluable insight into the dynamics of stock pricing of a company. Also, economic parameters such as GDP, income and unemployment rate seem to be closely related with stock pricing trends.

These findings can be logically explained as follows. One of the factors that has a strong correlation with stock prices is inflation. Although inflation manifests itself in a variety of ways, most common indicators of inflation include the GDP, per capita income and employment rate. Hence intuitively, the trends in these economic parameters must show up in the trends we witness in stock prices. Also, The job posting data of a company indicates its financial well-being and is expected to have some correlation with stock pricing of the company.

Training our model based on the information obtained from job activity data of companies and economic trends of states, we are able to predict with reasonable accuracy, the stock price trends of our validation dataset. This points towards the fact that the economic conditions of the state in which major part of operations of a company are performed, plays a major role in deciding the fluctuations in stock pricing of the company. Hence, keeping track of job opening postings of a company and the economic growth can drive decision-making for investors.

3. Technical Executive Summary

3.1. Linear Model

Our data is related to time sequence. There are several algorithms addressing time series data model. We first use multivariate linear regression model to fit data. The linear regression has a better perform on the computation efficiency, it also results in a pretty good predictive accuracy. Then we turn to ridge regression model. Both of these two linear models give us good result as showed below.

3.2 Ensemble Method

3.2.1 Method introduction

We have tried some of the single models such as Multiple Linear Regression, Ridge Linear Regression. Beyond single models, we decided to turn to ensemble methods to improve the prediction performance. We achieved fairly good results by using linear models. We would like to ensemble linear models as weak learners by using ensemble method. Boost model can incorporate not just decision tree, but also linear model compared to Random Forest. Thus, Boost methods are appropriate here. Specifically, we chose to use XGBoost model here to build our model to this prediction problem. We chose XGBoost here instead of an another popular boosting method which was taught in the class, Adaboost, because, firstly XGBoost is more flexible, i.e. it has more customizable parameters. Secondly, XGboost is much faster compared to most of implementations of AdaBoost. Thirdly, Gradient Boosting model works for generic loss functions, while Adaboost is derived mainly for classification with exponential loss.

3.2.2 Model Selection

3.2.2.1 Model Parameters

In order to build the XGBoost model, we need to select several parameters, we listed the parameters in the following table:

Parameter Value
max_depth 3
subsample 0.7
colsample_bytree 0.7
num_rounds 500
objective reg:linear
eval_metric RMSE

Firstly, for parameters objective and eval_metric, since we decided to ensemble linear models for this prediction problems, we set objective to be reg:linear and as mentioned before, Gradient Boosting model works for generic loss functions, we choose our customized R value function here as our evaluation metric. Secondly, for parameters max_depth, subsample and colsample_bytree. We used these three parameter to control overfitting in two ways, the first way is to directly control model complexity, the second way is to add randomness to make training robust to noise. Max_depth controls the model complexity. It is the maximum depth of a tree, increase this value will make the model more complex which is likely to be overfitting. We set this value to its default value 3. Subsample is subsample ratio of the training instance which means bootstrapping. Setting it to 0.7 means that XGBoost randomly collected 70% of the data instances to grow trees and this will prevent overfitting. Colsample_bytree is subsample ratio of columns for each split, which is similar to the idea of Random Forests. Thirdly, num_rounds is the number of rounds for boosting iteration. For this methods, in order to find the best iteration times for training the model without overfitting and underfitting. We need to conduct the Validation or finding the lowest error rate on the test set using the model of each iteration. Thus we choose to conduct Validation in order to find the most appropriate value for the boosting iteration times.

3.2.2 Validation

We have nearly 36000 of training data here for this problem. Thus, conducting K-fold Cross Validation is unnecessary here. Given advices from practitioners in the financial industry, we decided to conduct 80/20 Validation to trade off the bias and variance. We separate the original dataset into train set and validation set by subsampling randomly 80 percent of data into the train set, and the other 20% into the validation set. We plotted the training RMSE (Root of Mean Squared Error) (blue line) and and the validation RMSE vs the boosting iteration round in Image 5. As we could see, the training error keeps decreasing with the increasing of iteration round. However, the validation error stops decreasing at about the 77th iteration. Thus, we set the parameter num_rounds as 77.

In [12]:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/Datathon/2222.png")
Out[12]:

We computed the performance by RMSE for the 80/20 Validation. We can find the both the performance is no better than the single models which we have tried in the previous section.

3.3 Test Result

Finally, we trained the model with the parameters we chose from the 80/20 Validation. We test our final model on the test set. We put the performance by RMSE and R Value for training set, validation set and test set in the following tables as our final result.

In [3]:
from IPython.display import Image
Image("/mnt/c/Users/nsalehi/Desktop/Datathon/1111.png")
Out[3]:

3.4 CONCLUSION

From the final result of the Ensemble method we tried in the previous section, we could find that the Ensemble method doesn’t give us a better performance than the single models we tried, specifically Multiple Linear Regression. However, the ensemble method decreases the variation of the model a lot. In addition, compared to single model, Ensemble method completely lost the interpretation of the model due to ensemble process.

In [10]:
from IPython.display import HTML

HTML('''<script> $('div.input').hide()</script>''')
Out[10]:
In [ ]: