R program for Finance

[Introduction of Association Rules]

Sometimes, the anecdotal story helps you understand the new concept. But, this story is real. About 15 years ago, in Walmart, a sales guy made efforts to boost sales in his store. His idea was simple. He bundled the products together and applied some discounts to the bundled products. (Now, it became common practices in marketing) For example, this guy bundled bread with jam, so that customers easily found them together. Moreover, customers could afford to buy them together as the bundled product was discounted. In this way, we can expect an increase in the revenue.
As bread and jam was so classical, so that he was determined to analyze all sales records in a hope of seizing new opportunities. He found interesting. Many many customers who bought diapers also purchased beers.

Seemingly, those are totally unrelated. He decided to dig deeper. He realized that it was arduous to raise kids (It doesn't change at all in nowadays) So, the parents impulsively decided to purchase beer to relieve their stress. He bundled diapers and beers together. The sales skyrocketed. Still, this remains the perfect example of Association Rules in data mining. (Thank you professor Sun in University of Notre Dame! He gave this example in Business Intelligence class)

[About data]
Now, let's suppose that you own Sephora, the largest cosmetic chain in United States (And probably in the world) You are selling 14 products in your store. Just like Walmart sales guy, you hope to boost your sales with the same technique. How do we go about doing this?

Your products: Brushes, Mascara, Eye shadow, Bronzer, Lip liner, Nail Polish, Lipstick, ...
(To be honest, as a male, I have no idea what these products are)

Usually, sales data take on this form. It has a transaction number and corresponding items that our customers buy. Usually, when you extract the data from database(MS-SQL, Oracle whatever), it is supposed to be like this. First column is a transaction number, and second column is the item. So according to these data, our customer 1 purchased Blush, Bronzer, Brushes, Concealer, Eyeliner, Lip liner, Mascara, and Nail Polish at once. (I am not sure females purchased cosmetics in bulk actually)

However, in order to be used in R, it should take on this form. It doesn't have any transaction number. You need to vertically arrange items that our customer purchased in a single transaction. I am going to offer you this data in the source code.

I'll briefly touch on how to change the form of the data later.

[Terms that you should know]
You need to understand several key concepts regarding association rules.

1. A=>B

We call "A" as "LHS(Left-hand side)," and "B" as "RHS(Right-hand side)"
Let's assume that A is diaper and B is beer. It means when a customer buys diaper, she would buy beer too.

2. Support

Let me get back to Walmart's story. In this case, support means the probability of the customer buying diaper and beer together among all sales transactions.

3. Confidence

Suppose that if a customer pick up diaper. How he/she is likely to buy beer? The answer is "confidence" The maximum value of confidence has to be 1.

4. Lift

Lift is a true comparison between naive model and our model, meaning that how more likely a customer buy both, compared to buy separately? Lift 1 means, our customers are as likely to buy both diaper and beer together as buy them separately. Generally, in order to be meaningful in marketing, lift has to be greater than 1.

[Codes]
Unlike our theory, the code is simple. "arules" package allows you to do this really simply. just 4 lines. That's all.

#Association Rule
library(arules)
myurl <- "https://docs.google.com/spreadsheets/d/18KBtFWkMq1Q9mOSVo9Q55GJ9IeC3NRYRn7yV5Id3z6A/pub?gid=0&single=true&output=csv"
data.raw <- read.transactions(url(myurl), sep=",") #Please use read.transactions! It's not read.csv!
rules<-apriori(data.raw)
inspect(rules)

[Interpretation]
> inspect(rules)
lhs rhs support confidence lift 1 {Brushes} => {Nail Polish} 0.1556949 1.0000000 3.4178572 {Mascara} => {Eye shadow} 0.3354232 0.8991597 2.2585193 {Eye shadow} => {Mascara} 0.3354232 0.8425197 2.2585194 {Bronzer,Brushes} => {Nail Polish} 0.1013584 1.0000000 3.4178575 {Bronzer,Lip liner} => {Concealer} 0.1076280 0.8046875 1.742276
...

Well, this looks good. However, like I said, the higher lift is, the more it is meaningful in marketing sense. Let's sort it from high lift to low lift, which allows us to identify strong correlation.

> rules.sorted <- sort(rules, by="lift")
> inspect(rules.sorted)
lhs rhs support confidence lift
1 {Brushes} => {Nail Polish} 0.1556949 1.0000000 3.417857
4 {Bronzer,Brushes} => {Nail Polish} 0.1013584 1.0000000 3.417857
26 {Blush,Concealer,Eye shadow} => {Mascara} 0.1243469 0.9596774 2.572581
18 {Blush,Eye shadow} => {Mascara} 0.1765935 0.9285714 2.489196
13 {Eye shadow,Nail Polish} => {Mascara} 0.1243469 0.9083969 2.435115
23 {Concealer,Eye shadow} => {Mascara} 0.1870428 0.8905473 2.387265

Let's highlight the first row. Support is 0.1556, meaning that customers buy Brushes and Nail Polishes altogether by 15.56% among all transactions. Confidence is 100%, meaning that all brush buyers purchase nail polish (It's huge!). Lift is 3.41, meaning that our customers are 3.41 times more likely to buy brushes and nail polish altogether than buy them separately!

In next section, we are going to prune the result.

[Previous Post]
Single regression on Exxon's stock

[Introduction of Multi-regression]

Let's recall our last job. We conducted the single regression on Exxon Mobil's stock along with WTI crude oil spot price. The result was fantastic, which accounts for 25% of the variation of stock movement. Put it in other way, R-square. The problem is "are you happy with the result that explains only 25% of the variation?"

For investors, it could be risky that they know only one variable to affect the stock price that they hold. Is there any way to account for the variation of the stock price better?

Then, multi-regression comes out.

y = a + (a1)x1 + (a2)x2 + (a3)x3 + ... + ERROR

In the previous post, we simplify the model as y= ax + b. It was a great starting point. However, our real world is not that simple. It is affected by many variables.

I'll talk about this problem later, but the assumption is the variable x1, x2, and x3 are independent of one another. In order to understand this, you need to understand that the two random variables are independent of each other.

[Independent]
<Theory>
When we talk that the random variables are independent of each other, it means, it's not correlated. Think about the relationship between temperature and electricity bill. As the temperature goes up, you would use more air conditioner, increasing your electricity bill. In this case, they are "positively correlated." Now, think about the relationship between temperature and the time you spend with your laptop. For some people it might be related, but for me, regardless of the temperature, I should use laptop more than 5 hours for my work. Now, we can say that the temperature and the usage of computer are independent of each other. We can represent this in a mathematical way.

P(AB) = P(A)P(B)
Correlation(A, B) = 0

In order to understand the concept of correlation you can think about the gears rotating together.

<In real>
Sure, it's impossible to meet "totally independent two random variables." Especially, in the macro economic, it is nearly impossible because all variables are connected. In this case, we have to decide where do we draw the line. We can assume that if the correlation is less than 0.2, we can consider it independent.

[What can have an impact on Exxon's equity value?]
We assumed that the WTI crude oil price can do, which turned out to be true. Natural gas price can do that. Not only does Exxon sell oil, but it also sells natural gas. Most importantly, as a member of S&P 500, it also heavily affected by the general market condition. Especially, the index of S&P 500 is a collection of macro variables, like interest rate. In this analysis, I'll add two variables more.

[Where can I can the data]
I really want to be nice, so that I uploaded WTI oil price and natural gas price on my Google Drive. How do we use that? You can watch 3 min video to learn how to retrieve the data from Google Drive to R.

How to retrieve the data from Google Drive to R.

[Before getting into the code]
(1) I used tseries library just like the previous post. Please, install this if you don't have.
(2) I used the vanguard's S&P500 index fund in lieu of S&P itself because it's more accessible.
(3) If you can't understand the code, please go over the previous post again. I deliberately rule out any type of difficult code.
(4) As I mentioned earlier, you can use WTI and Natural gas data without changing URL. Don't download from the internet. Just use the code.

[Code]
#Gas: https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv
#WTI: https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv

#Multi Regression and Correlation

library(tseries)
library(zoo)

#Exxon Mobil's equity value in 2015
xom <- get.hist.quote("XOM", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#S&P 500. I used Vanguard Index Fund instead of directly using S&P.
snp <- get.hist.quote("VFINX", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)

#I am going to use close value only
xom_zoo <- xom$Close
snp_zoo <- snp$Close

#Please check the post that mentions how to get the data from Google Sheet.
wti <- read.csv("https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv")

#When it reads file first, it has categorical format we need to convert it.
wti$VALUE <- as.character(wti$VALUE)
#It also has garbage value "." in data. You can see in Excel. We can clean this with below command
wti <- wti[wti$VALUE!=".", #Get rid of any value that contains "."
1:2 #I need first column, and second column as well.
]
#Finally I want to convert character to numeric value.
wti$VALUE <- as.numeric(wti$VALUE)
wti_zoo <- read.zoo(wti, format="%m/%d/%Y")

gas <- read.csv(url("https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv"))
gas$Price <- as.character(gas$Price)
gas$Price <- as.numeric(gas$Price)
gas_zoo <- read.zoo(gas, format="%m/%d/%Y")

#Combine Two Time Series
two <- cbind(wti_zoo, snp_zoo)
three <- cbind(gas_zoo, two)
#Remove NA as S&P 500 has more trade days than normal stock.
three.f <- na.omit(three)

#What we need is return.
xom_rate <- (xom_zoo - lag(xom_zoo))/xom_zoo
wti_rate <- (three.f$wti_zoo - lag(three.f$wti_zoo))/three.f$wti_zoo
snp_rate <- (three.f$snp_zoo - lag(three.f$snp_zoo))/three.f$snp_zoo
gas_rate <- (three.f$gas_zoo - lag(three.f$gas_zoo))/three.f$gas_zoo
regression_result <- lm(xom_rate~wti_rate+snp_rate+gas_rate)

print(summary(regression_result))

cols<-rainbow(4)

#Draw Time Series Graph
#lty: Line Style
plot(xom_rate, col=cols[1], lty=1, main="Return of XOM, WTI, S&P500, Gas", xlab="2015", ylab="Return")
lines(wti_rate, col=cols[2], lty=2)
lines(snp_rate, col=cols[3], lty=2)
lines(gas_rate, col=cols[4], lty=2)
legend('bottomleft', #It's located in bottom left
c("XOM", "WTI", "S&P", "Gas"),
lty=c(1,4), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)

#Scatter plot for multi regression
#pch: Dot Style
plot(x=xom_rate, y=gas_rate, col=cols[1], pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=cols[2], pch=18)
points(x=xom_rate, y=wti_rate, col=cols[3], pch=19)
legend('bottomleft',
#X, #Y position of the legend. The higher Y value, the higher position in the monitor.
c("GAS", "S&P", "WTI"),
pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)

[How to interpret the result]

<Table>
Call:
lm(formula = xom_rate ~ wti_rate + snp_rate + gas_rate)

Residuals:
Min 1Q Median 3Q Max
-0.030185 -0.004732 0.000406 0.005333 0.039771

Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.0004744 0.0005458 0.869 0.3856
wti_rate 0.1578732 0.0193612 8.154 1.78e-14 ***
snp_rate 0.9308219 0.0577671 16.113 < 2e-16 ***
gas_rate -0.0310367 0.0150969 -2.056 0.0409 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.008641 on 247 degrees of freedom
Multiple R-squared: 0.6416, Adjusted R-squared: 0.6372
F-statistic: 147.4 on 3 and 247 DF, p-value: < 2.2e-16

Take a look at green highlights first. We need to look at Adjusted R-square as we used multiple variable. 63.72% variation can be explained by this model (Awesome!) Take a look at overall p-value. The likelihood that this model is nothing is less than 2.2*10^-16, meaning that this model is reliable.

Take a look at p-values for each variables (Yellow highlights) All p-values are less than 0.05, so that these variables are statistically meaningful too.

Take a look at blue highlights. Now we find the equation.

y(Exxon's stock return) = 0.0004744 + (0.1578732)(WTI return) + (0.9308219)(S&P return) - (0.0310367)(Natural gas return)

Wait. Something doesn't make sense to you. Is natural gas price is negatively correlated with the stock return on Exxon? I put forward several hypotheses to account for this unexpected outcome.
(1) As a result of the advent of fracking technology, natural gas is no longer the production of Exxon. Rather it becomes a raw material.
(2) Natural gas price responds the market later than WTI as investors more focus on WTI than natural gas.
(3) Exxon doesn't rely on Natural gas much. They hedged somehow as the oil price hit the lowest in historic level.

I'll let you decide on which theory is the most plausible.

[Time Series Graph]
We can see how much correlate with each other graphically. I drew all time series graphs first. You can see general tendency of up and down are consistent with one another.

[Scatter Plot]

You can also do the same thing with scatter plot, which allows you to see the correlation between XOM returns and other variables visually.

[Correlation in real world]
"cor" command allows you to identify the correlation between two variables. Like I said, it is nearly impossible to be independent for macro economic variables.

> cor(wti_rate, snp_rate)
[1] 0.2796217
>

As WTI return and S&P 500 return are not independent of each other, our model could be undermined. However, 27% correlation is not enough to claim that there is a significant correlation between them. Again, it's about where do we draw the line. As a data scientist, you have your own standard to draw the line. From my standpoint, it's okay to have 0.27 correlation.

I'll give you identifying remaining correlation (S&P ~ gas, wti ~ gas) as assignments.

[Additional Tip]
If you think that, in the scatter plot, the blue dots totally eclipse the other dots, you can make the color transparent. It's not that difficult.

#Make the color transparent
t_col <- function(color, percent = 50, name = NULL) {
# color = color name
# percent = % transparency
# name = an optional name for the color
## Get RGB values for named color
rgb.val <- col2rgb(color)
## Make new color using input color as base and alpha set by transparency
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
max = 255,
alpha = (100-percent)*255/100,
names = name)
## Save the color
return(t.col)
}

plot(x=xom_rate, y=gas_rate, col=t_col(cols[1]), pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")

points(x=xom_rate, y=snp_rate, col=t_col(cols[2]), pch=18)

points(x=xom_rate, y=wti_rate, col=t_col(cols[3]), pch=19)

legend('bottomleft',

c("GAS", "S&P", "WTI"),

pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol

col=cols, #color

horiz=TRUE, #Horizontal alignment

bty='n', #No background

cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor

)

R program for Finance

Google Ad sense

Friday, May 20, 2016

[Machine Learning] Association Rules - Marketing in R (1)

Monday, April 25, 2016

Multi Regression in R - Exxon stock return ~ (WTI/Gas/S&P)

Saturday, April 23, 2016

How to read Googlesheet (or Google Drive) on R within three steps