[Introduction of Association Rules]
Sometimes, the anecdotal story helps you understand the new concept. But, this story is real. About 15 years ago, in Walmart, a sales guy made efforts to boost sales in his store. His idea was simple. He bundled the products together and applied some discounts to the bundled products. (Now, it became common practices in marketing) For example, this guy bundled bread with jam, so that customers easily found them together. Moreover, customers could afford to buy them together as the bundled product was discounted. In this way, we can expect an increase in the revenue.
As bread and jam was so classical, so that he was determined to analyze all sales records in a hope of seizing new opportunities. He found interesting. Many many customers who bought diapers also purchased beers.
Seemingly, those are totally unrelated. He decided to dig deeper. He realized that it was arduous to raise kids (It doesn't change at all in nowadays) So, the parents impulsively decided to purchase beer to relieve their stress. He bundled diapers and beers together. The sales skyrocketed. Still, this remains the perfect example of Association Rules in data mining. (Thank you professor Sun in University of Notre Dame! He gave this example in Business Intelligence class)
[About data]
Now, let's suppose that you own Sephora, the largest cosmetic chain in United States (And probably in the world) You are selling 14 products in your store. Just like Walmart sales guy, you hope to boost your sales with the same technique. How do we go about doing this?
Your products: Brushes, Mascara, Eye shadow, Bronzer, Lip liner, Nail Polish, Lipstick, ...
(To be honest, as a male, I have no idea what these products are)
Usually, sales data take on this form. It has a transaction number and corresponding items that our customers buy. Usually, when you extract the data from database(MS-SQL, Oracle whatever), it is supposed to be like this. First column is a transaction number, and second column is the item. So according to these data, our customer 1 purchased Blush, Bronzer, Brushes, Concealer, Eyeliner, Lip liner, Mascara, and Nail Polish at once. (I am not sure females purchased cosmetics in bulk actually)
However, in order to be used in R, it should take on this form. It doesn't have any transaction number. You need to vertically arrange items that our customer purchased in a single transaction. I am going to offer you this data in the source code.
I'll briefly touch on how to change the form of the data later.
[Terms that you should know]
You need to understand several key concepts regarding association rules.
1. A=>B
We call "A" as "LHS(Left-hand side)," and "B" as "RHS(Right-hand side)"
Let's assume that A is diaper and B is beer. It means when a customer buys diaper, she would buy beer too.
2. Support
Let me get back to Walmart's story. In this case, support means the probability of the customer buying diaper and beer together among all sales transactions.
3. Confidence
Suppose that if a customer pick up diaper. How he/she is likely to buy beer? The answer is "confidence" The maximum value of confidence has to be 1.
4. Lift
Lift is a true comparison between naive model and our model, meaning that how more likely a customer buy both, compared to buy separately? Lift 1 means, our customers are as likely to buy both diaper and beer together as buy them separately. Generally, in order to be meaningful in marketing, lift has to be greater than 1.
[Codes]
Unlike our theory, the code is simple. "arules" package allows you to do this really simply. just 4 lines. That's all.
#Association Rule
library(arules)
myurl <- "https://docs.google.com/spreadsheets/d/18KBtFWkMq1Q9mOSVo9Q55GJ9IeC3NRYRn7yV5Id3z6A/pub?gid=0&single=true&output=csv"
data.raw <- read.transactions(url(myurl), sep=",") #Please use read.transactions! It's not read.csv!
rules<-apriori(data.raw)
inspect(rules)
[Interpretation]
> inspect(rules)
lhs rhs support confidence lift 1 {Brushes} => {Nail Polish} 0.1556949 1.0000000 3.4178572 {Mascara} => {Eye shadow} 0.3354232 0.8991597 2.2585193 {Eye shadow} => {Mascara} 0.3354232 0.8425197 2.2585194 {Bronzer,Brushes} => {Nail Polish} 0.1013584 1.0000000 3.4178575 {Bronzer,Lip liner} => {Concealer} 0.1076280 0.8046875 1.742276
...
Well, this looks good. However, like I said, the higher lift is, the more it is meaningful in marketing sense. Let's sort it from high lift to low lift, which allows us to identify strong correlation.
> rules.sorted <- sort(rules, by="lift")
> inspect(rules.sorted)
lhs rhs support confidence lift
1 {Brushes} => {Nail Polish} 0.1556949 1.0000000 3.417857
4 {Bronzer,Brushes} => {Nail Polish} 0.1013584 1.0000000 3.417857
26 {Blush,Concealer,Eye shadow} => {Mascara} 0.1243469 0.9596774 2.572581
18 {Blush,Eye shadow} => {Mascara} 0.1765935 0.9285714 2.489196
13 {Eye shadow,Nail Polish} => {Mascara} 0.1243469 0.9083969 2.435115
23 {Concealer,Eye shadow} => {Mascara} 0.1870428 0.8905473 2.387265
Let's highlight the first row. Support is 0.1556, meaning that customers buy Brushes and Nail Polishes altogether by 15.56% among all transactions. Confidence is 100%, meaning that all brush buyers purchase nail polish (It's huge!). Lift is 3.41, meaning that our customers are 3.41 times more likely to buy brushes and nail polish altogether than buy them separately!
In next section, we are going to prune the result.
Google Ad sense
Friday, May 20, 2016
Monday, April 25, 2016
Multi Regression in R - Exxon stock return ~ (WTI/Gas/S&P)
[Previous Post]
Single regression on Exxon's stock
[Introduction of Multi-regression]
Let's recall our last job. We conducted the single regression on Exxon Mobil's stock along with WTI crude oil spot price. The result was fantastic, which accounts for 25% of the variation of stock movement. Put it in other way, R-square. The problem is "are you happy with the result that explains only 25% of the variation?"
For investors, it could be risky that they know only one variable to affect the stock price that they hold. Is there any way to account for the variation of the stock price better?
Then, multi-regression comes out.
y = a + (a1)x1 + (a2)x2 + (a3)x3 + ... + ERROR
In the previous post, we simplify the model as y= ax + b. It was a great starting point. However, our real world is not that simple. It is affected by many variables.
I'll talk about this problem later, but the assumption is the variable x1, x2, and x3 are independent of one another. In order to understand this, you need to understand that the two random variables are independent of each other.
[Independent]
<Theory>
When we talk that the random variables are independent of each other, it means, it's not correlated. Think about the relationship between temperature and electricity bill. As the temperature goes up, you would use more air conditioner, increasing your electricity bill. In this case, they are "positively correlated." Now, think about the relationship between temperature and the time you spend with your laptop. For some people it might be related, but for me, regardless of the temperature, I should use laptop more than 5 hours for my work. Now, we can say that the temperature and the usage of computer are independent of each other. We can represent this in a mathematical way.
P(AB) = P(A)P(B)
Correlation(A, B) = 0
In order to understand the concept of correlation you can think about the gears rotating together.
<In real>
Sure, it's impossible to meet "totally independent two random variables." Especially, in the macro economic, it is nearly impossible because all variables are connected. In this case, we have to decide where do we draw the line. We can assume that if the correlation is less than 0.2, we can consider it independent.
[What can have an impact on Exxon's equity value?]
We assumed that the WTI crude oil price can do, which turned out to be true. Natural gas price can do that. Not only does Exxon sell oil, but it also sells natural gas. Most importantly, as a member of S&P 500, it also heavily affected by the general market condition. Especially, the index of S&P 500 is a collection of macro variables, like interest rate. In this analysis, I'll add two variables more.
[Where can I can the data]
I really want to be nice, so that I uploaded WTI oil price and natural gas price on my Google Drive. How do we use that? You can watch 3 min video to learn how to retrieve the data from Google Drive to R.
How to retrieve the data from Google Drive to R.
[Before getting into the code]
(1) I used tseries library just like the previous post. Please, install this if you don't have.
(2) I used the vanguard's S&P500 index fund in lieu of S&P itself because it's more accessible.
(3) If you can't understand the code, please go over the previous post again. I deliberately rule out any type of difficult code.
(4) As I mentioned earlier, you can use WTI and Natural gas data without changing URL. Don't download from the internet. Just use the code.
[Code]
#Gas: https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv
#WTI: https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv
#Multi Regression and Correlation
library(tseries)
library(zoo)
#Exxon Mobil's equity value in 2015
xom <- get.hist.quote("XOM", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#S&P 500. I used Vanguard Index Fund instead of directly using S&P.
snp <- get.hist.quote("VFINX", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#I am going to use close value only
xom_zoo <- xom$Close
snp_zoo <- snp$Close
#Please check the post that mentions how to get the data from Google Sheet.
wti <- read.csv("https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv")
#When it reads file first, it has categorical format we need to convert it.
wti$VALUE <- as.character(wti$VALUE)
#It also has garbage value "." in data. You can see in Excel. We can clean this with below command
wti <- wti[wti$VALUE!=".", #Get rid of any value that contains "."
1:2 #I need first column, and second column as well.
]
#Finally I want to convert character to numeric value.
wti$VALUE <- as.numeric(wti$VALUE)
wti_zoo <- read.zoo(wti, format="%m/%d/%Y")
gas <- read.csv(url("https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv"))
gas$Price <- as.character(gas$Price)
gas$Price <- as.numeric(gas$Price)
gas_zoo <- read.zoo(gas, format="%m/%d/%Y")
#Combine Two Time Series
two <- cbind(wti_zoo, snp_zoo)
three <- cbind(gas_zoo, two)
#Remove NA as S&P 500 has more trade days than normal stock.
three.f <- na.omit(three)
#What we need is return.
xom_rate <- (xom_zoo - lag(xom_zoo))/xom_zoo
wti_rate <- (three.f$wti_zoo - lag(three.f$wti_zoo))/three.f$wti_zoo
snp_rate <- (three.f$snp_zoo - lag(three.f$snp_zoo))/three.f$snp_zoo
gas_rate <- (three.f$gas_zoo - lag(three.f$gas_zoo))/three.f$gas_zoo
regression_result <- lm(xom_rate~wti_rate+snp_rate+gas_rate)
print(summary(regression_result))
cols<-rainbow(4)
#Draw Time Series Graph
#lty: Line Style
plot(xom_rate, col=cols[1], lty=1, main="Return of XOM, WTI, S&P500, Gas", xlab="2015", ylab="Return")
lines(wti_rate, col=cols[2], lty=2)
lines(snp_rate, col=cols[3], lty=2)
lines(gas_rate, col=cols[4], lty=2)
legend('bottomleft', #It's located in bottom left
c("XOM", "WTI", "S&P", "Gas"),
lty=c(1,4), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)
#Scatter plot for multi regression
#pch: Dot Style
plot(x=xom_rate, y=gas_rate, col=cols[1], pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=cols[2], pch=18)
points(x=xom_rate, y=wti_rate, col=cols[3], pch=19)
legend('bottomleft',
#X, #Y position of the legend. The higher Y value, the higher position in the monitor.
c("GAS", "S&P", "WTI"),
pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)
[How to interpret the result]
<Table>
Call:
lm(formula = xom_rate ~ wti_rate + snp_rate + gas_rate)
Residuals:
Min 1Q Median 3Q Max
-0.030185 -0.004732 0.000406 0.005333 0.039771
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.0004744 0.0005458 0.869 0.3856
wti_rate 0.1578732 0.0193612 8.154 1.78e-14 ***
snp_rate 0.9308219 0.0577671 16.113 < 2e-16 ***
gas_rate -0.0310367 0.0150969 -2.056 0.0409 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.008641 on 247 degrees of freedom
Multiple R-squared: 0.6416, Adjusted R-squared: 0.6372
F-statistic: 147.4 on 3 and 247 DF, p-value: < 2.2e-16
Take a look at green highlights first. We need to look at Adjusted R-square as we used multiple variable. 63.72% variation can be explained by this model (Awesome!) Take a look at overall p-value. The likelihood that this model is nothing is less than 2.2*10^-16, meaning that this model is reliable.
Take a look at p-values for each variables (Yellow highlights) All p-values are less than 0.05, so that these variables are statistically meaningful too.
Take a look at blue highlights. Now we find the equation.
y(Exxon's stock return) = 0.0004744 + (0.1578732)(WTI return) + (0.9308219)(S&P return) - (0.0310367)(Natural gas return)
Wait. Something doesn't make sense to you. Is natural gas price is negatively correlated with the stock return on Exxon? I put forward several hypotheses to account for this unexpected outcome.
(1) As a result of the advent of fracking technology, natural gas is no longer the production of Exxon. Rather it becomes a raw material.
(2) Natural gas price responds the market later than WTI as investors more focus on WTI than natural gas.
(3) Exxon doesn't rely on Natural gas much. They hedged somehow as the oil price hit the lowest in historic level.
I'll let you decide on which theory is the most plausible.
[Time Series Graph]
We can see how much correlate with each other graphically. I drew all time series graphs first. You can see general tendency of up and down are consistent with one another.
[Correlation in real world]
"cor" command allows you to identify the correlation between two variables. Like I said, it is nearly impossible to be independent for macro economic variables.
> cor(wti_rate, snp_rate)
[1] 0.2796217
>
As WTI return and S&P 500 return are not independent of each other, our model could be undermined. However, 27% correlation is not enough to claim that there is a significant correlation between them. Again, it's about where do we draw the line. As a data scientist, you have your own standard to draw the line. From my standpoint, it's okay to have 0.27 correlation.
I'll give you identifying remaining correlation (S&P ~ gas, wti ~ gas) as assignments.
[Additional Tip]
If you think that, in the scatter plot, the blue dots totally eclipse the other dots, you can make the color transparent. It's not that difficult.
#Make the color transparent
t_col <- function(color, percent = 50, name = NULL) {
# color = color name
# percent = % transparency
# name = an optional name for the color
## Get RGB values for named color
rgb.val <- col2rgb(color)
## Make new color using input color as base and alpha set by transparency
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
max = 255,
alpha = (100-percent)*255/100,
names = name)
## Save the color
return(t.col)
}
Single regression on Exxon's stock
[Introduction of Multi-regression]
Let's recall our last job. We conducted the single regression on Exxon Mobil's stock along with WTI crude oil spot price. The result was fantastic, which accounts for 25% of the variation of stock movement. Put it in other way, R-square. The problem is "are you happy with the result that explains only 25% of the variation?"
For investors, it could be risky that they know only one variable to affect the stock price that they hold. Is there any way to account for the variation of the stock price better?
Then, multi-regression comes out.
y = a + (a1)x1 + (a2)x2 + (a3)x3 + ... + ERROR
In the previous post, we simplify the model as y= ax + b. It was a great starting point. However, our real world is not that simple. It is affected by many variables.
I'll talk about this problem later, but the assumption is the variable x1, x2, and x3 are independent of one another. In order to understand this, you need to understand that the two random variables are independent of each other.
[Independent]
<Theory>
When we talk that the random variables are independent of each other, it means, it's not correlated. Think about the relationship between temperature and electricity bill. As the temperature goes up, you would use more air conditioner, increasing your electricity bill. In this case, they are "positively correlated." Now, think about the relationship between temperature and the time you spend with your laptop. For some people it might be related, but for me, regardless of the temperature, I should use laptop more than 5 hours for my work. Now, we can say that the temperature and the usage of computer are independent of each other. We can represent this in a mathematical way.
P(AB) = P(A)P(B)
Correlation(A, B) = 0
In order to understand the concept of correlation you can think about the gears rotating together.
<In real>
Sure, it's impossible to meet "totally independent two random variables." Especially, in the macro economic, it is nearly impossible because all variables are connected. In this case, we have to decide where do we draw the line. We can assume that if the correlation is less than 0.2, we can consider it independent.
[What can have an impact on Exxon's equity value?]
We assumed that the WTI crude oil price can do, which turned out to be true. Natural gas price can do that. Not only does Exxon sell oil, but it also sells natural gas. Most importantly, as a member of S&P 500, it also heavily affected by the general market condition. Especially, the index of S&P 500 is a collection of macro variables, like interest rate. In this analysis, I'll add two variables more.
[Where can I can the data]
I really want to be nice, so that I uploaded WTI oil price and natural gas price on my Google Drive. How do we use that? You can watch 3 min video to learn how to retrieve the data from Google Drive to R.
How to retrieve the data from Google Drive to R.
[Before getting into the code]
(1) I used tseries library just like the previous post. Please, install this if you don't have.
(2) I used the vanguard's S&P500 index fund in lieu of S&P itself because it's more accessible.
(3) If you can't understand the code, please go over the previous post again. I deliberately rule out any type of difficult code.
(4) As I mentioned earlier, you can use WTI and Natural gas data without changing URL. Don't download from the internet. Just use the code.
[Code]
#Gas: https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv
#WTI: https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv
#Multi Regression and Correlation
library(tseries)
library(zoo)
#Exxon Mobil's equity value in 2015
xom <- get.hist.quote("XOM", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#S&P 500. I used Vanguard Index Fund instead of directly using S&P.
snp <- get.hist.quote("VFINX", #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#I am going to use close value only
xom_zoo <- xom$Close
snp_zoo <- snp$Close
#Please check the post that mentions how to get the data from Google Sheet.
wti <- read.csv("https://docs.google.com/spreadsheets/d/19kE1nLp5u4Zf2UiZA4-GW0gYhdMvWU60L2M-SIbYqX0/pub?gid=0&single=true&output=csv")
#When it reads file first, it has categorical format we need to convert it.
wti$VALUE <- as.character(wti$VALUE)
#It also has garbage value "." in data. You can see in Excel. We can clean this with below command
wti <- wti[wti$VALUE!=".", #Get rid of any value that contains "."
1:2 #I need first column, and second column as well.
]
#Finally I want to convert character to numeric value.
wti$VALUE <- as.numeric(wti$VALUE)
wti_zoo <- read.zoo(wti, format="%m/%d/%Y")
gas <- read.csv(url("https://docs.google.com/spreadsheets/d/1-mMwoHJNrv9_St2x9xMlcNjs-NGKfJ32KOKRHixMZMk/pub?gid=0&single=true&output=csv"))
gas$Price <- as.character(gas$Price)
gas$Price <- as.numeric(gas$Price)
gas_zoo <- read.zoo(gas, format="%m/%d/%Y")
#Combine Two Time Series
two <- cbind(wti_zoo, snp_zoo)
three <- cbind(gas_zoo, two)
#Remove NA as S&P 500 has more trade days than normal stock.
three.f <- na.omit(three)
#What we need is return.
xom_rate <- (xom_zoo - lag(xom_zoo))/xom_zoo
wti_rate <- (three.f$wti_zoo - lag(three.f$wti_zoo))/three.f$wti_zoo
snp_rate <- (three.f$snp_zoo - lag(three.f$snp_zoo))/three.f$snp_zoo
gas_rate <- (three.f$gas_zoo - lag(three.f$gas_zoo))/three.f$gas_zoo
regression_result <- lm(xom_rate~wti_rate+snp_rate+gas_rate)
print(summary(regression_result))
cols<-rainbow(4)
#Draw Time Series Graph
#lty: Line Style
plot(xom_rate, col=cols[1], lty=1, main="Return of XOM, WTI, S&P500, Gas", xlab="2015", ylab="Return")
lines(wti_rate, col=cols[2], lty=2)
lines(snp_rate, col=cols[3], lty=2)
lines(gas_rate, col=cols[4], lty=2)
legend('bottomleft', #It's located in bottom left
c("XOM", "WTI", "S&P", "Gas"),
lty=c(1,4), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)
#pch: Dot Style
plot(x=xom_rate, y=gas_rate, col=cols[1], pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=cols[2], pch=18)
points(x=xom_rate, y=wti_rate, col=cols[3], pch=19)
legend('bottomleft',
#X, #Y position of the legend. The higher Y value, the higher position in the monitor.
c("GAS", "S&P", "WTI"),
pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)
<Table>
Call:
lm(formula = xom_rate ~ wti_rate + snp_rate + gas_rate)
Residuals:
Min 1Q Median 3Q Max
-0.030185 -0.004732 0.000406 0.005333 0.039771
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 0.0004744 0.0005458 0.869 0.3856
wti_rate 0.1578732 0.0193612 8.154 1.78e-14 ***
snp_rate 0.9308219 0.0577671 16.113 < 2e-16 ***
gas_rate -0.0310367 0.0150969 -2.056 0.0409 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.008641 on 247 degrees of freedom
Multiple R-squared: 0.6416, Adjusted R-squared: 0.6372
F-statistic: 147.4 on 3 and 247 DF, p-value: < 2.2e-16
Take a look at p-values for each variables (Yellow highlights) All p-values are less than 0.05, so that these variables are statistically meaningful too.
Take a look at blue highlights. Now we find the equation.
y(Exxon's stock return) = 0.0004744 + (0.1578732)(WTI return) + (0.9308219)(S&P return) - (0.0310367)(Natural gas return)
Wait. Something doesn't make sense to you. Is natural gas price is negatively correlated with the stock return on Exxon? I put forward several hypotheses to account for this unexpected outcome.
(1) As a result of the advent of fracking technology, natural gas is no longer the production of Exxon. Rather it becomes a raw material.
(2) Natural gas price responds the market later than WTI as investors more focus on WTI than natural gas.
(3) Exxon doesn't rely on Natural gas much. They hedged somehow as the oil price hit the lowest in historic level.
I'll let you decide on which theory is the most plausible.
[Time Series Graph]
We can see how much correlate with each other graphically. I drew all time series graphs first. You can see general tendency of up and down are consistent with one another.
[Scatter Plot]
You can also do the same thing with scatter plot, which allows you to see the correlation between XOM returns and other variables visually.[Correlation in real world]
"cor" command allows you to identify the correlation between two variables. Like I said, it is nearly impossible to be independent for macro economic variables.
> cor(wti_rate, snp_rate)
[1] 0.2796217
>
As WTI return and S&P 500 return are not independent of each other, our model could be undermined. However, 27% correlation is not enough to claim that there is a significant correlation between them. Again, it's about where do we draw the line. As a data scientist, you have your own standard to draw the line. From my standpoint, it's okay to have 0.27 correlation.
I'll give you identifying remaining correlation (S&P ~ gas, wti ~ gas) as assignments.
[Additional Tip]
If you think that, in the scatter plot, the blue dots totally eclipse the other dots, you can make the color transparent. It's not that difficult.
#Make the color transparent
t_col <- function(color, percent = 50, name = NULL) {
# color = color name
# percent = % transparency
# name = an optional name for the color
## Get RGB values for named color
rgb.val <- col2rgb(color)
## Make new color using input color as base and alpha set by transparency
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
max = 255,
alpha = (100-percent)*255/100,
names = name)
## Save the color
return(t.col)
}
plot(x=xom_rate, y=gas_rate, col=t_col(cols[1]), pch=17, main="Scatter plot on return", xlab="XOM", ylab="Gas, S&P, WTI")
points(x=xom_rate, y=snp_rate, col=t_col(cols[2]), pch=18)
points(x=xom_rate, y=wti_rate, col=t_col(cols[3]), pch=19)
legend('bottomleft',
c("GAS", "S&P", "WTI"),
pch=c(17, 18, 19), #As this is a line graph, we are going to use line as a symbol
col=cols, #color
horiz=TRUE, #Horizontal alignment
bty='n', #No background
cex=0.65 #If it is 1.0 it's too big. Basically it is a scale factor
)
Labels:
Multi Regression,
R,
R program,
R programming
Location:
Chicago, IL, USA
Saturday, April 23, 2016
Monday, April 18, 2016
Z-score, NPV, IRR, Bond, annuity, YTM calculation on your smartphone without installing app
Hello all, I am pleased to announce the upgrade of my calculator apps.
It is able to caculate
- Z-score
- Bond pricing
- Annuity
- NPV
- IRR
- YTM
- Black Scholes
- and other finance related formula in your smartphone without installing any mobile app.
- As it is a just HTML page, you can use this calculator on virtually any platform - iOS, OSX, Android, Window, Window RT - It doesn't matter.
The URL address: www.mobilecalculator.us
[Open Calculator]
Here's how to use the calculator
It is able to caculate
- Z-score
- Bond pricing
- Annuity
- NPV
- IRR
- YTM
- Black Scholes
- and other finance related formula in your smartphone without installing any mobile app.
- As it is a just HTML page, you can use this calculator on virtually any platform - iOS, OSX, Android, Window, Window RT - It doesn't matter.
The URL address: www.mobilecalculator.us
[Open Calculator]
Here's how to use the calculator
Sunday, April 17, 2016
Why the programmer decided to become MBA?
[My Background]
I was a programmer for a long time. I became a programmer from 2004. I was in charge of managing server and developing web applications. My main programming skills were JAVA and Oracle, although I learned python separately. I made more than 20 small and medium applications, and I participated in more than 10 big projects. When I talk about "Big project," it means more than 100 people were involved in the project. I have spent most of my career in the data centers. The servers and switches were my friends. I ate there and slept there. As more people thought of me a good engineer, the more career doors were open to me.
I moved my company several times. When I moved my company, I god paid more (or had more my personal time). But, one thing was clear - I moved away from programming toward finance. In 2012, I became a manager of IT governance department, which was the pinnacle of my career. Then, the problems arose. As I mentioned earlier, I was good at programming and managing server. However, my boss demanded me to have a conversation with corporate finance people. My first mission was to build the cloud center. The most important problem was "money." Let me put it other way - "Budget."
Again, I spent most of time in the data center. I was not aware of NPV or time value of money. I totally didn't understand why I needed to convince these finance guys. They used different language. I urged them that the cloud technology was a big deal and the new technology, but they weren't convinced. They needed me to translate my language into their language. I had been a good engineer, but that didn't make me a good IT governance manager.
That was not the only problem. The corporate finance guys came to visit my office often, demanding the budget reduction on the ground that the company lost its profit. Now, I had to negotiate with not only these finance guys but also the data center guys to allocate the reduced the budget effectively. That was not an easy process. I spent most of my time hearing screaming or slamming the desk.
[Why MBA?]
These experiences led me to MBA program. I wanted to be realistic. I can't program forever although I love coding a lot. Someday I have to be a manager, leading other people in my department and get them to be focused on the project. Some programmers can be lucky. Their employer might not demand them more than just programming, but mine was not. As my position gets higher, my employer shall demand me more in exchange for getting paid more.
Still, my employer wants me to have my computer skills, which is necessary to understand the basis of IT industry. I decided to become MBA in 2012. I started to study GMAT and TOEFL. That was really challenging to me who never experienced English before. I woke up 6:00 AM and started to study TOEFL. I went to my workplace. From time to time, I memorized vocabulary. I listened to NPR news (5 min summary) every two hours. That helps me a lot.
In 2014, I finally got my GMAT and TOEFL score. I applied for University of Notre Dame and it granted me substantial scholarship. I studied accounting, finance, and problem solving. Now, I realized my position in the company, understanding how company works. I am confident whenever I have conversation with corporate finance guys. I don't shy away from the negotiations any more. This program also gives me the new opportunity. I got an full-time offer from Deloitte Chicago as a cyber risk consultant. Hard work finally pays off.
When the programmers ask me about MBA career, I answer them that "It is worth!" How many programmers or IT guys actually work at IT company, like Google? Not many. I think most IT guys work at IT department in manufacturing industry, finance industry or other industry. I am not saying that the codes or the algorithms are not important. But, in order to make yourself suitable for working in non-IT industry, you need to understand business first. As your position gets higher, your employer would demand bigger role to you, more than programming or managing servers. They might want you to train new employees. They might want you to build a new data center. They might want you to explore new possibility of new technology area. The bottom line is, whatever you do, you should understand the business language. That's how this job works in this particular year.
[More opportunity for MBA programmer]
Now, things are more complicated. Data Scientist job is on the rise. I can tell you that, if you have a programming background, your MBA program makes you perfect for the data scientist as you learn SQL, R, and finance, accounting, and business statistics. You can boost your payment level more. More than your colleagues or peers in your class.
Data science job is to wrestle ambiguity basically. Your employer might ask you to increase our revenue with the data. In that case, it's blue sky. We don't even know where do we start. However, with the business knowledge, we can break the problem into finance, marketing, and operation. We can collect different data accordingly, putting forward the business hypothesis. Not only do you need to understand R or SQL, but you also need to understand business in general, adding a lot of values to your employer, which, in turn, makes you more valuable.
[Conclusion]
I want to conclude this post with a caveat. If you decide to become MBA, the rough road will be ahead of you. The road will test you a lot. Just don't give up! When you finish your journey, your efforts finally pay off!
Saturday, April 16, 2016
How to calculate Black-Scholes formula in R
Well, seeing is believing. Let's watch Youtube video to master what we've learn through R tutorial.
As I mentioned in the video, please use "pnorm" function to calculate the normal distribution. "dnorm" function calculates just density function suitable for drawing actual normal distribution curve.
[Code]
#Black Scholes Model
s<-100 #Stock Price
x<-105 #Exercise Price
t<-1 #Expiration Date
sigma<-0.2 #Volatility
rfree<-0.03 #Risk Free Rate
d1<-(log(s/x) + rfree * t)/(sigma*sqrt(t)) + sigma*sqrt(t)/2
d2<-d1 - sigma*sqrt(t)
c<-s*pnorm(d1)-exp(-rfree*t)*x*pnorm(d2)
As I mentioned in the video, please use "pnorm" function to calculate the normal distribution. "dnorm" function calculates just density function suitable for drawing actual normal distribution curve.
[Code]
#Black Scholes Model
s<-100 #Stock Price
x<-105 #Exercise Price
t<-1 #Expiration Date
sigma<-0.2 #Volatility
rfree<-0.03 #Risk Free Rate
d1<-(log(s/x) + rfree * t)/(sigma*sqrt(t)) + sigma*sqrt(t)/2
d2<-d1 - sigma*sqrt(t)
c<-s*pnorm(d1)-exp(-rfree*t)*x*pnorm(d2)
Wednesday, April 13, 2016
Tuesday, April 12, 2016
[Machine Learning] Naive Bayes in R (1)
[Previous Posts]
Introduction to Machine Learning
K-nearest Neighbor Analysis (1)
Classification Tree (CART) (1)
[About Bayes Theorem]
Let's assume that there is a basket that has 10 marbles.
We know it has blue marbles or orange marbles for sure, but we don't know how many of them are blue or orange.
You are asked to guess the color of the marble when you pick up the marble from the basket.
Your best guess is either choose blue or orange by flipping coin. As you might know it's not scientific, but this is the best way to guess.
Here is the different situation. Now, you know that there are 9 blue marbles and only 1 orange marble.
In this case, no one would dare to guess that the marble has orange color. We all know the Bayes theorem intuitively. Prof. Woonkyung Kim in Korea university called it as "the evolution of the probability"
Let's mathematically organize the situation. When we don't have the information, the probability of picking up a blue marble is
P(Blue) = 0.5. Just same as flipping coin and it turns out to be head or toss.
However, if we have the information,
P(Blue | Information) = 0.9.
Bayes theorem is in your smartphone also. (By the way, your cell phone is a collection of the probability theorems) The digital converts all the information into either 1 or 0. (Encoding the information). And prepared to send the digital signal. One way to make a difference between 1 or 0 in the signal is to manipulate the power of signal. We can give 0 less power and 1 more power. The power diagram is like below.
However, as soon as this signal goes through the air, this signal meets white noise, which could be generated by your conversation, light, or interruption of other electro-magnetic wave. When this signal gets to your cell phone, it ends up with the probability distribution like below.
When we get the signal if it has a blue line power, we naturally assume that that signal means 0. When we get the signal if it has a red line power, we naturally assume that that signal means 1. We are going to apply same theory to allow the machine to learn the data. (By the way, there's slight possibility of the error. Shannon came up with the way to correct this error. He also contributed to the birth of Classification Tree model)
[Validating the distribution function]
Let's figure out that our test data(iris) has the normal distribution. You can use the histogram that you learn in stock market class.
[Review histogram]
We are going to use sepal length and sepal width only.
As this is not our scope (machine learning), I put the code into the different post.
[Getting histograms on sepal length and sepal width (Multiple histograms)]
As you can see, like 0 or 1 case, it has each sub species has a normal distribution in terms of sepal length and width. We can conclude that we can apply Naive bayes to our data set. Please, keep in mind that it cannot be used in a categorical value like Male or Female. Only, CART has a capability of handling categorical values.
[Before Codes]
In this case, you need to install e1071 package.
Install.packages("e1071")
[Codes]
library("e1071")
library("gmodels")
normalize <- function(x) {
#In this case, like KNN case, it would be better to normalize the data.
mean_x <- mean(x)
stdev_x <- sd(x)*sqrt((length(x)-1)/(length(x)))
num <- x - mean_x
denom <- stdev_x
return (num/denom)
}
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
#Normalize data. Again, it is necessary as we deal with the distribution function.
#If it is not normalized, the distribution function could be distorted.
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris.training <- iris[ind==1, 1:4]
iris.trainingLabels <- iris[ind==1, 5]
iris.test <- iris[ind==2, 1:4]
iris.testLabels <- iris[ind==2, 5]
# Making normal distribution functions.
# For naive bayes, it is more important to choose the right training data set than in other models.
fit<-naiveBayes(iris.training, iris.trainingLabels)
# split up the data into two sets as we did always.
# Training: Help your computer build the right model
# Test: Validate your data set whether or not it works well.
iris_pred <- predict(fit, iris.test)
#Confusion Matrix => Assessing the machine learning model. I'll explain deeper at another post.
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)
Again, what we need to pay attention to is diagonal direction. So, the accuracy of this model is 36(=10+11+16)/38 = 94.73%. It's not that bad.
Finding a right model is very difficult. It varies from the data to the data. That's why datascientists exist. You have to have good eye to choose the right model to the right data that you have. If you need, you need to formulate the cost-effective data-gathering strategy. That's why the data-scientists get paid a lot in recent years.
Introduction to Machine Learning
K-nearest Neighbor Analysis (1)
Classification Tree (CART) (1)
[About Bayes Theorem]
Let's assume that there is a basket that has 10 marbles.
We know it has blue marbles or orange marbles for sure, but we don't know how many of them are blue or orange.
You are asked to guess the color of the marble when you pick up the marble from the basket.
Your best guess is either choose blue or orange by flipping coin. As you might know it's not scientific, but this is the best way to guess.
Here is the different situation. Now, you know that there are 9 blue marbles and only 1 orange marble.
In this case, no one would dare to guess that the marble has orange color. We all know the Bayes theorem intuitively. Prof. Woonkyung Kim in Korea university called it as "the evolution of the probability"
Let's mathematically organize the situation. When we don't have the information, the probability of picking up a blue marble is
P(Blue) = 0.5. Just same as flipping coin and it turns out to be head or toss.
However, if we have the information,
P(Blue | Information) = 0.9.
Bayes theorem is in your smartphone also. (By the way, your cell phone is a collection of the probability theorems) The digital converts all the information into either 1 or 0. (Encoding the information). And prepared to send the digital signal. One way to make a difference between 1 or 0 in the signal is to manipulate the power of signal. We can give 0 less power and 1 more power. The power diagram is like below.
However, as soon as this signal goes through the air, this signal meets white noise, which could be generated by your conversation, light, or interruption of other electro-magnetic wave. When this signal gets to your cell phone, it ends up with the probability distribution like below.
When we get the signal if it has a blue line power, we naturally assume that that signal means 0. When we get the signal if it has a red line power, we naturally assume that that signal means 1. We are going to apply same theory to allow the machine to learn the data. (By the way, there's slight possibility of the error. Shannon came up with the way to correct this error. He also contributed to the birth of Classification Tree model)
[Validating the distribution function]
Let's figure out that our test data(iris) has the normal distribution. You can use the histogram that you learn in stock market class.
[Review histogram]
We are going to use sepal length and sepal width only.
As this is not our scope (machine learning), I put the code into the different post.
[Getting histograms on sepal length and sepal width (Multiple histograms)]
As you can see, like 0 or 1 case, it has each sub species has a normal distribution in terms of sepal length and width. We can conclude that we can apply Naive bayes to our data set. Please, keep in mind that it cannot be used in a categorical value like Male or Female. Only, CART has a capability of handling categorical values.
[Before Codes]
In this case, you need to install e1071 package.
Install.packages("e1071")
[Codes]
library("e1071")
library("gmodels")
normalize <- function(x) {
#In this case, like KNN case, it would be better to normalize the data.
mean_x <- mean(x)
stdev_x <- sd(x)*sqrt((length(x)-1)/(length(x)))
num <- x - mean_x
denom <- stdev_x
return (num/denom)
}
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
#Normalize data. Again, it is necessary as we deal with the distribution function.
#If it is not normalized, the distribution function could be distorted.
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris.training <- iris[ind==1, 1:4]
iris.trainingLabels <- iris[ind==1, 5]
iris.test <- iris[ind==2, 1:4]
iris.testLabels <- iris[ind==2, 5]
# Making normal distribution functions.
# For naive bayes, it is more important to choose the right training data set than in other models.
fit<-naiveBayes(iris.training, iris.trainingLabels)
# split up the data into two sets as we did always.
# Training: Help your computer build the right model
# Test: Validate your data set whether or not it works well.
iris_pred <- predict(fit, iris.test)
#Confusion Matrix => Assessing the machine learning model. I'll explain deeper at another post.
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)
[Result & Interpretation]
Finding a right model is very difficult. It varies from the data to the data. That's why datascientists exist. You have to have good eye to choose the right model to the right data that you have. If you need, you need to formulate the cost-effective data-gathering strategy. That's why the data-scientists get paid a lot in recent years.
Monday, April 11, 2016
Multiple Histograms in R
We talked about the drawing histogram in R.
[Drawing histogram in R]
What if there are two variables that need to be drawn? We can solve this issue just by adding "add=TRUE" option. Please, keep in mind that TRUE FALSE value has to be large capital in R.
[Codes]
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!
#This function allows to make your bar color transparent!
t_col <- function(color, percent = 50, name = NULL) {
# color = color name
# percent = % transparency
# name = an optional name for the color
## Get RGB values for named color
rgb.val <- col2rgb(color)
## Make new color using input color as base and alpha set by transparency
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
max = 255,
alpha = (100-percent)*255/100,
names = name)
## Save the color
return(t.col)
}
#Unfortunately, this
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
#Iris-setosa, Iris-virginica, Iris-versicolor
setosa <- iris[iris$Species=="Iris-setosa",]
virginica <- iris[iris$Species=="Iris-virginica",]
versicolor <- iris[iris$Species=="Iris-versicolor",]
mycolor<-rainbow(3)
hist(setosa$Sepal.Width, col=t_col(mycolor[1]), xlim=c(1.7, 5))
hist(virginica$Sepal.Width, col=t_col(mycolor[2]), add=TRUE)
hist(versicolor$Sepal.Width, col=t_col(mycolor[3]), add=TRUE)
[Result]
[Are we done?]
As you can see above, the graph that we've drawn is not beautiful. The size of the bin that each histogram has is not consistent. The term "Frequency" means that these data are not normalized in terms of the data size. Here's the way to get around issue. We are going to use "freq" option and "breaks" option.
"freq" option gets the data normalized, making sure that it looks to have the same number of data points. Above case, as the setosa have more data points than others, it dominates the graph in terms of the height. From the data visualization perspective, that's not good. We can do better.
"breaks" option ensures that each histogram has the same bin size. Above graph again, the red bar is wider than others. We are going to fix this issue with the break keyword. Again, seeing is believing.
hist(setosa$Sepal.Length, col=t_col(mycolor[1]), xlim=c(4.0, 8.5), ylim=c(0, 1.05), breaks=seq(4,8,by=0.25),freq=FALSE)
hist(virginica$Sepal.Length, col=t_col(mycolor[2]), add=TRUE, breaks=seq(4,8,by=0.25),freq=FALSE)
hist(versicolor$Sepal.Length, col=t_col(mycolor[3]), add=TRUE, breaks=seq(4,8,by=0.25),freq=FALSE)
Now, we have the graph that has the same bar width and the same height. That looks much better.
[Drawing histogram in R]
What if there are two variables that need to be drawn? We can solve this issue just by adding "add=TRUE" option. Please, keep in mind that TRUE FALSE value has to be large capital in R.
[Codes]
iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!
#This function allows to make your bar color transparent!
t_col <- function(color, percent = 50, name = NULL) {
# color = color name
# percent = % transparency
# name = an optional name for the color
## Get RGB values for named color
rgb.val <- col2rgb(color)
## Make new color using input color as base and alpha set by transparency
t.col <- rgb(rgb.val[1], rgb.val[2], rgb.val[3],
max = 255,
alpha = (100-percent)*255/100,
names = name)
## Save the color
return(t.col)
}
#Unfortunately, this
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
#Iris-setosa, Iris-virginica, Iris-versicolor
setosa <- iris[iris$Species=="Iris-setosa",]
virginica <- iris[iris$Species=="Iris-virginica",]
versicolor <- iris[iris$Species=="Iris-versicolor",]
mycolor<-rainbow(3)
hist(setosa$Sepal.Width, col=t_col(mycolor[1]), xlim=c(1.7, 5))
hist(virginica$Sepal.Width, col=t_col(mycolor[2]), add=TRUE)
hist(versicolor$Sepal.Width, col=t_col(mycolor[3]), add=TRUE)
[Result]
[Are we done?]
As you can see above, the graph that we've drawn is not beautiful. The size of the bin that each histogram has is not consistent. The term "Frequency" means that these data are not normalized in terms of the data size. Here's the way to get around issue. We are going to use "freq" option and "breaks" option.
"freq" option gets the data normalized, making sure that it looks to have the same number of data points. Above case, as the setosa have more data points than others, it dominates the graph in terms of the height. From the data visualization perspective, that's not good. We can do better.
"breaks" option ensures that each histogram has the same bin size. Above graph again, the red bar is wider than others. We are going to fix this issue with the break keyword. Again, seeing is believing.
hist(setosa$Sepal.Length, col=t_col(mycolor[1]), xlim=c(4.0, 8.5), ylim=c(0, 1.05), breaks=seq(4,8,by=0.25),freq=FALSE)
hist(virginica$Sepal.Length, col=t_col(mycolor[2]), add=TRUE, breaks=seq(4,8,by=0.25),freq=FALSE)
hist(versicolor$Sepal.Length, col=t_col(mycolor[3]), add=TRUE, breaks=seq(4,8,by=0.25),freq=FALSE)
Now, we have the graph that has the same bar width and the same height. That looks much better.
Labels:
Histogram,
Multiple Histogram,
R,
R program,
R programming
Location:
미국 일리노이 시카고
Saturday, April 9, 2016
Multi-purpose calculator for mobile web (Annuity) with source code
Hello,
I made breakthrough yesterday. My productivity exploded.
I made Excel-like calculator, allows you to calculate discount rate really easily.
If you want to calculate annuity for three years, just type like
1/1.03 + 1/1.03^2 + 1/1.03^3
I have been in search for this type of calculator, but I couldn't
I really take care of user experience. If any kind of problems that you have, please, let me know.
I made this while I am studying CFA and AICAP. I hope it would help you a lot study finance.
It also has a function that stores the variable that you want. You can repeat the variable. Press "V" button when you want to repeat the variable. It really helps when you calculate NPV.
[Open the Calculator]
Source code is available in my github website.
[Source Code]
[Bug fixed-04/10/2016]
- It doesn't recognize "e^(-3)". I fixed it with simple string manipulation
- I don't have a great idea how to calculate "ln(2)" in the stack calculator. Please, someone gives me a great idea.
I made breakthrough yesterday. My productivity exploded.
I made Excel-like calculator, allows you to calculate discount rate really easily.
If you want to calculate annuity for three years, just type like
1/1.03 + 1/1.03^2 + 1/1.03^3
I have been in search for this type of calculator, but I couldn't
I really take care of user experience. If any kind of problems that you have, please, let me know.
I made this while I am studying CFA and AICAP. I hope it would help you a lot study finance.
It also has a function that stores the variable that you want. You can repeat the variable. Press "V" button when you want to repeat the variable. It really helps when you calculate NPV.
[Open the Calculator]
Source code is available in my github website.
[Source Code]
[Bug fixed-04/10/2016]
- It doesn't recognize "e^(-3)". I fixed it with simple string manipulation
- I don't have a great idea how to calculate "ln(2)" in the stack calculator. Please, someone gives me a great idea.
Labels:
Annuity,
Calculator,
Finance,
Interest Rate,
Mobile,
NPV,
source,
source code,
store variable
Location:
미국 일리노이 시카고
Friday, April 8, 2016
Finance Calculator & Z-score Calculator on the web
Hi All finance friends!
What bothered me during my MBA course was two. One was the financial calculator. I didn't bring my calculator all the time. It hassled me to calculate YTM stuff with just normal calculator. It bothers me that I had to look into the Z-score table all the time to calculate the risk.
Here, I made financial calculator (web, mobile supported) and z-score calculator. I am going to add function one by one. Please, use this a lot and share it with your friends!
You can use my mobile applications on your smartphone!
[Open Finance Calculator]
[Open Z-score Calculator]
What bothered me during my MBA course was two. One was the financial calculator. I didn't bring my calculator all the time. It hassled me to calculate YTM stuff with just normal calculator. It bothers me that I had to look into the Z-score table all the time to calculate the risk.
Here, I made financial calculator (web, mobile supported) and z-score calculator. I am going to add function one by one. Please, use this a lot and share it with your friends!
You can use my mobile applications on your smartphone!
[Open Finance Calculator]
[Open Z-score Calculator]
Thursday, April 7, 2016
Introduction to Machine Learning / Data Mining
[Machine Learning? Data Mining?]
Well, there is a little bit difference between machine learning and data mining although I don't see any difference between them.
See the debate on the difference between machine learning and data mining.
At the end, it is about training the machine to recognize the data, and the predict the future (or unknown variables) with the training. I'll use both terms interchangeably. Please, feel free to challenge me if I am wrong.
[How it works?]
Well, seeing is believing. I have been in search for the better explanation. But, professor Keating in University Notre Dame has a really great explanation for that. You'll see just two pictures with the painters' name. Next, I am going to give you just pictures, and give you the question "who is the painter?" I swear you can answer the question 100% correctly.
<Claude Monet>
<Van Gogh>
Now, who painted these pictures?
Well, there is a little bit difference between machine learning and data mining although I don't see any difference between them.
See the debate on the difference between machine learning and data mining.
At the end, it is about training the machine to recognize the data, and the predict the future (or unknown variables) with the training. I'll use both terms interchangeably. Please, feel free to challenge me if I am wrong.
Well, seeing is believing. I have been in search for the better explanation. But, professor Keating in University Notre Dame has a really great explanation for that. You'll see just two pictures with the painters' name. Next, I am going to give you just pictures, and give you the question "who is the painter?" I swear you can answer the question 100% correctly.
<Claude Monet>
<Van Gogh>
Now, who painted these pictures?
<1>
<2>
<3>
<4>
<5>
<Answer>
1 - Gogh
2 - Monet
3 - Monet
4 - Gogh
5 - Gogh
<How your brain worked?>
As soon as you saw those pictures, in your mind, you already have a formula, which allows you to make a difference between two painters. (By the way, both painters are well known to have stark contrast in their painting styles to each other.)
Features | Gogh | Monet |
Color | Use 4-5 colors | Use more than 10 colors |
Style | Masculine | Feminine |
Stroke | Rough | Smooth |
Viewer's Perception | Powerful | Detailed |
Although there are some pictures which exactly doesn't fall into those two categories, we can get a broad sense of which picture is painted by whom.
Machine learning does the same thing. It learns the data given by the user. We call it as a "training set" Then, it applies the formula that was built when the machine analyzed the training set to the data set that we want to forecast. We call it as a "test set." The prediction can be wrong, but generally as we provide the machine with the more qualified test data, we can get the better prediction.
[Where can we apply it?]
<Sales>
You are the sales person of the insurance company. Just you've got the list of potential customers. It has the information of their income, age, place, and jobs. If you are a good sales person, you would have a gut feeling to single out which customer is willing to sign up the new insurance plan. However, with the machine learning, you don't need any gut feeling. If you have the past transaction records, it tells you which customer is the most likely to sign up the new insurance plan.
<Card company>
Suppose that you are in charge of issuing cards. You don't want to issue cards to those who are highly likely not to pay the card bill on time. In this case, you can figure out who is likely to default based upon age, income, job, and savings. Actually, credit card companies adopt this techniques long time ago. If you get "you are rejected to your request on issuing card" message, you would probably not pass this test.
I want to lead this conversation into real application of data mining.
Wednesday, April 6, 2016
How to draw CAPM for my portfolio and getting equity beta in R
[Previous Relevant Posts]
[What is CAPM?]
According to the investopedia (http://www.investopedia.com/terms/c/capm.asp),
Single regression with R to identify relationship between WTI and stock price of Exxon
Getting stock volatility in R & Getting Histogram of returns
[What is CAPM?]
According to the investopedia (http://www.investopedia.com/terms/c/capm.asp),
The capital asset pricing model (CAPM) is a model that describes the relationship between risk and expected return and that is used in pricing of risky securities.
(Generally, you can understand that there is a linear relationship between risk (stock volatility) and the stock return.)
The general idea behind CAPM is that investors need to be compensated in two ways: time value of money and risk. The time value of money is represented by the risk-free (rf) rate in the formula and compensates the investors for placing money in any investment over a period of time. The other half of the formula represents risk and calculates the amount of compensation the investor needs for taking on additional risk. This is calculated by taking a risk measure (beta) that compares the returns of the asset to the market over a period of time and to the market premium (Rm-Rf).
[How do we approach]
Generally, rf is consistent with the T-bill rate. During 2015, it was almost 0.02 (2%) as Fed kept interest rate low to boost the economy. CAPM can be represented in portfolio. Now, I am going to choose the TOP 20 NYSE technology stocks during Mar 2016. This is done by single regression as we used in previous post.
Although we should be careful to use single regression in this situation because of the presence of "Alpha"
http://themarketmogul.com/wp-content/uploads/2015/05/Screen-Shot-2015-05-19-at-09.04.48.png |
This alpha is a interception of regression analysis and is not identical to rf. With this portfolio, you can expect the interception amount the of the return when your portfolio has zero volatility. Generally, the high alpha is regarded as a good portfolio. We will deal with that.
[Strategy]
We are going to go though general data scientist strategy.
(1) Data Gathering: We are going to gather the market data from the API.
(2) Data Manipulation: We choose only Top 20 firms in terms of market cap.
(3) Data Visulaization: Draw the risk-return graph
(4) Data Interpretation: See this graph is consistent with CAPM theory.
[Codes]
#I found this code is really versatile. Feel free to use this code in analyzing equity market!
#Getting TOP 100 stocks in NYSE volitility and return
library(TTR) #To get tickers
library(plyr) #For sorting
library(tseries) #For volatility / return
library(stringr) #String manipulation
library(calibrate) #To represent stock name on scatter plot
#NASDAQ, NYSE
market <- "NYSE"
#Technology, Finance, Energy, Consumer Services, Transportation, Capital Goods, Health Care, Basic Industries
sector <- "Technology"
getcapm <- function(stock) {
#Getting data from server
data <- get.hist.quote(stock, #Tick mark
start="2016-03-01", #Start date YYYY-MM-DD
end="2016-03-31" #End date YYYY-MM-DD
)
#We only take into account "Closing price", the price when the market closes
yesterdayprice <- data$Close
#This is a unique feature of R better than Excel
#I need to calculate everyday return
#The stock return is defined as (today price - yesterday price)/today price
todayprice <- lag(yesterdayprice)
#ret <- log(lag(price)) - log(price)
rets <- (todayprice - yesterdayprice)/todayprice
#Annualized and percentage
vol <- sd(rets) * sqrt(length(todayprice))
#Getting Geometric Mean.
#You might be tempted to use just mean(). Don't do that in stock market.
geometric_mean_return_prep <- rets + 1
geometric_mean_return_prep <- data.frame(Date=time(geometric_mean_return_prep), geometric_mean_return_prep, check.names=FALSE, row.names=NULL)
geometric_mean_return = 1
for(i in 1:length(geometric_mean_return_prep)) {
geometric_mean_return = geometric_mean_return * geometric_mean_return_prep[i,2]
}
geometric_mean_return <- geometric_mean_return^(1/length(geometric_mean_return_prep))
geometric_mean_return <- geometric_mean_return -1
information <- c(geometric_mean_return, vol) #It's a trick to return multiple values in one return.
return(information)
}
convert_marketcap <- function(str) {
str <- gsub("\\$", "", str) #Get rid of "$" first
#The reason why I use \\ is that $ has a special meaning in regular expression
#Regular expression is not the topic. #I'll deal with later
multiplier <- str_sub(str,-1,-1) #Million? Billion?
pure_number <- as.numeric(gsub("(B|M)", "", str)) #Get rid of M or B. Turn it into number
if(multiplier == "B") {
#Billion
adjustment <- 1000000000
} else if(multiplier == "M") {
#Million
adjustment <- 1000000
} else {
#Don't adjust it.
adjustment <- 1
}
return (pure_number * adjustment)
}
original <- stockSymbols()
#Getting NASDAQ
listings <- original[original$Exchange==market,]
#As these data include "NA," we need to clean them up for further data manipulation.
#If you don't clean up NA, you would encounter error while manipulating
listings <- listings[!is.na(listings$MarketCap),]
listings <- listings[!is.na(listings$Sector),]
#I want to focus on the specific sector
listings <- listings[listings$Sector==sector,]
#Market cap is string right now. We need to convert this to number
listings$MarketCap <- sapply(listings$MarketCap, convert_marketcap)
#Sort the list descending order of market capital
listings <- arrange(listings, desc(listings$MarketCap))
capm <- data.frame(ticker="", volatility=1:20, geometric_return=1:20)
capm$ticker <- listings$Symbol[1:20]
for(i in 1:20) {
information_on_stock <- getcapm(capm$ticker[i])
capm$geometric_return[i] <- information_on_stock[1]
capm$volatility[i] <- information_on_stock[2]
}
main_name <- paste(market, " / ")
main_name <- paste(main_name, sector)
main_name <- paste(main_name, " in Mar 2015")
capm_regression<-lm(capm$geometric_return ~ capm$volatility)
plot(x=capm$volatility,y=capm$geometric_return,pch=19, main = main_name, xlab="Stock Volatility", ylab="Stock Return")
#I want to know which stock is outlier.
textxy(capm$volatility, capm$geometric_return, capm$ticker)
abline(capm_regression, col="red") # regression line (y~x)
print(summary(capm_regression))
[Outcome Interpretation]
As the risk free rate is 0.002, we can just neglect this rate. (Right now, it makes sense to assume that there is no risk free rate. We are living in the historically low risk-free rate world.)
So, thus remaining things are two - Alpha and Beta.
(1) Data Gathering: We are going to gather the market data from the API.
(2) Data Manipulation: We choose only Top 20 firms in terms of market cap.
(3) Data Visulaization: Draw the risk-return graph
(4) Data Interpretation: See this graph is consistent with CAPM theory.
[Codes]
#I found this code is really versatile. Feel free to use this code in analyzing equity market!
#Getting TOP 100 stocks in NYSE volitility and return
library(TTR) #To get tickers
library(plyr) #For sorting
library(tseries) #For volatility / return
library(stringr) #String manipulation
library(calibrate) #To represent stock name on scatter plot
#NASDAQ, NYSE
market <- "NYSE"
#Technology, Finance, Energy, Consumer Services, Transportation, Capital Goods, Health Care, Basic Industries
sector <- "Technology"
getcapm <- function(stock) {
#Getting data from server
data <- get.hist.quote(stock, #Tick mark
start="2016-03-01", #Start date YYYY-MM-DD
end="2016-03-31" #End date YYYY-MM-DD
)
#We only take into account "Closing price", the price when the market closes
yesterdayprice <- data$Close
#This is a unique feature of R better than Excel
#I need to calculate everyday return
#The stock return is defined as (today price - yesterday price)/today price
todayprice <- lag(yesterdayprice)
#ret <- log(lag(price)) - log(price)
rets <- (todayprice - yesterdayprice)/todayprice
#Annualized and percentage
vol <- sd(rets) * sqrt(length(todayprice))
#Getting Geometric Mean.
#You might be tempted to use just mean(). Don't do that in stock market.
geometric_mean_return_prep <- rets + 1
geometric_mean_return_prep <- data.frame(Date=time(geometric_mean_return_prep), geometric_mean_return_prep, check.names=FALSE, row.names=NULL)
geometric_mean_return = 1
for(i in 1:length(geometric_mean_return_prep)) {
geometric_mean_return = geometric_mean_return * geometric_mean_return_prep[i,2]
}
geometric_mean_return <- geometric_mean_return^(1/length(geometric_mean_return_prep))
geometric_mean_return <- geometric_mean_return -1
information <- c(geometric_mean_return, vol) #It's a trick to return multiple values in one return.
return(information)
}
convert_marketcap <- function(str) {
str <- gsub("\\$", "", str) #Get rid of "$" first
#The reason why I use \\ is that $ has a special meaning in regular expression
#Regular expression is not the topic. #I'll deal with later
multiplier <- str_sub(str,-1,-1) #Million? Billion?
pure_number <- as.numeric(gsub("(B|M)", "", str)) #Get rid of M or B. Turn it into number
if(multiplier == "B") {
#Billion
adjustment <- 1000000000
} else if(multiplier == "M") {
#Million
adjustment <- 1000000
} else {
#Don't adjust it.
adjustment <- 1
}
return (pure_number * adjustment)
}
original <- stockSymbols()
#Getting NASDAQ
listings <- original[original$Exchange==market,]
#As these data include "NA," we need to clean them up for further data manipulation.
#If you don't clean up NA, you would encounter error while manipulating
listings <- listings[!is.na(listings$MarketCap),]
listings <- listings[!is.na(listings$Sector),]
#I want to focus on the specific sector
listings <- listings[listings$Sector==sector,]
#Market cap is string right now. We need to convert this to number
listings$MarketCap <- sapply(listings$MarketCap, convert_marketcap)
#Sort the list descending order of market capital
listings <- arrange(listings, desc(listings$MarketCap))
capm <- data.frame(ticker="", volatility=1:20, geometric_return=1:20)
capm$ticker <- listings$Symbol[1:20]
for(i in 1:20) {
information_on_stock <- getcapm(capm$ticker[i])
capm$geometric_return[i] <- information_on_stock[1]
capm$volatility[i] <- information_on_stock[2]
}
main_name <- paste(market, " / ")
main_name <- paste(main_name, sector)
main_name <- paste(main_name, " in Mar 2015")
capm_regression<-lm(capm$geometric_return ~ capm$volatility)
plot(x=capm$volatility,y=capm$geometric_return,pch=19, main = main_name, xlab="Stock Volatility", ylab="Stock Return")
#I want to know which stock is outlier.
textxy(capm$volatility, capm$geometric_return, capm$ticker)
abline(capm_regression, col="red") # regression line (y~x)
print(summary(capm_regression))
[Outcome Interpretation]
As the risk free rate is 0.002, we can just neglect this rate. (Right now, it makes sense to assume that there is no risk free rate. We are living in the historically low risk-free rate world.)
So, thus remaining things are two - Alpha and Beta.
Call:
lm(formula = capm$geometric_return ~ capm$volatility)
Residuals:
Min 1Q Median 3Q Max
-0.017195 -0.004096 -0.002011 0.004146 0.018340
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002651 0.004949 -0.536 0.599
capm$volatility 0.087089 0.072577 1.200 0.246
Residual standard error: 0.009088 on 18 degrees of freedom
Multiple R-squared: 0.07407, Adjusted R-squared: 0.02263
F-statistic: 1.44 on 1 and 18 DF, p-value: 0.2457
<Interpretation>
It is difficult to say that this regression is meaningful statistically as the p-value is higher than 0.05 (0.599 & 0.246 respectively) In this case, we call that "the idiosyncratic risk is much higher than the market risk." Idiosyncratic risk is the risk that the company only has. It could be management issue. It could be their products level. But, for now, let's pretend that this graph is statistically meaningful.
Alpha (Intercept) is -0.2651%, meaning that this portfolio is not good portfolio. Again, Alpha is the expected return when the volatility is 0. If there's no volatility, still we could lose the money with this portfolio.
Beta is 0.087, meaning that we can expect 8.7% additional return, when we bear the 1.0 volatility. When we bear 0.5 volatility it would become 4.35%. (I don't want to say that this is a good portfolio)
In March 2016, Infosys (INFY) had a better performance than LinkedIn(LINKD). It was less volatile, but generated the better return. If you had invested in Infosys in march 2016, it would have been a much better decision than buying LinkedIn stock.
Please keep in mind that these are all historical data. It doesn't tell you the future performance, but we can forecast the future performance by looking at the past data.
[Assignment]
Please, find the high beta portfolio in Mar 2016. You can do that with my codes. Just tweak it. I'll post the answer later on.
Sunday, April 3, 2016
R programming tutorials (R program 101)
For those who visit this site first, I would encourage you to look at my posts in this order. This site is for MBA or business students who are eager to learn R.
R tutorial (1) - Data types, iterations
R tutorial (2) - Logical control, File Read/Write, String manipulation
R tutorial (3) - Data frame manipulation (merge, delete row/column), Basic statistics (Mean, Median, ...), Basic graph
I want to minimize the introduction as I believe that the solving actual problems helps you more learn new computer languages.
[More than tutorial? Basic, Solver & Graph]
How to calculate Black-Scholes formula in R (Simple! with basic functions. It's a video!)
NPV (Net Present Value)
IRR (Internal Rate of Return) - How to replicate Excel Solver function in a single variable
Bond Duration - how to manipulate the data frame in depth
Annuity (Mortgage) - Learning about drawing graph in R
Getting data from the internet (Libor)
How to read Google Sheet on R - share your data with your co-worker.
[Statistics]
Histogram on Stock volatility - Histogram on single variables
Use Multiple Histogram - Histogram on multiple variables
Single Regression (What is the relationship between WTI Crude and Exxon equity)
Stock risk-returns in 2015
Muti Regression (The relationship between Exxon Mobil and WTI/Gas/S&P)
[Advanced]
Draw CAPM in R
[Machine Learning / Data Mining]
Introduction to Machine Learning
K-nearest Analysis
Classification Tree
Naive Bayes
Association Rules (1) - The relationship between selling beer and selling diapers
[Data Science Framework]
R tutorial (1) - Data types, iterations
R tutorial (2) - Logical control, File Read/Write, String manipulation
R tutorial (3) - Data frame manipulation (merge, delete row/column), Basic statistics (Mean, Median, ...), Basic graph
I want to minimize the introduction as I believe that the solving actual problems helps you more learn new computer languages.
[More than tutorial? Basic, Solver & Graph]
How to calculate Black-Scholes formula in R (Simple! with basic functions. It's a video!)
NPV (Net Present Value)
IRR (Internal Rate of Return) - How to replicate Excel Solver function in a single variable
Bond Duration - how to manipulate the data frame in depth
Annuity (Mortgage) - Learning about drawing graph in R
Getting data from the internet (Libor)
How to read Google Sheet on R - share your data with your co-worker.
[Statistics]
Histogram on Stock volatility - Histogram on single variables
Use Multiple Histogram - Histogram on multiple variables
Single Regression (What is the relationship between WTI Crude and Exxon equity)
Stock risk-returns in 2015
Muti Regression (The relationship between Exxon Mobil and WTI/Gas/S&P)
[Advanced]
Draw CAPM in R
[Machine Learning / Data Mining]
Introduction to Machine Learning
K-nearest Analysis
Classification Tree
Naive Bayes
Association Rules (1) - The relationship between selling beer and selling diapers
[Data Science Framework]
Friday, April 1, 2016
[R] Risk - return relationship of all stocks in 2015 with source code
[Previous Posting]
Getting stock volatility in R & Getting Histogram of returns
[Risk return relationship in 2015]
According to the CAPM theory, there is a linear relationship between the risk inherent in the stock and the return. It makes sense. It's consistent with the Golden Rule "Low risk low return, High risk high return"
However, that was not the case in 2015. I don't know why. Maybe I have to ask it to finance professors to figure out what was going on and what made them being deviate from the theory.
[Before getting into code]
I had to use a lot of packages. Thus, those who do not install following packages, please install them first. Just like other packages, it's free. I used string manipulation a little bit. It could be harder for those who are not familiar with regular expression. I'll post it later. Please bear with me.
TTR, plyr, tseries, stringr, calibrate
You can install with "install.packages()" command in your R.
[R Code]
#Getting TOP 100 stocks in NYSE volatility and return
library(TTR) #To get tickers
library(plyr) #For sorting
library(tseries) #For volatility / return
library(stringr) #String manipulation
library(calibrate) #To represent stock name on scatter plot
#NASDAQ, NYSE
market <- "NYSE"
#Technology, Finance, Energy, Consumer Services, Transportation, Capital Goods, Health Care, Basic Industries
sector <- "Finance"
getcapm <- function(stock) {
#Getting data from server
data <- get.hist.quote(stock, #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#We only take into account "Closing price", the price when the market closes
yesterdayprice <- data$Close
#This is a unique feature of R better than Excel
#I need to calculate everyday return
#The stock return is defined as (today price - yesterday price)/today price
todayprice <- lag(yesterdayprice)
#ret <- log(lag(price)) - log(price)
rets <- (todayprice - yesterdayprice)/todayprice
#Annualized and percentage
vol <- sd(rets) * sqrt(length(todayprice))
geometric_mean_return_prep <- rets + 1
geometric_mean_return_prep <- data.frame(Date=time(geometric_mean_return_prep), geometric_mean_return_prep, check.names=FALSE, row.names=NULL)
#geometric_mean_return <- exp(mean(log(geometric_mean_return_prep)))
geometric_mean_return = 1
for(i in 1:length(geometric_mean_return_prep)) {
geometric_mean_return = geometric_mean_return * geometric_mean_return_prep[i,2]
}
geometric_mean_return <- geometric_mean_return^(1/length(geometric_mean_return_prep))
geometric_mean_return <- geometric_mean_return -1
information <- c(geometric_mean_return, vol)
return(information)
}
convert_marketcap <- function(str) {
str <- gsub("\\$", "", str) #Get rid of "$" first
#The reason why I use \\ is that $ has a special meaning in regular expression
#Regular expression is not the topic. #I'll deal with later
multiplier <- str_sub(str,-1,-1) #Million? Billion?
pure_number <- as.numeric(gsub("(B|M)", "", str)) #Get rid of M or B. Turn it into number
if(multiplier == "B") {
#Billion
adjustment <- 1000000000
} else if(multiplier == "M") {
#Million
adjustment <- 1000000
} else {
#Don't adjust it.
adjustment <- 1
}
return (pure_number * adjustment)
}
original <- stockSymbols()
#Getting NASDAQ
listings <- original[original$Exchange==market,]
#As these data include "NA," we need to clean them up for further data manipulation.
#If you don't clean up NA, you would encounter error while manipulating
listings <- listings[!is.na(listings$MarketCap),]
listings <- listings[!is.na(listings$Sector),]
#I want to focus on the specific sector
listings <- listings[listings$Sector==sector,]
#Market cap is string right now. We need to convert this to number
listings$MarketCap <- sapply(listings$MarketCap, convert_marketcap)
#Sort the list descending order of market capital
listings <- arrange(listings, desc(listings$MarketCap))
capm <- data.frame(ticker="", volatility=1:20, geometric_return=1:20)
capm$ticker <- listings$Symbol[1:20]
for(i in 1:20) {
information_on_stock <- getcapm(capm$ticker[i])
capm$geometric_return[i] <- information_on_stock[1]
capm$volatility[i] <- information_on_stock[2]
}
main_name <- paste(market, " / ")
main_name <- paste(main_name, sector)
main_name <- paste(main_name, " in 2015")
plot(x=capm$volatility,y=capm$geometric_return,pch=19, main = main_name, xlab="Stock Volatility", ylab="Stock Return")
#I want to know which stock is outlier.
textxy(capm$volatility, capm$geometric_return, capm$ticker)
(2) NASDAQ / Health care
Gilead (Ticker: GILD) did a good job, but it's lower than 2%. Other than Gilead, if you had invested in health care sector, it would have been a disaster.
(3) NYSE / Finance
It was the weirdest risk-return graph that I've ever seen. I can see some outlier right over there (China Life Insurance Company, LFC) as the china market got volatile last year. On average, they didn't do well.
(4) NYSE / Technology
Only Palo Alto Networks (PANW) had a positive return. SAP didn't have a good time in 2015.
Getting stock volatility in R & Getting Histogram of returns
[Risk return relationship in 2015]
According to the CAPM theory, there is a linear relationship between the risk inherent in the stock and the return. It makes sense. It's consistent with the Golden Rule "Low risk low return, High risk high return"
However, that was not the case in 2015. I don't know why. Maybe I have to ask it to finance professors to figure out what was going on and what made them being deviate from the theory.
[Before getting into code]
I had to use a lot of packages. Thus, those who do not install following packages, please install them first. Just like other packages, it's free. I used string manipulation a little bit. It could be harder for those who are not familiar with regular expression. I'll post it later. Please bear with me.
TTR, plyr, tseries, stringr, calibrate
You can install with "install.packages()" command in your R.
[R Code]
#Getting TOP 100 stocks in NYSE volatility and return
library(TTR) #To get tickers
library(plyr) #For sorting
library(tseries) #For volatility / return
library(stringr) #String manipulation
library(calibrate) #To represent stock name on scatter plot
#NASDAQ, NYSE
market <- "NYSE"
#Technology, Finance, Energy, Consumer Services, Transportation, Capital Goods, Health Care, Basic Industries
sector <- "Finance"
getcapm <- function(stock) {
#Getting data from server
data <- get.hist.quote(stock, #Tick mark
start="2015-01-01", #Start date YYYY-MM-DD
end="2015-12-31" #End date YYYY-MM-DD
)
#We only take into account "Closing price", the price when the market closes
yesterdayprice <- data$Close
#This is a unique feature of R better than Excel
#I need to calculate everyday return
#The stock return is defined as (today price - yesterday price)/today price
todayprice <- lag(yesterdayprice)
#ret <- log(lag(price)) - log(price)
rets <- (todayprice - yesterdayprice)/todayprice
#Annualized and percentage
vol <- sd(rets) * sqrt(length(todayprice))
geometric_mean_return_prep <- rets + 1
geometric_mean_return_prep <- data.frame(Date=time(geometric_mean_return_prep), geometric_mean_return_prep, check.names=FALSE, row.names=NULL)
#geometric_mean_return <- exp(mean(log(geometric_mean_return_prep)))
geometric_mean_return = 1
for(i in 1:length(geometric_mean_return_prep)) {
geometric_mean_return = geometric_mean_return * geometric_mean_return_prep[i,2]
}
geometric_mean_return <- geometric_mean_return^(1/length(geometric_mean_return_prep))
geometric_mean_return <- geometric_mean_return -1
information <- c(geometric_mean_return, vol)
return(information)
}
convert_marketcap <- function(str) {
str <- gsub("\\$", "", str) #Get rid of "$" first
#The reason why I use \\ is that $ has a special meaning in regular expression
#Regular expression is not the topic. #I'll deal with later
multiplier <- str_sub(str,-1,-1) #Million? Billion?
pure_number <- as.numeric(gsub("(B|M)", "", str)) #Get rid of M or B. Turn it into number
if(multiplier == "B") {
#Billion
adjustment <- 1000000000
} else if(multiplier == "M") {
#Million
adjustment <- 1000000
} else {
#Don't adjust it.
adjustment <- 1
}
return (pure_number * adjustment)
}
original <- stockSymbols()
#Getting NASDAQ
listings <- original[original$Exchange==market,]
#As these data include "NA," we need to clean them up for further data manipulation.
#If you don't clean up NA, you would encounter error while manipulating
listings <- listings[!is.na(listings$MarketCap),]
listings <- listings[!is.na(listings$Sector),]
#I want to focus on the specific sector
listings <- listings[listings$Sector==sector,]
#Market cap is string right now. We need to convert this to number
listings$MarketCap <- sapply(listings$MarketCap, convert_marketcap)
#Sort the list descending order of market capital
listings <- arrange(listings, desc(listings$MarketCap))
capm <- data.frame(ticker="", volatility=1:20, geometric_return=1:20)
capm$ticker <- listings$Symbol[1:20]
for(i in 1:20) {
information_on_stock <- getcapm(capm$ticker[i])
capm$geometric_return[i] <- information_on_stock[1]
capm$volatility[i] <- information_on_stock[2]
}
main_name <- paste(market, " / ")
main_name <- paste(main_name, sector)
main_name <- paste(main_name, " in 2015")
plot(x=capm$volatility,y=capm$geometric_return,pch=19, main = main_name, xlab="Stock Volatility", ylab="Stock Return")
#I want to know which stock is outlier.
textxy(capm$volatility, capm$geometric_return, capm$ticker)
[Outcome]
(1) NASDAQ / Technology
Google, Apple didn't do well in 2015, and so did other firms, like Qualcomm. If I had drawn CAPM, it would have spit out negative beta.
Google, Apple didn't do well in 2015, and so did other firms, like Qualcomm. If I had drawn CAPM, it would have spit out negative beta.
(2) NASDAQ / Health care
Gilead (Ticker: GILD) did a good job, but it's lower than 2%. Other than Gilead, if you had invested in health care sector, it would have been a disaster.
(3) NYSE / Finance
It was the weirdest risk-return graph that I've ever seen. I can see some outlier right over there (China Life Insurance Company, LFC) as the china market got volatile last year. On average, they didn't do well.
(4) NYSE / Technology
Only Palo Alto Networks (PANW) had a positive return. SAP didn't have a good time in 2015.
Subscribe to:
Posts (Atom)