R program for Finance: [Machine Learning] K-Nearest Neighbor Analysis in R (1) (Updated)

[Machine Learning / Data Mining post series]

Introduction to Machine Learning
K-nearest Analysis
Classification Tree
Naive Bayes

[About K-nearest Neighbors]

K-nearest model is similar to what we recognize the object intuitively. When we see the fruit which has round shape and red color, we think of it as "an apple." We don't really know that is apple until it goes through DNA test, but we can intuitively recognize the object with certain characteristics of the object. (It could be a fake apple)

K-nearest algorithm is the first machine learning algorithm. But, there are many great resources which explains this algorithm on the internet already. So, in this post, I decided to briefly explain what the underlying algorithm is.

Data Camp shed light on this algorithm from different angle.

It uses "Geographical distance" to classify something. Let's assume that we have two kinds of data point - '+' and '-'. Each one has the characteristic of x and y, which are numbers. Figuratively, let's say we have the data that illustrates characteristics of male and female. x could correspond to height, y could correspond to weight. Although there is a gray area, we can assume that males are generally taller and heavier than females.

As you can see above, we have unknown record "dot." We want to know whether it is "+" or "-" It has two characteristics x and y. By measuring geographical distance, we can predict it is "+," as it is close to the group of "+"s. In this case we use k=3, meaning that it classifies the dot based upon closest three data points. For the sake of simplicity I use k=3 also.

[Training data set vs Test data set]
We are going to split the data set into two. One is "training data set," which allows the computer to learn what the model looks like. Second one is "test data set," which validates our model and measure how our model is well built. There is no golden rule which proportion is the most effective - 30:70, 40:60, or 50:50. However, keep in mind that large training data set doesn't ensure that the our model is more accurate. I use 70:30 in this post. The data set could be sorted in a certain way. In this case, our computer might spit out the biased result. In order to prevent this, I am going to choose the training data set randomly with "sample" command.

[Code]

#K-nearest Analysis algorithm
library("class") #If you don't have it, please install
library("gmodels") #For confusion Matrix

normalize <- function(x) {
#If we don't normalized it, some distance is far longer than others, which dominates the model.
mean_x <- mean(x)
stdev_x <- sd(x)*sqrt((length(x)-1)/(length(x)))

num <- x - mean_x
denom <- stdev_x
return (num/denom)
}

iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!

#Unfortunately, these data don't have name. We should add the name.
names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")

#If you want to normalize all data that you have, "lapply" is the greatest way to apply the function to all your data. I want to normalize all my data

iris_norm <- as.data.frame(lapply(iris[1:4], normalize))

#Now, we are going to split up the data into two sets. - Training and test
#Training allows the computer to learn the pattern of the data
#Test allows us to validate how accurate our model is.

set.seed(1234)
#Sample allows us to shuffle which row is training and which row is test.
#It's similar to card shuffling.
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
#What we know = Training Data Set
iris.training <- iris_norm[ind==1, 1:4]
iris.test <- iris_norm[ind==2, 1:4]
#What we want to know = Test Data Set
iris.trainLabels <- iris[ind==1, 5]
iris.testLabels <- iris[ind==2, 5]

#K-nearest Analysis
#K=3, Use nearest 3 points to classify unknown subject
iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)

#Print out confusion matrix
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)

[Explanation of sample data]

Well, as a non-expert of flowers, for me, it is difficult to make a difference among flowers. "Iris" has really similar appearances across the sub species. One way to distinguish them is to use length and width of petal / sepal.

<Iris-Setosa>

<Iris-versicolor>

Take a look at two pictures above. They look very similar and it is nearly impossible to figure out which one is versicolor and which one is setosa without going through DNS test. However, there is a rule of thumb always. Some biologists came up with brilliant idea with the size of petal and sepal.

<Output>

<How to interpret this table>

This table is called "Confusion matrix." to assess the accuracy of our model. The actual value of these test variables are in row. The column denotes the prediction value. In the first line, our prediction data exactly match up with the actual data. It's good. However, in the third column, among actual iris-virginica, our model misclassify two rows into Iris-versicolor. Keep in mind that there is no 100% accurate model. As human make mistakes, the machine make mistakes in machine learning.

The accuracy can be defined by (the number of data in diagonal direction) / (the total number of data)

Take a look at diagonal direction 10+12+14 = 36
Take a look at total number 38

Our model has the accuracy like 36/38 = 94.74%
---
Next time, I'll apply this to the financial situation.

[Next Article for machine learning/data mining]

[Machine Learning] Classification Tree (CART)
[Machine Learning] Naive Bayes

R program for Finance

Google Ad sense

Friday, March 25, 2016

[Machine Learning] K-Nearest Neighbor Analysis in R (1) (Updated)

1 comment: