Google Ad sense



All source codes in this blog are also available in Gregory Choi's Github Source code Repository


  

Tuesday, April 12, 2016

[Machine Learning] Naive Bayes in R (1)

[Previous Posts]
Introduction to Machine Learning
K-nearest Neighbor Analysis (1)
Classification Tree (CART) (1)

[About Bayes Theorem]

Let's assume that there is a basket that has 10 marbles.
We know it has blue marbles or orange marbles for sure, but we don't know how many of them are blue or orange.










You are asked to guess the color of the marble when you pick up the marble from the basket.
Your best guess is either choose blue or orange by flipping coin. As you might know it's not scientific, but this is the best way to guess.

Here is the different situation. Now, you know that there are 9 blue marbles and only 1 orange marble.













In this case, no one would dare to guess that the marble has orange color. We all know the Bayes theorem intuitively. Prof. Woonkyung Kim in Korea university called it as "the evolution of the probability"

Let's mathematically organize the situation. When we don't have the information, the probability of picking up a blue marble is

P(Blue) = 0.5. Just same as flipping coin and it turns out to be head or toss.

However, if we have the information,

P(Blue | Information) = 0.9.

Bayes theorem is in your smartphone also. (By the way, your cell phone is a collection of the probability theorems) The digital converts all the information into either 1 or 0. (Encoding the information). And prepared to send the digital signal. One way to make a difference between 1 or 0 in the signal is to manipulate the power of signal. We can give 0 less power and 1 more power. The power diagram is like below.







However, as soon as this signal goes through the air, this signal meets white noise, which could be generated by your conversation, light, or interruption of other electro-magnetic wave. When this signal gets to your cell phone, it ends up with the probability distribution like below.














When we get the signal if it has a blue line power, we naturally assume that that signal means 0. When we get the signal if it has a red line power, we naturally assume that that signal means 1. We are going to apply same theory to allow the machine to learn the data. (By the way, there's slight possibility of the error. Shannon came up with the way to correct this error. He also contributed to the birth of Classification Tree model)

[Validating the distribution function]

Let's figure out that our test data(iris) has the normal distribution. You can use the histogram that you learn in stock market class.
[Review histogram]

We are going to use sepal length and sepal width only.
As this is not our scope (machine learning), I put the code into the different post.
[Getting histograms on sepal length and sepal width (Multiple histograms)]



















As you can see, like 0 or 1 case, it has each sub species has a normal distribution in terms of sepal length and width. We can conclude that we can apply Naive bayes to our data set. Please, keep in mind that it cannot be used in a categorical value like Male or Female. Only, CART has a capability of handling categorical values.

[Before Codes]
In this case, you need to install e1071 package.
Install.packages("e1071")

[Codes]
library("e1071")
library("gmodels")

normalize <- function(x) {
    #In this case, like KNN case, it would be better to normalize the data.
    mean_x <- mean(x)
    stdev_x <- sd(x)*sqrt((length(x)-1)/(length(x)))
 
    num <- x - mean_x
    denom <- stdev_x
    return (num/denom)
}

iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this!

names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
#Normalize data. Again, it is necessary as we deal with the distribution function.
#If it is not normalized, the distribution function could be distorted.
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))

set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.7, 0.3))
iris.training <- iris[ind==1, 1:4]
iris.trainingLabels <- iris[ind==1, 5]

iris.test <- iris[ind==2, 1:4]
iris.testLabels <- iris[ind==2, 5]

# Making normal distribution functions.
# For naive bayes, it is more important to choose the right training data set than in other models.
fit<-naiveBayes(iris.training, iris.trainingLabels)

# split up the data into two sets as we did always. 
# Training: Help your computer build the right model
# Test: Validate your data set whether or not it works well.
iris_pred <- predict(fit, iris.test)

#Confusion Matrix => Assessing the machine learning model. I'll explain deeper at another post.
CrossTable(x = iris.testLabels, y = iris_pred, prop.chisq=FALSE)


[Result & Interpretation]

Again, what we need to pay attention to is diagonal direction. So, the accuracy of this model is 36(=10+11+16)/38 = 94.73%. It's not that bad.

Finding a right model is very difficult. It varies from the data to the data. That's why datascientists exist. You have to have good eye to choose the right model to the right data that you have. If you need, you need to formulate the cost-effective data-gathering strategy. That's why the data-scientists get paid a lot in recent years.







No comments:

Post a Comment