Naive Bayes Classifier

The optimal classifier..

Naive Bayes is a Generative probabilistic model classifier. This is in contrast to a Discriminitive probabilistic classifier, where the prediction $p(outcome|features)$ is computed directly. An example of this type of a classifier would be linear or logistic regression.

In a Generative model classifier, $p(label|features)$ is computed via

$p(features|label) \times p(label)$

Here $p(features|label)$ is called the Likelihood, and $p(label)$ is called the Prior

For example, Socrative is a voting app that could be used with a smartphone or other means (i.e. webbrowser). We are given that people who own a smartphone use Socrative with a probability of 0.7 and people who don't own a smartphone use Socrative with a probability of 0.5. We want to classify a person into one of two classes (Phone, NoPhone), by this feature, 'uses Socrative':

$p(Soc|Phone) = 0.7$

$p(Soc|NoPhone) = 0.5$

Now if we see that some one uses Socrative, then we should conclude that this person is more likely to have a phone, because $P(Soc|Phone) > P(Soc|NoPhone)$. This is called the Maximum Likelyhood (ML) Hypothesis. We select the label (or hypothesis) with the maximum probability of evidence (or feature).

If we were given additional information such as $p(Phone) = 0.3$ (This is called a Prior) then the decision making changes to:

$p(Phone|Soc) = p(Phone) \times p(Soc|Phone) / p(Soc)$

$p(NoPhone|Soc) = p(NoPhone) \times p(Soc|NoPhone)/ p(Soc)$

We use Bayes rule to compute the probabilities of each outcome for the feature we observe.

Substituting,

$p(Phone|Soc) = 0.3 * 0.7 / p(Soc) = 0.21 / p(Soc)$

$p(NoPhone|Soc) = (1-0.3) * 0.5 / p(Soc) = 0.35 / p(Soc)$

In this occasion, we select the label that maximises the probability given the features. Since $p(Phone|Soc) < p(NoPhone|Soc)$, it is more likely this person doesn't have a phone. This is called the Maximum a Posteriori Hypothesis (MAP)

Probability estimation explosion

Generalising this into $d$ features and $k$ classes, to estimate

$p(label_k,features_d) = p(feature_1,feature_2,...,feature_d | label_k) \times p(label_k) $

for $k$ labels we have to estimate $(k-1)$ probabilities. (as they add up to 1)

for each label, we have to estimate $2^d - 1$ binary features (hence 2). (-1 as they add up to 1)

So in total $(k-1) + k * (2^d - 1)$

For example for 20 binary features and 2 classes number of probability estimations would be: $ = (2-1) + 2 * (2^{20} - 1)$ $ = 1 + 2^{21} - 2$ $ = 2^{21} -1$

Aside 1

The above $p(Soc)$ can be expanded as show below:

But using Sum Rule

$p(x) = \sum_y p(x,y)$

and Product Rule $p(x,y) = p(x)p(y|x) = p(y)p(x|y)$

$p(Soc)$ can be written as below:

$\qquad= \sum_{labels} p(Soc,label)$

$\qquad= \sum_{labels} p(label).p(Soc|label)$

$\qquad= p(Phone).p(Soc|Phone) + p(NoPhone).p(Soc|NoPhone)$

Aside 2

Independent Events are events that has no influence on each other. For example the outcome of two coin tosses. The first outcome does not influence the second outcome.

Mutually Exclusive events are events that has no common outcomes. For example, an event (A)where a 6 sided die rolling an even number and an event (B) where it rolls an odd number. But these two events are not independent, because occurring the event A, prevents the event B from happening.