Naive Bayes Classifier

The optimal classifier..

Naive Bayes is a Generative probabilistic model classifier. This is in contrast to a Discriminitive probabilistic classifier, where the prediction $p(outcome|features)$ is computed directly. An example of this type of a classifier would be linear or logistic regression.

In a Generative model classifier, $p(label|features)$ is computed via

$p(features|label) \times p(label)$

Here $p(features|label)$ is called the Likelihood, and $p(label)$ is called the Prior

For example, Socrative is a voting app that could be used with a smartphone or other means (i.e. webbrowser). We are given that people who own a smartphone use Socrative with a probability of 0.7 and people who don't own a smartphone use Socrative with a probability of 0.5. We want to classify a person into one of two classes (Phone, NoPhone), by this feature, 'uses Socrative':

$p(Soc|Phone) = 0.7$

$p(Soc|NoPhone) = 0.5$

Maximum Likelyhood Hypothesis (ML)

Now if we see that some one uses Socrative, then we should conclude that this person is more likely to have a phone, because $P(Soc|Phone) > P(Soc|NoPhone)$.

This is called the Maximum Likelyhood (ML) Hypothesis. We select the label (or hypothesis) with the maximum probability of evidence (or feature).

Maximum a Posteriori Hypothesis (MAP)

If we were given additional information such as $p(Phone) = 0.3$ (This is called a Prior) then the decision making changes to:

$p(Phone|Soc) = p(Phone) \times p(Soc|Phone) / p(Soc)$

$p(NoPhone|Soc) = p(NoPhone) \times p(Soc|NoPhone)/ p(Soc)$

We use Bayes rule to compute the probabilities of each outcome for the feature we observe.

Substituting,

$p(Phone|Soc) = 0.3 * 0.7 / p(Soc) = 0.21 / p(Soc)$

$p(NoPhone|Soc) = (1-0.3) * 0.5 / p(Soc) = 0.35 / p(Soc)$

In this occasion, we select the label that maximises the probability given the features. Since $p(Phone|Soc) < p(NoPhone|Soc)$, it is more likely this person doesn't have a phone. This is called the Maximum a Posteriori Hypothesis (MAP)

Probability estimation explosion

Generalising this into $d$ features and $k$ classes, to estimate

$p(label_k,features_d) = p(feature_1,feature_2,...,feature_d | label_k) \times p(label_k) $

for $k$ labels we have to estimate $(k-1)$ probabilities. (as they add up to 1)

for each label, we have to estimate $2^d - 1$ binary features (hence 2). (-1 as they add up to 1)

So in total $(k-1) + k * (2^d - 1)$

For example for 20 binary features and 2 classes number of probability estimations would be: $ = (2-1) + 2 * (2^{20} - 1)$ $ = 1 + 2^{21} - 2$ $ = 2^{21} -1$