Classification with Multinomial Naive Bayes

Category: Natural Language Processing

Classification

Take an input $x$ and a fixed set of output classes $Y = {y_1, y_2, \ldots, y_M}$ and return a predicted class $y \in Y$.

Text classification sometimes uses $c$ for class instead of $y$ as output, and $d$ for document instead of $x$ as input.

In the supervised situation we have a training set of $N$ documents that have each been hand labeled with a class: ${(d_1, c_1),....,(d_N, c_N)}$. Our goal is to learn a classifier that is capable of mapping from a new document $d$ to its correct class $c \in C$, where $C$ is some set of useful document classes.

Multinomial Naive Bayes

Represent text as bag of words. For a document $d$ out of all all classes $c \in C$, outputs $\hat{c}$ that maximizes $P(c|d)$.

$$ \begin{aligned} \hat{c} &= \arg \max_{c \in C} P(c|d) \ &= \arg \max_{c \in C} \frac{P(d|c)P(c)}{P(d)} \end{aligned} $$

Where $P(d | c)$ is the likelihood, and $P(c)$ is the prior. Or with features $f_1, f_2, \ldots, f_n$:

$$ \hat{c} = \arg \max_{c \in C} P(f_1, f_2, \ldots, f_n | c)P(c) $$

Naive Bayes Assumption

$$ P(f_1, f_2, \ldots, f_n | c) = \prod_{i=1}^n P(f_i | c) $$

$$ \begin{aligned} C_{\text{NB}} &= \arg \max_{c \in C} P(c) \prod_{i=1}^n P(w_i | c) \ &= \arg \max_{c \in C} \log P(c) + \sum_{i=1}^n \log P(w_i | c) \end{aligned} $$

$$ A = B \ \Rightarrow \log A = \log B $$

Training

Estimate $P(c)$: the prior probability of each class.

$$ \hat{P}(w_i | c) = \frac{count(w_i, c) + 1}{\sum_{w \in V} count(w, c) + |V|} $$

Or with Laplace smoothing:

$$ \hat{P}(w_i | c) = \frac{count(w_i, c) + 1}{\sum_{w \in V} count(w, c) + |V|} $$

function TRAIN_NAIVE_BAYES(D, C) returns V, log_P(c), log_P(w|c)
  for each class c ∈ C do
    Ndoc = number of documents in D
    Nc = number of documents from D in class c
    logprior[c] = log(Nc / Ndoc)
    V = vocab of D
    bigdoc[c] = concat(d ∈ D where d.class = c)
    for each word w ∈ V do
      count[w, c] = number of times w appears in bigdoc[c]
      loglikelihood[w, c] = log [(count[w, c] + 1) / (sum_{w' ∈ V} count[w', c] + |V|)]

  return logprior, loglikelihood, V

function TEST_NAIVE_BAYES(testdoc, logprior, loglikelihood, C, V) returns best_c
  for each class c ∈ C do
    sum[c] = logprior[c]
    for each position i in testdoc do
      word = testdoc[i]
      if word ∈ V then
        sum[c] = sum[c] + loglikelihood[word, c]

  return argmax_c sum[c]

Evaluation

Use a confusion matrix, precision, recall, and F1 score.