Overview

In this laboratory, we build a generative classifier for Dante’s Divina Commedia. The text is divided into three sections (Cantiche):

Each tercet (three-line stanza) is assigned to one of these three classes. We model the text using the multinomial word model. Two equivalent views of the model are considered:

  1. Token-level formulation - each word is considered as an independent categorical event.
  2. Bag-of-words formulation - the document is represented in terms of counts (frequencies) for each word in the dictionary.

Since the likelihoods in both formulations differ only by a constant (the multinomial coefficient), the maximum likelihood (ML) estimates and any likelihood-based inference (including likelihood ratios or posterior probabilities) yield the same results.


Model and Likelihood Functions

Token-Level Model

Let a document (a collection of words in a tercet) be represented as an ordered sequence of tokens $X_1,...,X_N$, where each $X_i$ takes a value from the dictionary $\mathcal{D}$ (of size $M$).

The likelihood for the document is given by:

$$ \mathcal{L}X(\Pi)=\prod{i=1}^N\pi_{x_i}=\prod_{j=1}^M\pi_j^{N_j} $$

where $N_j$ is the number of times word $j$ appears in the document.

Taking the logarithm, the log-likelihood becomes:

$$ \ell_X(\Pi)=\sum_{j=1}^MN_j\log\pi_j $$

Bag-of-Words Model

Alternatively, consider the word occurrence counts: