In this laboratory, we build a generative classifier for Dante’s Divina Commedia. The text is divided into three sections (Cantiche):
Each tercet (three-line stanza) is assigned to one of these three classes. We model the text using the multinomial word model. Two equivalent views of the model are considered:
Since the likelihoods in both formulations differ only by a constant (the multinomial coefficient), the maximum likelihood (ML) estimates and any likelihood-based inference (including likelihood ratios or posterior probabilities) yield the same results.
Let a document (a collection of words in a tercet) be represented as an ordered sequence of tokens $X_1,...,X_N$, where each $X_i$ takes a value from the dictionary $\mathcal{D}$ (of size $M$).
The likelihood for the document is given by:
$$ \mathcal{L}X(\Pi)=\prod{i=1}^N\pi_{x_i}=\prod_{j=1}^M\pi_j^{N_j} $$
where $N_j$ is the number of times word $j$ appears in the document.
Taking the logarithm, the log-likelihood becomes:
$$ \ell_X(\Pi)=\sum_{j=1}^MN_j\log\pi_j $$
Alternatively, consider the word occurrence counts: