### API Proposal to add Gibbs sampling to LDA
This enhancement was recently brought up in a meeting between a few of sklearn's core maintainers and students in the MLH Fellowship. @parthivc and I have expressed interest, and @thomasjpfan recommended some starting points for proposing API changes.
### Parameter changes
From what I can tell, replacing Variational Bayes (VB) with a collapsed Gibbs sampler (CGS) shouldn't require any additional parameters. <sup>[[1, page 3]](http://mlg.eng.cam.ac.uk/teaching/4f13/1112/lect10.pdf)</sup>
I've compared the current parameters in [`sklearn.LatentDirichletAllocation`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) with the parameters used by @hannawallach in her excellent [Python 2.7 implementation](https://github.com/hannawallach/python-lda/blob/master/src/lda.py). (I believe this link is what was referred to earlier by @amueller.) I couldn't find any new parameters that need to be added, and many existing parameters can be directly reused.
<details>
<summary>Table hidden, click to show</summary>
**Note:** Certain parameters are used by sklearn, but are absent in hannahwallach's implementation. I've grouped these into 3 categories:
1. Parameters that are specific to Variational Bayes, whether online or batch.
2. Parameters that are related to sklearn’s use of perplexity for checking LDA model convergence.
3. Parameters related to sklearn core functionality (e.g. partial_fit, random_state, parallel processing, output verbosity)
I believe parameters of the 1st type are unnecessary for Gibbs sampling. Parameters of the 2nd and 3rd type, though, could be reused.
| Name (sklearn’s LDA w/VB) | Name (@hannawallach’s LDA w/Gibbs) | Description |
| ------------------------------------------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `n_components` (default: `n_components=10`) | `T` (default: `T=100`) | Number of topics. |
| `doc_top_prior` (default: `doc_top_prior=1/n_components`) | `alpha` (default: `alpha=0.1`) | Prior of document topic distribution theta (θ). In [[3]](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf), this is called alpha (α). |
| `topic_word_prior` (default: `topic_top_prior=1/n_components`) | `beta` (default: `beta=0.01`) | Prior of the topic word distribution beta (β). In [[3]](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf), this is called eta (η). |
| `learning_method` | N/A <sup>[1]</sup> | Method used to update `_component`. (‘batch’ or ‘online’) |
| `learning_decay` | N/A <sup>[1]</sup> | A parameter that controls learning rate in the online learning method. In the literature, this is called kappa (κ). |
| `learning_offset` | N/A <sup>[1]</sup> | A (positive) parameter that downweights early iterations in online learning. In the literature, this is called tau_0 (τ<sub>0</sub>). |
| `max_iter` (default: `max_iter=10`) | `S` (default: `S=1000`) | “The maximum number of iterations.” (sklearn) // “The number of Gibbs sampling iterations." (@hannawallach) |
| `batch_size` | N/A <sup>[1]</sup> | Number of documents to use in each EM iteration. Only used in online learning. |
| `evaluate_every` | N/A <sup>[2]</sup> | How often to evaluate perplexity. |
| `total_samples` | N/A <sup>[3]</sup> | Total number of documents. Only used in the partial_fit method. |
| `perp_tol` | N/A <sup>[2]</sup> | Perplexity tolerance in batch learning. Only used when evaluate_every is greater than 0. |
| `mean_change_tol` | N/A <sup>[1]</sup> | Stopping tolerance for updating document topic distribution in E-step. |
| `max_doc_update_iter` | N/A <sup>[1]</sup> | Max number of iterations for updating document topic distribution in the E-step. |
| `n_jobs` | N/A <sup>[3]</sup> | The number of jobs to use in the E-step. |
| `verbose` | N/A <sup>[3]</sup> | Verbosity level. |
| `random_state` | N/A <sup>[3]</sup> | Pass an int for reproducible results across multiple function calls. |
| `X` | `corpus` | Document word matrix. |
| `weight` | N/A <sup>[1]</sup> | Weight, rho (ρ). Used internally, and is a function of `learning_offset` and `learning_decay`. |
</details>
### R's implementation
As well, I've taken a look at R's implementation ([`lda.collapsed.gibbs.sampler`](https://www.rdocumentation.org/packages/lda/versions/1.4.2/topics/lda.collapsed.gibbs.sampler)) to see if they've included any additional parameters. There are a few that seem useful to add, but they don't seem to be absolutely necessary.
<details>
<summary>Table hidden, click to show</summary>
| Name | Description |
| ---- | ----------- |
| `burnin` | A scalar integer indicating the number of Gibbs sweeps to consider as burn-in (i.e., throw away) |
| `compute.log.likelihood` | A scalar logical which when TRUE will cause the sampler to compute the log likelihood of the words (to within a constant factor) after each sweep over the variables. The log likelihood for each iteration is stored in the log.likelihood field of the result. This is useful for assessing convergence, but slows things down a tiny bit. |
</details>
### Other details
There are a few API details I'm unclear on:
* Whether to keep online/batch VB as an alternative. _(A parameter `method` could be added with options `{'bayes', 'gibbs'}` for backwards compatibility. Then, parameters specific to Variational Bayes could be left as optional keyword parameters. The descriptions for these parameters could be amended to specifically mention VB.)_
* Default values for `n_components`, `max_iter`, and the priors. _(There are some differences between sklearn's LDA w/VB and hannahwallach's LDA w/CGS, as listed in the first table.)_
* The discussion above mentions Cython, but @thomasjpfan mentioned the possibility of a pure Python/NumPy implementation.
I'd love to hear thoughts on this before proceeding. :)