HackMD - Collaborative Markdown Knowledge Base

### API Proposal to add Gibbs sampling to LDA This enhancement was recently brought up in a meeting between a few of sklearn's core maintainers and students in the MLH Fellowship. @parthivc and I have expressed interest, and @thomasjpfan recommended some starting points for proposing API changes. ### Parameter changes From what I can tell, replacing Variational Bayes (VB) with a collapsed Gibbs sampler (CGS) shouldn't require any additional parameters. [[1, page 3]](http://mlg.eng.cam.ac.uk/teaching/4f13/1112/lect10.pdf) I've compared the current parameters in [`sklearn.LatentDirichletAllocation`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) with the parameters used by @hannawallach in her excellent [Python 2.7 implementation](https://github.com/hannawallach/python-lda/blob/master/src/lda.py). (I believe this link is what was referred to earlier by @amueller.) I couldn't find any new parameters that need to be added, and many existing parameters can be directly reused. <details> <summary>Table hidden, click to show</summary> **Note:** Certain parameters are used by sklearn, but are absent in hannahwallach's implementation. I've grouped these into 3 categories: 1. Parameters that are specific to Variational Bayes, whether online or batch. 2. Parameters that are related to sklearn’s use of perplexity for checking LDA model convergence. 3. Parameters related to sklearn core functionality (e.g. partial_fit, random_state, parallel processing, output verbosity) I believe parameters of the 1st type are unnecessary for Gibbs sampling. Parameters of the 2nd and 3rd type, though, could be reused. | Name (sklearn’s LDA w/VB) | Name (@hannawallach’s LDA w/Gibbs) | Description | | ------------------------------------------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | `n_components` (default: `n_components=10`) | `T` (default: `T=100`) | Number of topics. | | `doc_top_prior` (default: `doc_top_prior=1/n_components`) | `alpha` (default: `alpha=0.1`) | Prior of document topic distribution theta (θ). In [[3]](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf), this is called alpha (α). | | `topic_word_prior` (default: `topic_top_prior=1/n_components`) | `beta` (default: `beta=0.01`) | Prior of the topic word distribution beta (β). In [[3]](https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf), this is called eta (η). | | `learning_method` | N/A [1] | Method used to update `_component`. (‘batch’ or ‘online’) | | `learning_decay` | N/A [1] | A parameter that controls learning rate in the online learning method. In the literature, this is called kappa (κ). | | `learning_offset` | N/A [1] | A (positive) parameter that downweights early iterations in online learning. In the literature, this is called tau_0 (τ0). | | `max_iter` (default: `max_iter=10`) | `S` (default: `S=1000`) | “The maximum number of iterations.” (sklearn) // “The number of Gibbs sampling iterations." (@hannawallach) | | `batch_size` | N/A [1] | Number of documents to use in each EM iteration. Only used in online learning. | | `evaluate_every` | N/A [2] | How often to evaluate perplexity. | | `total_samples` | N/A [3] | Total number of documents. Only used in the partial_fit method. | | `perp_tol` | N/A [2] | Perplexity tolerance in batch learning. Only used when evaluate_every is greater than 0. | | `mean_change_tol` | N/A [1] | Stopping tolerance for updating document topic distribution in E-step. | | `max_doc_update_iter` | N/A [1] | Max number of iterations for updating document topic distribution in the E-step. | | `n_jobs` | N/A [3] | The number of jobs to use in the E-step. | | `verbose` | N/A [3] | Verbosity level. | | `random_state` | N/A [3] | Pass an int for reproducible results across multiple function calls. | | `X` | `corpus` | Document word matrix. | | `weight` | N/A [1] | Weight, rho (ρ). Used internally, and is a function of `learning_offset` and `learning_decay`. | </details> ### R's implementation As well, I've taken a look at R's implementation ([`lda.collapsed.gibbs.sampler`](https://www.rdocumentation.org/packages/lda/versions/1.4.2/topics/lda.collapsed.gibbs.sampler)) to see if they've included any additional parameters. There are a few that seem useful to add, but they don't seem to be absolutely necessary. <details> <summary>Table hidden, click to show</summary> | Name | Description | | ---- | ----------- | | `burnin` | A scalar integer indicating the number of Gibbs sweeps to consider as burn-in (i.e., throw away) | | `compute.log.likelihood` | A scalar logical which when TRUE will cause the sampler to compute the log likelihood of the words (to within a constant factor) after each sweep over the variables. The log likelihood for each iteration is stored in the log.likelihood field of the result. This is useful for assessing convergence, but slows things down a tiny bit. | </details> ### Other details There are a few API details I'm unclear on: * Whether to keep online/batch VB as an alternative. _(A parameter `method` could be added with options `{'bayes', 'gibbs'}` for backwards compatibility. Then, parameters specific to Variational Bayes could be left as optional keyword parameters. The descriptions for these parameters could be amended to specifically mention VB.)_ * Default values for `n_components`, `max_iter`, and the priors. _(There are some differences between sklearn's LDA w/VB and hannahwallach's LDA w/CGS, as listed in the first table.)_ * The discussion above mentions Cython, but @thomasjpfan mentioned the possibility of a pure Python/NumPy implementation. I'd love to hear thoughts on this before proceeding. :)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.