Tasks: - HackMD

:::info _Tasks:_ - [x] [`calculate-n-grams`: **dn**] Calculate n-grams: - [x] Given a string of text calculate the "n-gram representation" ```typescript function nGramRepresentation(s: string, n: number): Array<string> ``` - [x] Given a string of text calculate the "n-gram counting (sparse) vector" ```typescript function nGramCountingVector(s: string, n: number): Map<string, number> ``` - [x] [`filters`: **Joran**] Preprocess/normalize a given string of text. We need different filters. We want at least the following (investigate and maybe implement more): ```typescript function lowerCase(text: string): string function normalizeSpaces(text: string): string function removeCommonWords(text: string): string function removePunctuation(text: string): string ``` - [x] [`source-dictionary`: **Lucca**] Dictionary preprocessing (English or Dutch or the hacker jargon file or the Webster 1913 file?): - [x] Find a dictionary (check licenses) and download the most useful file format (the "initial dict file"). Make a baby version of this file for testing (the real dictionary file might be quite large, 100's of MB's?)... Add these files to a folder called `assets` in the project root. Add a `README.md` documenting the origin of the file and its license. - [x] Make a TypeScript script to process the "initial dict file" into a utf-8 text file, the "processed dict file", with the following format: ``` word_1 description_1_1 word_1 description_1_2 word_2 description_2_1 word_3 description_3_1 ... ``` I.e. we have empty lines in between definitions and the word we are defining is on the first line of each "paragraph". Lines are separated UNIX style by '\n'. The script should be run by doing `npm run preprocess-dict` (and `npm run preprocess-baby-dict`) and places the output into `assets`. There should be a function ```typescript async function preprocessDict(input: Readable, output: Writable) ``` We add the "processed dict file" to git (also the baby version). The baby version is compared automatically in the unit tests. - [x] [`dictionary`: **as**] Functions to work with our "processed dict files": - [x] Load a "processed dict file" and make it available as an `AsyncGenerator<[string, string]>` (see [MDN AsyncGenerator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncGenerator)): ```typescript async function* loadDictionary( input: Readable ): AsyncGenerator<[string, string]> ``` - [x] Run a number of filters, taking `string` to `string`, on the description fields (one after another): ```typescript async function* filterDescriptions( db: AsyncIterable<[string, string]>, filters: ((description: string) => string) | Array<(description: string) => string> ): AsyncGenerator<[string, string]> ``` - [x] Run a filter on our `[word, description]` to an arbitrary output type `T` (e.g, filtering from `[string, string]` to `[string, Map<string, number>]` with `nGramCountingVector`): ```typescript async function* filter<T>( db: AsyncIterable<[string, string]>, filter: (word: string, description: string) => T ): AsyncGenerator<T> ``` - [ ] [`dictionary-lookup`: **Arne**] Write a command-line tool to look up a word (just the word, no n-gram stuff) in a "processed dict file". This script should be in `src/scripts/` and called `dictionary-lookup.ts`. - [ ] [`dictionary-lookup-output-ngram`: **Joran**] Addition to previous item, calculate some n-gram output and print this to the screen given a command line option `--output-n-gram` (possible with some parameters, like which n-gram(s), or which filters...). - [x] [`find-similar`: **Joran**] Find the best top-$k$ matches of an n-gram counting vector in a "database" of n-gram counting vectors sorted by cosine similarity measure. - [x] Calculate the cosine similarity between two sparse n-gram counting vectors: ```typescript function cosineDistance( a: Map<string, number>, b: Map<string, number> ): number ``` - [x] Scan a "database" to find the best $k$ approximate matches based on the cosine distance: ```typescript async function findSimilar( search: Map<string, number>, db: AsyncIterable<[string, Map<string, number>]>, k: number ): Promise<[[string, Map<string, number>], number][]> ``` Note: instead of `AsyncIterable<[string, Map<string, number>]>` we could first write `Array<[string, Map<string, number>]>` as well as `Map<string, Map<string, number>>` (but the latter actually disallows a word to have multiple descriptions). We in fact just want to be able to iterate over the "database" by doing `for await (const [word, nGramVec] of db)`... Note that the output contains at most $k$ items. - [x] [`normalize-unicode`: **Arne**] Normalize unicode text (see e.g. [stackoverflow](https://stackoverflow.com/questions/286921/efficiently-replace-all-accented-characters-in-a-string/23767389#23767389)) ```typescript function normalizeUnicode(text: string): string ``` - [x] [`calculate-word-n-grams`: **Daan**] Rework `nGramRepresentation` and `nGramCountingVector` such that we have `s: string | Array<string>` (to be able to use the same code for n-grams on words). - [x] [`filters-streamify`: **Daan**] Rework the filters such that instead of the `string` argument we can pass in `Iterable` or `AsyncIterable`. We then return a `Generator` on which we can do `for await (const [...] of gen) { ... }`; this should be compatible with streams. _Unit tests_: [Unit tests addendum](/90VwugITTeO4svnZ-p2LXg) (Only go there after you are ready to code...) ::: BRANCH NAAM: feature/n-gram/taakspecifieren regexr.com

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.