:::info
_Tasks:_
- [x] [`calculate-n-grams`: **dn**] Calculate n-grams:
- [x] Given a string of text calculate the "n-gram representation"
```typescript
function nGramRepresentation(s: string, n: number): Array<string>
```
- [x] Given a string of text calculate the "n-gram counting (sparse) vector"
```typescript
function nGramCountingVector(s: string, n: number): Map<string, number>
```
- [x] [`filters`: **Joran**] Preprocess/normalize a given string of text. We need different filters. We want at least the following (investigate and maybe implement more):
```typescript
function lowerCase(text: string): string
function normalizeSpaces(text: string): string
function removeCommonWords(text: string): string
function removePunctuation(text: string): string
```
- [x] [`source-dictionary`: **Lucca**] Dictionary preprocessing (English or Dutch or the hacker jargon file or the Webster 1913 file?):
- [x] Find a dictionary (check licenses) and download the most useful file format (the "initial dict file"). Make a baby version of this file for testing (the real dictionary file might be quite large, 100's of MB's?)... Add these files to a folder called `assets` in the project root. Add a `README.md` documenting the origin of the file and its license.
- [x] Make a TypeScript script to process the "initial dict file" into a utf-8 text file, the "processed dict file", with the following format:
```
word_1
description_1_1
word_1
description_1_2
word_2
description_2_1
word_3
description_3_1
...
```
I.e. we have empty lines in between definitions and the word we are defining is on the first line of each "paragraph". Lines are separated UNIX style by '\n'. The script should be run by doing `npm run preprocess-dict` (and `npm run preprocess-baby-dict`) and places the output into `assets`. There should be a function
```typescript
async function preprocessDict(input: Readable, output: Writable)
```
We add the "processed dict file" to git (also the baby version). The baby version is compared automatically in the unit tests.
- [x] [`dictionary`: **as**] Functions to work with our "processed dict files":
- [x] Load a "processed dict file" and make it available as an `AsyncGenerator<[string, string]>` (see [MDN AsyncGenerator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncGenerator)):
```typescript
async function* loadDictionary(
input: Readable
): AsyncGenerator<[string, string]>
```
- [x] Run a number of filters, taking `string` to `string`, on the description fields (one after another):
```typescript
async function* filterDescriptions(
db: AsyncIterable<[string, string]>,
filters: ((description: string) => string) | Array<(description: string) => string>
): AsyncGenerator<[string, string]>
```
- [x] Run a filter on our `[word, description]` to an arbitrary output type `T` (e.g, filtering from `[string, string]` to `[string, Map<string, number>]` with `nGramCountingVector`):
```typescript
async function* filter<T>(
db: AsyncIterable<[string, string]>,
filter: (word: string, description: string) => T
): AsyncGenerator<T>
```
- [ ] [`dictionary-lookup`: **Arne**] Write a command-line tool to look up a word (just the word, no n-gram stuff) in a "processed dict file".
This script should be in `src/scripts/` and called `dictionary-lookup.ts`.
- [ ] [`dictionary-lookup-output-ngram`: **Joran**] Addition to previous item, calculate some n-gram output and print this to the screen given a command line option `--output-n-gram` (possible with some parameters, like which n-gram(s), or which filters...).
- [x] [`find-similar`: **Joran**] Find the best top-$k$ matches of an n-gram counting vector in a "database" of n-gram counting vectors sorted by cosine similarity measure.
- [x] Calculate the cosine similarity between two sparse n-gram counting vectors:
```typescript
function cosineDistance(
a: Map<string, number>,
b: Map<string, number>
): number
```
- [x] Scan a "database" to find the best $k$ approximate matches based on the cosine distance:
```typescript
async function findSimilar(
search: Map<string, number>,
db: AsyncIterable<[string, Map<string, number>]>,
k: number
): Promise<[[string, Map<string, number>], number][]>
```
Note: instead of `AsyncIterable<[string, Map<string, number>]>` we could first write `Array<[string, Map<string, number>]>` as well as `Map<string, Map<string, number>>` (but the latter actually disallows a word to have multiple descriptions). We in fact just want to be able to iterate over the "database" by doing `for await (const [word, nGramVec] of db)`... Note that the output contains at most $k$ items.
- [x] [`normalize-unicode`: **Arne**] Normalize unicode text (see e.g. [stackoverflow](https://stackoverflow.com/questions/286921/efficiently-replace-all-accented-characters-in-a-string/23767389#23767389))
```typescript
function normalizeUnicode(text: string): string
```
- [x] [`calculate-word-n-grams`: **Daan**] Rework `nGramRepresentation` and `nGramCountingVector` such that we have `s: string | Array<string>` (to be able to use the same code for n-grams on words).
- [x] [`filters-streamify`: **Daan**] Rework the filters such that instead of the `string` argument we can pass in `Iterable` or `AsyncIterable`. We then return a `Generator` on which we can do `for await (const [...] of gen) { ... }`; this should be compatible with streams.
_Unit tests_: [Unit tests addendum](/90VwugITTeO4svnZ-p2LXg) (Only go there after you are ready to code...)
:::
BRANCH NAAM: feature/n-gram/taakspecifieren
regexr.com