FNet parameters

# FNet parameters and performance ## training speed (ingestion) Conservative estimate (on small corpora) is about 1MB/s (0.25 MT/s). - will slow down once FNet is larger (n_units) - should be achievable with large fork-join training using massive merger approach ## RAM usage / model size - RAM usage should be carefully considered, - meaningful is also the size of the saved FNet - which is mostly dominated by `n_units` used by FNet Following FNet storage sizes have been obtained in training for `max_regions=4, edge_threshold=12` (and unconstrained `max_units`/`max_edges`): |set|n_epochs|size(MB)| |--|--|--| |vsmall(1MB)|3|7| |vsmall(1MB)|6|13| |small(10MB)|3|82| |small(10MB)|6|130| (size scales almost linerly with n_epochs, as long as the net still learns something -- see saturation effects below) For comparison, running with `n_regions=12, threshold=8` leads to: ![image](https://hackmd.io/_uploads/HkQvORxoR.png) - tracing "peak memory" in python - duration of each epoch of training grows with n_units/n_edges, likely proportional to actually used RAM ## max_regions Generally controls the "height" of FNet (also attention span?). Notes: - less regions -> less memory consumed, fewer units and edges, - faster WER->0, - duration of epochs seems ~ to memory consumption, - proportion of edges to units: - for small `max_regions=3`, `~3` edges per unit - for `max_regions=12`, `~1.5` edges per unit Figures: FNets for max_regions =5 vs =10 (thr=3; vvsmall (200kB) set used) ![](https://hackmd.io/_uploads/rkaBxCejA.png) Fig. Training with `max_regions=5`. Beyond `epoch~=12` we observe a "saturation effect", where WER=0, and the number of edges and units stops to grow. Exact stop of growth does depend on `max_regions` and `edge_threshold` for (and unconstrained units/edges), and does occur (but not abruptly) around `epoch = max_regions * edge_threshold`. ![image](https://hackmd.io/_uploads/S1lFVb5jC.png) Fig. (subtopic: saturation effect; here with `max_regions=5` and `edge_threshold=8` -> saturation observed around epoch~=`35`) ![](https://hackmd.io/_uploads/BkalgRloR.png) Fig. Training with `max_regions=10` ## max_units FNet parameter `max_units` and the actual number of units in FNet (`n_units`) directly impact: - max achievable WER on given corpus - `max_units` ~2x n_tokens of the corpus usually allows <15% WER (after many epochs of training), though this is not what FNets should aim for, - `n_units` in FNet determines the ingestion speed (training rate), - `n_units` also mostly dominates the size of the model (RAM, also on save); roughly 100k units cost about 5MB (~check). Figures below show the typical convergence of WER for different `max_units` on a vsmall dataset (1MB = 250kT) with unconstrained `n_edges` and `n_regions=6`, `edge_threshold=6`: ![image](https://hackmd.io/_uploads/Bk11N_Ns0.png) *Fig. WER error (on the train set) as a function of `max_units` for `edge_threshold=6`; the extra (red) series called "240_T3" shows the `edge_threshold=3`, where WER converges faster, but its final limit is nonetheless determined by `max_units`. ## edge threshold with larger threshold: - slower WER↓0% convergence - slower growth of n_units and n_edges (until corpus fully learned) Plots: effects of edge threshold parameter on FNet structure and WER (n_regions=6, vsmall set (1MB)) ![image](https://hackmd.io/_uploads/BJuNaCxoC.png) *threshold=3 ↑* ![image](https://hackmd.io/_uploads/HyCzCAejR.png) *threshold=6 ↑* ![image](https://hackmd.io/_uploads/HJT11yWoC.png) *threshold=12 ↑* ## Garbage collector options The current FNet setup/algorithms give us the following garbage collection options: ```python class GCOptions: max_edges: int goal_num_units: float goal_num_edges: float ``` While the `max_edges` parameter has been discussed earlier, the effect of `goal_num_*` parameters requires clarification. ### `goal_num_edges` In general, GC triggers when the number of edges (created in training) becomes close to `max_edges`, and deletes "weak edges" as long as there are more than `goal_num_edges * max_edges` of them. However, it should be noted, that the there is a lower bound, where at lest one edge must exist for each node, so in case of a large number of nodes, the number of edges will grow irrespectively of the `max_edges` parameter, as in the training below: ![image](https://hackmd.io/_uploads/H1dkCyqs0.png) Fig. Operation of GC - limiting number of edges, with `max_edges=100k` and `goal_num_edges=0.6`. The `n_edges` is indeed kept below 100k until `n_units` exceeds it and pulls `n_edges` with it up. Note that the WER is reduced in training even though GC operates on edges aggresively. ![image](https://hackmd.io/_uploads/rkvkXe9o0.png) Fig. Analog of previous plot, but now with `goal_num_edges=0.99`. Note how this time the GC events cut far fewer edges (flatter curve until epoch ~13). Also the overall `n_units` is a bit larger, and WER converges to lower values in this case. ### `goal_num_units` GC events now attempt to reduce the `n_units` to `max_units * goal_num_units` (where `max_units` is a property of FNet, and not of the GC, at least in current implementation). With `max_units`=100k we are lead to: ![image](https://hackmd.io/_uploads/rk1Fbgcs0.png) Fig. FNet training limited by `n_units` with `goal_num_units=0.6` (and unconstrained `n_edges`). In this case, changing to `goal_num_units=0.95` leads to almost identical plot.