# FNet parameters and performance
## training speed (ingestion)
Conservative estimate (on small corpora) is about 1MB/s (0.25 MT/s).
- will slow down once FNet is larger (n_units)
- should be achievable with large fork-join training using massive merger approach
## RAM usage / model size
- RAM usage should be carefully considered,
- meaningful is also the size of the saved FNet
- which is mostly dominated by `n_units` used by FNet
Following FNet storage sizes have been obtained in training for `max_regions=4, edge_threshold=12` (and unconstrained `max_units`/`max_edges`):
|set|n_epochs|size(MB)|
|--|--|--|
|vsmall(1MB)|3|7|
|vsmall(1MB)|6|13|
|small(10MB)|3|82|
|small(10MB)|6|130|
(size scales almost linerly with n_epochs, as long as the net still learns something -- see saturation effects below)
For comparison, running with `n_regions=12, threshold=8` leads to:

- tracing "peak memory" in python
- duration of each epoch of training grows with n_units/n_edges, likely proportional to actually used RAM
## max_regions
Generally controls the "height" of FNet (also attention span?).
Notes:
- less regions -> less memory consumed, fewer units and edges,
- faster WER->0,
- duration of epochs seems ~ to memory consumption,
- proportion of edges to units:
- for small `max_regions=3`, `~3` edges per unit
- for `max_regions=12`, `~1.5` edges per unit
Figures: FNets for max_regions =5 vs =10 (thr=3; vvsmall (200kB) set used)

Fig. Training with `max_regions=5`. Beyond `epoch~=12` we observe a "saturation effect", where WER=0, and the number of edges and units stops to grow. Exact stop of growth does depend on `max_regions` and `edge_threshold` for (and unconstrained units/edges), and does occur (but not abruptly) around `epoch = max_regions * edge_threshold`.

Fig. (subtopic: saturation effect; here with `max_regions=5` and `edge_threshold=8` -> saturation observed around epoch~=`35`)

Fig. Training with `max_regions=10`
## max_units
FNet parameter `max_units` and the actual number of units in FNet (`n_units`) directly impact:
- max achievable WER on given corpus
- `max_units` ~2x n_tokens of the corpus usually allows <15% WER (after many epochs of training), though this is not what FNets should aim for,
- `n_units` in FNet determines the ingestion speed (training rate),
- `n_units` also mostly dominates the size of the model (RAM, also on save); roughly 100k units cost about 5MB (~check).
Figures below show the typical convergence of WER for different `max_units` on a vsmall dataset (1MB = 250kT) with unconstrained `n_edges` and `n_regions=6`, `edge_threshold=6`:

*Fig. WER error (on the train set) as a function of `max_units` for `edge_threshold=6`; the extra (red) series called "240_T3" shows the `edge_threshold=3`, where WER converges faster, but its final limit is nonetheless determined by `max_units`.
## edge threshold
with larger threshold:
- slower WERβ0% convergence
- slower growth of n_units and n_edges (until corpus fully learned)
Plots: effects of edge threshold parameter on FNet structure and WER (n_regions=6, vsmall set (1MB))

*threshold=3 β*

*threshold=6 β*

*threshold=12 β*
## Garbage collector options
The current FNet setup/algorithms give us the following garbage collection options:
```python
class GCOptions:
max_edges: int
goal_num_units: float
goal_num_edges: float
```
While the `max_edges` parameter has been discussed earlier, the effect of `goal_num_*` parameters requires clarification.
### `goal_num_edges`
In general, GC triggers when the number of edges (created in training) becomes close to `max_edges`, and deletes "weak edges" as long as there are more than `goal_num_edges * max_edges` of them. However, it should be noted, that the there is a lower bound, where at lest one edge must exist for each node, so in case of a large number of nodes, the number of edges will grow irrespectively of the `max_edges` parameter, as in the training below:

Fig. Operation of GC - limiting number of edges, with `max_edges=100k` and `goal_num_edges=0.6`. The `n_edges` is indeed kept below 100k until `n_units` exceeds it and pulls `n_edges` with it up. Note that the WER is reduced in training even though GC operates on edges aggresively.

Fig. Analog of previous plot, but now with `goal_num_edges=0.99`. Note how this time the GC events cut far fewer edges (flatter curve until epoch ~13). Also the overall `n_units` is a bit larger, and WER converges to lower values in this case.
### `goal_num_units`
GC events now attempt to reduce the `n_units` to `max_units * goal_num_units` (where `max_units` is a property of FNet, and not of the GC, at least in current implementation). With `max_units`=100k we are lead to:

Fig. FNet training limited by `n_units` with `goal_num_units=0.6` (and unconstrained `n_edges`). In this case, changing to `goal_num_units=0.95` leads to almost identical plot.