owned this note
owned this note
Published
Linked with GitHub
# Table value proposal
Author: [Devyn Cairns](https://github.com/devyn)
Tables in Nushell are currently only ever organized as lists of records. This is very flexible and fits a JSON-compatible data model, but has disadvantages when a table is known to have more regular structure:
- Column keys have to be duplicated for each record.
- This is an inefficient use of memory.
- Column lookup involves a search of each row for the correct key, as there is no guarantee that they are in the same position or even that they exist on each row.
- This is unnecessarily slow.
- It is not possible to know which columns a table stream contains without collecting the entire stream.
- This is a pain point for `to csv` and similar tabular formats, and the reason the `--columns` argument exists.
My suggestion (and what I've heard from others) is that we continue to be able to treat `list<record>` as a table in most cases, but that we also add a new `table` type and a `TableStream` that addresses all of these problems and more.
## `Value::Table`
Fundamentally this would be structured as a list of column names and a list of row data.
There are two options for how to structure the row data:
- **Row-first**, i.e. indexed by row, and then column.
- **Column-first**, i.e. indexed by column, and then row.
The row-first approach is the most similar to what we already have, and has these advantages:
- Finding all of the values for one row only requires a single lookup
- Pushing a row only requires one `Vec` to grow
- Deleting a row only requires one memory move.
Most databases work this way (but not all) and it typically has high transactional performance.
The column-first approach is the one more commonly seen in analytical software (including Polars, at least by default) and analytical databases, and has many advantages of its own that I think would benefit Nushell:
- Values are located near each other when doing an aggregation.
- Processors prefer sequential access to memory. Cache & TLB are both used more efficiently if more data fits in them, and processor prefetch is more efficient too.
- SIMD is possible because values are always a constant offset apart
- The number of rows usually far outweighs the number of columns, so the number of non-sequential accesses is fewer even for a full table scan.
- Let $N$ be the number of rows, $M$ the number of columns, $C$ the amortized speedup of sequential accesses over random ones
- For row-first order, the full table scan in the worst case costs $N$ potentially random accesses, while for column-first order, it costs $M$ potentially random accesses.
- Row-first cost: $N + \frac{N \times (M - 1)}{C}$
- Column-first cost: $M + \frac{(N - 1) \times M}{C}$
- Since we expect $N > M$, it's more beneficial to put the optimization on that side
- There are fewer, larger `Vec`s overall to manage, so memory allocation is cheaper.
- It is possible to select a subset of columns without visiting each row, whereas selecting a subset of rows only requires visiting each column (usually much fewer than the number of rows)
Benchmark proof (nightly bencher), most efficient implementation of a full table sum for each column:
```rust=
#![feature(test)]
extern crate test;
use std::hint::black_box;
use test::bench::Bencher;
#[bench]
fn row_major(b: &mut Bencher) {
let rows: Vec<Vec<i32>> = (0..10000).map(|_| (0..10).collect()).collect();
b.iter(|| {
let mut row_sums = vec![0; 10];
for row in &rows {
for (sum, val) in row_sums.iter_mut().zip(row) {
*sum += *val;
}
}
black_box(row_sums)
});
}
#[bench]
fn col_major(b: &mut Bencher) {
let cols: Vec<Vec<i32>> = (0..10).map(|_| (0..10000).collect()).collect();
b.iter(|| {
black_box(
cols.iter()
.map(|vals| vals.iter().copied().sum::<i32>())
.collect::<Vec<i32>>(),
)
});
}
```
```
running 2 tests
test col_major ... bench: 3,618.77 ns/iter (+/- 99.25)
test row_major ... bench: 33,060.19 ns/iter (+/- 1,308.65)
```
I suggest we go with column-first order. I think most of our operations are likely to benefit from it.
## `TableStream`
In order to enable table operations to be able to assume the shape of a table in a stream, we need to add a stream type for table as well.
Naturally, it wouldn't make sense for this to be in column-first order, very few potential stream sources work that way. So I suggest that we still keep row-first order in this case, but that we optimize it in a similar way to how `ByteStream` works so that the producing function is adding its data to each column directly through a list of mutable references, rather than having to allocate and then release an intermediate row-first `Vec`. If we do this there should really be no significant added cost to the mismatch between the row-first order of the stream and the column-first order of the value.
Pseudocode:
```rust
trait TableStream {
// Constant and available from the beginning
fn columns(&self) -> &[String];
fn size_hint(&self) -> (usize, Option<usize>);
// Pushes directly to each `Vec`, and must push one
// value per row
fn next(out: &mut [Vec<Value>]) -> bool;
}
```