owned this note
owned this note
Published
Linked with GitHub
# Lutra - a proposal for a data tool
Lutra is a tool for executing relational data queries.
It operates with a directory of PRQL source files and .parquet input data. It
extracts queries, executes them using DuckDB and writes results to local disk.
It is highly interactive, asking for additional information about what the user
is trying to accomplish.
It is designed as a CLI-library duo, so it can be used from command line as well
integrated into other data tools.
It uses the directory as the source of truth for project state, implementes
appropriate locking to allow many concurrent invocations on the same directory.
This is a typical project structure:
```
- lutra.toml # project config file
- _project.prql # main PRQL source file
- utils.prql # a PRQL source with utils
- sales/ # a directory module
- orders.prql # a sub module
- order_dump.parquet # data input file
```
## Building
PRQL source files contain queries that either load data or transform that data.
Any relational query that is not used in any other query is exported into a
.parquet file.
```
# orders.prql
# loading data
let all_orders = (
from read_parquet('./sales/order_dump.parquet')
filter important_column == null
)
let closed = (
from all_orders
filter status == 'closed'
)
all_orders
```
Now we run:
```
$lutra run
Building project... done.
Results:
- sales/orders.main.parquet
- sales/orders.closed.parquet
```
And we get the following files:
```
- ...
- sales/ # (already existed)
- orders.prql # (already existed)
- order_dump.parquet # (already existed)
- orders.main.parquet # result of `sales.orders.main`
- orders.closed.parquet # result of `sales.orders.closed`
```
## Auto-importing
When we add:
```
- ...
- sales/ # (already existed)
- orders.prql # (already existed)
- ...
- articles.parquet # new data input
```
... and run:
```
$ lutra run
> Found unimported data in `sales/articles.parquet`. Do you want to import it?
[y] Yes, use default name `sales.articles.main`
[e] Yes, but edit the name.
[n] No, ignore it.
Creating file `sales/articles.prql` with contents:
from read_parquet('sales/articles.parquet')
Building project... done.
Results:
- sales/order.main.parquet
- sales/order.closed.parquet
```
Lutra is reading and writing .parquet files all over your project. To know which
files are input, which are output and which were recently added, it needs to
keep tracks of all `read_parquet` calls and all output files.
This means that we need to build a DAG of all relations If we are traversing the
trees, we might as well build the DAG and infer inputs and outputs from that.
## Importing other file formats
When a CSV, JSON, XLSX, ODS, HTML or SQL files are detected to have been added
into the project, Lutra will try to convert those files into parquet. That is
preferred over other formats because parquet is very close to relational data.
Each of the other format may contain wildly varied data, which needs to be
converted and regularized into tabular data:
- CSV contains only string data. Any numbers, booleans or temporal data has to
be deserialized from strings.
- JSON can contain tabular data in different data structures: an array of row
objects, an array of row arrays, an array of column arrays are just a few of
them.
- XLSX and ODS can contain multiple sheets in one file, along with non-tabular
structures.
The workflow would look like this:
```
$ lutra run
> Found unimported data in `sales/articles.csv`. Do you want to import it?
[y] Yes, use default name `sales.articles.main`
[e] Yes, but edit the name.
[n] No, ignore it.
> CSV files need to be converted to Parquet. Does this layout looks correct?
article_id | article_name | category | out_of_stock | price
int32 | text | text | bool | float64
------------+--------------+----------+--------------+---------
42 | Pilot M2 1mm | pen | false | 7.50
55 | Fountain G33 | pen | true | 10.99
... 411 more rows (413 total) ...
[y] Yes, import as-is
[e] Edit column types
[n] No, abort.
Successfully imported.
> How do you want to cleanup the original CSV file?
[d] Delete it.
[a] Move it into `archive/` directory.
[i] Keep it where it is and ignore it in future runs.
```