owned this note
owned this note
Published
Linked with GitHub
# OpenRefine history improvements
This document presents proposed improvements to the way we represent sequences of operations in OpenRefine. We first motivate the changes by explaining the concrete user workflows we want to improve by bringing in these changes. We then sketch a new architecture which is designed with all these use cases in mind. We also propose visual representations for operations, which could be introduced in the tool.
## Use cases
This section describes issues with the current architecture, and what the ideal behaviour of the tool should be. We only sketch out what the user should experience, not how it should concretely be implemented in the tool.
### Applying a JSON workflow fails silently when column names mismatch
This is [issue #5440](https://github.com/OpenRefine/OpenRefine/issues/5540).
#### Problem
Say you have a list of 4 operations forming a workflow that you want to reuse:
* capitalize the `country_code` column
* reconcile the `city` column using `country_code` as a reconciliation property
* add columns from reconciled values, fetching the `mayor` and the `population` columns
* remove all rows where `population` is less than 1000 inhabitants
Once you have done this workflow on a project, you can extract it as a JSON blob in the Undo/Redo tab, and reapply it on a new project.
The workflow JSON looks like this:
```json
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "country_code",
"expression": "value.toUppercase()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column country_code using expression value.toUppercase()"
},
{
"op": "core/recon",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "city",
"config": {
"mode": "standard-service",
"service": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
"identifierSpace": "http://www.wikidata.org/entity/",
"schemaSpace": "http://www.wikidata.org/prop/direct/",
"type": {
"id": "Q486972",
"name": "human settlement"
},
"autoMatch": true,
"columnDetails": [
{
"column": "country_code",
"propertyName": "SPARQL: P17/P297",
"propertyID": "P17/P297"
}
],
"limit": 0
},
"description": "Reconcile cells in column city to type Q486972"
},
{
"op": "core/extend-reconciled-data",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "city",
"endpoint": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
"identifierSpace": "http://www.wikidata.org/entity/",
"schemaSpace": "http://www.wikidata.org/prop/direct/",
"extension": {
"properties": [
{
"id": "P1082",
"name": "population"
},
{
"id": "P6",
"name": "head of government"
}
]
},
"columnInsertIndex": 1,
"description": "Extend data at index 1 based on column city"
},
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "range",
"name": "population",
"expression": "value",
"columnName": "population",
"from": 0,
"to": 1000,
"selectNumeric": true,
"selectNonNumeric": true,
"selectBlank": false,
"selectError": true
}
],
"mode": "row-based"
},
"description": "Remove rows"
}
]
```
If the new project does not use the same column names, or if some columns are missing, this problem will not be reported to the user in a clear way.
Instead, the tool will try to apply all 4 operations in sequence, without stopping if columns are missing.
For instance, say we try to apply the workflow to a project with a column `country_code`, but the `city` column has been renamed into `administrative_area`.
This is what currently happens:
* the first operation, capitalizing the column `country_code`, will run successfully;
* the reconciliation operation is skipped as there is no such column to reconcile;
* the fetching operation is skipped, since `city` still does not exist;
* the `population` column does not exist, so the facet used to select the rows to delete has an error. So all rows are selected by default. So all rows in the new project are removed.
This is obviously extremely bad in terms of user experience. No data is actually lost since we can still roll back to a previous stage, but it is far from ideal.
Very elegantly, OpenRefine dumps some stack traces in the console as well:
```
java.lang.Exception: No column named city
at com.google.refine.operations.recon.ReconOperation$ReconProcess.populateEntries(ReconOperation.java:205)
at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:239)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-9" java.lang.NullPointerException
at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:247)
at java.lang.Thread.run(Thread.java:748)
java.lang.Exception: No column named city
at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.populateRowsWithMatches(ExtendDataOperation.java:161)
at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.run(ExtendDataOperation.java:244)
at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-10" java.lang.NullPointerException
at com.google.refine.model.changes.DataExtensionChange.apply(DataExtensionChange.java:153)
at com.google.refine.history.HistoryEntry.apply(HistoryEntry.java:148)
at com.google.refine.history.History.addEntry(History.java:135)
at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.run(ExtendDataOperation.java:296)
at java.lang.Thread.run(Thread.java:748)
```
#### Proposed solution
Given a workflow, the tool needs to be able to determine if it applies to the current project, by infering from the workflow what the initial shape of the data should look like.
As a user, I expect that:
* the tool detects that this workflow depends on columns `city` and `country_code`
* I get the opportunity to map which columns in my new project correspond to these old columns (without actually renaming my columns to the old names)
* the tool translates the workflow to work on these new column names (`administrative_area` and `country_code`) automatically and applies the workflow.
### Undoing an operation requires discarding all operations done after it
Mentioned in [#183](https://github.com/OpenRefine/OpenRefine/issues/183) and a [mailing list post](https://groups.google.com/forum/#!msg/openrefine/wsT0xj8KCZI/_HvP6zyeAAAJ;context-place=forum/openrefine).
#### Problem
Consider the following workflow:
* capitalize the `country_code` column
* capitalize the `locale` column
* reconcile column `city` using `country_code` as a reconciliation property
* add columns from reconciled values, fetching the `population` and the `mayor` columns
* remove all rows where `population` is less than 1000 inhabitants
After carying out these operations, you realize that you should not have capitalized the `locale` column, since locales normally contain both lowercase and uppercase letters (such as `en_GB`). So you would like to undo this operation.
At the moment, if you want to use the undo/redo feature of OpenRefine, you will also need to undo all operations done after it, even if these operations did not actually rely on the `locale` column in any way. In particular, you will be forced to discard the reconciliation and data extension steps, which are long-running operations costing resources (time, bandwidth…)
#### Proposed solution
The exact user experience to undo such operations is still to decide. One way is to let users swap two consecutive operations which operate on different columns:
* In the history tab, dragging the "capitalize the `locale` column" down swaps it with the reconciliation operation;
* I can keep dragging this operation down until it reaches the end of the history
* Then I can simply undo this last operation.
The ability to swap operations depends on their nature and whether they work on distinct columns: see the next section for that.
One could also consider adding a way to speed this process up:
* right click on the operation I want to undo, a menu appears, with "undo" as one of the possible actions;
* I click this "Undo" action, and if this operation can be swapped with all following operations, then the tool does the workflow above for me.
### Facets are not updated when columns are renamed
Issue [#133](https://github.com/OpenRefine/OpenRefine/issues/133).
#### Problem
If I create a facet on column `foo` and then rename the column into `bar`, the facet will show an error as it is refering to a column that no longer exists.
#### Proposed solution
When columns are renamed, facets should be updated to refer to the new column names.
When columns are deleted, facets refering to them should be deleted as well.
### Facets are recomputed even when the columns they depend on have not been touched
#### Problem
Say I have one facet applied on column `foo` and I do mass edits on column `bar`. Every time I do an edit, the grid is refreshed, but also the facets, even if facet data cannot possibly change, since the changes I am making are not covered by the facet.
This is not a big problem if the dataset is small, as computing facets is not expensive, but for larger datasets it can be frustrating.
#### Proposed solution
Only refresh facets if one of them depends on a column that has been changed by the operation just applied.
### It is hard to visualize what a workflow does
#### Problem
Consider the example workflow JSON given earlier:
* it is hard to understand concisely what the workflow does, on which column it relies;
* it is hard to edit the workflow to adapt it to other situations;
* even the representation in the UI (as a list of descriptions) is not very easy to work with as the history grows and becomes more complex.
#### Proposed solution
We could introduce a graph-based representation of workflows laying out the column dependencies of each operation (see proposal below).
### Two independent long-running operations should be able to run in parallel
#### Problem
Many workflows rely on reconciling multiple columns, fetching URLs or reconciled data from other columns. These operations take time.
We currently have a queue of pending operations: if you start a reconciliation on column `foo`, and then start another reconciliation on column `bar`, the latter will be put in the queue until the first one finishes.
#### Proposed solution
If they are independent from each other, two operations should be able to run concurrently. The first one to finish would make it into the history first, and then be followed by the second. The actual order would not matter to the user since they would be able to reorder them in the history by dragging one towards the other.
## Architecture
In this section we outline the basic changes we want to make to operations and metadata for changes to enable the workflows proposed above.
### Extraction of column dependencies from expressions
All the solutions proposed above require the abilities:
* to analyze which column a given expression depends on;
* to update an expression after a column rename.
For instance:
* when run on column `foo`, the expression `value + "/" + cells["bar"].value` depends on columns `foo` and `bar`. If `bar` is renamed into `my_column`, the expression becomes `value + "/" + cells["my_column"].value`;
* the expression `"hello"` does not depend on any column;
* the expression `filter(row.columnNames, cn, isNonBlank(cells[cn].value))` depends on all columns currently present.
This syntactical analysis of expressions can never be complete: there are cases for which we cannot determine easily the set of columns an expression depends on. In this case, we can give up and assume that the expression depends on all columns. Any facet using such an expression will not be updated as columns get renamed, any operation using this expression will not be reorderable with other operations, and so on.
Concretely this means that [the `Evaluable` interface is extended with new methods](https://github.com/OpenRefine/OpenRefine/blob/d8a1680b4feb164ff03b2c94cf187555ebc81efc/or-model/src/main/java/org/openrefine/expr/Evaluable.java#L61):
* to compute the column dependencies (done);
* to compute a new Evaluable after renames (to do).
This is initially implemented for GREL only (see the corresponding [tests](https://github.com/OpenRefine/OpenRefine/blob/d8a1680b4feb164ff03b2c94cf187555ebc81efc/or-grel/src/test/java/org/openrefine/grel/GrelTests.java#L205)). Support for other expression languages would be nice to have but should not be necessary to enable dependency extraction in most places (facets and operations use GREL predominantly).
### Operation types
We break down OpenRefine operations into four categories:
* the operations which **transform a single column**, possibly using information drawn from other columns;
* the operations which **add one or more new columns**, possibly using other columns as dependencies;
* those which **reorder, delete and/or rename columns**;
* those which **remove, reorder or duplicate rows**;
* those which are **opaque**: these cannot be reordered with any other operation, for instance because we fail to isolate the columns they depend on, or they do not work in a row-wise fashion.
This classification actually applies to the changes generated by the operations (not the operations themselves). In OpenRefine, a Change is a step in the history, obtained by configuring and applying an operation on the project.
We categorize the operations here, as they are what users are acquinted with - they all induce a particular change.
The Change generated by any operation run in records mode will be considered opaque - this is because it introduces latent dependencies on other columns and are not row-wise.
Depending on their type, changes expose metadata indicating which columns they depend on, which columns they transform/add, reorder, rename, and so on.
#### Transformations
* TextTransformOperation
* MassEditOperation
* RowFlagOperation
* RowStarOperation
* ReconMatchBestCandidatesOperation
* ReconMatchSpecificTopicOperation
* ReconJudgeSimilarCellsOperation
* ReconOperation
* ReconClearSimilarCellsOperation
* ReconMarkNewTopicsOperation
* ReconUseValuesAsIdentifiersOperation
* ReconCopyAcrossColumnsOperation
* ReconDiscardJudgmentsOperation
#### Column additions
* ColumnAdditionByFetchingURLsOperation
* ColumnSplitOperation
* ColumnAdditionOperation
* ExtendDataOperation
#### Reorder, delete and/or rename
* ColumnMoveOperation
* ColumnRenameOperation
* ColumnRemovalOperation
* ColumnReorderOperation
#### Row removal or reordering
* RowRemovalOperation
* RowReorderOperation
#### Opaque
* TransposeRowsIntoColumnsOperation
* KeyValueColumnizeOperation
* FillDownOperation
* MultiValuedCellJoinOperation
* MultiValuedCellSplitOperation
* BlankDownOperation
* TransposeColumnsIntoRowsOperation
* DenormalizeOperation
* UnknownOperation
### Visualization
We propose to visualize the dependencies of each change using a graph-based representation. This is similar to many other ETL tools which represent the information flow as a directed acyclic graph of dependencies between transformation steps:
* [Dataiku](https://www.dataiku.com/product/)
* [Trifacta](https://www.trifacta.com/) with its [flow view](https://www.trifacta.com/blog/new-features-wrangler-flow-view-natural-language-column-menus/)
* [LinkedPipes ETL](https://etl.linkedpipes.com/), where graphs of transformations are represented as RDF graphs
* [YesWorkflow](https://github.com/yesworkflow-org/yw-prototypes). The [OR2YW](https://github.com/LanLi2017/OR2YWTool) tool maps OpenRefine JSON histories to YesWorkflow
Many other tools use similar representations beyond the ETL field.
Our proposed representation differs from the ones used in these tools because it relies intrinsically on the tabular nature of the transformed data:
* **edges** are **columns**;
* **nodes** are **changes**.
Between each pairs of consecutive changes, we represent the columns present at this stage by wires crossing the horizontal line between the changes. The changes themselves are represented by nodes, which are connected from above to the lines corresponding to the columns that they depend on, and connected from below to those corresponding to the columns they change or produce. Here is a possible representation for our first example workflow:
![graphical representation of a sample workflow](https://svgshare.com/i/JYr.svg)
Operation nodes could come with metadata (colour, style, hover text) to help understand what type of operation is being done. Edges could also be hovered to read the column name. They could be coloured by datatype, reconciliation status, or other column metadata.
We detail below the representation of each type of operation to explain the diagram above.
#### Transformations
![representation of a transformation operation](https://svgshare.com/i/JXM.svg)
A transformation operates on a single column, but can use other columns to influence the new values it writes in the column. These additional columns are drawn as splines connecting to the corresponding wires.
Any facets applied during the transformation also count as column dependencies. Column dependencies coming from facets could potentially be drawn in a different line style to emphasize the difference.
#### Column additions
These operations create a few new columns, by relying on values from other columns. Again, any facets applied during the transformation also count as column dependencies.
![adding two new columns depending on two existing columns](https://svgshare.com/i/JY0.svg)
#### Reorder, delete and/or rename
These operations just rearrange the columns without doing any changes to the data they contain. We can use special representations to make this clear. Again, the specifics of these representations are not meant to be definitive - rendering style can be tweaked easily.
Reordering two adjacent columns:
![swapping two columns](https://svgshare.com/i/JX7.svg)
Deleting a column:
![deleting a column](https://svgshare.com/i/JYs.svg)
Reordering and deleting a bunch of columns:
![reordering and deleting a bunch of columns](https://svgshare.com/i/JYg.svg)
Renaming a column:
![renaming a column](https://svgshare.com/i/JX8.svg)
#### Row removal / reordering
These operations are not row-wise in the sense that they do not apply to the list of rows via a simple `map` operation, but they still preserve rows themselves, so we can avoid treating them as opaque. We just need to represent which columns are used to determine how rows should be reordered, or which rows should be dropped.
This representation could perhaps do:
![](https://svgshare.com/i/JYN.svg)
The two nodes signify that the two corresponding columns are used to sort/drop rows.
#### Opaque
Opaque operations can do arbitrary changes on the grid. They are represented as unanalyzed blocks, which cannot be swapped with any other kind of change.
![](https://svgshare.com/i/JWb.svg)
### History rewriting
History rewriting is the process of rearranging changes in the history.
The ability to reorder two independent consecutive operations is necessary to undo operations buried in the history.
This section explains what sort of independence between two operations will be captured by the model. There are many cases where two operations can genuinely be swapped but the model cannot detect that. For instance, applying the transform `value.toUpper()` and then applying the transfor `value.trim()` on the same column gives the same result as applying the operations in the reverse order, but the tool will not be able to detect that and let the user swap the two changes, because it requires knowledge about the interaction between two particular GREL functions.
#### Rewriting cases
Generally speaking, changes will commute when the columns modified or produced by one are not column dependencies of the other, and conversely.
For instance, consider this sequence of two transformations:
![two commuting transformations](https://svgshare.com/i/JWc.svg)
We have added colours to the nodes to distinguish the changes from each other. The conditions are met to swap these two changes:
* The red (first) transform modifies the third column, which is not a dependency of the green (second) transform;
* Conversely, the green transform modifies the fourth column, which is not a dependency of the red transform.
The workflow is therefore equivalent to the one obtained by swapping both operations:
![commuted transformations](https://svgshare.com/i/JX9.svg)
Now consider these two transformations instead:
![non-commuting transformations](https://svgshare.com/i/JYt.svg)
In this scenario, we cannot commute the red and green operations, since the first operation depends on the fith column, that the green operation modifies. So running the green operation before the red one would potentially give different results.
Other rewriting cases can be worked out by hand, by considering all possible pairs of change types.
#### User experience
If and how such graphical representations of workflows should be integrated in the UI is to be decided.
Initially, history rewriting could be limited to moves which do not change the overall result (reordering two independent operations).
Of course, users could also be interested in reordering operations which do not commute, because they actually want to change the results of their workflow. This could also be added down the line, but we would need to make sure users clearly identify that they are (potentially) changing the behaviour of their workflow.
Enabling history rewrites which change the behaviour would require re-running long-running operations such as reconciliation or URL fetching on the fly.
In addition to reordering / rearranging changes, one could also give users the opportunity to change the parameters used to generate the change (expression used to derive a column, reconciliation settings, and so on).