OpenRefine history improvements

This document presents proposed improvements to the way we represent sequences of operations in OpenRefine. We first motivate the changes by explaining the concrete user workflows we want to improve by bringing in these changes. We then sketch a new architecture which is designed with all these use cases in mind. We also propose visual representations for operations, which could be introduced in the tool.

Use cases

This section describes issues with the current architecture, and what the ideal behaviour of the tool should be. We only sketch out what the user should experience, not how it should concretely be implemented in the tool.

Applying a JSON workflow fails silently when column names mismatch

This is issue #5440.

Problem

Say you have a list of 4 operations forming a workflow that you want to reuse:

capitalize the country_code column
reconcile the city column using country_code as a reconciliation property
add columns from reconciled values, fetching the mayor and the population columns
remove all rows where population is less than 1000 inhabitants

Once you have done this workflow on a project, you can extract it as a JSON blob in the Undo/Redo tab, and reapply it on a new project.
The workflow JSON looks like this:

[
  {
    "op": "core/text-transform",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "country_code",
    "expression": "value.toUppercase()",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10,
    "description": "Text transform on cells in column country_code using expression value.toUppercase()"
  },
  {
    "op": "core/recon",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "columnName": "city",
    "config": {
      "mode": "standard-service",
      "service": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
      "identifierSpace": "http://www.wikidata.org/entity/",
      "schemaSpace": "http://www.wikidata.org/prop/direct/",
      "type": {
        "id": "Q486972",
        "name": "human settlement"
      },
      "autoMatch": true,
      "columnDetails": [
        {
          "column": "country_code",
          "propertyName": "SPARQL: P17/P297",
          "propertyID": "P17/P297"
        }
      ],
      "limit": 0
    },
    "description": "Reconcile cells in column city to type Q486972"
  },
  {
    "op": "core/extend-reconciled-data",
    "engineConfig": {
      "facets": [],
      "mode": "row-based"
    },
    "baseColumnName": "city",
    "endpoint": "https://tools.wmflabs.org/openrefine-wikidata/en/api",
    "identifierSpace": "http://www.wikidata.org/entity/",
    "schemaSpace": "http://www.wikidata.org/prop/direct/",
    "extension": {
      "properties": [
        {
          "id": "P1082",
          "name": "population"
        },
        {
          "id": "P6",
          "name": "head of government"
        }
      ]
    },
    "columnInsertIndex": 1,
    "description": "Extend data at index 1 based on column city"
  },
  {
    "op": "core/row-removal",
    "engineConfig": {
      "facets": [
        {
          "type": "range",
          "name": "population",
          "expression": "value",
          "columnName": "population",
          "from": 0,
          "to": 1000,
          "selectNumeric": true,
          "selectNonNumeric": true,
          "selectBlank": false,
          "selectError": true
        }
      ],
      "mode": "row-based"
    },
    "description": "Remove rows"
  }
]

If the new project does not use the same column names, or if some columns are missing, this problem will not be reported to the user in a clear way.
Instead, the tool will try to apply all 4 operations in sequence, without stopping if columns are missing.

For instance, say we try to apply the workflow to a project with a column country_code, but the city column has been renamed into administrative_area.
This is what currently happens:

the first operation, capitalizing the column country_code, will run successfully;
the reconciliation operation is skipped as there is no such column to reconcile;
the fetching operation is skipped, since city still does not exist;
the population column does not exist, so the facet used to select the rows to delete has an error. So all rows are selected by default. So all rows in the new project are removed.

This is obviously extremely bad in terms of user experience. No data is actually lost since we can still roll back to a previous stage, but it is far from ideal.

Very elegantly, OpenRefine dumps some stack traces in the console as well:

java.lang.Exception: No column named city
	at com.google.refine.operations.recon.ReconOperation$ReconProcess.populateEntries(ReconOperation.java:205)
	at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:239)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-9" java.lang.NullPointerException
	at com.google.refine.operations.recon.ReconOperation$ReconProcess.run(ReconOperation.java:247)
	at java.lang.Thread.run(Thread.java:748)
java.lang.Exception: No column named city
	at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.populateRowsWithMatches(ExtendDataOperation.java:161)
	at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.run(ExtendDataOperation.java:244)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "Thread-10" java.lang.NullPointerException
	at com.google.refine.model.changes.DataExtensionChange.apply(DataExtensionChange.java:153)
	at com.google.refine.history.HistoryEntry.apply(HistoryEntry.java:148)
	at com.google.refine.history.History.addEntry(History.java:135)
	at com.google.refine.operations.recon.ExtendDataOperation$ExtendDataProcess.run(ExtendDataOperation.java:296)
	at java.lang.Thread.run(Thread.java:748)

Proposed solution

Given a workflow, the tool needs to be able to determine if it applies to the current project, by infering from the workflow what the initial shape of the data should look like.

As a user, I expect that:

the tool detects that this workflow depends on columns city and country_code
I get the opportunity to map which columns in my new project correspond to these old columns (without actually renaming my columns to the old names)
the tool translates the workflow to work on these new column names (administrative_area and country_code) automatically and applies the workflow.

Undoing an operation requires discarding all operations done after it

Mentioned in #183 and a mailing list post.

Problem

Consider the following workflow:

capitalize the country_code column
capitalize the locale column
reconcile column city using country_code as a reconciliation property
add columns from reconciled values, fetching the population and the mayor columns
remove all rows where population is less than 1000 inhabitants

After carying out these operations, you realize that you should not have capitalized the locale column, since locales normally contain both lowercase and uppercase letters (such as en_GB). So you would like to undo this operation.

At the moment, if you want to use the undo/redo feature of OpenRefine, you will also need to undo all operations done after it, even if these operations did not actually rely on the locale column in any way. In particular, you will be forced to discard the reconciliation and data extension steps, which are long-running operations costing resources (time, bandwidth…)

Proposed solution

The exact user experience to undo such operations is still to decide. One way is to let users swap two consecutive operations which operate on different columns:

In the history tab, dragging the "capitalize the locale column" down swaps it with the reconciliation operation;
I can keep dragging this operation down until it reaches the end of the history
Then I can simply undo this last operation.

The ability to swap operations depends on their nature and whether they work on distinct columns: see the next section for that.

One could also consider adding a way to speed this process up:

right click on the operation I want to undo, a menu appears, with "undo" as one of the possible actions;
I click this "Undo" action, and if this operation can be swapped with all following operations, then the tool does the workflow above for me.

Issue #133.

Problem

If I create a facet on column foo and then rename the column into bar, the facet will show an error as it is refering to a column that no longer exists.

Proposed solution

When columns are renamed, facets should be updated to refer to the new column names.

When columns are deleted, facets refering to them should be deleted as well.

Problem

Say I have one facet applied on column foo and I do mass edits on column bar. Every time I do an edit, the grid is refreshed, but also the facets, even if facet data cannot possibly change, since the changes I am making are not covered by the facet.

This is not a big problem if the dataset is small, as computing facets is not expensive, but for larger datasets it can be frustrating.

Proposed solution

Only refresh facets if one of them depends on a column that has been changed by the operation just applied.

It is hard to visualize what a workflow does

Problem

Consider the example workflow JSON given earlier:

it is hard to understand concisely what the workflow does, on which column it relies;
it is hard to edit the workflow to adapt it to other situations;
even the representation in the UI (as a list of descriptions) is not very easy to work with as the history grows and becomes more complex.

Proposed solution

We could introduce a graph-based representation of workflows laying out the column dependencies of each operation (see proposal below).

Two independent long-running operations should be able to run in parallel

Problem

Many workflows rely on reconciling multiple columns, fetching URLs or reconciled data from other columns. These operations take time.

We currently have a queue of pending operations: if you start a reconciliation on column foo, and then start another reconciliation on column bar, the latter will be put in the queue until the first one finishes.

Proposed solution

If they are independent from each other, two operations should be able to run concurrently. The first one to finish would make it into the history first, and then be followed by the second. The actual order would not matter to the user since they would be able to reorder them in the history by dragging one towards the other.

Architecture

In this section we outline the basic changes we want to make to operations and metadata for changes to enable the workflows proposed above.

Extraction of column dependencies from expressions

All the solutions proposed above require the abilities:

to analyze which column a given expression depends on;
to update an expression after a column rename.

For instance:

when run on column foo, the expression value + "/" + cells["bar"].value depends on columns foo and bar. If bar is renamed into my_column, the expression becomes value + "/" + cells["my_column"].value;
the expression "hello" does not depend on any column;
the expression filter(row.columnNames, cn, isNonBlank(cells[cn].value)) depends on all columns currently present.

This syntactical analysis of expressions can never be complete: there are cases for which we cannot determine easily the set of columns an expression depends on. In this case, we can give up and assume that the expression depends on all columns. Any facet using such an expression will not be updated as columns get renamed, any operation using this expression will not be reorderable with other operations, and so on.

Concretely this means that the Evaluable interface is extended with new methods:

to compute the column dependencies (done);
to compute a new Evaluable after renames (to do).

This is initially implemented for GREL only (see the corresponding tests). Support for other expression languages would be nice to have but should not be necessary to enable dependency extraction in most places (facets and operations use GREL predominantly).

Operation types

We break down OpenRefine operations into four categories:

the operations which transform a single column, possibly using information drawn from other columns;
the operations which add one or more new columns, possibly using other columns as dependencies;
those which reorder, delete and/or rename columns;
those which remove, reorder or duplicate rows;
those which are opaque: these cannot be reordered with any other operation, for instance because we fail to isolate the columns they depend on, or they do not work in a row-wise fashion.

This classification actually applies to the changes generated by the operations (not the operations themselves). In OpenRefine, a Change is a step in the history, obtained by configuring and applying an operation on the project.

We categorize the operations here, as they are what users are acquinted with - they all induce a particular change.

The Change generated by any operation run in records mode will be considered opaque - this is because it introduces latent dependencies on other columns and are not row-wise.

Depending on their type, changes expose metadata indicating which columns they depend on, which columns they transform/add, reorder, rename, and so on.

Transformations

TextTransformOperation
MassEditOperation
RowFlagOperation
RowStarOperation
ReconMatchBestCandidatesOperation
ReconMatchSpecificTopicOperation
ReconJudgeSimilarCellsOperation
ReconOperation
ReconClearSimilarCellsOperation
ReconMarkNewTopicsOperation
ReconUseValuesAsIdentifiersOperation
ReconCopyAcrossColumnsOperation
ReconDiscardJudgmentsOperation

Column additions

ColumnAdditionByFetchingURLsOperation
ColumnSplitOperation
ColumnAdditionOperation
ExtendDataOperation

Reorder, delete and/or rename

ColumnMoveOperation
ColumnRenameOperation
ColumnRemovalOperation
ColumnReorderOperation

Row removal or reordering

RowRemovalOperation
RowReorderOperation

Opaque

TransposeRowsIntoColumnsOperation
KeyValueColumnizeOperation
FillDownOperation
MultiValuedCellJoinOperation
MultiValuedCellSplitOperation
BlankDownOperation
TransposeColumnsIntoRowsOperation
DenormalizeOperation
UnknownOperation

Visualization

We propose to visualize the dependencies of each change using a graph-based representation. This is similar to many other ETL tools which represent the information flow as a directed acyclic graph of dependencies between transformation steps:

Dataiku
Trifacta with its flow view
LinkedPipes ETL, where graphs of transformations are represented as RDF graphs
YesWorkflow. The OR2YW tool maps OpenRefine JSON histories to YesWorkflow

Many other tools use similar representations beyond the ETL field.

Our proposed representation differs from the ones used in these tools because it relies intrinsically on the tabular nature of the transformed data:

edges are columns;
nodes are changes.

Between each pairs of consecutive changes, we represent the columns present at this stage by wires crossing the horizontal line between the changes. The changes themselves are represented by nodes, which are connected from above to the lines corresponding to the columns that they depend on, and connected from below to those corresponding to the columns they change or produce. Here is a possible representation for our first example workflow:

graphical representation of a sample workflow

Operation nodes could come with metadata (colour, style, hover text) to help understand what type of operation is being done. Edges could also be hovered to read the column name. They could be coloured by datatype, reconciliation status, or other column metadata.

We detail below the representation of each type of operation to explain the diagram above.

Transformations

representation of a transformation operation

A transformation operates on a single column, but can use other columns to influence the new values it writes in the column. These additional columns are drawn as splines connecting to the corresponding wires.

Any facets applied during the transformation also count as column dependencies. Column dependencies coming from facets could potentially be drawn in a different line style to emphasize the difference.

Column additions

These operations create a few new columns, by relying on values from other columns. Again, any facets applied during the transformation also count as column dependencies.

adding two new columns depending on two existing columns

Reorder, delete and/or rename

These operations just rearrange the columns without doing any changes to the data they contain. We can use special representations to make this clear. Again, the specifics of these representations are not meant to be definitive - rendering style can be tweaked easily.

Reordering two adjacent columns:

swapping two columns

Deleting a column:

deleting a column

Reordering and deleting a bunch of columns:

reordering and deleting a bunch of columns

Renaming a column:

renaming a column

Row removal / reordering

These operations are not row-wise in the sense that they do not apply to the list of rows via a simple map operation, but they still preserve rows themselves, so we can avoid treating them as opaque. We just need to represent which columns are used to determine how rows should be reordered, or which rows should be dropped.

This representation could perhaps do:

The two nodes signify that the two corresponding columns are used to sort/drop rows.

Opaque

Opaque operations can do arbitrary changes on the grid. They are represented as unanalyzed blocks, which cannot be swapped with any other kind of change.

History rewriting

History rewriting is the process of rearranging changes in the history.
The ability to reorder two independent consecutive operations is necessary to undo operations buried in the history.

This section explains what sort of independence between two operations will be captured by the model. There are many cases where two operations can genuinely be swapped but the model cannot detect that. For instance, applying the transform value.toUpper() and then applying the transfor value.trim() on the same column gives the same result as applying the operations in the reverse order, but the tool will not be able to detect that and let the user swap the two changes, because it requires knowledge about the interaction between two particular GREL functions.

Rewriting cases

Generally speaking, changes will commute when the columns modified or produced by one are not column dependencies of the other, and conversely.

For instance, consider this sequence of two transformations:

two commuting transformations

We have added colours to the nodes to distinguish the changes from each other. The conditions are met to swap these two changes:

The red (first) transform modifies the third column, which is not a dependency of the green (second) transform;
Conversely, the green transform modifies the fourth column, which is not a dependency of the red transform.

The workflow is therefore equivalent to the one obtained by swapping both operations:

commuted transformations

Now consider these two transformations instead:

non-commuting transformations

In this scenario, we cannot commute the red and green operations, since the first operation depends on the fith column, that the green operation modifies. So running the green operation before the red one would potentially give different results.

Other rewriting cases can be worked out by hand, by considering all possible pairs of change types.

User experience

If and how such graphical representations of workflows should be integrated in the UI is to be decided.

Initially, history rewriting could be limited to moves which do not change the overall result (reordering two independent operations).

Of course, users could also be interested in reordering operations which do not commute, because they actually want to change the results of their workflow. This could also be added down the line, but we would need to make sure users clearly identify that they are (potentially) changing the behaviour of their workflow.

Enabling history rewrites which change the behaviour would require re-running long-running operations such as reconciliation or URL fetching on the fly.

In addition to reordering / rearranging changes, one could also give users the opportunity to change the parameters used to generate the change (expression used to derive a column, reconciliation settings, and so on).

Thad Guidry

2020/03/28 15:14:49

"I click this Undo action..." Could we say instead in this sentence that if there are no operations that depend on this operation, we can delete it safely? Or did I misread your original paper proposal about dependency graphing this? (Edited)

2020/03/28 17:34:24

Footnote: EAD (Edit, Add, Delete) or MADS (Modify, Add, Delete, Show) or DAVE (Delete, Add, View, Edit) or BREAD (Browse, Read, Edit, Add, Delete) and CRUD (Create, Read, Update, Delete). All of these essentially work within T space (Transform). Would it make sense to tie this in for readers and group operations within those generally known concepts or one of them? (Edited)

2020/03/28 17:39:11

confusing. At 'Reconcile column "city"' there is a spline between a node and column. You said later that splines are "additional columns". What is additional here with the operation "Reconcile column city"'? (Edited)

2020/03/28 18:05:37

I am thinking that perhaps instead of using solid spline lines to represent "dependencies" of additional columns, that perhaps a line made of plus segments ++++ would be appropriate (plus meaning addition) (Edited)

2020/03/28 18:17:05

Wait, you said that "nodes are changes". What Change is contained on each of those 2 nodes? Your sentence said "used" and that would refer to a "dependency" , yes/no? (Edited)

2020/03/28 18:23:19

This is interesting... mentally up until this stage, we have been doing Column operations... now we hit a Row operation and then it is a bit less clear here for me with the Guitar Fret analogy we have used thus far. Dunno, I need to think about this one a bit more, definitely. Or maybe get some explanation further from you. (Edited)

2020/03/29 12:58:46

Ah! a dependency! So yeah, its hard to see this as a dependency. I think we need to improve only using "solid line splines" for dependencies and choose some other styling of the spline. Perhaps a ++++ line or something that represents an AND op or "ADDITIVE". (Edited)

Tom Morris

2020/03/31 00:01:45

This is just a bug. Although the "Apply History" operation is brittle in many ways because it's been extended beyond its original intent without the necessary error handling, etc. (Edited)

2020/03/31 00:05:57

This seems like a much more reasonable UI than the incremental swapping one, with the caveat that there should be no "Undo" operation offered if it isn't valid or can't be accomplished. (Edited)

2020/03/31 00:14:24

This sounds like an execution engine enhancement, not a History thing. Real workflow languages typically describe data dependency graphs, but again an operation history isn't a workflow language. (Edited)

2020/03/28 15:01:23

We should probably quickly describe for the reader the term "workflow" here at the beginning. "A workflow is an ordered sequence of steps/operations" (Edited)

2020/03/28 15:06:27

I really like how we are describing "shape of the data". (Edited)

2020/03/28 18:09:13

visually, I'd like a skull and crossbones ;-) ... but a red X or red circle could do well here or whatever we decide later. (Edited)

2020/03/28 18:10:53

Hmm, Name = Text = Box... yeah that works mentally for me quite well. perhaps instead of a square box for rename, a rectangle? (Edited)

Antonin Delpeuch

2020/03/28 20:01:59

Good catch - for this picture we don't really have the node = change correspondence. We could fix that. (Edited)

2020/03/28 20:02:20

Yeah, or a trash can… (Edited)

2020/03/28 20:03:39

When you reconcile the city column you also map the country_code column to a Wikidata property, so this operation depends on that column too (Edited)

2020/03/28 20:04:39

In some way, yes, but the correspondence isn't perfect (since reordering rows is not the same thing as reordering columns) (Edited)

2020/03/28 20:05:28

Yes that's a valid way to rephrase it (Edited)

2020/03/28 20:08:37

sure, it could be a bit bigger too (Edited)

2020/03/29 13:00:01

I like trash can! (Edited)

2020/03/29 13:07:56

Hmm, "sort" and "drop/remove". I think a good symbol for those 2 words would be a upside down triangle? Or we use 2 symbols for each? hmm. (Edited)

Lan Li

2020/03/29 17:36:34

Dependencies type could also be taken into consideration. (Edited)

2020/03/29 17:36:41

cite: Validation and Inference of Schema-Level Workflow Data-Dependency Annotations (Edited)

2020/03/31 00:08:24

This long standing bug doesn't appear to have anything to do with History (Edited)

2020/03/31 00:09:59

Another facet bug which is unrelated to History. (Edited)

2020/03/31 00:12:09

It's "hard to edit" because it's an operation history, not a workflow language. Workflow languages have different trade-offs (Edited)

2020/03/31 00:20:25

If that's true from a technical point of view (unclear to me), it's certainly not true from the user's point of view. (Edited)

2020/07/04 20:11:40

Ok, but this is something I intend to solve by bringing in the expression analysis mechanisms that are required for the history changes below. (Edited)

2020/07/04 20:12:11

I think we should aim to offer a workflow language (or build on an existing one). (Edited)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

OpenRefine history improvements

Use cases

Applying a JSON workflow fails silently when column names mismatch

Problem

Proposed solution

Undoing an operation requires discarding all operations done after it

Problem

Proposed solution

Facets are not updated when columns are renamed

Problem

Proposed solution

Facets are recomputed even when the columns they depend on have not been touched

Problem

Proposed solution

It is hard to visualize what a workflow does

Problem

Proposed solution

Two independent long-running operations should be able to run in parallel

Problem

Proposed solution

Architecture

Extraction of column dependencies from expressions

Operation types

Transformations

Column additions

Reorder, delete and/or rename

Row removal or reordering

Opaque

Visualization

Transformations

Column additions

Reorder, delete and/or rename

Row removal / reordering

Opaque

History rewriting

Rewriting cases

User experience