Smart Data Specification

# Smart Data Specification <br> # Nautirust --- ## Context Provenance of data. ![p-plan](https://www.avercruysse.be/p-plan.svg) --- ## p-plan ![p-plan](https://www.avercruysse.be/p-plan.svg) ``` ex:Square a ex:Shape; a p-plan:Entity. ex:Star1 a p-plan:Activity; rdfs:comment "Transforms shape"; p-plan:used ex:Square. ex:Circle a ex:Shape; a p-plan:Entity. p-plan:wasGeneratedBy ex:Star1. ``` --- ![p-plan](https://www.avercruysse.be/streams.svg) --- ![p-plan](https://www.avercruysse.be/sds.svg) ``` # sds:Stream is subclass of p-plan:Entity ex:SquareStream a sds:Stream; a p-plan:Entity. ex:Star a p-plan:Activity; rdfs:comment "Transforms shape"; p-plan:used ex:Square. ex:CircleStream a sds:Stream; a p-plan:Entity. p-plan:wasGeneratedBy ex:Star1. ``` --- ## SDS Metadata ontology for data on a stream <br> Including: - provenance - kind of data on stream --- ### Structure What data is on this stream? --- sds:carries property ```ttl ex:CsvStream a sds:Stream; sds:carries sds:Record. ex:CsvToMember a p-plan:Activity; p-plan:used ex:CsvStream. ex:MemberStream a sds:Stream; p-plan:wasGeneratedBy ex:CsvToMember; sds:carries sds:Member; sds:shape <person.sh>. ``` --- Data on a stream has a link to the originating stream. csv row ```turtle [] a sds:Record; sds:payload "42,43"^^csvw:Row; sds:stream ex:CsvStream. ``` TREE Member ```rdf [] a sds:Member; sds:payload ex:member1; sds:stream ex:MemberStream. ex:member1 a ex:Person; foaf:name "Arthur". ``` --- ## SDS also includes - `sds:dataset` talks about the licence etc for data on the stream. - `sds:bucket` the stream can be split up in buckets or partitions --- # Nautirust ### An orchestrator for workflows --- #### Data processing ![dsp workflow](https://i.imgur.com/ULsSu3k.png) --- #### Complex setups of processing units `Source -> [ MapperA, MapperB ] -> Aggregator ...` --- ### Reality of setups - Bunch of bash scripts (yuck!) - Different commands for different runs - Coordination between components --- ### Case study (RMLStreamer benchmark) ![rmlstreamer-benchmark](https://i.imgur.com/sQJyzoi.png) --- ### Case study (RMLStreamer benchmark) - Nested bash scripts - Difficult to switch out components - Engine dependent CLI args - Local and global level mixed - Local level: application - Global level: workflow pipeline --- ## Nautirust to the rescue! --- ## What is Nautirust? - An orchestrator for dataprocessing workflows --- ## Why Nautirust? - Language independent - Data provenance - Reproducible workflow - Separation of focus --- ### Language independent - We support **ANY** languages you love! - Mix-match languages --- ### Data provenance ![pipeline](https://www.avercruysse.be/pipeline.svg) Create a stream of processes that transform data and metadata. --- ### Reproducible workflow Nautirust pipeline configuration starts the same pipeline every time. --- ### Separation of focus - Local level - Application execution - Global level - Workflow pipeline - Source -> Mapper -> Aggregator -> ... --- ## Components of Nautirust * Runner * Step * Channel --- ### Runner Executor of your step based on config provided by Nautirust * Config file * Injector required (ex. RML config injector) * Available channels --- ### Step - Your application step (mapping, ldes server) - Configure a step with parameters. - Nautirust requests these parameters from the user. --- ### Channels Some way to transfer data from step A to step B. For example: kafka, tcp, files ... --- ## Configuration Example [Nautirust](https://github.com/ajuvercr/nautirust) [Nautirust-config](https://github.com/ajuvercr/nautirust-configs)