Harvesting from S3 Buckets

# Harvesting from S3 Buckets Co-authored by: Sebastien Lavoie <sebastien.lavoie@datopian.com> ## Done today * Addressed the expectation of getting tests to pass with just `pip install`. * Create another bucket in AWS. One that doesn't have been deleted in CKAN. * Documented what's need to be done before running the tests. * Parameterize `S3_BUCKET` (load it from an env var). * Why? Because not everyone has access to the same bucket, thus they need to read from a different one. * Documented working values for each of the environment variables. --- Action Irio Email to client: Hi, Here is a simple pythnon script that harvests from S3 buckets into CKAN. Due to wanting to get you something quickly we have not yet had time to integrate iot into the CKAN harvester structure - that is easy to do and we could help do that in the next few days. Best, XXX :::warning Irio > Rufus: Done. There a draft for the email at the top of the README. ::: Great! Action * Write test to generate the data package from the s3 bucket * Bonus: write failing test for syncing to ckan * Review README. * Rename script to `process.py`. * Ensure all commands are up-to-date, are necessary and suficient to run the script. Look at nhs-tools it does both the things you want ... --- ``` def s3_to_datapackage(): return '' def test_s3_to_datapackage(): exp = { "name": "abc", "resources": [ { "name": "chamber_of_deputies_presences.csv", "path": "https://datopian-new-test-account-poc.s3.amazonaws.com/chamber_of_deputies_presences.csv" }, { "name": "territories.csv", "path": "https://datopian-new-test-account-poc.s3.amazonaws.com/territories.csv" } ] } assert s3_to_datapackage() == exp ``` # use dotenv to load local aws - see nhs-tools ```sh $ pip install -r requirements.txt $ BUCKET=datopian-new-test-account-poc ORGID=8895b53a-2c98-4be9-8f9d-898a4b5f1a8d APIKEY=8f6d4285-a783-41e9-be5a-c56eb6535225 CKAN_INSTANCE=https://demo.ckan.org/ python main.py ``` def test_get_data_package(): bucket = xxx # outdp is a dict outdp = get_data_package(bucket) ## TODO **TOP Priority:** Get the proof of concept shipped by tomorrow morning. * Waiting from an answer from Paul about the time of the demo. * https://gitlab.com/datopian/tech/lead/-/issues/38#note_333731182 * [x] Understand what has been accomplished. * Basic template in the repository. * A few failing tests. * Function that doesn't really work for pulling filenames from a bucket. * Connection with an AWS account (may not be working). * Experimenting with a Python package called `harvesters`. * https://github.com/datopian/ckan-ng-harvester-core * [x] Understand the blockers. * [ ] Set a plan of work/change the existing plan of work. * [x] All 3 - Quick cleanup on the project and push it. * [x] All 3 - Write function to list files from a bucket. * [x] (Maybe) Get an example `datapackage.json` to be able to parallelize the next two tasks. * [x] Generate a `datapackage.json` out of these files. * [x] Publish a `datapackage.json` in the demo CKAN instance. * https://demo.ckan.org/organization/nirabsebas * [ ] If not possible, get a local CKAN instance, with API key, running. * [x] Document the project so that the client can use it. * [x] Communicate the delivery to Paul, so he can tell the client. --- ## Paul's comment on Initial Implementation * There is code related to uploading files to S3. The request is to read from S3, not write to it. Code for writing to S3 is confusing for the person we are providing this for, who wants an example of harvesting data into CKAN * There is code called at the module level here onwards which makes this code unportable (ideally, we can ship this to the client as a CKAN Harvester). In any event, this is not very idiomatic (code to be called for a script should generally be guarded behind if __name__ == "__main__"). * Because we have not done any mapping of metadata (from datapackage.json to a CKAN dataset, or, from info we can derive from S3 to use as metadata in CKAN about the files), the script is primarily a demo of writing data to S3, and only lines 94-105 are about "harvesting" into CKAN. However, without the metadata mapping, it is just an example of a POST request for a single file into a hardcoded ORG and hardcoded dataset, which is unfortunately not what the client is looking for here. ## Plan of work to address the comments * [x] Clean the code * [x] Remove the upload code and helper methods for upload, refactor code to work without the upload function * [x] Refactor the code into separate functions to make it more portable * [x] Map data from datapackage * [x] iterate over each resource in datapackage to get metadata for CKAN * [x] Use the metadata to upload the resources correctly ::: warning Irio's comments: * "fix its dependency" * What dependency? * What will be fixed? ~Nirab : Made it more clearer * I would prioritize the metadata over cleaning the code, as Paul explicitly pointed to be important for the client. :+1: ::: https://gitlab.com/datopian/tech/lead/-/issues/38 > Here's the initial use case for the harvester. Files loaded into multiple S3 buckets. > Provide harvester with AWS keys and bucket names Harvester catalogs data in CKAN by doing either or both: reading JSON file containing metadata information for data files (preferable if we have to chose one harvester) reading metadata elements directly off of the files in the buckets. > Cataloged data should continue to be stored in S3, not CKAN, but should display all metadata information provided, and the S3 location of the files. Acceptance Criteria: bucket == a dataset, files in the bucket are the resources of the dataset Solution outline: write a simple python script (runnable from the command line) that: * [ ] Given a private bucket name and AWS key generates a data package with data resources. One (data resource) for each file in the bucket. * **Yes** as credentials are read from `~/.aws/credentials` if they are available, but would require more testing. * [x] Given those data resources / packages, it syncs them to a CKAN instance (note that next gen harvester should have this. please check if https://github.com/datopian/ckan-ng-harvester-core can be helpful) Bonus: * [x] Save the datapackage.json in the bucket. * [ ] Generate a table schema for each resource that is tabular * Frictionless Data should have documentation on that. * [ ] given a config file of buckets and aws keys generate multiple data packages one for each bucket * [ ] if a datapackage.json exists in the root of the bucket use that ... ## Plan of work * [x] Create AWS account, if you don't have one already * [x] Set up a test S3 bucket and add some sample files ... * [x] Write a (failing) test * [ ] Implement (use dataflows ...) * https://github.com/frictionlessdata/dataflows-aws may be useful (not sure) * This didn't seem necessary. * [ ] See the test pass * More tests need to be added and make sure they pass. ## Quick references * AWS python package to pull data from the buckets * https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-creating-buckets.html * How to generate a Data Package * https://frictionlessdata.io/tooling/data-package-tools/ * CKAN api to sync data * https://github.com/ckan/ckanapi https://gitlab.com/datopian/experiments/harvert-from-s3-buckets For examples of tests: https://github.com/datopian/data-subscriptions https://docs.ckan.org/en/latest/api/index.html#module-ckan.logic.action.create --- ### Discuss later - 12 factors, not sharing secrets ### Code snippets ``` from ckanapi import RemoteCKAN ua = 'ckanapiexample/1.0 (+http://example.com/my/website)' mysecret = "8f6d4285-a783-41e9-be5a-c56eb6535225" mysite = RemoteCKAN('http://demo.ckan.org', apikey=mysecret, user_agent=ua) mysite.action.resource_create(package_id='my-dataset-with-files', url='dummy-value', upload=open('emojis.csv', 'rb')) ``` ``` bucket_location = boto3.client('s3').get_bucket_location(Bucket=s3_bucket_name) object_url = "https://s3-{0}.amazonaws.com/{1}/{2}".format( bucket_location['LocationConstraint'], s3_bucket_name, key_name) ``` --- ### Links - Research NHS Tools - (Similar Project) https://gitlab.com/datopian/experiments/nhs-tools http://specs.frictionlessdata.io/data-resource/ https://stackoverflow.com/a/56090535/8787680 ``` [datapackage.exceptions.ValidationError('Descriptor validation error: {\'path\': \'https://s3-None.amazonaws.com/datopian-new-test-account-poc/chamber_of_deputies_presences.csv\', \'profile\': \'tabular-data-resource\'} is not valid under any of the given schemas at "resources/0" in descriptor and at "properties/resources/items/oneOf" in profile'), datapackage.exceptions.ValidationError('Descriptor validation error: {\'path\': \'https://s3-None.amazonaws.com/datopian-new-test-account-poc/territories.csv\', \'profile\': \'tabular-data-resource\'} is not valid under any of the given schemas at "resources/1" in descriptor and at "properties/resources/items/oneOf" in profile')] ```