Cryo Data Analysis in AWS SageMaker

# Getting Started with Cryo Data Analysis in AWS SageMaker [AWS SageMaker](https://aws.amazon.com/sagemaker/) is a managed ML platform that provides a Jupyter notebook interface to build data pipelines and train models. [Cryo](https://github.com/paradigmxyz/cryo) is a tool to extract data from the Ethereum Blockchain. This post will show you how to set up notebooks in SageMaker to analyze data extracted by Cryo. A forked version of Paradigm Data Portal and the notebooks from this post are available here: https://github.com/ipatka/paradigm-data-portal/tree/s3/notebooks ![](https://hackmd.io/_uploads/BkT4XsEe6.png) The [Paradigm Data Portal](https://github.com/paradigmxyz/paradigm-data-portal) hosts a public dataset of Ethereum data extracted by Cryo. The portal can be installed as a CLI tool or imported as a Python package. While you can extract this data yourself with Cryo, downloading the datasets from PDP is the quickest way to get started. ## Setting up SageMaker If you do not have a SageMaker Studio environment set up already, follow this tutorial to create one: [Amazon SageMaker Studio](https://aws.amazon.com/tutorials/machine-learning-tutorial-build-model-locally/). ## Working with S3 SageMaker Studio comes with an S3 bucket that you can use to store data. You can store the parquet files from Cryo in S3 and then load them into your notebook. Using this fork of the [Paradigm Data Portal](https://github.com/ipatka/paradigm-data-portal/tree/s3) you can download the data from PDP and upload it to your S3 bucket. To install this version of PDP run `pip install .` in the root directory of the repository on the `s3` branch ```python import boto3 import sagemaker import pdp # Set SageMaker and S3 client variables sess = sagemaker.Session() region = sess.boto_region_name s3_client = boto3.client("s3", region_name=region) sagemaker_role = sagemaker.get_execution_role() # Set read and write S3 buckets and locations write_bucket = sess.default_bucket() datasets = ["ethereum_contracts","ethereum_native_transfers","ethereum_slots"] for dataset in datasets: pdp.download_dataset_to_s3(dataset=dataset, s3_client=s3_client, s3_bucket=write_bucket) ``` After running this script you should have all PDP datasets in S3. ![S3 buckets with data from PDP](https://hackmd.io/_uploads/Sku8XoNgp.png) *S3 buckets with data from PDP* ![Ethereum Contracts Dataset](https://hackmd.io/_uploads/ry_UmoNla.png) *Ethereum Contracts dataset* ## Working with Data from S3 The normal flow for loading local parquet files for data analysis looks like this: ```python data_path = '~/pdp/ethereum_contracts/ethereum_contracts__v1_0_0__*.parquet' data_path = os.path.expanduser(data_path) result = pl.scan_parquet(data_path).select(pl.count()).collect() n_deployments = result.item() ``` However when loading data from S3 we have to make some slight modifications to the path. We can use the `s3fs` library to load data from S3. The `s3fs` library is a file interface to S3. It allows you to mount S3 buckets as if they were local directories. This allows you to use the same code to load data from S3 as you would from a local directory. First, set up the connection to s3fs: ```python import boto3 import sagemaker # Set SageMaker and S3 client variables sess = sagemaker.Session() region = sess.boto_region_name s3_client = boto3.client("s3", region_name=region) sagemaker_role = sagemaker.get_execution_role() # Set read and write S3 buckets and locations data_bucket = sess.default_bucket() ``` Then, load the data into a Polars datarame using `scan_pyarrow_dataset`: *NOTE - Choose the XLarge environment in Sagemaker to analyze the whole dataset at once* ```python import polars as pl import pyarrow.dataset as ds import s3fs fs = s3fs.S3FileSystem() dataset = "ethereum_contracts" data_path = f"{data_bucket}/{dataset}" # Get a list of all files in the directory files = fs.ls(data_path) # Filter the list to include only Parquet files parquet_files = [f for f in files if f.endswith('.parquet')] dataset = ds.dataset(parquet_files, format='parquet', filesystem=fs) df = pl.scan_pyarrow_dataset(dataset) ``` Finally, adapt the queries from the original sample notebook to work with the Polars dataframe: ```python result = df.select(pl.col('create_index').count()).collect() n_deployments = result.item() n_deployments ``` ![](https://hackmd.io/_uploads/BkE-4jElp.png) ## Conclusion That's it! We now have a SageMaker studio environment set up to analyze data extracted by Cryo. In future posts we will explore how to use SageMaker to train models on this data. ### Notes *Edited on Sep 30, 2023 with input from @banteg (thanks!)*