# Getting Started with Cryo Data Analysis in AWS SageMaker
[AWS SageMaker](https://aws.amazon.com/sagemaker/) is a managed ML platform that provides a Jupyter notebook interface to build data pipelines and train models. [Cryo](https://github.com/paradigmxyz/cryo) is a tool to extract data from the Ethereum Blockchain. This post will show you how to set up notebooks in SageMaker to analyze data extracted by Cryo.
A forked version of Paradigm Data Portal and the notebooks from this post are available here: https://github.com/ipatka/paradigm-data-portal/tree/s3/notebooks

The [Paradigm Data Portal](https://github.com/paradigmxyz/paradigm-data-portal) hosts a public dataset of Ethereum data extracted by Cryo. The portal can be installed as a CLI tool or imported as a Python package. While you can extract this data yourself with Cryo, downloading the datasets from PDP is the quickest way to get started.
## Setting up SageMaker
If you do not have a SageMaker Studio environment set up already, follow this tutorial to create one: [Amazon SageMaker Studio](https://aws.amazon.com/tutorials/machine-learning-tutorial-build-model-locally/).
## Working with S3
SageMaker Studio comes with an S3 bucket that you can use to store data. You can store the parquet files from Cryo in S3 and then load them into your notebook. Using this fork of the [Paradigm Data Portal](https://github.com/ipatka/paradigm-data-portal/tree/s3) you can download the data from PDP and upload it to your S3 bucket.
To install this version of PDP run `pip install .` in the root directory of the repository on the `s3` branch
```python
import boto3
import sagemaker
import pdp
# Set SageMaker and S3 client variables
sess = sagemaker.Session()
region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sagemaker_role = sagemaker.get_execution_role()
# Set read and write S3 buckets and locations
write_bucket = sess.default_bucket()
datasets = ["ethereum_contracts","ethereum_native_transfers","ethereum_slots"]
for dataset in datasets:
pdp.download_dataset_to_s3(dataset=dataset, s3_client=s3_client, s3_bucket=write_bucket)
```
After running this script you should have all PDP datasets in S3.

*S3 buckets with data from PDP*

*Ethereum Contracts dataset*
## Working with Data from S3
The normal flow for loading local parquet files for data analysis looks like this:
```python
data_path = '~/pdp/ethereum_contracts/ethereum_contracts__v1_0_0__*.parquet'
data_path = os.path.expanduser(data_path)
result = pl.scan_parquet(data_path).select(pl.count()).collect()
n_deployments = result.item()
```
However when loading data from S3 we have to make some slight modifications to the path. We can use the `s3fs` library to load data from S3. The `s3fs` library is a file interface to S3. It allows you to mount S3 buckets as if they were local directories. This allows you to use the same code to load data from S3 as you would from a local directory.
First, set up the connection to s3fs:
```python
import boto3
import sagemaker
# Set SageMaker and S3 client variables
sess = sagemaker.Session()
region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sagemaker_role = sagemaker.get_execution_role()
# Set read and write S3 buckets and locations
data_bucket = sess.default_bucket()
```
Then, load the data into a Polars datarame using `scan_pyarrow_dataset`:
*NOTE - Choose the XLarge environment in Sagemaker to analyze the whole dataset at once*
```python
import polars as pl
import pyarrow.dataset as ds
import s3fs
fs = s3fs.S3FileSystem()
dataset = "ethereum_contracts"
data_path = f"{data_bucket}/{dataset}"
# Get a list of all files in the directory
files = fs.ls(data_path)
# Filter the list to include only Parquet files
parquet_files = [f for f in files if f.endswith('.parquet')]
dataset = ds.dataset(parquet_files, format='parquet', filesystem=fs)
df = pl.scan_pyarrow_dataset(dataset)
```
Finally, adapt the queries from the original sample notebook to work with the Polars dataframe:
```python
result = df.select(pl.col('create_index').count()).collect()
n_deployments = result.item()
n_deployments
```

## Conclusion
That's it! We now have a SageMaker studio environment set up to analyze data extracted by Cryo. In future posts we will explore how to use SageMaker to train models on this data.
### Notes
*Edited on Sep 30, 2023 with input from @banteg (thanks!)*