# Making the River, with Galaxykate
## So you want to do a project
* Find something to work with:
* what do you have "an unfair advantage" in? Or an interest in?
* what's the neatest toy available?
* List what you have that can help you work with that
* datasets, people, libraries, past code, etc
## Inventory
When starting a project, check your inventory first! What do you have?
The Art Institute of Chicago put up all their art, lets make something with it.
https://www.thisiscolossal.com/2018/10/art-institute-of-chicago-image-collection/
### Web basics
```
` > image.png` redirection operator
curl -s 'https://icanhazdadjoke.com/' -H 'Accept: text/plain'
curl ifconfig.me
Image curls
curl -I "https://www.artic.edu/iiif/2/ea8c5d62-6ce8-88e8-feb1-e0053cf534c5/full/1686,/0/default.jpg"
curl -I "https://64.media.tumblr.com/97769a7bee6f93e1ead537ebba8a16d6/50b1a9ee2dee08f3-12/s2048x3072/d9125681061cb1ed20b3bf96cad89ec57ac8ea74.jpg"
curl -I "https://d2w9rnfcy7mm78.cloudfront.net/12305376/original_7c01d19129eb0b26432a9443db766499.jpg?1624122193?bc=0"
curl -I "https://i.imgur.com/Q6Lzz0n.jpeg"
```
### Hosting
We can serve locally with Flask
ChatGPT does a good flask implementation
Not sure where to host it later??? Glitch?
### River/Max Bittker
[Max Bittker's River](https://river.maxbittker.com/):
Its our model of what we can do.
*Pros:* Cool, good vibes. Art is hotlinked to `are.na` so he doesn't have to host his own
*Cons:* Questionable art sourcing, hotlinked art has the wrong headers so we can embed it in HTML, but not use it in VR or do analysis of the pixels
He gave us his embedding code! We can email him for help. He also teaches classes like this at NYU. Academic friends!
### AIC API
[Art Institute of Chicago API](https://www.artic.edu/open-access/public-api)
([docs](https://api.artic.edu/docs/))
`https://api.artic.edu/api/v1/artworks?fields=id,title,artist_display,date_display,main_reference_number,is_public_domain`
Can get artwork data by ID
Can get artwork image by ID
https://www.artic.edu/iiif/2/{identifier}/full/843,/0/default.jpg
Documentation out of date:
https://www.artic.edu/iiif/2/1adf2696-8489-499b-cad2-821d7fde4b33/full/843,/0/default.jpg does not work
https://www.artic.edu/iiif/2/2d484387-2509-5e8e-2c43-22f9981972eb/full/843,/0/default.jpg
Image IDs have updated I guess!
License:
```
For information about copyright, please see the Image Licensing (opens new window)and Terms (opens new window)pages on our website. We defer to those resources in all legal respects.
From a developer's perspective, we recommend only using images from artworks that are tagged as public domain. When querying for artworks, you can filter by public domain status like so:
https://api.artic.edu/api/v1/artworks/search?query[term][is_public_domain]=true&limit=0
To get the full list of artworks that are in the public domain, you will need to download our data dumps and perform the filtering locally.
```
### AIC Scrape
[Cleaned up scrape by u/streamfrag](https://www.reddit.com/r/DataHoarder/comments/d0wuae/50k_images_from_the_art_institute_of_chicago/) from original [scrape](https://old.reddit.com/r/DataHoarder/comments/a8xqwj/50k_images_from_the_art_institute_of_chicago_for/)
Has:
* Images
* Metadata for each image, separate files
* `metadata.txt: all metadata 21Mb
50,000 images, 78Gb
From `metadata-format.txt`
```
art id, subpic, filename, size, md5, width, height, art url, artist, title, subpic title, origin, date, medium, tags
tab delimited
some fields can be empty
entries with no file size have been deleted from the web site
```
### ML Stuff
https://github.com/mlfoundations/open_clip
I've done this processing for all 50,000 images, and now have the embeddings (124 MB)
```
# From Max
def load_img_to_embedding(filename):
path = directory + filename
try:
image = Image.open(path)
# image.show()
# Downsize n crop
image = preprocess(image).unsqueeze(0)
# ** MAGIC HAPPENS ** get the embedding
image_features = model.encode_image(image)
# print(type(image_features))
return image_features
except:
bad_images.append(filename)
# return "X"
```
### Tools I'm making
* **AIC Class:** Keeps all the art together, can do fxns on art
* *methods:*
* get_arts_by_tag
* **Art Class:** Data about one art
* *attributes:*
* tags: a list of strings (lowercase)
# To run your own Max Bittker River
* Get a set of images
* Ideally, get a source
* where the images are served via URL online
* - so you don't have to serve your own data-heavy images
* that has `access-control-allow-origin`
* - so you can read the image accross origins
* Get the CLIP embeddings
* Make a get_closest(image_id) function, probably with RPForest
* Make a Flask Restful API
* Make a webpage that shows the resulting images somehow
Check if the image hoster has headers:
```
Check Are.na:
curl -I "https://d2w9rnfcy7mm78.cloudfront.net/12305376/original_7c01d19129eb0b26432a9443db766499.jpg?1624122193?bc=0"
HTTP/2 200
content-type: image/jpeg
content-length: 119191
date: Tue, 24 Oct 2023 15:22:19 GMT
last-modified: Sat, 19 Jun 2021 17:03:14 GMT
etag: "bb859ea3fe9e06adda8c18e5d0e970e7"
cache-control: max-age=31536000
x-amz-version-id: bEN0gppg_9oSR_d6nU9VdqZ4P8Y1uVaf
accept-ranges: bytes
server: AmazonS3
x-cache: Miss from cloudfront
via: 1.1 98e2eb12ca62ecc662bc928ec41abedc.cloudfront.net (CloudFront)
x-amz-cf-pop: ORD52-C2
x-amz-cf-id: 4L4xF6M7HmMdvcaktNFa67BEpn4YficUXxFlIGXHFRHKh_eJwrsDaA==
```
```
Check Art Institute of Chicago:
curl -I "https://www.artic.edu/iiif/2/ea8c5d62-6ce8-88e8-feb1-e0053cf534c5/full/1686,/0/default.jpg"
HTTP/2 200
content-type: image/jpeg
date: Fri, 20 Oct 2023 03:19:17 GMT
server: Jetty(9.4.24.v20191120)
x-powered-by: Cantaloupe/4.1.7
access-control-allow-origin: *
cache-control: max-age=2592000, public, no-transform
link: <http://iiif.io/api/image/2/level2.json>;rel="profile"
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 d8f323d5df48e82196070cbc9534df98.cloudfront.net (CloudFront)
x-amz-cf-pop: ORD56-P4
x-amz-cf-id: qhfoogyeMLg7-6EH39-Nkf-NAYEPW2alXRBR3pZo7iO9zMrhxrN_dw==
age: 389185
```
You want to see `access-control-allow-origin: *`. **Imgur**, **tumblr**, **Facebook**, **NYTimes**, **Art Institute**, **Smithsonian** many museums (no: Mastodon, Instagram, Twitter, Reddit, The Block Museum, XKCD)