River project - HackMD

# Making the River, with Galaxykate ## So you want to do a project * Find something to work with: * what do you have "an unfair advantage" in? Or an interest in? * what's the neatest toy available? * List what you have that can help you work with that * datasets, people, libraries, past code, etc ## Inventory When starting a project, check your inventory first! What do you have? The Art Institute of Chicago put up all their art, lets make something with it. https://www.thisiscolossal.com/2018/10/art-institute-of-chicago-image-collection/ ### Web basics ``` ` > image.png` redirection operator curl -s 'https://icanhazdadjoke.com/' -H 'Accept: text/plain' curl ifconfig.me Image curls curl -I "https://www.artic.edu/iiif/2/ea8c5d62-6ce8-88e8-feb1-e0053cf534c5/full/1686,/0/default.jpg" curl -I "https://64.media.tumblr.com/97769a7bee6f93e1ead537ebba8a16d6/50b1a9ee2dee08f3-12/s2048x3072/d9125681061cb1ed20b3bf96cad89ec57ac8ea74.jpg" curl -I "https://d2w9rnfcy7mm78.cloudfront.net/12305376/original_7c01d19129eb0b26432a9443db766499.jpg?1624122193?bc=0" curl -I "https://i.imgur.com/Q6Lzz0n.jpeg" ``` ### Hosting We can serve locally with Flask ChatGPT does a good flask implementation Not sure where to host it later??? Glitch? ### River/Max Bittker [Max Bittker's River](https://river.maxbittker.com/): Its our model of what we can do. *Pros:* Cool, good vibes. Art is hotlinked to `are.na` so he doesn't have to host his own *Cons:* Questionable art sourcing, hotlinked art has the wrong headers so we can embed it in HTML, but not use it in VR or do analysis of the pixels He gave us his embedding code! We can email him for help. He also teaches classes like this at NYU. Academic friends! ### AIC API [Art Institute of Chicago API](https://www.artic.edu/open-access/public-api) ([docs](https://api.artic.edu/docs/)) `https://api.artic.edu/api/v1/artworks?fields=id,title,artist_display,date_display,main_reference_number,is_public_domain` Can get artwork data by ID Can get artwork image by ID https://www.artic.edu/iiif/2/{identifier}/full/843,/0/default.jpg Documentation out of date: https://www.artic.edu/iiif/2/1adf2696-8489-499b-cad2-821d7fde4b33/full/843,/0/default.jpg does not work https://www.artic.edu/iiif/2/2d484387-2509-5e8e-2c43-22f9981972eb/full/843,/0/default.jpg Image IDs have updated I guess! License: ``` For information about copyright, please see the Image Licensing (opens new window)and Terms (opens new window)pages on our website. We defer to those resources in all legal respects. From a developer's perspective, we recommend only using images from artworks that are tagged as public domain. When querying for artworks, you can filter by public domain status like so: https://api.artic.edu/api/v1/artworks/search?query[term][is_public_domain]=true&limit=0 To get the full list of artworks that are in the public domain, you will need to download our data dumps and perform the filtering locally. ``` ### AIC Scrape [Cleaned up scrape by u/streamfrag](https://www.reddit.com/r/DataHoarder/comments/d0wuae/50k_images_from_the_art_institute_of_chicago/) from original [scrape](https://old.reddit.com/r/DataHoarder/comments/a8xqwj/50k_images_from_the_art_institute_of_chicago_for/) Has: * Images * Metadata for each image, separate files * `metadata.txt: all metadata 21Mb 50,000 images, 78Gb From `metadata-format.txt` ``` art id, subpic, filename, size, md5, width, height, art url, artist, title, subpic title, origin, date, medium, tags tab delimited some fields can be empty entries with no file size have been deleted from the web site ``` ### ML Stuff https://github.com/mlfoundations/open_clip I've done this processing for all 50,000 images, and now have the embeddings (124 MB) ``` # From Max def load_img_to_embedding(filename): path = directory + filename try: image = Image.open(path) # image.show() # Downsize n crop image = preprocess(image).unsqueeze(0) # ** MAGIC HAPPENS ** get the embedding image_features = model.encode_image(image) # print(type(image_features)) return image_features except: bad_images.append(filename) # return "X" ``` ### Tools I'm making * **AIC Class:** Keeps all the art together, can do fxns on art * *methods:* * get_arts_by_tag * **Art Class:** Data about one art * *attributes:* * tags: a list of strings (lowercase) # To run your own Max Bittker River * Get a set of images * Ideally, get a source * where the images are served via URL online * - so you don't have to serve your own data-heavy images * that has `access-control-allow-origin` * - so you can read the image accross origins * Get the CLIP embeddings * Make a get_closest(image_id) function, probably with RPForest * Make a Flask Restful API * Make a webpage that shows the resulting images somehow Check if the image hoster has headers: ``` Check Are.na: curl -I "https://d2w9rnfcy7mm78.cloudfront.net/12305376/original_7c01d19129eb0b26432a9443db766499.jpg?1624122193?bc=0" HTTP/2 200 content-type: image/jpeg content-length: 119191 date: Tue, 24 Oct 2023 15:22:19 GMT last-modified: Sat, 19 Jun 2021 17:03:14 GMT etag: "bb859ea3fe9e06adda8c18e5d0e970e7" cache-control: max-age=31536000 x-amz-version-id: bEN0gppg_9oSR_d6nU9VdqZ4P8Y1uVaf accept-ranges: bytes server: AmazonS3 x-cache: Miss from cloudfront via: 1.1 98e2eb12ca62ecc662bc928ec41abedc.cloudfront.net (CloudFront) x-amz-cf-pop: ORD52-C2 x-amz-cf-id: 4L4xF6M7HmMdvcaktNFa67BEpn4YficUXxFlIGXHFRHKh_eJwrsDaA== ``` ``` Check Art Institute of Chicago: curl -I "https://www.artic.edu/iiif/2/ea8c5d62-6ce8-88e8-feb1-e0053cf534c5/full/1686,/0/default.jpg" HTTP/2 200 content-type: image/jpeg date: Fri, 20 Oct 2023 03:19:17 GMT server: Jetty(9.4.24.v20191120) x-powered-by: Cantaloupe/4.1.7 access-control-allow-origin: * cache-control: max-age=2592000, public, no-transform link: <http://iiif.io/api/image/2/level2.json>;rel="profile" vary: Accept-Encoding x-cache: Hit from cloudfront via: 1.1 d8f323d5df48e82196070cbc9534df98.cloudfront.net (CloudFront) x-amz-cf-pop: ORD56-P4 x-amz-cf-id: qhfoogyeMLg7-6EH39-Nkf-NAYEPW2alXRBR3pZo7iO9zMrhxrN_dw== age: 389185 ``` You want to see `access-control-allow-origin: *`. **Imgur**, **tumblr**, **Facebook**, **NYTimes**, **Art Institute**, **Smithsonian** many museums (no: Mastodon, Instagram, Twitter, Reddit, The Block Museum, XKCD)