A Confluence of Art, Machine Learning, and Collaborative Development: An In-depth Exploration

Recently, I took a class on ML competitions. In the final week, we had hackathon that served as a platform for a remarkable intersection of art, machine learning, and collaborative teamwork. Our project centered on creating a system that could help art appreciation in museum and gallery settings.

Introducing the "Art Guide" Project

The core objective of our task was to build a system that could provide auditory descriptions of artworks. This system, named "Art Guide" offers users the ability to capture images of art pieces using their devices, triggering an audio response that describes the artwork. This innovation has the potential to enrich cultural experiences in spaces where traditional audio guides may be absent.

To delve deeper into the technical aspects of our project, you can find the full codebase on our GitHub Repository.

Data Collection

Our journey began with data collection. Our data collection team scrapped around 200,000 images from WikiArt, accompanied by corresponding descriptions and pertinent metadata such as artist details and creation dates. While the WikiArt data formed the bulk of our collection, we looked for additional information from Wikipedia. Unfortunately, only a minor fraction of our data found correspondence there.

Consequently, our data repository consisted of of several images alongside an associated metadata table. To manage this data, we opted for simplicity and chose to use CSV files and Pandas dataframes instead of a traditional SQL database.

Technical Workflow of the Bot

The operational workflow of the bot can be dissected into several distinct stages:

  • Preprocessing the given image
  • Reversed image search
  • Description generation
  • Text-to-Speech Conversion

Below, I describe each part in detail.

Upon receiving an image, the bot initiates a reverse image search operation to locate a similar image within our database. To optimize this process, we employ vectorization techniques to convert images into numerical representations.

To vectorize the images, we used transfer learning. We took some pretrained model and extracted the feature-generation layers up until the latent vector. Since we wanted something fast, we used a Resnet18 pretrained on ImageNet.

These vectors are then indexed with Spotify's Annoy Algorithm, which facilitates efficient nearest-neighbor searches. We used this algorithm spefically, because it's production-ready, supports file indexing (instead of RAM) and allows us to use plenty of different metrics (such as cosine similarity, euclidean distance, etc).

Again, we decanted for not using some pre-stablished well-known vector databases.

Pre-processing the given Image

Given the potential noise and irrelevant elements in user-captured images, we undertook image pre-processing. We segment the image, and by the means of some heuristics (such as the area and the distance between the centroid and the center of the image), we identify the most relevant connected component, and crop it out of the image.

One more thing to consider was perspective correction. The user could take the photo from any angle, and this should be handled by identifying the perspective and correcting it, which could enhance accuracy. It was not implemented though, due to time constraints.

What if we don't have the image in the database?

We used cosine similarity to find the nearest neighbor to the image embedding vector. Since this metric is constrained within the range

[1,1], we selected a threshold value so that anything below the threshold means we don't have the image on the database.

Description Generation

Once an image match is identified, metadata associated with the image is used as a basis for generating textual descriptions. We contemplated different strategies, including manual text construction and automated summarization using machine learning models. Ultimately, we favored manual construction for its practicality, considering the majority of our descriptions were concise.

Text-to-Speech Conversion

The text descriptions are then transformed into audio using the Google Text-to-Speech library. While alternative non-AI audio synthesizers were evaluated, they were dismissed due to their poor audio quality.

Deployment and Maintenance

Ensuring the longevity and maintainability of the project's different components was vital. To address this, we incorporated Data Version Control (DVC) to manage various stages of data and embeddings. Our deployment strategy involved renting a server equipped with adequate storage and memory, and we employed Docker images to encapsulate and deploy the bot.

Collaborative Team Dynamics

Our team, comprising 13 members, embraced a collaborative approach with distinct roles. Each team had a role: reverse image searche, data collection, text-to-speech integration, image processing, description generation, and bot development. This last team was in charge of:

  • Building an API, for which we used FastAPI.
  • Maintaining code quality and reviewing the PRs. We added CI/CD to further improve this process.
  • Doing any kind of hot fixes when needed.
  • Deploying the bot.
  • Active coordination with all the teams, and support them when needed.

In Retrospect

In summation, our hackathon saw the harmonization of art, machine learning, and collaborative teamwork. The "Art Guide" project demonstrates the potential to augment art experiences through technology.

Our technical journey encompassed data collection, reverse image search, segmentation, description generation, and text-to-speech conversion. In general, we solved most tasks by using pretrained models and existing open-sourced algorithms. We can definetely improve the quality of the bot by labelling data, and fine-tuning some of the models, as well as by experimenting with other options for each subtask.

The collaborative spirit of our team was pivotal, encapsulating a unified approach across various disciplines and technical stages.

Acknowledgements

I want to give thanks to our teacher Alexander Guschin, for bringing this interesting task to us, and giving his support to all of us.