# Project report week 3
Team: Book friending bot
Members: Alla Chepurova, Tihonov Nikita, Smirnov Nikita
## News
On this stage we got a new team member to help us deploy our app. Welcome Nikita Smirnov.
## Current progress
### Nikita T.
I valiantly proceed with dataset collection, but it was failing again and again. We felt like fools as we spent so much time on something that did not wark at all and even could not find a reason why that is happening.
But finally, I found a reason - Litres (and other ebook shops) have not an open API, instead, you should be their partner to have an access. And it was not written anywhere in documentation. I wrote an email to litres and got the following response:

Then I wrote another email:

And there were no responce at all until now.
### Alla
- As the idea with litres (as well as other apps) parsing failed, I spent huge amount of time on finding dataset for our needs. Surprisingly, it was not so easy as we needed all the books with specific set of features (all the genres, plot summary, etc.). But finally I managed to find appropriate dataset, unfortunetely, in English language. It is scrapped book pages from Wikipedia. The fileds are the following:
| book_name | author | genres | plot_summary |
| -------- | -------- | --------| -------- |
You can find the dataset converted into json format and notebook for convertation [here](https://drive.google.com/drive/folders/1tNL4IEQV1Lov_zEJe80jWCwai5T-jfN0?usp=sharing).
- I conducted a research on how we can perform user ranking. The first idea was vecorizing books by all the feature that it has. After that we could find the mean book vector for each user and retrieve the most similar users by Annoy tree or even with simple KNN methods. But for this we should find the embeddings that have same semantical space for both words and paragraphs (genres and plot summaries) to have possibility to place them in one place. Unfortunetely, Sota BERT-like embeddings are not appropriate for this task, but doc2vec actually is able to align word and text spaces into one. Having intention to stick with solution, I realized an important thing - this approach could not consider authors that specific user prefers, but it is important for some reader to find friends that likes similar authors.
- And after that I came up with the following ranking scheme:

*weighted IoU is Intersection over union using sets with repeating elements
Then we combine this three resulted scores into one using the formula:
ranking_score=$w_d*score_d+w_g*score_g+w_a*score_a$
And here comes the main trick - weights for each three scores $w_d,w_g,w_a$ are adaptive. Which means that user can specify the degree of each of three book aspects (plot, genres, author) that are important for her/him at the profile registration stage. Important, that $w_d+w_g+w_a=1$. We can show it to user as a scale in the popular games when you cannot choose maximum for all the abilities (your smartness, luck, health, etc.)

- Next challenge for me was to find a way to embed books' plot summaries. After conducting some research I found out the most appropriate way for doing this - using [Logformers](https://huggingface.co/transformers/model_doc/longformer.html). I even wrote some scripts for conversion data jsons to jsons containing embeddings. But afterward I dound out that [Longformer's embeddings are not applicable](https://github.com/allenai/longformer/issues/43) for cosine similarity usage due to the process of training.
- My next option were usage of SBERT and doc2vec embeddings. I decided to stick with the first option. I converted data to vectorized form using SBERT embeddings in [this notebook](https://). Embedded book in json format are [here](https://drive.google.com/file/d/16vVviLPgRPDQXAG21f0XLbTQX2M24_zw/view?usp=sharing).
Some interesting examples of clustered embeddings (you can see more in the [notebook](https://colab.research.google.com/drive/1KPZlJPu8meh3BRDXK1FpRg0AIX-w2yEX?usp=sharing) with interactive graphs):
1. Even in the books' subset of 100 instances there could be seen some dependencies - lower books are fantastic ones (Dune trilogy, Lord of the Rings, Greece heroic stuff - Odissey and Othello).

What is interesting, the book "Farmer Giles of Ham" is also Tolkien's book:


In the upper corner there are also fantastic books, but with more SciFi temathics:

Also algorithm managed to put in a one separate cluster all Bond's book:


A lot of Star Wars books are clustered together not so far from classical sci-fi books:

### Nikita S.
- I have initialized cluster in MongoDB Atlas and all necessary collections for our app.
- 
- All book embeddings were serialized, put into JSON file along with books' descriptions and uploaded to our cluster
- 