THE REPORT - HackMD

###### tags: `research` # THE REPORT ## My conclusion and humble idea about the proccess This my idea on how i think the process should be: 1. Put data into model, get data back with those extra informative columns. 2. Transform each row into an object (a review object, for example), create an array out of the object and send it to the frontend side through the API endpoints (when requested by the frontend, of course) 3. In the frontend, by using React, the different tables and different data will be displayed on the webpage. 4. There will be a button that, when clicked, will allow the user the export the report into a pdf. After multiple investigation, I have found two libraries that allow us to do this (very similar, maybe we could mix them together): - react-pdf (little show-off tutorial https://blog.logrocket.com/generating-pdfs-react/) - i don't know how easy it would be to add charts, since it doesn't contain a chart component - jspdf (little show-off tutorial https://www.freecodecamp.org/news/how-to-create-pdf-reports-in-react/) - easy charts but hard duplication of pages? maybe, to further investigate After seeing other options of doing the report in the backend, i personally believe this on to be the best as it allows us to do both the website and the report at the same time (and reuse components!). Also, this is a very common problem, there is LOTS of documentation and react examples about it. ## My conclusion and humble idea about the design Intro page: >![](https://i.imgur.com/Hx1oatO.png) Operating systems, device and browser page (to duplicate): >![](https://i.imgur.com/v9Sb9Xo.png) Page used per category: >![](https://i.imgur.com/dkiJN86.png) # Other solutions (backend-wise) Some of the possible Python libraries that allow to make reports out of data are: ReportLab, pydf2, pdfdocument, FPDF and stitch. Some tutorials and extra information: - Tutorial using python and pandas https://plotly.com/python/v3/html-reports/ - About the libraries and specificallt FPDF https://stackoverflow.com/questions/51864730/python-what-is-the-process-to-create-pdf-reports-with-charts-from-a-db - About stitch https://pystitch.github.io/ ## More in specific, option 1: FPDF (library) Code example: > // Create the dataframe > df = pd.DataFrame() df['Question'] = ["Q1", "Q2", "Q3", "Q4"] df['Charles'] = [3, 4, 5, 3] df['Mike'] = [3, 3, 4, 4] // Set the titles title("Professor Criss's Ratings by Users") xlabel('Question Number') ylabel('Score') // Set some values c = [2.0, 4.0, 6.0, 8.0] m = [x - 0.5 for x in c] // Set more values - note how we have to size everything by hand xticks(c, df['Question']) bar(m, df['Mike'], width=0.5, color="#91eb87", label="Mike") bar(c, df['Charles'], width=0.5, color="#eb879c", label="Charles") legend() axis([0, 10, 0, 8]) savefig('barchart.png') // to start creating the pdf pdf = FPDF() pdf.add_page() pdf.set_xy(0, 0) // note how we have to position everything cell by cell pdf.set_font('arial', 'B', 12) pdf.cell(60) pdf.cell(75, 10, "A Tabular and Graphical Report of Professor Criss's Ratings by Users Charles and Mike", 0, 2, 'C') pdf.cell(90, 10, " ", 0, 2, 'C') pdf.cell(-40) pdf.cell(50, 10, 'Question', 1, 0, 'C') pdf.cell(40, 10, 'Charles', 1, 0, 'C') pdf.cell(40, 10, 'Mike', 1, 2, 'C') pdf.cell(-90) pdf.set_font('arial', '', 12) for i in range(0, len(df)): pdf.cell(50, 10, '%s' % (df['Question'].iloc[i]), 1, 0, 'C') pdf.cell(40, 10, '%s' % (str(df.Mike.iloc[i])), 1, 0, 'C') pdf.cell(40, 10, '%s' % (str(df.Charles.iloc[i])), 1, 2, 'C') pdf.cell(-90) pdf.cell(90, 10, " ", 0, 2, 'C') pdf.cell(-30) pdf.image('barchart.png', x = None, y = None, w = 0, h = 0, type = '', link = '') pdf.output('test.pdf', 'F') The output is ![](https://i.imgur.com/L0B54lf.png) Pros: - It's python and can be done directly after using the model. - Does exactly what we want - Easy 'language' to learn - I'm finding lots of documentation Cons: - We would have to do everything manually, so it would take lots of time - Also, it seems hard to adapt this to different csvs of different sizes. Example, imagine we want to display an x amount of comments that variates, since here we have to indicate the specific space, the variating size of the comments might fuck everything up -> to further investigate, it might not even be a problem if we fix a maximum amount of comments displayed, and then show all of them in the website. ## More in specific, option 2: Stitch (markup language) Code example: > Stich is simple and great (##) Usefull markup language You can use markdown syntax, such as **bold**, _italic_, ~~Strikethrough~~ (##) display dataframes Direct output from python will be nicelly output. ```{python, echo=False} import pandas as pd df = pd.DataFrame() df['Question'] = ["Q1", "Q2", "Q3", "Q4"] df['Charles'] = [3, 4, 5, 3] df['Mike'] = [3, 3, 4, 4] df = df.set_index('Question') df.style df ``` >(##)display graphics Direct matplotlib output, without rendering to file. ```{python, echo=False} #%matplotlib inline df.plot.bar(title="Professor Criss's Ratings by Users") None ``` >(##)Symbolic expressions You may also want to work with sympy : ```{python, echo=False} import sympy sympy.init_printing() x=sympy.symbol.Symbol('x') sympy.integrate(sympy.sqrt(1/sympy.sin(x**2))) ``` How it looks like: ![](https://i.imgur.com/ADodfXz.png) Pros: - WAY better looking and simpler to use, will be definitely shorter to use - It's a language but allows to use python code to plot directly inside, which makes our life easier Cons: - We need to learn a 'new' language, I don't know how limited it is. # Research ## Aspect-Based Sentiment Analysis ### Definition: Aspect-based sentiment analysis (ABSA) is a text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one. Aspect-based sentiment analysis can be used to analyze customer feedback by associating specific sentiments with different aspects of a product or service. Here’s a breakdown of what aspect-based sentiment analysis can extract: Sentiments: positive or negative opinions about a particular aspect Aspects: the category, feature, or topic that is being talked about **Source:** https://monkeylearn.com/blog/aspect-based-sentiment-analysis/ <u>This is exactly what we are trying to do in our project.</u> ### Research: I have found a video that gives a great plan on how we could tackle this problem. It uses multiple Machine Learning models, on top of many Data Mining algorithms to achieve the best possible result **Video:** https://www.youtube.com/watch?v=vM5ykPFBWp0&ab_channel=BoardInfinity #### In summary: This is the overall model that was used to classify unlabeled reviews into "aspects" (categories) and give a sentiment analysis on them: ![](https://i.imgur.com/51GdK6A.png) The preprocessing included these steps: ![](https://i.imgur.com/hzSRAtn.png) Then they get the topics from the unlabeled data through "Topic Modelling" using LDA which stands for "Latent Dirichlet Allocation" (and not Linear Discriminant Analysis): ![](https://i.imgur.com/CcbsWEW.png) This process requires manually creating the topics, and allocating certain key words to indicate it is part of this topic: ![](https://i.imgur.com/xC8dBEv.png) In order to assign labels to the data we use Embedding, Cosine Similarity, and Word2Vec. Here we can see a visual representation of Embeddings: ![](https://i.imgur.com/9eIw4xM.png) The visualization of Cosine Similarity: ![](https://i.imgur.com/TuZlGSh.png) To my understanding, we train a neural network with Word2Vec using all our manually assigned keywords for a specific topic. We then have the result represented as a vector. Then we take a review and put it through a similar process. Using the cosine similarity, we measure to which topic it is most closely related to, and use that to label which topic this review belongs to. ![](https://i.imgur.com/aHiGu3y.png) The end result will look something like this: ![](https://i.imgur.com/HeLhg0j.png) If we can successfully achieve this, then it is just a matter of simply running Sentiment analysis on the data which should be possible through the existing pipelines.