Try โ€‚โ€‰HackMD

Messenger Bot: Example Bot with Web Scraping

tags: MessengerBot

To extend from the Echo Bot, I made a RateMyProf bot using the same tools and skills, with a little bit of web scraping.


The Process

Goal: Input a professor's name and output their RateMyProfessor's link.

Since I want to send a link back to the user, I'll probably need:

  • send_buttons()
  • URL button
  • Scrape the ratemyprofessor.com's website to grab specific information

Web Scraping - BeautifulSoup

Beautiful Soup is a library that makes it easy to get information from web pages.

BeautifulSoup Documentation

Install dependencies (in the terminal):

pip3 install --user bs4 requests

Steps of web scraping process:

  1. Make a request with requests module via a URL.
  2. Retrieve the HTML content as text.
  3. Examine the HTML structure closely to identify the particular HTML element from which to extract data. To do this, right click on the web page in the browser and select inspect options to view the structure.
  4. Use BeautifulSoup to find the particular element from the response and extract the text.

Import dependencies

import requests
from bs4 import BeautifulSoup

RateMyProf Website

I'll use Prof. Schurgers as an example: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=605064

Notice how every professor's page URL is differentiated by an id number appended to the end of the URL. However, we don't have this information! Instead, we want to search by name.

How can we do that? Search bar! ๐Ÿ”


Search Query

ratemyprofessor.com's search bar returns this:
http://www.ratemyprofessors.com/search.jsp?query=curt+schurgers

Notice that the search query is structured as follows: http://www.ratemyprofessors.com/search.jsp?query=[YOUR SEARCH HERE]

So, the first step is for my bot to be able to return the search query for an input name.

From the Echo Bot, we know we can get the user's input using messaging_event['message']['text']

Now, we only need to process that string and append it to the end of the search URL.

# This splits the input string into a list
input = messaging_event['message']['text'].lower().split()

# This turns the list back into a string but separated by +
prof_search_query = "+".join(input)

Making the Request

# Base url for Rate My Professor search
rmp_url = "http://www.ratemyprofessors.com/search.jsp?query="

# Make the request to get the page
page = requests.get(rmp_url + prof_search_query)

# Use BeautifulSoup to get the HTML content
soup = BeautifulSoup(page.content, 'html.parser')

Parsing the HTML Content

# Get the href suffix of the first professor
prof_link = soup.find('li', {'class': 'listing PROFESSOR'}).findChild('a', href=True, recursive=False).get('href')

# Append suffix to get full url
full_rmp_link = "https://www.ratemyprofessors.com" + prof_link

# Define the WEB_URL button
buttons = 
[
    ActionButton(ButtonType.WEB_URL, "Rate My Professor", url=full_rmp_link)
]

# Send the link as a button
messager.send_buttons(sender_id, "Here's the RMP Link: ", buttons)

Full Code:

# This splits the input string into a list input = messaging_event['message']['text'].lower().split() # This turns the list back into a string but separated by + prof_search_query = "+".join(input) # Base url for Rate My Professor search rmp_url = "https://www.ratemyprofessors.com/search.jsp?query=" # Make the request to get the page page = requests.get(rmp_url + prof_search_query) # Use BeautifulSoup to get the HTML content soup = BeautifulSoup(page.content, 'html.parser') # Get the href suffix of the first professor prof_link = soup.find('li', {'class': 'listing PROFESSOR'}).findChild('a', href=True, recursive=False).get('href') # Append suffix to get full url full_rmp_link = "https://www.ratemyprofessors.com" + prof_link # Define the WEB_URL button buttons = [ActionButton(ButtonType.WEB_URL, "Rate My Professor", url=full_rmp_link)] # Send the link as a button messager.send_buttons(sender_id, "Here's the RMP Link: ", buttons)

Warning โš : Some websites are not scrapable with just BeautifulSoup due to the website generating dynamic content using Javascript

For these, you'd need to use another tool like Selenium to do the scraping. Check out this tutorial.

Another Warning โš : Selenium is not compatible with Glitch.com unfortunately, so you might want to consider alternative options (see below)

Alternative options:

  • Using an API (next week's tutorial ๐Ÿ˜Ž)
  • (Sometimes) Manually parse the req.text returned from BeautifulSoup