Try   HackMD

Design a twitter crawler

This document is used to record the development process.

1. Use cases and constraints

Gather requirements and scope the problem.

Use cases

We'll scope the problem to handle the following use cases

  • Monitor :
    • Monitor multiple twitter accounts activities (tweet, retweet, delete)
  • Crawler :
    • Collect the first 5 tweets (retweets), and their replies and likes count given a user account.
  • User app :
    • Add/delete multiple user accounts to the system.
    • Access/download all data has been collected.

Constraints and assumptions

State assumptions

  • Support only anonymous users
  • The web crawler should not get stuck in an infinite loop
    • We get stuck in an infinite loop if the graph contains a cycle
  • 500 users to monitor
  • Monitor a user every 15 min

Calculate usage

  • 2 MB of stored content per day
    • One account posts 1 tweet per day, while we need to get 5 tweets
    • 7 Kb per tweet
    • 100 replies per tweet
    • 4 kb per reply
    • 5 * 100 * 4 kb + 7 kb = 2 MB

2: Create a high level design

Outline a high level design with all important components.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

3: Development environment

  • Python 3.7
  • MongoDB
  • MySQL
  • Javascript
  • Tweepy, twarc
  • Ubuntu 14.04

4: Hardware requirements

  • 2 CPUs
  • 8 GB RAM
  • 40+ GB hard-drive space

5: Next step

The next step is to develop each core componets