Try โ€‚โ€‰HackMD

Senior Data Scientist / Data Engineer Assessment
Common Crawl Data Analysis

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Please sign up to be eligible for up to HK$6,000 completion bonus.
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
This assessment awards an Advanced level certificate.

You are required to use the Common Crawl January 2017 dataset in this test, which is publicly available on Amazon. It contains 3.14 billion web pages and about 250 TiB of uncompressed content. This challenge requires good engineering skills.

Your goal is to analyze popular job boards worldwide and answer following questions:

  • How to identify a job board website/domain?
  • What are 100 most popular job boards? How do you define "most popular"?
  • If a candidate has following skills: "Hadoop", "Apache Spark" and "Tensorflow", what job board would you recommend to him/her? Based on what criteria?

Guidelines

  • You are expected to deliver source code and project build instructions.
  • Your documentation should explain your thoughts and system design.
  • You could use any programming language and any database.
    • We recommend looking into Spark and Amazon EMR.
  • Results should be presented in PowerPoint or PDF format.

Deliverables

1. Top 100 Job Boards Worldwide

You are required to define a criteria for popularity and display the top 100 job boards according to your criteria. Explain your choice of criteria.

2. Job Recommendation Engine

Implement a job recommendation engine. Based on given skills, it should recommend related jobs and job boards. For example, if a candidate applies for a Data Science position, the system might recommend similar positions on other job boards, other companies within the same board, or similar positions, such as business intelligence analyst.

Each recommendation should be made in a reasonable time (< 5 minutes), however, the model training phase might take much longer. In addition to the code, include several ideas for future improvements.

3. Presentation

Per above, results should be presented in PowerPoint or PDF format.

Submission

  1. You can find our grading guidelines at https://t1.gl/review.
  2. Submit your assessment at https://t1.gl/submit-assessment.

Copyright ยฉ 2016-2020 Terminal 1 Limited.