Ideas for Data Science Challenges

# Ideas for Data Science Challenges We need 2 challenges at the beginner, intermediate, and advanced levels. # Beginner Challenges ## Beginner Challenge 1 For this challenge we're going to be making use of the popular UCI Machine Learning Repository. In particular, you'll be modeling a person's income as a function of several datapoints collected during the 1994 Census. You can access the dataset [here](https://archive.ics.uci.edu/ml/datasets/Adult). ### Instructions Your task is to predict whether an individual makes more or less than 50 thousand dollars per year as a function of the supplied attributes. You may utilize any Python packages you wish in order to complete this task. The deliverable should be a Github repository containing a single Dockerfile and a README describing how to build, run, and view your analysis. You may include the dataset within the Github repository. **Timebox your analysis to 4 to 6 hours.** This classification problem must be conducted using two different classes of models. Your analysis should be presented within a Jupyter Notebook and touch upon the following topics: data ingestion, preprocessing and handling of missing data, modeling techniques and hyperparameter selection, model validation, and which model you have chosen and why. ## Beginner Challenge 2 For this challenge you'll be performing dimensionality reduction on an IoT network dataset. This task falls under the realm of unsupervised learning as we are not attempting to predict an output as a function of input. Rather, we are interested in understanding the structure of the inputs themselves so that we may reduce the number of columns in our dataset. The dataset can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/00442/Philips_B120N10_Baby_Monitor/). __For this analysis we're going to focus on the _benign_traffic.csv_ file.__ ### Instructions Your task is to perform dimensionality reduction on this dataset using two separate techniques. Explain your process and describe how each approach works. A graph should be created for each method illustrating the amount of _explained variance_ as a function of the selected variables/features/components. See [this Google search](https://www.google.com/search?q=explained+variance+pca) for further clarification. The deliverable should be a Github repository containing a single Dockerfile and a README describing how to build, run, and view your analysis. __Do not__ include the dataset within the Github repository. **Timebox your analysis to 4 to 6 hours.** # Intermediate Challenges ## Intermediate Challenge 1 Investors and traders are always looking for an edge within the financial markets. A commonly used signal within the stock market is insider trading activity (the legal kind). The SEC requires that executives and high-level individuals within publicly-traded companies file their trades in advance to prevent unfair trading practices. This prevents individuals in-the-know from placing trades based on information not yet released to the public. Nonetheless, it's often considered a _bullish_ signal when insiders buy additional stock, and a _bearish_ signal when insiders sell stock. The goal for this challenge is to leverage insider trading data and historical stock prices to see if insider trading is actually an indicator of future price movements. ### Instructions Data for this challenge will be consumed through a free financial API published by [Financial Modeling Prep](https://site.financialmodelingprep.com/developer). Create a free account to generate an API token so that you can pull the data. We'll be working with the [Historical Prices Endpoint](https://site.financialmodelingprep.com/developer/docs#Stock-Historical-Price) and the [Stock Insider Trading Endpoint](https://site.financialmodelingprep.com/developer/docs#Stock-Insider-Trading). Your task is to consume historical data for 100 companies [within the S&P 500](https://site.financialmodelingprep.com/developer/docs#List-of-S&P-500-companies) and perform an analysis to determine whether insider trading is an indicator of future price movement. In other words, do stocks typically increase in value following insider purchasing, and/or do they typically decrease in value following insider selling? The analysis and all dependencies should be packaged within a single Docker container with a corresponding README file describing how to build and run your analysis. Your analysis should be summarized within a Jupyter Notebook. **This task should be timeboxed to 4 to 6 hours.** Once complete, share your Github repository and we will build your container locally. ## Intermediate Challenge 2 Sports events at stadia cause increased traffic on highways. To mitigate the increased traffic, commuters can be advised to take a different route during these events. If the highway patrol system is not connected to the sports complex, how can the sporting event times be determined? It turns out the highway patrol system monitors traffic at highway on-ramps. A sample of data from an on-ramp near the LA Dodgers stadium is hosted on the UCI ML dataset site [here](https://archive.ics.uci.edu/ml/datasets/Dodgers+Loop+Sensor). > These loop sensor measurements were obtained from the Freeway Performance Measurement System (PeMS), "[Web Link](http://pems.dot.ca.gov/)" ### Instructions Your task is to consume the full timeseries dataset and build a predictive model to determine the stop time of each game. Note that some games will have traffic leaving early, particularly in the case of a blowout (one team has a much larger score than the other). The analysis and all dependencies should be packaged within a single Docker container with a corresponding README file describing how to build and run your analysis. Your analysis should be summarized within a Jupyter Notebook. **This task should be timeboxed to 4 to 6 hours.** Once complete, share your Github repository and we will build your container locally. # Advanced Challenges ## Advanced Challenge 1 Raft currently works with the Consumer Financial Protection Bureau (CFPB) to build and manage infrastructure and code related to the collection of mortgage lending data. This data collection is mandated by the Federal Government through the Home Mortgage Disclosure Act (HMDA). This data is available for download and analysis at [consumerfinance.gov](https://www.consumerfinance.gov/data-research/hmda/). ### Instructions Imagine you have been hired as a data scientist for the Consumer Financial Protection Bureau. On your first day you are asked to _explore the data, and produce an analysis._ This analysis can take any form you wish. Timebox your analysis to 4 to 6 hours. The analysis and all dependencies should be packaged within a single Docker container with a corresponding README file describing how to build and run your analysis. You may present your analysis in any way you like. **This task should be timeboxed to 4 to 6 hours.** Once complete, share your Github repository and we will build your container locally. Do not store the data within Github. ## Advanced Challenge 2 Space Situational Awareness (SSA) is an increasingly important field, as more satellites are launched annually. To ensure that collisions are avoided, data are collected and analyzed. The [space-track website](https://www.space-track.org) has [TLE data](https://en.wikipedia.org/wiki/Two-line_element_set) for objects in orbit. ### Instructions Obtain an account on the `space-track` website. Use the `tle_latest` API to obtain data, and build a model to predict object collisions. The analysis and all dependencies should be packaged within a single Docker container with a corresponding README file describing how to build and run your analysis. You may present your analysis in any way you like. **This task should be timeboxed to 4 to 6 hours.** Once complete, share your Github repository and we will build your container locally. Do not store the data within Github.