## Day 1: Prompt: I want to create a cool data science project. My skills are in using python and pandas and seaborn. I'd like my project to really stand out and be impressive - I'm going to be presenting it to a panel of judges in a "shark tank" style format. The first step is discovering interesting data to analyze. Act as my data science mentor, and ask a few questions about my personal interests, and in turn suggest 5 great project ideas. Please respond in the following format: ``` Project Title: <title here> Links to datasets to download: <url of first publicly available dataset here> <url of second publicly available dataset here> Hypotheses: <three interesting hypotheses to explore> Synthesis with other datasets: <one interesting way to merge this project with another project on the list, to synthesize data> ``` Here’s what I got as a response: ```markdown That's great! Based on your interests in online edtech, entrepreneurship, Roman history, chess, and your background in the automotive industry, here are five project ideas that combine these elements: **Project Title: Analysis of Coding Education Trends in the SEA Region** Links to datasets to download: - [Coding bootcamp enrollment data](https://www.kaggle.com/rtatman/coding-bootcamp-enrollment-stats) - [Education indicators dataset](https://databank.worldbank.org/reports.aspx?source=education-statistics-~-all-indicators#) Hypotheses: 1. The demand for coding education in the SEA region has been steadily increasing over the past decade. 2. There are significant variations in coding education enrollment and completion rates across different countries in the SEA region. 3. The presence of coding bootcamps has a positive impact on the growth of the local tech startup ecosystem. Synthesis with other datasets: You can merge this project with the next one on the list to explore the relationship between coding education and the success of tech startups in the region. **Project Title: Analysis of Tech Startup Landscape in the SEA Region** Links to datasets to download: - [Startup funding dataset](https://www.kaggle.com/rtatman/venture-capital-investments/downloads/venture-capital-investments.zip) - [Tech company acquisitions dataset](https://www.kaggle.com/vipashakaul/startup-acquisitions) Hypotheses: 1. The number of tech startups in the SEA region has been growing rapidly in recent years. 2. There is a positive correlation between funding received by startups and their success in terms of acquisitions or IPOs. 3. Certain industries, such as e-commerce or fintech, dominate the startup landscape in the region. Synthesis with other datasets: Combine this project with the previous one to examine the correlation between coding education and the growth of successful tech startups. **Project Title: Analysis of Chess Tournament Performance** Links to datasets to download: - [Chess game dataset](https://www.kaggle.com/datasnaek/chess) - [Chess player ratings dataset](https://www.fide.com/fide/handbook.html?id=207&view=article) Hypotheses: 1. The opening moves played in chess games have a significant impact on the game's outcome. 2. Chess players' ratings are positively correlated with their tournament performance. 3. Male and female chess players exhibit differences in playing styles and overall performance. Synthesis with other datasets: You can merge this project with a historical dataset to explore the evolution of chess strategies and playing styles throughout history. **Project Title: Analysis of Ancient Roman Historical Events** Links to datasets to download: - [Ancient Roman historical events dataset](https://www.kaggle.com/titovic/romans) Hypotheses: 1. The rise and fall of the Roman Empire can be traced to specific historical events and factors. 2. The Roman military's success in conquest was influenced by factors such as leadership, strategy, and logistics. 3. There were distinct periods of cultural and technological advancements during the Roman Empire. Synthesis with other datasets: Combine this project with a geographical dataset to visualize the spread of Roman influence and analyze its impact on different regions. **Project Title: Analysis of Automotive Market Trends** Links to datasets to download: - [Automotive sales dataset](https://www.kaggle.com/toramky/automobile-dataset) - [Fuel economy dataset](https://www.fueleconomy.gov/feg/download.shtml) Hypotheses: 1. There is a correlation between a car's price and its fuel efficiency. 2. Certain car manufacturers dominate the automotive market in terms of sales and market share. 3. The popularity of electric vehicles (EVs) has been increasing over time. Synthesis with other datasets: Merge this project with a dataset on government policies or incentives related to electric vehicles to analyze their impact on the adoption of EVs. I hope these project ideas inspire you and provide a good starting point for your data science project. Feel free to choose the one that resonates with you the most or combine elements from different ideas to create a unique project that aligns with your interests and goals. Good luck with your presentation! ``` My follow up: I actually really like the first one, that one seems particularly relevant to my needs, and interesting in a professional setting. Can you help give me a few more hypotheses that I can explore, as well as some sample code to generate a few visualizations? What kind of predictive models using sci-kit learn would I be able to build? ```python Absolutely! I'm glad you found the first project idea appealing. Here are a few additional hypotheses you can explore for the "Analysis of Coding Education Trends in the SEA Region" project: 4. The availability of coding bootcamps in a country positively impacts the number of tech job openings. 5. Certain programming languages are more popular than others in the SEA region, and their popularity has shifted over time. 6. There is a gender disparity in coding education enrollment and completion rates in the SEA region. To generate visualizations for your project, you can use Python libraries such as Matplotlib and Seaborn. Here's some sample code to create a few visualizations based on the hypotheses mentioned above: **Visualization 1: Coding Bootcamps vs. Tech Job Openings** This visualization explores the relationship between the presence of coding bootcamps and the number of tech job openings in different countries. import pandas as pd import matplotlib.pyplot as plt # Load the coding bootcamp enrollment data bootcamp_data = pd.read_csv('coding_bootcamp_data.csv') # Group the data by country and calculate the total tech job openings job_openings_by_country = bootcamp_data.groupby('Country')['Tech Job Openings'].sum() # Plot the data job_openings_by_country.plot(kind='bar', figsize=(10, 6)) plt.xlabel('Country') plt.ylabel('Tech Job Openings') plt.title('Tech Job Openings by Country') plt.xticks(rotation=45) plt.show() ``` **Visualization 2: Popular Programming Languages** This visualization shows the popularity of different programming languages in the SEA region over time. ```python import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Load the coding bootcamp enrollment data bootcamp_data = pd.read_csv('coding_bootcamp_data.csv') # Count the occurrences of each programming language language_counts = bootcamp_data['Programming Language'].value_counts() # Plot the data sns.barplot(x=language_counts.index, y=language_counts.values) plt.xlabel('Programming Language') plt.ylabel('Count') plt.title('Popularity of Programming Languages') plt.xticks(rotation=45) plt.show() ``` Regarding predictive models using scikit-learn, there are several possibilities for this project. Here are a few examples: 1. **Predicting Coding Bootcamp Enrollment**: You can build a classification model to predict whether a person is likely to enroll in a coding bootcamp based on factors such as their educational background, work experience, and coding skills. 2. **Predicting Tech Job Openings**: Using regression techniques, you can create a model to predict the number of tech job openings in a specific country based on factors like the country's GDP, population, education indicators, and coding education metrics. 3. **Predicting Coding Bootcamp Completion**: You can build a binary classification model to predict whether a student will successfully complete a coding bootcamp based on various features such as demographics, enrollment details, and performance in pre-bootcamp assessments. These are just a few examples of the predictive models you can build using scikit-learn. The choice of the model will depend on the specific question you want to answer and the available data. Remember to preprocess and prepare your data appropriately before training the models. I hope this helps you get started with your project! Feel free to ask if you have any further questions. ``` What are the interesting questions I can ask? Can you help me frame the narrative of my final presentation? ```python Certainly! Here are some interesting questions you can ask and incorporate into the narrative of your final presentation for the "Analysis of Coding Education Trends in the SEA Region" project: 1. **Question 1: How has coding education enrollment evolved over time in the SEA region?** - Explore the trends in coding education enrollment and identify any significant changes or patterns. - Highlight the countries with the highest and lowest enrollment rates and discuss potential factors contributing to these variations. 2. **Question 2: What are the most popular coding languages in the SEA region? How has their popularity changed?** - Analyze the data to identify the programming languages most commonly learned in the SEA region. - Examine how the popularity of different languages has shifted over time and discuss possible reasons for these changes. 3. **Question 3: What is the impact of coding bootcamps on the tech job market in the SEA region?** - Investigate the relationship between the presence of coding bootcamps and the number of tech job openings. - Discuss the potential benefits and challenges associated with coding bootcamps and their role in nurturing the tech talent pipeline. 4. **Question 4: Is there a gender disparity in coding education in the SEA region?** - Analyze enrollment and completion rates by gender to identify any gender disparities in coding education. - Discuss the potential factors contributing to these disparities and highlight the importance of promoting diversity and inclusion in the tech industry. 5. **Question 5: How can we leverage coding education trends to support entrepreneurship in the SEA region?** - Explore the intersection of coding education and entrepreneurship, discussing how coding skills can empower individuals to pursue entrepreneurial ventures. - Highlight success stories of tech startups emerging from the coding education ecosystem and discuss ways to further nurture and support these initiatives. When framing the narrative of your final presentation, consider the following structure: 1. **Introduction**: Provide an overview of the project, your motivations, and its relevance to the edtech and entrepreneurship landscape in the SEA region. 2. **Data Collection and Preparation**: Briefly explain the datasets used, the steps taken for data cleaning and preprocessing, and any challenges encountered. 3. **Exploratory Data Analysis**: Present your findings from the exploratory analysis, highlighting interesting trends, patterns, and insights related to coding education in the SEA region. 4. **Hypothesis Testing**: Discuss the hypotheses you formulated and present the results of your analysis, supporting or refuting each hypothesis with evidence. 5. **Visualizations**: Showcase the visualizations you created to effectively communicate your findings. Explain the insights gained from each visualization and how they contribute to the overall narrative. 6. **Predictive Models (Optional)**: If you developed any predictive models, briefly explain the models used, their performance, and the insights they provide for future predictions or decision-making. 7. **Conclusion and Implications**: Summarize the key findings of your project, emphasizing the implications for edtech, entrepreneurship, and the broader tech ecosystem in the SEA region. Discuss potential actions and policies that can be derived from your analysis to support coding education and its impact on the region. 8. **Q&A and Discussion**: Engage the panel of judges and audience in a Q&A session to further explore the insights and implications of your analysis and address any additional questions they may have. By following this structure and incorporating the interesting questions into your presentation, you can create a compelling narrative that highlights the significance of coding education trends in the SEA region and the potential opportunities they present for edtech and entrepreneurship. Best of luck with your final presentation! ``` Actually, it looks like we should use this dataset. https://www.kaggle.com/datasets/fccuser/2021-new-coder-survey, because GPT hallucinated - https://www.kaggle.com/rtatman/coding-bootcamp-enrollment-stats doesn’t actually exist. Jesus. ```python Actually, I have a dataset, it's the Freecodecamp new coder survey. Here are the columns - can you help me generate some interesting hypotheses to explore? Timestamp,What is your biggest reason for learning to code?,What methods have you used to learn about coding?,How much time have you spent learning to code?,What is your current job title?,What is your current salary?,What is your gender?,What is your age?,What is your race/ethnicity?,What is your highest level of education?,What is your current country of residence?,What is your current state or province of residence?,What is your current city or town of residence?,What is your current employer?,What is your current job role?,What is your current industry?,What is your current company size?,What is your current company type?,What is your current employment status?,What is your current income?,What is your current source of income?,What is your current coding experience?,What is your current programming language of choice?,What is your current coding learning platform of choice?,What is your current coding learning resource of choice?,What are your current career goals?,What are your current coding challenges?,What are your current coding resources?,What are your current coding tips?,What are your current coding advice? ``` (I used bard) ```python Certainly! Based on the columns in the Freecodecamp new coder survey dataset, here are some interesting hypotheses that you can explore: Hypothesis 1: The main reasons for learning to code vary based on gender. Investigate whether there are gender differences in the motivations for learning to code. Analyze the distribution of responses to the "What is your biggest reason for learning to code?" question by gender. Hypothesis 2: Different learning methods are associated with varying levels of coding experience. Explore the relationship between the learning methods respondents have used and their reported coding experience. Determine if certain learning methods are more effective for gaining coding experience. Hypothesis 3: Coding experience and current salary are positively correlated. Investigate whether there is a relationship between the reported coding experience and current salary of respondents. Analyze the distribution of salaries across different levels of coding experience. Hypothesis 4: Coding learning platform preference is influenced by current job role. Explore whether there are differences in the preferred coding learning platforms based on respondents' current job roles. Analyze the distribution of responses to the "What is your current coding learning platform of choice?" question by job role. Hypothesis 5: The level of coding experience impacts career goals. Investigate whether respondents with different levels of coding experience have varying career goals. Analyze the distribution of career goals based on coding experience levels. These hypotheses provide a starting point for exploring the dataset and gaining insights into various aspects of new coders' experiences. You can further refine and expand upon these hypotheses based on your specific research interests and goals. Happy exploring and analyzing the Freecodecamp new coder survey data! ``` There’s still quite a ways to go, but I think we have some ideas. Now, we begin with the end in mind. What are some big questions we would want to answer, that a shark tank judge would care about? The idea is to deliver real business value - because your skills are useful. The question is basically like this: - What country should CoderSchool expand to next, and how should we should we approach the market?