Week 4 - HackMD

--- tags: FTMLE-Philipines-2020 --- # Week 4 ## Monday 8.6.2020 **INFERENCE STATISTICS** - Some discrete and continuous statistical distributions: Uniform distribution, binomial distribution, Poisson distribution, Normal distributions, ... - The t-test test statistic. - The z-test test statistic. (I will add more detail on this part!) ## Tuesday 9.6.2020 <div style="text-align: justify"> Some remarks on today exercises - We use ```~``` here, which means "not duplicated!". The code below will show the unique records in our dataframe. ```python= # Check for duplicated data # Your code here df[~df.duplicated()] ``` - ```str.replace``` and ```replace``` are different. ```replace``` will replace the whole row and cannot replace sub-string inside that object, to replace the object contained inside rows, we need to use ```str.string```. It's always better to use string methods for manipulating string than any other general methods. Example: ```python= # Replace misspelled countries # Niacragua -> Nicaragua, Eucador -> Ecuador # Your code here df['Country'] = df['Country'].str.replace('Niacragua','Nicaragua') df['Country'] = df['Country'].str.replace('Eucador', 'Ecuador') ``` Note that we don't have a keyword value ```inplace = True``` here. - We use the function method ```pandas.get_dummies('myDataFrame')```. ```python= # There are lots of blended chocolate bars i.e. those with multiple countries in BeanOrigin # What can we do to make col BeanOrigin more meaningful? # Your code here df['BeanOrigin'].str.get_dummies(sep=',') pd.concat(df['BeanOrigin'].str.get_dummies(sep=',') ``` Using ```get_dummies``` with ```sep= ','``` means that we want to separate at the comma ```,``` in each string. The second line is for concatinating two tables. - For adding new rows to the DataFrame, we need to generate a dictionary and use the ```append``` method. - **Binning**: We use the method ```pd.cut``` (cutting into equal ranges intervals) or ```pd.qcut``` to separate it into quartiles ```python= # Divide the Chocolate Bar into 4 groups based on Rating: Terrible, Medium, Good, Excellent # Each group should have an approximately equal number of chocolate bar # Your code here df['RatingBin'] = pd.cut(df['Rating'], bins = 4, labels = ['Terrible', 'Medium', 'Good', 'Excellent']) ``` or ```python= df['RatingBin'] = pd.qcut(df['Rating'], q = 4, labels = ['Terrible', 'Medium', 'Good', 'Excellent']) ``` Then we count the values ```df['RatingBin'].value_counts()``` - ***There are lots of blended chocolate bars i.e. those with multiple countries in BeanOrigin. What can we do to make the column BeanOrigin more meaningful?*** For example: What is the growth of chocolate bars from Vietnam over year, by Avg. Rating? ```python= df1[df1['Vietnam'] == 1].groupby('ReviewDate').sum() ``` Let's analyze this code (this is called **unpivot** or **melting**) ```python= unpivot = pd.melt(df1,id_vars = df.columns, value_vars=pivot.columns, var_name='Origin',value_name='Count') unpivot[(unpivot['BeanOrigin'] == 'South America,Africa') & (unpivot['Count'] > 0)] ``` Compare this with the following ```python= df1[df['BeanOrigin'] == 'South America,Africa'][[*df.columns,'South America','Africa']] ``` After finishing **unpivot**, we filter out the rows that have columns ```Count``` equal to $1$. Next, we ```groupby(['ReviewDate', 'Country'])``` ```python unpivot[unpivot['Count'] == 1].groupby(['ReviewDate','Origin']).sum() ``` How to get the top 5? ```python= grouped = unpivot[unpivot['Count'] == 1].groupby(['ReviewDate','Origin'][::-1]).sum() ``` Run ```groupped.index``` to see indices. Then ```python= sorted(list(set([x[0] for x in grouped.index]))[:5]) ``` Take 5 top countries ```python pivot.columns[1:6] top5 = grouped.loc[pivot.columns[1:6]] top5['AvgRating'] = [top5['Rating'] / top5['Count'] ``` **Try to find a way to sort within the multiindex groupped** - Match the geographic data and the data we already have obtained above. So firstly we need to check if those names match each other. - Notice the **right** merge and **left** merge, plot two to see the difference. (Hint: some countries are missing when we use **left**) </div> ## Wednesday 10.6.2020 <div style="text-align: justify"> Morning session: Working on the "Kickstarter project" data analysis. - Check if the "ID" column is unique ```python # Check if ID column is unique # Your code here data['ID'].nunique() == len(data) ``` - Drop some unnecessary columns, it depends on what you want to do with your data. - Check for missing data, we use ```python # Check for missing data # Your code here data.isnull().sum() ``` - Use ```pandas.to_datetime(df[column])``` t convert the whole column to type "datetime" (timestamps). ### Data analysis: Telling a story! - **INTRODUCTION** Introduce the data with some general facts. Show top 10 in different metrics. What can you infer from the charts? We should also combine different charts with different metrics to see some insights. Using only one chart could lead to false conclusion. (See category "Documentary" in the Project Kickstarter analysis when we compare the 3-overall top10 charts) - Now we look into the Success rate of project: The pie-chart: More than 50% of the projects failed! (majority) The bar-chart: We can compare the number of successful and failed projects in each category. - If we have a time series data, we need to analyze the growth over the year. This is to see the trend/pattern of the data. Moreover, we can also predict the future based on that. - **WHAT FACTORS NEED TO BE CONSIDERED ...?** (Example in the Kickstarter project) - Success Rate vs Month - This could recommend you the good time to launch a project with a specific category. For example, one should not lauch a proiject on "tabletop game" in December! - Success Rate vs Duration - The longer the project is, the higher chance that it will fail! We can conclude this by looking at the chart and see the trends. - The more money I set to the goal, the more likely it will fail. For your project in kickstarter, you should keep it short and keep it small. </div>