HackMD - Collaborative Markdown Knowledge Base

## A simple spark program for you to try 1. Edge nodes / S3 bucket - daily files are coming 2. File format - XYZ_20221021.zip 3. Inside Zip - you have employee.csv file with columns - first name, last name, id, salary, dept_id ### Transformations needed - create a new name column - name = first name + last name - calculate salary ranking for each employee within dept_id and add a new column - salary ranking. E.g., - for IT or Admin dept if employee salary is at the highest level or 2nd hihest etc. - Ingest final data to Hive table/ HDFS or S3 bucket - partitioned date from file name itself "20221021" ``` ### put your code here ``` path='../employee.csv' df = pd.read_csv(path) df['name'] = df.apply(lambda x: x['first_name'] + x['last_name']) df_grps = df.groupby(['dept_id']) for indx,group in df_grps: group.sort_values(key='salary') ranks = list(range(1, len(group) + 1)) group['ranks'] = pd.Series(ranks)