2020 Shopee Code League --- Competition Code

--- title: '2020 Shopee Code League --- Competition Code' disqus: hackmd --- Shopee Code League 2020 === ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ![build](https://img.shields.io/appveyor/ci/:user/:repo.svg) ![chat](https://img.shields.io/discord/:serverId.svg) ## Table of Contents [TOC] ### Open Group If you are a total beginner to this, start here! 1. Visit hackmd.io 2. Click "Sign in" 3. Choose a way to sign in 4. Start writing note! ### User story --- > I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates] ### *TrueNote.py* --- ```mermaid classDiagram TrueNotepy --* Dex Lu TrueNotepy --* VincentLiu TrueNotepy --* Wenchun TrueNotepy --* Gueiya class TrueNotepy{ Team member } class Dex Lu{ Data Scientist } class Gueiya{ Process Engineer } class VincentLiu{ Data Scientist } class Wenchun{ Process Engineer } ``` ### Competition Schedule --- ```mermaid gantt title Shopee Code League 2020 competition timeline axisFormat %m%d section Order Brushing Data Analytics :a1, 6/08, 6d 1st Competition :crit, after a1 , 1d section Product Detection Image Processing :b1, 6/15 , 6d 2nd Competition :crit, doc1, after b1, 13d section SHORT ALGORITHM CONTEST Short Programming Contest :c1, 6/22, 6d 3rd Competition :crit, after c1, 1d section Title Translation Data Science :d1, 6/30, 6d 4th Competition:crit, after d1, 08/01 section Logistics Data Analytics :e1, 7/7, 6d 5th Competition:crit, after e1, 1d ``` > Read more about mermaid here: http://mermaid-js.github.io/mermaid/ ### Competition 1 -- Order Brushing #### Solution 1 (existing flaws in grouping event_time within one hour) input data ```python= # This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session ``` ```python= import pandas as pd # Read the data df = pd.read_csv('/kaggle/input/orderbrushing/order_brush_order.csv') df.head() ``` ```python= df.describe() ``` deforming date time ```python= # Convert string to date time type Python df["event_time"] = pd.to_datetime(df['event_time']) ``` ```python= #Get all orders with group by userid and shopid df = df.set_index(pd.DatetimeIndex(df['event_time'])).drop('event_time', axis=1).sort_index() orders = df.groupby(['shopid', 'userid', pd.Grouper(freq='H', label='left', base=0)]).count() ``` ```python= orders ``` count userID ```python= brush_order = orders[orders.orderid >=3] brush_order ``` ```python= listuserid = [] brush_order.reset_index().groupby('shopid')['userid'].apply(lambda x: listuserid.append(x.values)) ``` ```python= #Check list userid listuserid ``` ```python= #Drop duplicate shopid brush_order.reset_index().drop_duplicates(subset = ["shopid"]) ``` ```python= #Concat userid with & def concat_userid(data): result = '&'.join(str(x) for x in data) return result bulk_userid = [] for i in listuserid: bulk_userid.append(concat_userid(i)) ``` ```python= bulk_userid ``` ```python= #DF order brushing df_brush = pd.DataFrame({"shopid": brush_order.reset_index()['shopid'].unique(), "userid": bulk_userid}) df_brush.head() ``` ```python= #DF no order brushing df0 = pd.DataFrame({'shopid': df['shopid'].unique(), 'userid': 0}) ``` ```python= # Export result as csv res_df = pd.concat([df0[~df0.shopid.isin(df_brush.shopid)], df_brush]) res_df.to_csv("submission.csv", index=False) ``` #### solution 2 input data ```python= # This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session ``` ```python= order_data = pd.read_csv('../input/students-order-brushing-1/order_brush_order.csv') order_data.head() ``` formating date time from series into object ```python= order_data['event_time'] = pd.to_datetime(order_data.event_time) order_data.dtypes ``` grouping event time within one hour ```python= def get_suspicious_buyer(df): df.sort_values(by='event_time', inplace=True) # print(df, end='\n\n') n = len(df.index) is_suspicious = [False for _ in range(n)] for i in range(n): maxJ = -1 userid_set = set() for j in range(i, n): delta_second = (df['event_time'].iloc[j] - df['event_time'].iloc[i]).total_seconds() if delta_second > 3600: break userid_set.add(df['userid'].iloc[j]) if j-i+1 >= len(userid_set) * 3: maxJ = j for j in range(i, maxJ+1): is_suspicious[j] = True brush_df = df.loc[is_suspicious] # print(brush_df, end='\n\n') user_count = brush_df.groupby('userid').orderid.count() # print(user_count, end='\n\n') most_suspicious_users = list(user_count[user_count == user_count.max()].index) most_suspicious_users.sort() res = '&'.join([str(x) for x in most_suspicious_users]) if res == '': res = '0' return res ``` ```python= shop_groups = order_data.groupby('shopid') suspicious_users = [] for shop_id, df in shop_groups: suspicious_users.append(get_suspicious_buyer(df)) ``` ```python= shop_ids = [] for shop_id, df in shop_groups: shop_ids.append(shop_id) output = pd.DataFrame({'shopid': shop_ids, 'userid': suspicious_users}) output.to_csv('submission.csv', index=False) ``` #### Code from Dex Lu ```python= f1 = pd.read_csv("/Users/dexterlu/Downloads/SideProject-Shopee/order_brush_order.csv") df1['orderid'] = df1['orderid'].astype("str") df1['shopid'] = df1['shopid'].astype("str") df1['userid'] = df1['userid'].astype("str") df1['shop+user']=df1['shopid']+'_'+df['userid'] df1['event_time']=pd.to_datetime(df['event_time']) s = df1['shop+user'].value_counts()>=3 df2 = s.to_frame() df2 = df2.reset_index() df2['shop+user'] = df2['shop+user'].astype('str') dfm = df1.merge(df2, how='left',left_on='shop+user',right_on='index') df = dfm[dfm['shop+user_y']=='True'] df = df.sort_index(by=['event_time'],ascending=True) ``` #### Code from Vincent Liu ```python= df['shopid'] =df['shopid'].apply(str) df['userid'] =df['userid'].apply(str) df['all'] = df['shopid'] +df['userid'] df2 = df.groupby('all').size().reset_index().rename(columns={0:'records'}) df2 = df2[df2['records']>=3] df3 = df2.merge(df, on='all', how='left').sort_values('all').sort_values('event_time') df3['event_time'] = pd.to_datetime(df3['event_time']) df3['time'] = '2019-12-27 00:00:00' df3['time'] = pd.to_datetime(df3['time'], format= '%Y-%m-%d %H:%M:%S') df3 df3['timediff'] = (df3['event_time'] -df3['time']).dt.total_seconds() df3.sort_values('all') df4 = df3.sort_values('all').reset_index() del df4['index'] df4['all'].unique() ``` #### Code from Wenhuan Chang ```python= import pandas as pd from datetime import datetime input = '../input/order-brushing-shopee/order_brush_order.csv' df = pd.read_csv(input) df['orderid'] =df.shopid.astype(str).str.cat(df.userid.astype(str),sep='_') df.groupby('orderid').count() df['orderid'].value_counts() df['freq'] = df.groupby('orderid')['orderid'].transform('count') df= df[df['freq'] >= 3] ``` #### Code from Gueiya ```python= import pandas as pd import numpy as np import datetime data_path = 'data/' df = pd.read_csv(data_path + 'order_brush_order.csv') df["count"]="1" df.head() #將其轉換為時間數組 import time a = str(df.event_time) timeArray = time.strptime(a, "%Y/%m/%d %H:%M") # 轉換為時間戳: timeStamp = int(time.mktime(timeArray)) timeStamp == 1381419600 timeArray = time.strptime(a, "%Y/%m/%d %H:%M") otherStyleTime = time.strftime("%Y/%m/%d %H", timeArray) ``` ### Competition 2 -- Product Predition --- #### Code from Gueiya (Colab) ```python= !pip install plotly import plotly.express as px ``` ```python= from google.colab import files uploaded = files.upload() ``` ```python= # 按順序讀取train資料夾中00~41資料夾裡的.jpg檔名 import glob import pandas as pd df=[] for i in range(100,142,1): i = str(i).lstrip('1').str[-1] catfs = glob.glob("/content/drive/My Drive/shopee/competition/shopee-product-detection-dataset/train/train/"+str(i)+"/*.jpg") catt= [int(i)] * len(catfs) dft = pd.DataFrame({ "path":catfs, "number":catt }) ddft = dft['path'].str.split('/').str[-1] df_test = pd.DataFrame({ 'filename':ddft, 'category':catt }) df_test.shape ``` ```python= # 準備好table 等train data傳完update import glob import pandas as pd catfs = glob.glob("/content/drive/My Drive/shopee/competition/shopee-product-detection-dataset/test/test/*.jpg") catt= [43] * len(catfs) dft = pd.DataFrame({ "path":catfs, "number":catt }) ddft = dft['path'].str.split('/').str[-1] df_test = pd.DataFrame({ 'filename':ddft, 'category':catt }) df_test.head() ``` ```python= from keras.applications.vgg16 import VGG16 vgg = VGG16(include_top=False, input_shape=(224, 224, 3)) vgg.summary() from keras.models import Model from keras.layers import Dense, Dropout, Flatten from keras.layers import BatchNormalization # trainable一定要在compile前 for l in vgg.layers: l.trainable = False x = BatchNormalization()(vgg.output) # MLP x = Flatten()(x) x = Dense(512, activation="relu")(x) x = Dropout(0.25)(x) out = Dense(2, activation="softmax")(x) model = Model(inputs=vgg.input, outputs=out) model.summary() model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"]) ``` ```python= # https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py import numpy as np from keras.preprocessing.image import load_img from keras.applications.vgg16 import preprocess_input p = df.iloc[0]["path"] img = load_img(p, target_size=(224, 224)).convert("RGB") img_np = np.array(img) img_pre = preprocess_input(img_np) img_pre ``` ```python= # 試試看numpy的randint ori = np.random.randint(0, 10, 5) new = list(map(lambda x:x**2, ori)) print(ori) print(new) def preprocess(path): img = load_img(path, target_size=(224, 224)).convert("RGB") img_np = np.array(img) img_pre = preprocess_input(img_np) return img_pre def get_images(paths, targets, batch=20): idx = np.random.randint(0, len(paths), batch) ps = paths[idx] xs = np.array(list(map(preprocess, ps))) ys = targets[idx] return (ps, xs, ys) ``` ```python= from sklearn.model_selection import train_test_split x = np.array(df["path"]) y = np.array(df["number"]) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1) ps, xs, ys = get_images(x_train, y_train) print(ys) for i in range(10): print("-" * 15, "Times:" , i, "-" * 15) _, xs, ys = get_images(x_train, y_train) train_loss = model.train_on_batch(xs, ys) print("[Train]:", train_loss) _, xs, ys = get_images(x_test, y_test) val_loss = model.test_on_batch(xs, ys) print("[Validate]:", val_loss) _, xs, ys = get_images(x_test, y_test, 100) model.evaluate(xs, ys) ``` ```python= # 觀察機率: 0.0 1.0絕對有問題 model.predict(xs) ``` ```python= import numpy as np import matplotlib.pyplot as plt %matplotlib inline ps, xs, ys = get_images(x_test, y_test, 100) accu = model.evaluate(xs, ys)[1] print("命中率:", accu * 100, "%") # 如果是使用Model, 而不是Sequential的話 # 你會沒有 predict_classes # 你要用predict + argmax來替代 pre = model.predict(xs).argmax(axis=1) x_test = np.array(list(map(lambda p:np.array(load_img(p)), ps))) idx = np.nonzero(pre != ys)[0][:200] pre_false_img = x_test[idx] pre_false_label = ys[idx] pre_false_pre = pre[idx] width = 10 height = len(idx) // width + 1 plt.figure(figsize=(14, 42)) trans = ["Cat", "Dog"] z = zip(pre_false_img, pre_false_label, pre_false_pre) # (i, (z1, z2, z3)) for i, (img, label, p) in enumerate(z): plt.subplot(height, width, i+1) plt.title("[O]:{}\n[P]:{}".format(trans[label], trans[p])) plt.axis("off") plt.imshow(img) ``` ### Competition 3 -- Short Programming Contest 1 --- Shopee released problem Statement & Datasets <iframe frameborder="0" scrolling="no" width="640" height="480" src="https://drive.google.com/file/d/1eckQ8lCQZQ-g5vnrLPxmkCyFKwcmTkWb/preview"> </iframe> #### Lucky Winner (Max. score: 20) [google docs](https://docs.google.com/document/d/1UfgUQ2TmRUvxEMM3W4nqHrMdXFNcMnOG9O9taVlnj4s/edit) code from Gueiya ```python= import pandas as pd import numpy as np # 先產生亂數table N = list(range(1000)) tb = pd.DataFrame(np.random.rand(1000,1),index=list(N),columns=(list('K'))) # 輸入值絕對值為table index, 輸出tb['K']:value print(str(tb['K'][int(abs(int(input())))]).strip('0.')[:3]) ``` #### Sequences (Max. score: 30) [google docs](https://docs.google.com/document/d/11sTwvCzlkH1lyACpuHagqv8Y1sqnAQapZYx65tokzt0/edit) #### Search Engine (Max. score: 20) [google docs](https://docs.google.com/document/d/1s1l_y-PgrMbeJiiMaHHxvc89klJet-d25y2fQAWVBvU/edit) #### Item Stock (Max. score: 10) [google docs](https://docs.google.com/document/d/1wjthvmYk1P_0lYsY_wMfDuapZfWqO1032f4mkiI0jVY/edit) #### Judging Servers (Max. score: 20) [google docs](https://docs.google.com/document/d/1BZV3ojZnduusLsIF6lfYHNPWxqPBkGpSFSUvRClq5IA/edit) ## Workshop --- ### Python Intermediate by The Brainery Code on 15 June 2020 :::info **Tips share:** importing data methods & modifying rows with .loc[] ::: - 2 methods of importing data ```python= # import file import pandas as pd # if your file is someway away from your py file pokemin_df = pd.read_csv("C:\\Users\\Queiy\\Documents\\shopee\\Pokemon_Gen_1-8.csv") # if the file is at the same forder as your py file pokemin_df = pd.read_csv("Pokemon_Gen_1-8.csv") print(pokemin_df) ``` - adding/modifying rows with .loc ```python= import numpy as py import pandas as pd cars = {'Brand':['Honda','Toyota','Ford'],'Price':[22000,25000,27000]} cars_df = pd.DataFrame(cars, columns= ['Brand','Price'],index=['Car 1','Car 2', 'Car 3']) year = [2010,2011,2008] cars_df['Year'] = year cars_df.insert(1, 'Model', ['Civic', 'Prius', 'Focus'], True) # adding row/modefication row cars_df.loc['Car 4'] = ['Hundai','Avante',20000,2010] cars_df.loc['Car 3'] = ['Suzuki','Swift',26000,2013] cars_df['Discount'] = 0.1*cars_df['Price'] cars_df['Discount Price'] = (cars_df['Price']) - 0.9*cars_df['Price'] print(cars_df) ``` - selecting rows with .loc[:] ```python= import pandas as pd pd.set_option('display.max_rows',None) pd.set_option('display.max_columns',None) pd.set_option('display.width',None) # eg.1 # [[]] --> to become a list # print(pokemon_df[['#','Name','Generation']]) # eg.2 # pokemon_df = pd.read_csv("Pokemon_Gen_1-8.csv") # # [20:28] --> slice up table # pokemon_generation = pokemon_df.loc[20:28,['#','Name','Generation']] # print(pokemon_generation) # eg.3 pokemon_12 = pokemon_df.loc[12:] print(pokemon_12) ``` ### Intro to Common Algorithms on 16 June 2020 <iframe src="https://www.slideshare.net/slideshow/embed_code/key/yZMW3SEzrqshJP" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/secret/yZMW3SEzrqshJP" title="Shopee algo slides" target="_blank">Shopee algo slides</a> </strong> from <strong><a href="https://www.slideshare.net/GUEIYAJHANG" target="_blank">GUEI-YA JHANG</a></strong> </div> ###### tags: `.loc[]` `data exploration analysis` `algorithm` ### 20200628 Sorting Algorithms <iframe frameborder="0" scrolling="no" width="640" height="480" src="https://drive.google.com/file/d/1Z1NhTjOSmNP_7NN8y5rGFn_iXqc_x1FU/preview"> </iframe> ### Python Advanced 20200701 [google drive](https://https://drive.google.com/drive/folders/1zlpDlIXAz-UnIPwxua7hAfnBusDjIWZE) ### Java 8 20200706 :::info **new function of Java 8** ::: [github](https://github.com/yahyabaassou/Java8Workshop/tree/master/src/test/java/com/yahyabaassou/java8/exercises/exercise1) [slides](https://drive.google.com/file/d/1r-Bn2SDkmwXQ5Zd_u6VyxHwJU5SJD5fC/view) ### Data Science Models 20200708 [google drive](https://drive.google.com/drive/folders/1ozzAQWn5U7XW2rIkPB963azowVZvzT3a)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.