owned this note
owned this note
Published
Linked with GitHub
---
title: '2020 Shopee Code League --- Competition Code'
disqus: hackmd
---
Shopee Code League 2020
===
![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg)
![build](https://img.shields.io/appveyor/ci/:user/:repo.svg)
![chat](https://img.shields.io/discord/:serverId.svg)
## Table of Contents
[TOC]
### Open Group
If you are a total beginner to this, start here!
1. Visit hackmd.io
2. Click "Sign in"
3. Choose a way to sign in
4. Start writing note!
### User story
---
> I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates]
### *TrueNote.py*
---
```mermaid
classDiagram
TrueNotepy --* Dex Lu
TrueNotepy --* VincentLiu
TrueNotepy --* Wenchun
TrueNotepy --* Gueiya
class TrueNotepy{
Team member
}
class Dex Lu{
Data Scientist
}
class Gueiya{
Process Engineer
}
class VincentLiu{
Data Scientist
}
class Wenchun{
Process Engineer
}
```
### Competition Schedule
---
```mermaid
gantt
title Shopee Code League 2020 competition timeline
axisFormat %m%d
section Order Brushing
Data Analytics :a1, 6/08, 6d
1st Competition :crit, after a1 , 1d
section Product Detection
Image Processing :b1, 6/15 , 6d
2nd Competition :crit, doc1, after b1, 13d
section SHORT ALGORITHM CONTEST
Short Programming Contest :c1, 6/22, 6d
3rd Competition :crit, after c1, 1d
section Title Translation
Data Science :d1, 6/30, 6d
4th Competition:crit, after d1, 08/01
section Logistics
Data Analytics :e1, 7/7, 6d
5th Competition:crit, after e1, 1d
```
> Read more about mermaid here: http://mermaid-js.github.io/mermaid/
### Competition 1 -- Order Brushing
#### Solution 1 (existing flaws in grouping event_time within one hour)
input data
```python=
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
```
```python=
import pandas as pd
# Read the data
df = pd.read_csv('/kaggle/input/orderbrushing/order_brush_order.csv')
df.head()
```
```python=
df.describe()
```
deforming date time
```python=
# Convert string to date time type Python
df["event_time"] = pd.to_datetime(df['event_time'])
```
```python=
#Get all orders with group by userid and shopid
df = df.set_index(pd.DatetimeIndex(df['event_time'])).drop('event_time', axis=1).sort_index()
orders = df.groupby(['shopid', 'userid', pd.Grouper(freq='H', label='left', base=0)]).count()
```
```python=
orders
```
count userID
```python=
brush_order = orders[orders.orderid >=3]
brush_order
```
```python=
listuserid = []
brush_order.reset_index().groupby('shopid')['userid'].apply(lambda x: listuserid.append(x.values))
```
```python=
#Check list userid
listuserid
```
```python=
#Drop duplicate shopid
brush_order.reset_index().drop_duplicates(subset = ["shopid"])
```
```python=
#Concat userid with &
def concat_userid(data):
result = '&'.join(str(x) for x in data)
return result
bulk_userid = []
for i in listuserid:
bulk_userid.append(concat_userid(i))
```
```python=
bulk_userid
```
```python=
#DF order brushing
df_brush = pd.DataFrame({"shopid": brush_order.reset_index()['shopid'].unique(), "userid": bulk_userid})
df_brush.head()
```
```python=
#DF no order brushing
df0 = pd.DataFrame({'shopid': df['shopid'].unique(), 'userid': 0})
```
```python=
# Export result as csv
res_df = pd.concat([df0[~df0.shopid.isin(df_brush.shopid)], df_brush])
res_df.to_csv("submission.csv", index=False)
```
#### solution 2
input data
```python=
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
```
```python=
order_data = pd.read_csv('../input/students-order-brushing-1/order_brush_order.csv')
order_data.head()
```
formating date time from series into object
```python=
order_data['event_time'] = pd.to_datetime(order_data.event_time)
order_data.dtypes
```
grouping event time within one hour
```python=
def get_suspicious_buyer(df):
df.sort_values(by='event_time', inplace=True)
# print(df, end='\n\n')
n = len(df.index)
is_suspicious = [False for _ in range(n)]
for i in range(n):
maxJ = -1
userid_set = set()
for j in range(i, n):
delta_second = (df['event_time'].iloc[j] - df['event_time'].iloc[i]).total_seconds()
if delta_second > 3600:
break
userid_set.add(df['userid'].iloc[j])
if j-i+1 >= len(userid_set) * 3:
maxJ = j
for j in range(i, maxJ+1):
is_suspicious[j] = True
brush_df = df.loc[is_suspicious]
# print(brush_df, end='\n\n')
user_count = brush_df.groupby('userid').orderid.count()
# print(user_count, end='\n\n')
most_suspicious_users = list(user_count[user_count == user_count.max()].index)
most_suspicious_users.sort()
res = '&'.join([str(x) for x in most_suspicious_users])
if res == '':
res = '0'
return res
```
```python=
shop_groups = order_data.groupby('shopid')
suspicious_users = []
for shop_id, df in shop_groups:
suspicious_users.append(get_suspicious_buyer(df))
```
```python=
shop_ids = []
for shop_id, df in shop_groups:
shop_ids.append(shop_id)
output = pd.DataFrame({'shopid': shop_ids,
'userid': suspicious_users})
output.to_csv('submission.csv', index=False)
```
#### Code from Dex Lu
```python=
f1 = pd.read_csv("/Users/dexterlu/Downloads/SideProject-Shopee/order_brush_order.csv")
df1['orderid'] = df1['orderid'].astype("str")
df1['shopid'] = df1['shopid'].astype("str")
df1['userid'] = df1['userid'].astype("str")
df1['shop+user']=df1['shopid']+'_'+df['userid']
df1['event_time']=pd.to_datetime(df['event_time'])
s = df1['shop+user'].value_counts()>=3
df2 = s.to_frame()
df2 = df2.reset_index()
df2['shop+user'] = df2['shop+user'].astype('str')
dfm = df1.merge(df2, how='left',left_on='shop+user',right_on='index')
df = dfm[dfm['shop+user_y']=='True']
df = df.sort_index(by=['event_time'],ascending=True)
```
#### Code from Vincent Liu
```python=
df['shopid'] =df['shopid'].apply(str)
df['userid'] =df['userid'].apply(str)
df['all'] = df['shopid'] +df['userid']
df2 = df.groupby('all').size().reset_index().rename(columns={0:'records'})
df2 = df2[df2['records']>=3]
df3 = df2.merge(df, on='all', how='left').sort_values('all').sort_values('event_time')
df3['event_time'] = pd.to_datetime(df3['event_time'])
df3['time'] = '2019-12-27 00:00:00'
df3['time'] = pd.to_datetime(df3['time'], format= '%Y-%m-%d %H:%M:%S')
df3
df3['timediff'] = (df3['event_time'] -df3['time']).dt.total_seconds()
df3.sort_values('all')
df4 = df3.sort_values('all').reset_index()
del df4['index']
df4['all'].unique()
```
#### Code from Wenhuan Chang
```python=
import pandas as pd
from datetime import datetime
input = '../input/order-brushing-shopee/order_brush_order.csv'
df = pd.read_csv(input)
df['orderid'] =df.shopid.astype(str).str.cat(df.userid.astype(str),sep='_')
df.groupby('orderid').count()
df['orderid'].value_counts()
df['freq'] = df.groupby('orderid')['orderid'].transform('count')
df= df[df['freq'] >= 3]
```
#### Code from Gueiya
```python=
import pandas as pd
import numpy as np
import datetime
data_path = 'data/'
df = pd.read_csv(data_path + 'order_brush_order.csv')
df["count"]="1"
df.head()
#將其轉換為時間數組
import time
a = str(df.event_time)
timeArray = time.strptime(a, "%Y/%m/%d %H:%M")
# 轉換為時間戳:
timeStamp = int(time.mktime(timeArray))
timeStamp == 1381419600
timeArray = time.strptime(a, "%Y/%m/%d %H:%M")
otherStyleTime = time.strftime("%Y/%m/%d %H", timeArray)
```
### Competition 2 -- Product Predition
---
#### Code from Gueiya (Colab)
```python=
!pip install plotly
import plotly.express as px
```
```python=
from google.colab import files
uploaded = files.upload()
```
```python=
# 按順序讀取train資料夾中00~41資料夾裡的.jpg檔名
import glob
import pandas as pd
df=[]
for i in range(100,142,1):
i = str(i).lstrip('1').str[-1]
catfs = glob.glob("/content/drive/My Drive/shopee/competition/shopee-product-detection-dataset/train/train/"+str(i)+"/*.jpg")
catt= [int(i)] * len(catfs)
dft = pd.DataFrame({
"path":catfs,
"number":catt
})
ddft = dft['path'].str.split('/').str[-1]
df_test = pd.DataFrame({
'filename':ddft,
'category':catt
})
df_test.shape
```
```python=
# 準備好table 等train data傳完update
import glob
import pandas as pd
catfs = glob.glob("/content/drive/My Drive/shopee/competition/shopee-product-detection-dataset/test/test/*.jpg")
catt= [43] * len(catfs)
dft = pd.DataFrame({
"path":catfs,
"number":catt
})
ddft = dft['path'].str.split('/').str[-1]
df_test = pd.DataFrame({
'filename':ddft,
'category':catt
})
df_test.head()
```
```python=
from keras.applications.vgg16 import VGG16
vgg = VGG16(include_top=False, input_shape=(224, 224, 3))
vgg.summary()
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import BatchNormalization
# trainable一定要在compile前
for l in vgg.layers:
l.trainable = False
x = BatchNormalization()(vgg.output)
# MLP
x = Flatten()(x)
x = Dense(512, activation="relu")(x)
x = Dropout(0.25)(x)
out = Dense(2, activation="softmax")(x)
model = Model(inputs=vgg.input, outputs=out)
model.summary()
model.compile(loss="sparse_categorical_crossentropy",
optimizer="adam",
metrics=["accuracy"])
```
```python=
# https://github.com/keras-team/keras-applications/blob/master/keras_applications/imagenet_utils.py
import numpy as np
from keras.preprocessing.image import load_img
from keras.applications.vgg16 import preprocess_input
p = df.iloc[0]["path"]
img = load_img(p, target_size=(224, 224)).convert("RGB")
img_np = np.array(img)
img_pre = preprocess_input(img_np)
img_pre
```
```python=
# 試試看numpy的randint
ori = np.random.randint(0, 10, 5)
new = list(map(lambda x:x**2, ori))
print(ori)
print(new)
def preprocess(path):
img = load_img(path, target_size=(224, 224)).convert("RGB")
img_np = np.array(img)
img_pre = preprocess_input(img_np)
return img_pre
def get_images(paths, targets, batch=20):
idx = np.random.randint(0, len(paths), batch)
ps = paths[idx]
xs = np.array(list(map(preprocess, ps)))
ys = targets[idx]
return (ps, xs, ys)
```
```python=
from sklearn.model_selection import train_test_split
x = np.array(df["path"])
y = np.array(df["number"])
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.1)
ps, xs, ys = get_images(x_train, y_train)
print(ys)
for i in range(10):
print("-" * 15, "Times:" , i, "-" * 15)
_, xs, ys = get_images(x_train, y_train)
train_loss = model.train_on_batch(xs, ys)
print("[Train]:", train_loss)
_, xs, ys = get_images(x_test, y_test)
val_loss = model.test_on_batch(xs, ys)
print("[Validate]:", val_loss)
_, xs, ys = get_images(x_test, y_test, 100)
model.evaluate(xs, ys)
```
```python=
# 觀察機率: 0.0 1.0絕對有問題
model.predict(xs)
```
```python=
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
ps, xs, ys = get_images(x_test, y_test, 100)
accu = model.evaluate(xs, ys)[1]
print("命中率:", accu * 100, "%")
# 如果是使用Model, 而不是Sequential的話
# 你會沒有 predict_classes
# 你要用predict + argmax來替代
pre = model.predict(xs).argmax(axis=1)
x_test = np.array(list(map(lambda p:np.array(load_img(p)), ps)))
idx = np.nonzero(pre != ys)[0][:200]
pre_false_img = x_test[idx]
pre_false_label = ys[idx]
pre_false_pre = pre[idx]
width = 10
height = len(idx) // width + 1
plt.figure(figsize=(14, 42))
trans = ["Cat", "Dog"]
z = zip(pre_false_img, pre_false_label, pre_false_pre)
# (i, (z1, z2, z3))
for i, (img, label, p) in enumerate(z):
plt.subplot(height, width, i+1)
plt.title("[O]:{}\n[P]:{}".format(trans[label], trans[p]))
plt.axis("off")
plt.imshow(img)
```
### Competition 3 -- Short Programming Contest 1
---
Shopee released problem Statement & Datasets
<iframe frameborder="0" scrolling="no"
width="640" height="480"
src="https://drive.google.com/file/d/1eckQ8lCQZQ-g5vnrLPxmkCyFKwcmTkWb/preview">
</iframe>
#### Lucky Winner (Max. score: 20)
[google docs](https://docs.google.com/document/d/1UfgUQ2TmRUvxEMM3W4nqHrMdXFNcMnOG9O9taVlnj4s/edit)
code from Gueiya
```python=
import pandas as pd
import numpy as np
# 先產生亂數table
N = list(range(1000))
tb = pd.DataFrame(np.random.rand(1000,1),index=list(N),columns=(list('K')))
# 輸入值絕對值為table index, 輸出tb['K']:value
print(str(tb['K'][int(abs(int(input())))]).strip('0.')[:3])
```
#### Sequences (Max. score: 30)
[google docs](https://docs.google.com/document/d/11sTwvCzlkH1lyACpuHagqv8Y1sqnAQapZYx65tokzt0/edit)
#### Search Engine (Max. score: 20)
[google docs](https://docs.google.com/document/d/1s1l_y-PgrMbeJiiMaHHxvc89klJet-d25y2fQAWVBvU/edit)
#### Item Stock (Max. score: 10)
[google docs](https://docs.google.com/document/d/1wjthvmYk1P_0lYsY_wMfDuapZfWqO1032f4mkiI0jVY/edit)
#### Judging Servers (Max. score: 20)
[google docs](https://docs.google.com/document/d/1BZV3ojZnduusLsIF6lfYHNPWxqPBkGpSFSUvRClq5IA/edit)
## Workshop
---
### Python Intermediate by The Brainery Code on 15 June 2020
:::info
**Tips share:** importing data methods & modifying rows with .loc[]
:::
- 2 methods of importing data
```python=
# import file
import pandas as pd
# if your file is someway away from your py file
pokemin_df = pd.read_csv("C:\\Users\\Queiy\\Documents\\shopee\\Pokemon_Gen_1-8.csv")
# if the file is at the same forder as your py file
pokemin_df = pd.read_csv("Pokemon_Gen_1-8.csv")
print(pokemin_df)
```
- adding/modifying rows with .loc
```python=
import numpy as py
import pandas as pd
cars = {'Brand':['Honda','Toyota','Ford'],'Price':[22000,25000,27000]}
cars_df = pd.DataFrame(cars, columns= ['Brand','Price'],index=['Car 1','Car 2', 'Car 3'])
year = [2010,2011,2008]
cars_df['Year'] = year
cars_df.insert(1, 'Model', ['Civic', 'Prius', 'Focus'], True)
# adding row/modefication row
cars_df.loc['Car 4'] = ['Hundai','Avante',20000,2010]
cars_df.loc['Car 3'] = ['Suzuki','Swift',26000,2013]
cars_df['Discount'] = 0.1*cars_df['Price']
cars_df['Discount Price'] = (cars_df['Price']) - 0.9*cars_df['Price']
print(cars_df)
```
- selecting rows with .loc[:]
```python=
import pandas as pd
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)
pd.set_option('display.width',None)
# eg.1
# [[]] --> to become a list
# print(pokemon_df[['#','Name','Generation']])
# eg.2
# pokemon_df = pd.read_csv("Pokemon_Gen_1-8.csv")
# # [20:28] --> slice up table
# pokemon_generation = pokemon_df.loc[20:28,['#','Name','Generation']]
# print(pokemon_generation)
# eg.3
pokemon_12 = pokemon_df.loc[12:]
print(pokemon_12)
```
### Intro to Common Algorithms on 16 June 2020
<iframe src="https://www.slideshare.net/slideshow/embed_code/key/yZMW3SEzrqshJP" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen> </iframe> <div style="margin-bottom:5px"> <strong> <a href="https://www.slideshare.net/secret/yZMW3SEzrqshJP" title="Shopee algo slides" target="_blank">Shopee algo slides</a> </strong> from <strong><a href="https://www.slideshare.net/GUEIYAJHANG" target="_blank">GUEI-YA JHANG</a></strong> </div>
###### tags: `.loc[]` `data exploration analysis` `algorithm`
### 20200628 Sorting Algorithms
<iframe frameborder="0" scrolling="no"
width="640" height="480"
src="https://drive.google.com/file/d/1Z1NhTjOSmNP_7NN8y5rGFn_iXqc_x1FU/preview">
</iframe>
### Python Advanced 20200701
[google drive](https://https://drive.google.com/drive/folders/1zlpDlIXAz-UnIPwxua7hAfnBusDjIWZE)
### Java 8 20200706
:::info
**new function of Java 8**
:::
[github](https://github.com/yahyabaassou/Java8Workshop/tree/master/src/test/java/com/yahyabaassou/java8/exercises/exercise1)
[slides](https://drive.google.com/file/d/1r-Bn2SDkmwXQ5Zd_u6VyxHwJU5SJD5fC/view)
### Data Science Models 20200708
[google drive](https://drive.google.com/drive/folders/1ozzAQWn5U7XW2rIkPB963azowVZvzT3a)