日本股票數據 - 特徵工程

組員:
林謙穎 07155156

目標

我們將使用機器學習模型的預測來識別可能上漲或下跌的資產，因此我們可以相應地進入市場中性的多頭和空頭頭寸。該方法類似於”線性模型 – 從風險因素到回報預測”和”工作流程 – 從模型到策略回測”中使用線性回歸的初始交易策略。

需要的東西

環境的套件

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

環境套件
python版本

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

3.8
需要的資料(.csv)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

DATA

用到的套件











import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
import talib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
idx = pd.IndexSlice

取得資料（Stooq Japanese Equity data 2014-2019）

我們使用Stooq提供的數據為日本股票設計一個策略。Stooq 是一家波蘭數據提供商，目前為各種資產類別、市場和頻率提供有趣的數據集。
雖然數據的來源和質量幾乎沒有透明度，但它具有目前免費的強大優勢。換句話說，我們可以每天、每小時和 5 分鐘的頻率對股票、債券、商品和外匯數據進行試驗，但對結果應該持謹慎態度。
我們使用了 2014-2019 年期間大約 3,000 只日本股票的價格數據。
過去 2 年將作為樣本外測試期，而前幾年將作為我們選擇模型的交叉驗證樣本。






DATA_DIR = Path('..', 'data')
prices = (pd.read_hdf(DATA_DIR / 'assets.h5', 'stooq/jp/tse/stocks/prices')
          .loc[idx[:, '2014': '2019'], :]
          .loc[lambda df: ~df.index.duplicated(), :])
prices.info(show_counts=True)
before = len(prices.index.unique('ticker').unique())

移除缺失值

我們刪除連續缺失值超過五個的股票代碼，只保留交易量最大的 1,000 隻股票。










prices = (prices.unstack('ticker')
        .sort_index()
        .ffill(limit=5)
        .dropna(axis=1)
        .stack('ticker')
        .swaplevel())
prices.info(show_counts=True)

after = len(prices.index.unique('ticker').unique())
print(f'Before: {before:,.0f} after: {after:,.0f}')

保留交易最多的符號




dv = prices.close.mul(prices.volume)
keep = dv.groupby('ticker').median().nlargest(1000).index.tolist()
prices = prices.loc[idx[keep, :], :]
prices.info(show_counts=True)

特徵工程

特徵工程是最大限度地從原始數據中提取特徵以供算法和模型使用。
特徵工程包括:特徵使用方案、特徵獲取方案、特徵處理、特徵監控。

計算週期回報








intervals = [1, 5, 10, 21, 63]
returns = []
by_ticker = prices.groupby(level='ticker').close
for t in intervals:
    returns.append(by_ticker.pct_change(t).to_frame(f'ret_{t}'))
returns = pd.concat(returns, axis=1)
returns.info(show_counts=True)

圖表

刪除異常值
















max_ret_by_sym = returns.groupby(level='ticker').max()
percentiles = [0.001, .005, .01, .025, .05, .1]
percentiles += [1-p for p in percentiles]
max_ret_by_sym.describe(percentiles=sorted(percentiles)[6:])

quantiles = max_ret_by_sym.quantile(.95)
to_drop = []
for ret, q in quantiles.items():
    to_drop.extend(max_ret_by_sym[max_ret_by_sym[ret]>q].index.tolist()) 

to_drop = pd.Series(to_drop).value_counts()
to_drop = to_drop[to_drop > 1].index.tolist()
len(to_drop)

prices = prices.drop(to_drop, level='ticker')
prices.info(show_counts=True)

表格

計算相對回報百分位數








returns = []
by_sym = prices.groupby(level='ticker').close
for t in intervals:
    ret = by_sym.pct_change(t)
    rel_perc = (ret.groupby(level='date')
             .apply(lambda x: pd.qcut(x, q=20, labels=False, duplicates='drop')))
    returns.extend([ret.to_frame(f'ret_{t}'), rel_perc.to_frame(f'ret_rel_perc_{t}')])
returns = pd.concat(returns, axis=1)

技術指標

從TA-Lib的158個技術指標找比較有興趣的出來分析

百分比價格振盪器

百分比價格振盪器 (PPO)：移動平均收斂/發散(MACD) 指標的標準化版本，用於衡量 14 天和 26 天指數移動平均線之間的差異，以捕捉資產動量的差異。

Calculation：
PPO Line: {(14-day EMA - 26-day EMA)/26-day EMA} x 100

Signal Line: 9-day EMA of PPO

PPO Histogram: PPO - Signal Line


ppo = prices.groupby(level='ticker').close.apply(talib.PPO).to_frame('PPO')

歸一化平均真實範圍

標準化平均真實範圍 (NATR)：以一種可以跨資產比較的方式衡量價格波動。

Calculation：
NATR = ATR(n) / Close * 100

Where: ATR(n) = Average True Range over ‘n’ periods.


natr = prices.groupby(level='ticker', group_keys=False).apply(lambda x: talib.NATR(x.high, x.low, x.close)).to_frame('NATR')

相對強弱指標

相對強弱指數 (RSI)：另一個流行的動量指標。
用來測量價格動向的快慢(速度)和變化(幅度)。以RSI之高低來決定買賣時機是根據漲久必跌，跌久必漲之原則。

Calculation：
RSI = 100 – 100/ (1 + RS)

RS = Average Gain of n days UP / Average Loss of n days DOWN


rsi = prices.groupby(level='ticker').close.apply(talib.RSI).to_frame('RSI')

布林帶

布林帶：移動平均線與移動標準差的比率，用於識別均值回歸的機會。

計算公式
%B = (目前價格 - 下軌) / (上軌 - 下軌Lower Band)

基本原理:這關於價格與上軌和下軌之間的關係。有六個基本關係可以量化。





def get_bollinger(x):
    u, m, l = talib.BBANDS(x)
    return pd.DataFrame({'u': u, 'm': m, 'l': l})
    
bbands = prices.groupby(level='ticker').close.apply(get_bollinger)

結合特點







data = pd.concat([prices, returns, ppo, natr, rsi, bbands], axis=1)

data['bbl'] = data.close.div(data.l)
data['bbu'] = data.u.div(data.close)
data = data.drop(['u', 'm', 'l'], axis=1)

data.bbu.corr(data.bbl, method='spearman')

隨機樣本代碼的繪圖指標










indicators = ['close', 'bbl', 'bbu', 'PPO', 'NATR', 'RSI']
ticker = np.random.choice(data.index.get_level_values('ticker'))
(data.loc[idx[ticker, :], indicators].reset_index('ticker', drop=True)
 .plot(lw=1, subplots=True, figsize=(16, 10), title=indicators, layout=(3, 2), legend=False))
plt.suptitle(ticker, fontsize=14)
sns.despine()
plt.tight_layout()
plt.subplots_adjust(top=.95)

data = data.drop(prices.columns, axis=1)

圖表

CLOSE:收盤價持續遞增。
BBL:在2016年中百分比中震盪較大。其餘還算穩定。
PPO:百分比中震盪較穩定。
NATR:有幾個區間百分比中波動較大。其餘波動較小。
RSI:測量結果，當n=14時，指數最具代表性。他指出當某證券的RSI升至70時，代表該證券已被超買（Overbought），投資者應考慮出售該證券。相反，當證券RSI跌至30時，代表證券被超賣。

創建時間段指標




dates = data.index.get_level_values('date')
data['weekday'] = dates.weekday
data['month'] = dates.month
data['year'] = dates.year

計算遠期回報










outcomes = []
by_ticker = data.groupby('ticker')
for t in intervals:
    k = f'fwd_ret_{t:02}'
    outcomes.append(k)
    data[k] = by_ticker[f'ret_{t}'].shift(-t)

data.info(null_counts=True)

data.to_hdf('data.h5', 'stooq/japan/equities')

結論

=print(data)

圖表

data.iloc[？]

？：可以從1~1383904抓出所有的解果
就可以很明顯地看出所有的回報是多少！
圖表
 圖表

日本股票數據 - 特徵工程

目標

需要的東西

用到的套件

取得資料（Stooq Japanese Equity data 2014-2019）

移除缺失值

保留交易最多的符號

特徵工程

計算週期回報

刪除異常值

計算相對回報百分位數

技術指標

百分比價格振盪器

歸一化平均真實範圍

相對強弱指標

布林帶

結合特點

隨機樣本代碼的繪圖指標

創建時間段指標

計算遠期回報

結論

Read more

日本股票數據 - 特徵工程

日本股票數據 - 特徵工程

日本股票數據 - 特徵工程

日本股票數據 - 特徵工程