金融数据特征提取与自动标注

2018-07-02 14:45阅读：

http://blog.sina.cn/dpool/blog/u/1323171481

金融数据特征提取与自动标注

传统机器学习，其中有一项重要的工作就是”特征工程“，所谓”特征工程“就是提取基础数据的”特征“。比如图像识别，需要对图片进行灰度、二值化处理等。
金融时间序列也可以做相应的特征提取，比如收益率，波动率，各种技术指标：均线，动量等，当然也可以是财务特征比如PE,PB，ROA等。这里的特征和传统金融分析里的“因子”可以对应上。我们可以基于传统的alpha因子，用机器学习模型去寻找其中的数据关联。
所以，我们在基础数据的基础上，实现自动化的特征提取与数据标准。

金融数据特征自动提取

导入所需的包以前模块

from engine.common.mongo_utils
import mongo import pandas
as pd from datetime import
datetime

从mongo里查询数据，并进行预处理

def
feature_extractor(instrument,features,start_date='',end_date='',benchmark='000300_index'):
items =
mongo.query_docs('astock_daily_quotes',{'code':instrument,
'

date':{'$gt':start_date,'$lt':end_date}}, ) df = pd.DataFrame(list(items)) df = df[['open','high','low','close','date','code']] df.index = df['date'] df.sort_index(inplace=True) for feature in features: df = parse_feature(df,feature) return df

解析需要的特征

def
parse_feature(df,feature):
features_support = ['return'] if
'_' in feature: feature_name =
feature[:feature.index('_')] param =
int(feature[feature.index('_')+1:])
else: feature_name = feature param = 0
print(feature_name, param) if feature_name
not in features_support:
return df if feature_name ==
'return': df[feature] = df['close']
/df['close'].shift(param+1)
-1 return df

我们尝试读取贵州茅台（600519）于2017-01-01至2017-01-31之间的数据，并提取当天的收益率，与5天的收益率特征。

features = ['return_0','return_4']
start = datetime(2017,1,1)
end =
datetime(2017,1,31)
print(feature_extractor('600519',features,start_date=start,end_date=end))

如下结果可以看到，我们不仅读取了基本数据OHLC的日K线数据，还自动计算了return_0以及return_4

open high low
close code return_0
return_4 date 2017-01-03
334.28 337.00 332.81
334.56 600519 NaN NaN
2017-01-04 334.62 352.17
334.60 351.91 600519
0.051859 NaN 2017-01-05
350.00 351.45 345.44
346.74 600519 -0.014691
NaN 2017-01-06 346.64
359.78 346.10 350.76 600519
0.011594 NaN 2017-01-09
347.80 352.88 346.54
348.51 600519 -0.006415
NaN 2017-01-10 348.45
352.00 346.60 349.00 600519
0.001406 0.043161 2017-01-11
348.00 348.00 343.50
345.45 600519 -0.010172
-0.018357 2017-01-12
346.55 347.40 344.51
347.05 600519 0.004632
0.000894 2017-01-13 346.98
347.39 343.88 344.87 600519
-0.006282
-0.016792 2017-01-16
344.13 344.80 338.80
341.47 600519 -0.009859
-0.020200 2017-01-17
342.60 351.50 342.00
349.13 600519 0.022432
0.000372 2017-01-18 348.88
356.77 347.21 355.08 600519
0.017042 0.027877

金融数据自动标注

数据标注是机器学习里“监督学习”在数据准备阶段最重要的工作。监督学习本质上就是”统计“学习样本特征与标注之间的相关性。统计学上说的“garbage in,garbage out”就是在强调数据标注质量的重要性。
现代深度学习，基于大数据样本以及GPU的强大算力。其中这里的数据标注成本是非常高的，很多公司，比如做无人驾驶，需要跨国雇人做数据标注等。
金融时间序列在数据标注上相对容易，一定程序上我们是可以实现标注自动化的。因为从回测的角度，所以的交易都是发生过且记录在案。站在过去的时点上，我们是知道“未来”几天或几个月的走势，相关的收益率，波动率等。可以把这些特征做过样本的标注。
前文描述的，在做特征提取的时候，我们是“回顾历史”，比如看近5天的收益率，做为当天的一个数据特征。而标注，则是看未来，即当前这些数据特征，在未来，比如未来5天的收益率是多少。

目标是未来5天的收益率。
对未来5天的收益率顺序，使用Series的0.2，0.4，0.6，0.8四个分位点，把整个序列分成5份，分别标注为0-4五类。

#自动标注数据 def
auto_labeler(df,label,hold_days):
label_name = '' if label ==
'return': label_name =
'label_return_'+str(hold_days) df[label_name] =
df['close'].shift(-hold_days)/df['close']
- 1 rank20 = df[label_name].quantile(0.2)
rank40 = df[label_name].quantile(0.4) rank60 =
df[label_name].quantile(0.6) rank80 =
df[label_name].quantile(0.8) df['label']
= np.where(df[label_name]0,None)
df['label'] = np.where(df[label_name] > rank20,
1, df['label']) df['label']
= np.where(df[label_name] > rank40, 2,
df['label']) df['label'] =
np.where(df[label_name] > rank60, 3,
df['label']) df['label'] =
np.where(df[label_name] > rank80, 4,
df['label']) return df

调用，先提取基础数据特征，在这个基础上进行数据标注。

start =
datetime(2017,1,1)
end =
datetime(2017,1,31) df =
feature_extractor('600519',features,start_date=start,end_date=end)
df = auto_labeler(df,'return',5)
print(df.head(10))

得到结果如下：

open high low
close code return_0
return_4 \ date 2017-01-03
334.28 337.00 332.81
334.56 600519 NaN NaN
2017-01-04 334.62 352.17
334.60 351.91 600519
0.051859 NaN 2017-01-05
350.00 351.45 345.44
346.74 600519 -0.014691
NaN 2017-01-06 346.64
359.78 346.10 350.76 600519
0.011594 NaN 2017-01-09
347.80 352.88 346.54
348.51 600519 -0.006415
NaN 2017-01-10 348.45
352.00 346.60 349.00 600519
0.001406 0.043161 2017-01-11
348.00 348.00 343.50
345.45 600519 -0.010172
-0.018357 2017-01-12
346.55 347.40 344.51
347.05 600519 0.004632
0.000894 2017-01-13 346.98
347.39 343.88 344.87 600519
-0.006282
-0.016792 2017-01-16
344.13 344.80 338.80
341.47 600519 -0.009859
-0.020200 label_return_5
label date 2017-01-03
0.043161 4 2017-01-04
-0.018357 1 2017-01-05
0.000894 2 2017-01-06
-0.016792 1 2017-01-09
-0.020200 0 2017-01-10
0.000372 2 2017-01-11
0.027877 3 2017-01-12
0.022101 3 2017-01-13
0.029344 4 2017-01-16
0.028553 4

关于作者：魏佳斌，互联网产品/技术总监，北京大学光华管理学院（MBA）,特许金融分析师（CFA），资深产品经理/码农。偏爱python，深度关注互联网趋势，人工智能，AI金融量化。致力于使用最前沿的认知技术去理解这个复杂的世界。AI量化开源项目：
https://github.com/ailabx/ailabx
扫描下方二维码，关注：AI量化实验室（ailabx），了解AI量化最前沿技术、资讯。

金融数据特征提取与自动标注

举报/Report

我的更多文章

下载客户端阅读体验更佳