根据数据的特点,选择合适的模型 - 充分理解数据的特性 - 充分理解各个/各类模型算法的优点与不足 - 确定比较合适的m种方案,实验/验证,再从中选择出1种最佳方案 数据特点,规模,量级 模型复杂度,计算量,计算代价,性能,效率,准确率
|
波士顿房价封装 from tpf.datasets import load_boston X_train, y_train, X_test, y_test = load_boston(split=True,test_size=0.15, reload=False) X_train.shape,y_train.shape, X_test.shape, y_test.shape
((430, 13), (430,), (76, 13), (76,))
重要性低于0.01特征丢弃 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] 原官方方法
from sklearn.datasets import load_boston
# 加载数据集,没有算法没有数据集测试,算法就不值钱
X,y = load_boston(return_X_y=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
|
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train,y_train)
y_pred_dtr = dtr.predict(X_test)
((y_pred_dtr - y_test)**2).mean()
19.292105263157893 并不是每次训练结果都一样,会有变化
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train,y_train)
y_pred_dtr = dtr.predict(X_test)
((y_pred_dtr - y_test)**2).mean()
17.45394736842105
将重要性低于0.01的列舍弃,重新训练
feature=dtr.feature_importances_
import numpy as np
a=np.argsort(feature)[::-1]
a
array([ 5, 12, 7, 0, 4, 6, 10, 9, 11, 2, 1, 8, 3])
X_train = X_train[a][:6]
y_train = y_train[a][:6]
X_test = X_test[a][:6]
y_test = y_test[a][:6]
dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) y_pred_dtr = dtr.predict(X_test) ((y_pred_dtr - y_test)**2).mean()
14.103333333333325
使用训练器 from sklearn.tree import DecisionTreeRegressor model = DecisionTreeRegressor()
from tpf import MlTrain
MlTrain.train(X_train, y_train, X_test, y_test,
model,save_path=save_path,epoch=1000)
2.994999999999996
from tpf import pkl_load,pkl_save
model,loss=pkl_load(file_path=save_path)
y_pred_dtr = model.predict(X_test)
((y_pred_dtr - y_test)**2).mean()
2.994999999999996
|
from sklearn.neighbors import KNeighborsClassifier ,KNeighborsRegressor knn_reg = KNeighborsRegressor(n_neighbors=5) knn_reg.fit(X_train,y_train) y_pred_knn = knn_reg.predict(X_test) ((y_pred_knn-y_test)**2).mean() 21.96151578947368 使用决策树选择后的特征,对于KNN算法无提升
21.747733333333333
|
from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor() rfr.fit(X_train,y_train) y_pred_frf = rfr.predict(X_test) ((y_pred_frf - y_test)**2).mean()
12.495555855263165
使用决策树选择后的特征,对于随机森林算法有明显提升
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
dtr.fit(X_train,y_train)
feature=dtr.feature_importances_
import numpy as np
a=np.argsort(feature)[::-1]
X_train = X_train[a][:6]
y_train = y_train[a][:6]
X_test = X_test[a][:6]
y_test = y_test[a][:6]
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
rfr.fit(X_train,y_train)
y_pred_frf = rfr.predict(X_test)
((y_pred_frf - y_test)**2).mean()
7.368401833333395
使用训练器 from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor()
from tpf import MlTrain
MlTrain.train(X_train, y_train, X_test, y_test,
model,
save_path="/media/xt/tpf/tpf/aitpf/source/models/fangjia_RandomForestRegressor.pkl",
epoch=3000,
loss_break=0.1)
loss_start: 5.198032166666702
5.0362105000000055
|
|
使用训练器 from sklearn.tree import DecisionTreeRegressor dtr = DecisionTreeRegressor() dtr.fit(X_train,y_train) feature=dtr.feature_importances_ import numpy as np a=np.argsort(feature)[::-1] X_train = X_train[a][:6] y_train = y_train[a][:6] X_test = X_test[a][:6] y_test = y_test[a][:6] from sklearn.svm import SVC,SVR model = SVR()
from tpf import MlTrain
MlTrain.train(X_train, y_train, X_test, y_test,
model,
save_path="/media/xt/tpf/tpf/aitpf/source/models/fangjia_SVR.pkl",
epoch=3000,
loss_break=0.1)
loss_start: 9.564113280675924
SVM特点 SVM的速度真是快,3000轮,几秒,也就五秒左右就执行完了, - 让我误以为代码出问题了, - 3000轮怎么一下子就过去了,之前的算法都是等上一会的, - 排查了好几遍代码问题 SVM极其稳定 - 每次执行结果都一样 - 3000次的执行结果都是9.564113280675924 - 不像其他算法可能会有一些浮动 - 这意味着SVM算法用不着训练器,因为预测数据一致,结果基本不变 数学难度极高 - SVM背后的数学理论可能是常见机器学习算法中最复杂,难度最高的 SVM只是极其稳定,但并不意味着相同数据集结果100%不变
from tpf import MlTrain
MlTrain.train(X_train, y_train, X_test, y_test,
model,
save_path="fangjia_SVR.pkl",
epoch=300000,
loss_break=0.1)
loss_start: 6.955530122835717
还有一个5点多的,由于代码缺陷没保存上, 原来的代码以第一次训练为基准,后面的训练要与第一次训练结果对比, 更好才保存, 没想到SVM第一次就最好,后面一直不变,结果出现5点多时就没保存上 这就像游戏中开局即over,首次出手就结束了局面,出场就是终场... 这个不同的结果,是我隔天再试才出现的... |