机器学习实战三：预测汽车油耗效率 MPG

发布日期：2021-05-11 00:18:00 浏览次数：17 分类：精选文章

本文共 2907 字，大约阅读时间需要 9 分钟。

预测汽车油耗效率 MPG 的简单线性回归实验

导读

我们通过简单的线性回归模型，探索如何预测汽车的油耗效率 MPG。通过本次实验，深入理解 linear regression 的基础知识。如果想更深入了解，可以参观我的机器学习之旅。

数据输入与探索

首先，我们读取包含九列数据的文件，其中有一列是汽车油耗效率 mpg。数据涵盖了多个因素：

mpg：燃油效率（km/L）

cylinders：气缸数

displacement：排量

horsepower：马力

weight：车重

acceleration：加速度

model year：型号年份

origin：原产地

car name：车型名字

使用 pandas 和 numpy 读取数据：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = '../data_files/3.MPG/auto-mpg.data'
columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model year", "origin", "car name"]
cars = pd.read_csv(path, delim_whitespace=True, names=columns)

数据分析与可视化

接下来，我们在不同因素与 mpg 之间绘制散点图，以探索潜在的线性关系。

cylinders、displacement、weight 和 acceleration 与 mpg 呈明显线性关系，其中，weight（车重）与 mpg 的关系最为显著。

所以，我们优先基于 weight 建立线性回归模型。

数据拆分

将数据拆分为训练集和测试集：

from sklearn.model_selection import train_test_split
X = cars[['weight']]
Y = cars['mpg']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

单变量线性回归模型

模型搭建

使用 scikit-learn 的线性回归模型训练：

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, Y_train)

验证结果

对训练集和测试集进行预测并绘图：

plt.scatter(X_train, Y_train, color='red', alpha=0.3)
plt.scatter(X_train, lr.predict(X_train), color='green', alpha=0.3)
plt.xlabel('weight')
plt.ylabel('mpg')
plt.title('train data')
plt.show()
plt.scatter(X_test, Y_test, color='blue', alpha=0.3)
plt.scatter(X_train, lr.predict(X_train), color='green', alpha=0.3)
plt.xlabel('weight')
plt.ylabel('mpg')
plt.title('test data')
plt.show()

模型得分显示：

print('score = {}'.format(lr.score(X, Y)))

多变量线性回归模型

考虑到 mpg 与 weight、horsepower、displacement 存在非线性关系，我们建立更全面的多变量模型。

mul = ['weight', 'horsepower', 'displacement']
mul_lr = LinearRegression()
mul_lr.fit(cars[mul], cars['mpg'])
cars['mpg_prediction'] = mul_lr.predict(cars[mul])

模型得分较高，表明多变量模型性能优于单变量模型。

模型评估

计算误差：

from sklearn.metrics import mean_squared_error as mse
y_pred = cars['mpg_prediction']
rmse = np.sqrt(mse(y_pred, cars['mpg']))
print('rmse = %f' % rmse)

可视化结果

绘制多变量模型预测结果的可视化图表：

fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(3, 1, 1)
ax2 = fig.add_subplot(3, 1, 2)
ax3 = fig.add_subplot(3, 1, 3)
ax1.scatter(cars['weight'], cars['mpg'], c='blue', alpha=0.3)
ax1.scatter(cars['weight'], cars['mpg_prediction'], c='red', alpha=0.3)
ax1.set_title('weight')
ax2.scatter([float(x) for x in cars['horsepower'].tolist()], 
           cars['mpg'], c='blue', alpha=0.3)
ax2.scatter([float(x) for x in cars['horsepower'].tolist()], 
           cars['mpg_prediction'], c='red', alpha=0.3)
ax2.set_title('horsepower')
ax3.scatter(cars['displacement'], cars['mpg'], c='blue', alpha=0.3)
ax3.scatter(cars['displacement'], cars['mpg_prediction'], c='red', alpha=0.3)
ax3.set_title('displacement')
plt.show()

总结

通过本次实验，我们成功构建了单变量和多变量线性回归模型，并验证了多变量模型的优越性。预测结果准确，表现出良好的可解释性。

各位技术爱好者，值得一提的：If you find a path with no obstacles, it probably doesn’t lead anywhere.

上一篇：机器学习实战四：好事达保险索赔预测 Allstate Claims Severity （xgboost)

下一篇：Python之建模数值逼近篇–最小二乘拟合

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！