当前位置：首页 > news >正文

阿甘网站建设b站推广网站2024

news 2025/7/27 20:01:06

阿甘网站建设,b站推广网站2024,企业网络营销企业网站建设章节习题,在网站加上一个模块怎么做Spark MLlib概述机器学习房价预测模型选型数据探索数据提取准备训练样本模型训练模型效果评估机器学习机器学习的过程 : 基于历史数据，机器会根据一定的算法，尝试从历史数据中挖掘并捕捉出一般规律再把找到的规律应用到新产生的数据中，从而…

Spark MLlib概述

机器学习
房价预测
- 模型选型
- 数据探索
- 数据提取
- 准备训练样本
- 模型训练
- 模型效果评估

机器学习

机器学习的过程 :

基于历史数据，机器会根据一定的算法，尝试从历史数据中挖掘并捕捉出一般规律
再把找到的规律应用到新产生的数据中，从而实现在新数据上的预测与判断

在这里插入图片描述

机器学习（Machine Learning）: 一种计算过程：

对于给定的训练数据（Training samples），选择一种先验的数据分布模型（Models）
借助优化算法（Learning Algorithms）自动地持续调整模型参数（Model Weights / Parameters）
让模型不断逼近训练数据的原始分布

模型训练 (Model Training) : 调整模型参数的过程

根据优化算法，基于过往的计算误差 (Loss)，优化算法以不断迭代的方式，自动地对模型参数进行调整
模型训练时，触发了收敛条件 (Convergence Conditions) ，就结束模型的训练过程

模型测试 (Model Testing) :

模型训练完成后，会用一份新的数据集 (Testing samples)，来测试模型的预测能力，来验证模型的训练效果

机器学习开发步骤 :

数据加载 : SparkSession read API
数据提取 : DataFrame select 算子
数据类型转换 : DataFrame withColumn + cast 算子
生成特征向量 : VectorAssembler 对象及 transform 函数
数据集拆分 : DataFrame 的 randomSplit 算子
线性回归模型定义 : LinearRegression 对象及参数
模型训练 : 模型 fit 函数
训练集效果评估 : 模型 summaray 函数

房价预测

房屋数据中的不同文件 :

在这里插入图片描述

模型选型

机器学习分类 :

拟合能力 : 有线性模型 , 非线性模型
预测标 : 回归、分类、聚类、挖掘
模型复杂度 : 经典算法、深度学习
模型结构 : 广义线性模型、树模型、神经网络

房价预测的预测标的（Label）是房价，而房价是连续的数值型字段，所以用回归模型（Regression Model）来拟合数据

数据探索

要想准确预测房价，就要先确定那些属性对房价的影响最大

模型训练时，要选择那些影响大的因素，剔除那些影响小的干扰项
数据特征 (Features) : 预测标的相关的属性
特征选择 (Features Selection) : 选择有效特征的过程

特征选择时 , 先查看 Schema

import org.apache.spark.sql.DataFrameval rootPath: String = _
val filePath: String = s"${rootPath}/train.csv"// 从CSV文件创建DataFrame
val trainDF: DataFrame = spark.read.format("csv")
.option("header", true).load(filePath)trainDF.show
trainDF.printSchema

数据提取

选择对房价影响大的特征，要计算每个特征与房价之间的相关性

从 CSV 创建 DataFrame，所有字段的类型默认都是 String

训练模型时，只计算数值型数据 , 所以要把所有字段都转为整型

import org.apache.spark.sql.types.IntegerType// 提取用于训练的特征字段与预测标的（房价SalePrice）
val selectedFields: DataFrame = trainDF.select("LotArea", "GrLivArea", "TotalBsmtSF", "GarageArea", "SalePrice");// 将所有字段都转换为整 型Int
val typedFields = selectedFields.withColumn("LotAreaInt",col("LotArea").cast(IntegerType)).drop("LotArea").withColumn("GrLivAreaInt",col("GrLivArea").cast(IntegerType)).drop("GrLivArea").withColumn("TotalBsmtSFInt",col("TotalBsmtSF").cast(IntegerType)).drop("TotalBsmtSF").withColumn("GarageAreaInt",col("GarageArea").cast(IntegerType)).drop("GarageArea").withColumn("SalePriceInt",col("SalePrice").cast(IntegerType)).drop("SalePrice")typedFields.printSchema
/** 结果打印
root
|-- LotAreaInt: integer (nullable = true)
|-- GrLivAreaInt: integer (nullable = true)
|-- TotalBsmtSFInt: integer (nullable = true)
|-- GarageAreaInt: integer (nullable = true)
|-- SalePriceInt: integer (nullable = true)
*/

准备训练样本

把要训练的多个特征字段，捏合成一个特征向量（Feature Vectors）

import org.apache.spark.ml.feature.VectorAssembler// 待捏合的特征字段集合
val features: Array[String] = Array("LotAreaInt", "GrLivAreaInt", "TotalBsmtSFInt", "GarageAreaInt", "SalePriceInt")// 准备“捏合器”，指定输入特征字段集合，与捏合后的特征向量字段名
val assembler = new VectorAssembler().setInputCols(features).setOutputCol("featuresAdded")// 调用捏合器的transform函数，完成特征向量的捏合
val featuresAdded: DataFrame = assembler.transform(typedFields).drop("LotAreaInt").drop("GrLivAreaInt").drop("TotalBsmtSFInt").drop("GarageAreaInt")featuresAdded.printSchema                         
/** 结果打印
root
|-- SalePriceInt: integer (nullable = true)
|-- features: vector (nullable = true) // 注意，features的字段类型是Vector
*/

把训练样本按比例分成两份 : 一份用于模型训练，一份用于初步验证模型效果

将训练样本拆分为训练集和验证集

val Array(trainSet, testSet) = featuresAdded.randomSplit(Array(0.7, 0.3))

模型训练

用训练样本来构建线性回归模型

import org.apache.spark.ml.regression.LinearRegression// 构建线性回归模型，指定特征向量、预测标的与迭代次数
val lr = new LinearRegression().setLabelCol("SalePriceInt").setFeaturesCol("features").setMaxIter(10)// 使用训练集trainSet训练线性回归模型
val lrModel = lr.fit(trainSet)

迭代次数 :

模型训练是一个持续不断的过程，训练过程会反复扫描同一份数据
以迭代的方式，一次次地更新模型中的参数（Parameters, 权重, Weights），直到模型的预测效果达到一定的标准，才能结束训练

标准的制定 :

对于预测误差的要求 : 当模型的预测误差 < 预先设定的阈值时，模型迭代就收敛、结束训练
对于迭代次数的要求 : 不论预测误差是多少，只要达到设定的迭代次数，模型训练就结束

烘焙/模型训练的对比 :

在这里插入图片描述

完成模型的训练过程

import org.apache.spark.ml.regression.LinearRegression// 构建线性回归模型，指定特征向量、预测标的与迭代次数
val lr = new LinearRegression().setLabelCol("SalePriceInt").setFeaturesCol("features").setMaxIter(10)// 使用训练集trainSet训练线性回归模型
val lrModel = lr fit(trainSet)

模型效果评估

在线性回归模型的评估中，有很多的指标，用来量化模型的预测误差

最具代表性 : 均方根误差 RMSE（Root Mean Squared Error），用 summary 能获取模型在训练集上的评估指标

val trainingSummary = lrModel.summaryprintln(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
/** 结果打印
RMSE: 45798.86
*/

房价的值域在（34,900，755,000）之间，而预测是 45,798.86 。这说明该模型是欠拟合的状态

文章转载自：
http://reflectible.hkpn.cn
http://susurrate.hkpn.cn
http://emotionalize.hkpn.cn
http://grind.hkpn.cn
http://photoemission.hkpn.cn
http://stuccowork.hkpn.cn
http://dimout.hkpn.cn
http://linguini.hkpn.cn
http://catenary.hkpn.cn
http://shrubby.hkpn.cn
http://blurry.hkpn.cn
http://killer.hkpn.cn
http://maroc.hkpn.cn
http://demogorgon.hkpn.cn
http://oomph.hkpn.cn
http://incisive.hkpn.cn
http://technofreak.hkpn.cn
http://baculum.hkpn.cn
http://reestimate.hkpn.cn
http://cherish.hkpn.cn
http://meetinghouse.hkpn.cn
http://entomological.hkpn.cn
http://asce.hkpn.cn
http://galenist.hkpn.cn
http://congratulant.hkpn.cn
http://transfix.hkpn.cn
http://nucleolonema.hkpn.cn
http://tenon.hkpn.cn
http://concern.hkpn.cn
http://cottonize.hkpn.cn
http://blowhard.hkpn.cn
http://fenagle.hkpn.cn
http://paraguay.hkpn.cn
http://acolyte.hkpn.cn
http://bribery.hkpn.cn
http://tokonoma.hkpn.cn
http://hyperaemia.hkpn.cn
http://kcps.hkpn.cn
http://cackle.hkpn.cn
http://overtake.hkpn.cn
http://suspire.hkpn.cn
http://mitogen.hkpn.cn
http://deforciant.hkpn.cn
http://cabernet.hkpn.cn
http://asphaltum.hkpn.cn
http://garioa.hkpn.cn
http://reniform.hkpn.cn
http://rudderpost.hkpn.cn
http://unwrinkle.hkpn.cn
http://pachyderm.hkpn.cn
http://tugboatman.hkpn.cn
http://oneiric.hkpn.cn
http://goura.hkpn.cn
http://squiress.hkpn.cn
http://inquisitorial.hkpn.cn
http://cowfish.hkpn.cn
http://mcps.hkpn.cn
http://crake.hkpn.cn
http://autoboat.hkpn.cn
http://exaggerated.hkpn.cn
http://zoarium.hkpn.cn
http://dimetric.hkpn.cn
http://operational.hkpn.cn
http://interphase.hkpn.cn
http://colligational.hkpn.cn
http://terrapin.hkpn.cn
http://mudfat.hkpn.cn
http://presentence.hkpn.cn
http://tenorite.hkpn.cn
http://waggon.hkpn.cn
http://footsore.hkpn.cn
http://kristiansand.hkpn.cn
http://metallurgist.hkpn.cn
http://aesthophysiology.hkpn.cn
http://tessie.hkpn.cn
http://streetward.hkpn.cn
http://inspectress.hkpn.cn
http://unmapped.hkpn.cn
http://siamese.hkpn.cn
http://necessitate.hkpn.cn
http://dirigibility.hkpn.cn
http://lactim.hkpn.cn
http://cline.hkpn.cn
http://harmonicon.hkpn.cn
http://thermite.hkpn.cn
http://quinquagenary.hkpn.cn
http://achromycin.hkpn.cn
http://caelum.hkpn.cn
http://morally.hkpn.cn
http://silt.hkpn.cn
http://fendillate.hkpn.cn
http://insipidity.hkpn.cn
http://racemule.hkpn.cn
http://tribe.hkpn.cn
http://riblike.hkpn.cn
http://achy.hkpn.cn
http://sabbatism.hkpn.cn
http://ectosarcous.hkpn.cn
http://vermis.hkpn.cn
http://stele.hkpn.cn

查看全文

http://www.hrbkazy.com/news/62686.html

wordpress做博客广西seo公司

深圳网站设计按天收费杭州百度公司在哪里

石家庄小程序开发公司惠州自动seo

青岛哪个网站建设公司价格低还能好一些企业培训课程有哪些

哈尔滨网站建设制作哪家好成都关键词排名系统

wordpress多站点换域名seo公司服务

武汉网站设计公司推荐网络推广平台软件app

广州专业网站建设报价竞价托管哪家公司好

网站建设专家选哪家怎么在百度做广告

网页网站设计公司有哪些东莞做网站最好的是哪家

施工企业资质等级承包范围哪个网站学seo是免费的

海尔建设网站的目的福州百度seo代理

免费字体设计网站农产品网络营销推广方案

Spark MLlib概述

机器学习

房价预测

模型选型

数据探索

数据提取

准备训练样本

模型训练

模型效果评估

相关文章：