线性回归_人工智能平台 PAI(PAI)-阿里云帮助中心

线性回归（Linear Regression）是分析因变量和多个自变量之间的线性关系模型。

组件配置

您可以使用以下任意一种方式，配置线性回归组件参数。

方式一：可视化方式

在Designer工作流页面配置组件参数。

页签	参数	描述
字段设置	选择特征列	输入数据源中，参与训练的特征列。
	选择标签列	支持DOUBLE及BIGINT类型。
	是否稀疏格式	使用KV格式表示稀疏格式。
	kv对间分隔符	默认使用英文逗号（,）分隔。
	key与value分隔符	默认使用英文冒号（:）分隔。
参数设置	最大迭代轮数	算法进行的最大迭代次数。
	最小似然误差	如果两次迭代间的Log Likelihood之差小于该值，则算法终止。
	正则化类型	支持L1、L2及None类型。
	正则系数	如果正则化类型为None，则该参数失效。
	生成模型评估表	指标包括R-Squared、AdjustedR-Squared、AIC、自由度、残差的标准差及偏差。
	回归系数评估	指标包括T值、P值及置信区间[2.5%,97.5%]。只有选中生成模型评估表复选框，该参数才生效。
执行调优	计算核心数	默认为系统自动分配。
执行调优	每核内存大小	默认为系统自动分配。

方式二：PAI命令方式

使用PAI命令方式，配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用，详情请参见SQL脚本。

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DfeatureColNames=x
    -DlabelColName=y
    -DmodelName=lm_test_input_model_out;

参数	是否必选	描述	默认值
inputTableName	是	输入表的名称。	无
modelName	是	输出模型的名称。	无
outputTableName	否	输出的模型评估表名称。如果enableFitGoodness为true，则该参数必选。	无
labelColName	是	因变量，支持DOUBLE及BIGINT类型。只能选择一列作为因变量。	无
featureColNames	是	自变量。如果输入数据为稠密格式，则支持DOUBLE及BIGINT类型。如果输入数据为稀疏格式，则支持STRING类型。	无
inputTablePartitions	否	输入表的分区。	无
enableSparse	否	输入数据是否为稀疏格式，取值范围为{true,false}。	false
itemDelimiter	否	KV对之间的分隔符。如果enableSparse为true，则该参数生效。	英文逗号（,）
kvDelimiter	否	keyvalue之间的分隔符。如果enableSparse为true，则该参数生效。	英文冒号（:）
maxIter	否	算法进行的最大迭代次数。	100
epsilon	否	最小似然误差。如果两次迭代间的Log Likelihood之差小于该值，则算法终止。	0.000001
regularizedType	否	正则化类型，取值范围为{l1,l2,None}。	None
regularizedLevel	否	正则系数。如果regularizedType为None，则该参数失效。	1
enableFitGoodness	否	是否生成模型评估表。指标包括R-Squared、AdjustedR-Squared、AIC、自由度、残差的标准差及偏差。取值范围为{true,false}。	false
enableCoefficientEstimate	否	是否进行回归系数评估。评估指标包括T值、P值及置信区间[2.5%,97.5%]。如果enableFitGoodness为true，则该参数生效。取值范围为{true,false}。	false
lifecycle	否	模型评估输出表的生命周期。	-1
coreNum	否	计算的核心数量。	系统自动分配
memSizePerCore	否	每个核心的内存，取值范围为1024 MB~20*1024 MB。	系统自动分配

示例

使用SQL语句，生成测试数据。

 drop table if exists lm_test_input;
  create table lm_test_input as
  select
    *
  from
  (
    select 10 as y, 1.84 as x1, 1 as x2, '0:1.84 1:1' as sparsecol1 from dual
      union all
    select 20 as y, 2.13 as x1, 0 as x2, '0:2.13' as sparsecol1 from dual
      union all
    select 30 as y, 3.89 as x1, 0 as x2, '0:3.89' as sparsecol1 from dual
      union all
    select 40 as y, 4.19 as x1, 0 as x2, '0:4.19' as sparsecol1 from dual
      union all
    select 50 as y, 5.76 as x1, 0 as x2, '0:5.76' as sparsecol1 from dual
      union all
    select 60 as y, 6.68 as x1, 2 as x2, '0:6.68 1:2' as sparsecol1 from dual
      union all
    select 70 as y, 7.58 as x1, 0 as x2, '0:7.58' as sparsecol1 from dual
      union all
    select 80 as y, 8.01 as x1, 0 as x2, '0:8.01' as sparsecol1 from dual
      union all
    select 90 as y, 9.02 as x1, 3 as x2, '0:9.02 1:3' as sparsecol1 from dual
      union all
    select 100 as y, 10.56 as x1, 0 as x2, '0:10.56' as sparsecol1 from dual
  ) tmp;

使用PAI命令，提交线性回归组件参数。

PAI -name linearregression
    -project algo_public
    -DinputTableName=lm_test_input
    -DlabelColName=y
    -DfeatureColNames=x1,x2
    -DmodelName=lm_test_input_model_out
    -DoutputTableName=lm_test_input_conf_out
    -DenableCoefficientEstimate=true
    -DenableFitGoodness=true
    -Dlifecycle=1;

使用PAI命令，提交预测组件参数。

pai -name prediction
    -project algo_public
    -DmodelName=lm_test_input_model_out
    -DinputTableName=lm_test_input
    -DoutputTableName=lm_test_input_predict_out
    -DappendColNames=y;

查看输出的模型评估表lm_test_input_conf_out。

+------------+------------+------------+------------+--------------------+------------+
| colname    | value      | tscore     | pvalue     | confidenceinterval | p          |
+------------+------------+------------+------------+--------------------+------------+
| Intercept  | -6.42378496687763 | -2.2725755951390028 | 0.06       | {"2.5%": -11.964027, "97.5%": -0.883543} | coefficient |
| x1         | 10.260063429838898 | 23.270944360826963 | 0.0        | {"2.5%": 9.395908, "97.5%": 11.124219} | coefficient |
| x2         | 0.35374498323846265 | 0.2949247320997519 | 0.81       | {"2.5%": -1.997160, "97.5%": 2.704650} | coefficient |
| rsquared   | 0.9879675667384592 | NULL       | NULL       | NULL               | goodness   |
| adjusted_rsquared | 0.9845297286637332 | NULL       | NULL       | NULL               | goodness   |
| aic        | 59.331109494251805 | NULL       | NULL       | NULL               | goodness   |
| degree_of_freedom | 7.0        | NULL       | NULL       | NULL               | goodness   |
| standardErr_residual | 3.765777749448906 | NULL       | NULL       | NULL               | goodness   |
| deviance   | 99.26757440771128 | NULL       | NULL       | NULL               | goodness   |
+------------+------------+------------+------------+--------------------+------------+

查看预测结果表lm_test_input_predict_out。

+------------+-------------------+------------------+-------------------+
| y          | prediction_result | prediction_score | prediction_detail |
+------------+-------------------+------------------+-------------------+
| 10         | NULL              | 12.808476727264404 | {"y": 12.8084767272644} |
| 20         | NULL              | 15.43015013867922 | {"y": 15.43015013867922} |
| 30         | NULL              | 33.48786177519568 | {"y": 33.48786177519568} |
| 40         | NULL              | 36.565880804147355 | {"y": 36.56588080414735} |
| 50         | NULL              | 52.674180388994415 | {"y": 52.67418038899442} |
| 60         | NULL              | 62.82092871092313 | {"y": 62.82092871092313} |
| 70         | NULL              | 71.34749583130122 | {"y": 71.34749583130122} |
| 80         | NULL              | 75.75932310613193 | {"y": 75.75932310613193} |
| 90         | NULL              | 87.1832221199846 | {"y": 87.18322211998461} |
| 100        | NULL              | 101.92248485222113 | {"y": 101.9224848522211} |
+------------+-------------------+------------------+-------------------+