文档

特征编码

更新时间:

特征编码是将非线性特征通过GBDT编码成线性特征。

功能介绍

特征编码由决策树和Ensemble算法挖掘新特征的一种策略,特征来自一个或多个特征组成的决策树叶子结点的one-hot结果。

例如,下图有三棵树,共有12个叶子结点。根据树的顺序依次编码为0~11号特征,其中第一棵树的叶子结点占据0~3号特征,第二棵树占据4~7号特征,第三棵树占据8~11号特征。该编码策略可以有效转换GBDT非线性特征为线性特征。

image

组件配置

您可以使用以下任意一种方式,配置特征编码组件参数。

方式一:可视化方式

Designer工作流页面配置组件参数。

页签

参数

描述

字段设置

特征列

输入表中,用于训练的特征列。

标签列

必选,选择标签列。

附加输出列

可选,保留原特征至输出结果表。

参数设置

计算核心数

计算的核心数,格式为正整数。

每个核心内存数

每个核心的内存数量,格式为正整数。

方式二:PAI命令方式

使用PAI命令方式,配置该组件参数。您可以使用SQL脚本组件进行PAI命令调用,详情请参见SQL脚本

PAI -name fe_encode_runner -project algo_public
    -DinputTable="tdl_pai_bank_test1"
    -DencodeModel="xlab_m_GBDT_LR_1_19064"
    -DselectedCols="pdays,previous,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,age,campaign"
    -DlabelCol="y"
    -DoutputTable="pai_temp_2159_19061_1";
    -DcoreNum=10
    -DmemSizePerCore=1024

参数名称

是否必选

描述

默认值

inputTable

输入表的名称。

inputTablePartitions

输入表中指定参与训练的分区,格式为partition_name=value

如果是多级,格式为name1=value1/name2=value2

如果指定多个分区,使用英文逗号(,)分隔。

输入表的所有分区

encodeModel

编码的输入GBDT二分类的模型。

outputTable

缩放尺度后的结果表。

selectedCols

勾选GBDT参与编码的特征,通常是GBDT组件的训练特征。

labelCol

标签字段。

lifecycle

结果表的生命周期。

7

coreNum

指定Instance的总数,支持BIGINT类型。

-1,会根据输入数据量计算需要的Instance的数量。

memSizePerCore

指定memory大小。

-1,会根据输入数据量计算需要的内存大小。

示例

  1. 使用SQL语句,生成训练数据。

    CREATE TABLE IF NOT EXISTS tdl_pai_bank_test1
    (
        age            BIGINT COMMENT '',
        campaign       BIGINT COMMENT '',
        pdays          BIGINT COMMENT '',
        previous       BIGINT COMMENT '',
        emp_var_rate   DOUBLE COMMENT '',
        cons_price_idx DOUBLE COMMENT '',
        cons_conf_idx  DOUBLE COMMENT '',
        euribor3m      DOUBLE COMMENT '',
        nr_employed    DOUBLE COMMENT '',
        y              BIGINT COMMENT ''
    )
    LIFECYCLE 7;
    insert overwrite table tdl_pai_bank_test1
    select * from
    (select 53 as age,1 as campaign,999 as pdays,0 as previous,-0.1 as emp_var_rate,
           93.2 as cons_price_idx,-42.0 as cons_conf_idx, 4.021 as euribor3m,5195.8 as nr_employed,0 as y
    union all
    select 28 as age,3 as campaign,6 as pdays,2 as previous,-1.7 as emp_var_rate,
           94.055 as cons_price_idx,-39.8 as cons_conf_idx, 0.729 as euribor3m,4991.6 as nr_employed,1 as y
    union all
    select 39 as age,2 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.405 as euribor3m,5099.8 as nr_employed,0 as y
    union all
    select 55 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.201 as cons_price_idx,-31.4 as cons_conf_idx, 0.869 as euribor3m,5076.2 as nr_employed,1 as y
    union all
    select 30 as age,8 as campaign,999 as pdays,0 as previous,1.4 as emp_var_rate,
           93.918 as cons_price_idx,-42.7 as cons_conf_idx, 4.961 as euribor3m,5228.2 as nr_employed,0 as y
    union all
    select 37 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.327 as euribor3m,5099.1 as nr_employed,0 as y
    union all
    select 39 as age,1 as campaign,999 as pdays,0 as previous,-1.8 as emp_var_rate,
           92.893 as cons_price_idx,-46.2 as cons_conf_idx, 1.313 as euribor3m,5099.1 as nr_employed,0 as y
    union all
    select 36 as age,1 as campaign,3 as pdays,1 as previous,-2.9 as emp_var_rate,
           92.963 as cons_price_idx,-40.8 as cons_conf_idx, 1.266 as euribor3m,5076.2 as nr_employed,1 as y
    union all
    select 27 as age,2 as campaign,999 as pdays,1 as previous,-1.8 as emp_var_rate,
           93.075 as cons_price_idx,-47.1 as cons_conf_idx, 1.41 as euribor3m,5099.1 as nr_employed,0 as y
    ) a
  2. 构建如下实验,通常与GBDT二分类组件配合使用。详情请参见算法建模

    设置GBDT二分类组件的参数,树的数目为5,树的最大深度为3,y为标签列,其它字段为特征列。建模

  3. 运行实验,查看预测结果。

    kv

    y

    2:1,5:1,8:1,12:1,15:1,18:1,28:1,34:1,41:1,50:1,53:1,63:1,72:1

    0.0

    2:1,5:1,6:1,12:1,15:1,16:1,28:1,34:1,41:1,50:1,51:1,63:1,72:1

    0.0

    2:1,3:1,12:1,13:1,28:1,34:1,36:1,39:1,55:1,61:1

    1.0

    2:1,3:1,12:1,13:1,20:1,21:1,22:1,42:1,43:1,46:1,63:1,64:1,67:1,68:1

    0.0

    0:1,10:1,28:1,29:1,32:1,36:1,37:1,55:1,56:1,59:1

    1.0

    输出的效果可以直接输入至逻辑回归二分类或多分类组件,通常效果会比单独使用LR或GDBT的效果好,且不易拟合。

  • 本页导读 (1)
文档反馈