Machine Learning - Applications

变量类型

Nominal
Binary
Ordinal
Count
Time
Interval

Nominal

Ordinal

Use Category Encoders to improve model performance when you have nominal or ordinal data that may provide value.

Encoder methods

Classic Encoders
Contrast Encoders
Bayesian Encoders

Classic Encoders

Ordinal
OneHot
Binary
BaseN
Hashing

Binary Encoder

依照Ordinal编码的结果转换成二进制数
每个数位拆分为一个单独的列
适合于高维度Ordinal数据

Contrast Encoders

Helmert
Sum
Backward Difference
Polynominal

Helmert Encoding

The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

Sum Encoding

Compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels.

Backward Difference

The mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level

Polynomial

Orthogonal polynomial contrasts. The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable.

Bayesian Encoders

Target
LeaveOneOut
WeightOfEvidence

Target Encoding

Leave One Out Encoding

$$s = (s.sum() - s)/(len(s) - 1)$$

Weight Of Evidence

Evolved from the logistic regression
Benchmark to screen variables
Apply for credit risk modeling, customer attrition model, campaign response model.

From the credit score modeling view
"Bad Customers" refers to the customers who defaulted on a loan.
"Good Customers" refers to the customers who paid back loan.

Distribution of Goods - % of Good Customers in a particular group
Distribution of Bads - % of Bad Customers in a particular group

WOE = In(% of non-events ➗ % of events)

Steps of Calculating WOE

For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events

Information Value (IV)

IV = ∑ (% of non-events - % of events) * WOE

Quick Summary

For nominal columns:
OneHot, Hashing, LeaveOneOut, and Target encoding.Avoid OneHot for high cardinality columns and decision tree-based algorithms.
For ordinal columns:
Ordinal (Integer), Binary, OneHot, LeaveOneOut, and Target
For regression tasks:
Target and LeaveOneOut probably won’t work well

Category Encoding Introduction