Imbalance Learning Introduction

  • 基本问题介绍
  • 常用方法

基本问题介绍

  • 不平衡数据集在实际问题中是比较常见的情况。
  • 大多数情况下,占少数的正样本是我们比较关心的。
  • 由于样本的不平衡,将会导致我们在模型选择、评价和模型指标的解读上都会有所调整。
  • 样本不平衡所到来的三个主要问题:
    1. The machine problem
    2. The intrinsic problem
    3. The human problem

Think about the problem



  • Class C0:C1 = 9:1
  • C0 : Gaussian distribution of mean 0 and variance 4
  • C1 : Gaussian distribution of mean 2 and variance 1

$$P(C_0|x)>P(C_1|x)$$

About the separability



  • Class C0:C1 = 9:1
  • C0 : Gaussian distribution of mean 0 and variance 4
  • C1 : Gaussian distribution of mean 10 and variance 1

Theoretical minimal error probability

$P(wrong|x)=min(P(C0|x),P(C1|x))=\frac{min(P(x|C0)P(C0),p(x|C1)P(C1))}{P(x)}$

$P(wrong)=\int_\mathcal{R} P(wrong|x)P(x)dx=\int_\mathcal{R}min(P(x|C0)P(C0),P(x|C1)P(C1))$

Reworking On Data



  • modifying the dataset with resampling-like methods is changing the reality
  • requires to be careful
Machine Learning Applications and practices