본문 바로가기
Library

sklearn - Scaler

by wycho 2021. 6. 23.

- StandardScaler

: 정규분포를 갖도록 데이터의 스케일을 조정. N(0,1)

: 평균을 0에 오도록 하며, outlier를 제거하는데 사용.

: Sparse 데이터에 사용하게되면 sparseness 구조가 파괴됨.

 

- MinMaxScaler, MaxAbsScaler

: 표준편차가 매우 작고 sparse 데이터에 적합.

$$ x^\prime=\frac{x-x_{min}}{x_{max}-x_{min}} $$

 

- Normalizer

: 각 feature마다 정규화함.

: Euclidean distance가 1이 되도록 조정.

$$ ED = \sqrt{\sum_i x_i^2} = 1 $$

: feature마다 스케일이 다른 경우, training에서 큰 값을 가지는 쪽으로 큰 weight를 주게되는 것을 방지.

즉, 최소값과 최대값의 편차가 큰 데이터에서 사용.

: 데이터의 분포를 유지하면서 0과 1의 값으로 스케일링.

 

 

- RobustScaler

: Outlier 가 많거나 값이 커서 mean과 variance에 영향을 많이 줄 경우, 데이터의 중간값과 interquartile 범위로 스케일링.

 

Usage

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)

 

fit(X[, y])            Compute the median and quantiles to be used for scaling.
fit_transform(X[, y])  Fit to data, then transform it.
get_params([deep])     Get parameters for this estimator.
inverse_transform(X)   Scale back the data to the original representation
set_params(**params)   Set the parameters of this estimator.
transform(X)           Center and scale the data.

 

- QuantileTransformer, Non-linear Transformation

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html

: This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

 

- PowerTransformer, Non-linear Transformation

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

  method={‘yeo-johnson’, ‘box-cox’}

  : ‘yeo-johnson’ works with positive and negative values
  : ‘box-cox’ only works with strictly positive values

: Apply a power transform featurewise to make data more Gaussian-like.

: Box-Cox transformation is obviously useful to remove spurious interactions and to identify the factors that are really significant.

 

 

 

Reference

- https://scikit-learn.org/stable/modules/preprocessing.html

- https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

- https://towardsdatascience.com/box-cox-transformation-explained-51d745e34203

- https://www.isixsigma.com/tools-templates/normality/dealing-non-normal-data-strategies-and-tools/

 

 

 

 

'Library' 카테고리의 다른 글

numpy - ravel_multi_index  (0) 2021.12.21
sklearn - template  (0) 2021.07.01
Scikit-allel  (0) 2020.11.06
sklearn - Standardization  (0) 2020.11.05
Scikit-learn, sklearn  (0) 2020.11.04

댓글