- StandardScaler
: 정규분포를 갖도록 데이터의 스케일을 조정. N(0,1)
: 평균을 0에 오도록 하며, outlier를 제거하는데 사용.
: Sparse 데이터에 사용하게되면 sparseness 구조가 파괴됨.
- MinMaxScaler, MaxAbsScaler
: 표준편차가 매우 작고 sparse 데이터에 적합.
$$ x^\prime=\frac{x-x_{min}}{x_{max}-x_{min}} $$
- Normalizer
: 각 feature마다 정규화함.
: Euclidean distance가 1이 되도록 조정.
$$ ED = \sqrt{\sum_i x_i^2} = 1 $$
: feature마다 스케일이 다른 경우, training에서 큰 값을 가지는 쪽으로 큰 weight를 주게되는 것을 방지.
즉, 최소값과 최대값의 편차가 큰 데이터에서 사용.
: 데이터의 분포를 유지하면서 0과 1의 값으로 스케일링.
- RobustScaler
: Outlier 가 많거나 값이 커서 mean과 variance에 영향을 많이 줄 경우, 데이터의 중간값과 interquartile 범위로 스케일링.
Usage
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
fit(X[, y]) Compute the median and quantiles to be used for scaling.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(X) Scale back the data to the original representation
set_params(**params) Set the parameters of this estimator.
transform(X) Center and scale the data.
- QuantileTransformer, Non-linear Transformation
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html
: This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
- PowerTransformer, Non-linear Transformation
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html
method={‘yeo-johnson’, ‘box-cox’}
: ‘yeo-johnson’ works with positive and negative values
: ‘box-cox’ only works with strictly positive values
: Apply a power transform featurewise to make data more Gaussian-like.
: Box-Cox transformation is obviously useful to remove spurious interactions and to identify the factors that are really significant.
Reference
- https://scikit-learn.org/stable/modules/preprocessing.html
- https://towardsdatascience.com/box-cox-transformation-explained-51d745e34203
- https://www.isixsigma.com/tools-templates/normality/dealing-non-normal-data-strategies-and-tools/
'Library' 카테고리의 다른 글
numpy - ravel_multi_index (0) | 2021.12.21 |
---|---|
sklearn - template (0) | 2021.07.01 |
Scikit-allel (0) | 2020.11.06 |
sklearn - Standardization (0) | 2020.11.05 |
Scikit-learn, sklearn (0) | 2020.11.04 |
댓글