Confidence Interval

우리가 다루는 대부분의 데이터는 sample 데이터라고 할 수 있다. 다시말해, sample 데이터는 보통 크기가 충분하지 않아 global한 feature를 보기에는 한계가 있다. sample 단위에서 population의 global한 feature를 보고자 다양한 distribution이 나오게 되었고, 여러 통계적 개념이 정의되었다.

중요한 지표 중 하나인 confidence interval은 sample 데이터로부터 global한 description을 어느 정도로 할 수 있는지 이야기해준다. 과정은 다음과 같다.

우리는 population의 mean 값을 알 수 없으니 여러 sample 데이터들의 mean값을 구하게 된다. sample들의 mean값이 많아지면 이또한 distribution을 가지게 되고 population의 distribution을 따라가게 될 것 같다고 생각해볼 수 있다. 여기서 가정하고 있는 것은 central limit theorem에 따라 데이터가 무수히라고 할 만큼 많게 되면 normal distribution에 가까워진다라고 본다. sampled distribution도 normal distribution을 따른다. 다시 돌아와서 sampled distribution으로부터 '평균의 평균'과 얼만큼 떨어져 있는지 나타내주는 지표인 standard deviation을 구할 수 있다.

standard deviation을 sample수의 square root로 나누어 준 것을 SEM, standard error of mean이라고 한다. 95% confidence를 가지고 population의 mean을 추정하기 위해서는 sampled distribution의 mean값으로부터 1.96(95%)*SEM 떨어진 구간을 confidence interval이라고 한다.

confidence interval을 두 가지로 생각해 볼 수 있는데, 1) population의 standard deviation을 알고 있을 때와 2) population의 정보가 없을 때이다.

1) 경우에는 sample 데이터들의 distribution이 있을 텐데, random하게 100개의 sample 데이터를 뽑아서 confidence interval을 보면 위에서 정한 95%라는 값, 즉 95개 정도의 sample 데이터의 confidence interval에는 population mean값이 들어간다는 뜻이다.

2) 경우에는 population의 sampling data들 중, 100개를 뽑아 mean값을 보면 95%라고 정한 값, 즉 95개 정도는 confidence interval에 포함된다는 이야기 이다.

confidence interval은 SEM값에 따라 달라지고, SEM은 sample 수에 따라 달라진다. sample수가 많아지면 SEM는 작아지고 confidence interval은 좁아져서 population의 mean값을 더 정확히 유추할 수 있다. 이는 당연한 이야기이고, confidence interval의 용도는 다음과 같다.

두 sample들을 비교한다고 해보자. 각 sample들의 confidence interval을 구했을 때, 겹치는 부분이 적다면 두 그룹은 같은 population에서 sampling 되었다라고 하기는 어렵다. 반대로 겹쳐지는 부분이 많다면 같은 population에서 sampling 되었을 가능성이 크다고 볼 수 있다. 이는 sampling group의 homogeneity를 확인하거나 집단 비교에 활용할 수 있다.

통계에서 이야기하고자 하는 것은 population이며, 이것을 염두해두고 문제에 접근하면 문제를 이해하기에 편하다.

Reference

- http://onlinestatbook.com/2/estimation/difference_means.html

- Standard deviation and standard error of the mean. https://ekja.org/upload/pdf/kjae-68-220_ko.pdf

- Inference by eye: confidence intervals and how to read pictures of data. doi.org/10.1037/0003-066x.60.2.170

저작자표시

'Statistics' 카테고리의 다른 글

Standard error of estimate and R-square (0)	2020.05.28
Effect (0)	2020.05.26
Goodness-of-fit test (0)	2020.05.19
Sampling error (0)	2020.05.19
Systematic and Chance factor (0)	2020.05.19

Analytic reasoning

Confidence Interval

'Statistics' 카테고리의 다른 글

댓글

티스토리툴바

Confidence Interval

'Statistics' 카테고리의 다른 글

관련글

댓글

티스토리툴바