본문 바로가기
R

[R] predict

by wycho 2021. 10. 19.

https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html

 

데이터를 통해 GLM으로 모델링한 model이 있다고 해보자. 함수 predict 는 모델로부터 값을 예측해주는 기능을 한다.

 

predict(model, newdata, type=c("link", "response", "terms"), se.fit=FALSE, dispersion=NULL, term=NULL, na.action=na.pass,..)

 

- Parameter

  * model : glm fittef model from data.

  * newdata : 예측하고자 하는 값이 있는 데이터.

  * type : { link : log-odds, response : expected value, terms : constant when the given value is 0 }.

  * se.fit : standard error.

  * dispersion : standard error 를 구할때 사용하는 dispersion. NULL 일 경우 summary(model)$dispersion 사용.

  * term : type="terms" 로 했을 경우, 꼭 들어가야 하는 term. NULL 일 경우 모든 term이 들어감.

  * na.action : 데이터의 NA 값을 어떻게 처리할 것인지 판단하는 값.

 

- Example : https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/

m1 <- glm.nb(daysabs ~ math + prog, data = dat) # 2 independent variable, math and prog

newdata1 <- data.frame(math = mean(dat$math), prog = factor(1:3, levels = 1:3, 
    labels = levels(dat$prog)))

newdata1$phat <- predict(m1, newdata1, type = "response")

##    math       prog   phat
## 1 48.27    General 10.237
## 2 48.27   Academic  6.588
## 3 48.27 Vocational  2.850

newdata2 <- data.frame(
  math = rep(seq(from = min(dat$math), to = max(dat$math), length.out = 100), 3),
  prog = factor(rep(1:3, each = 100), levels = 1:3, labels =
  levels(dat$prog)))
  
newdata2 <- cbind(newdata2, predict(m1, newdata2, type = "link", se.fit=TRUE))
newdata2 <- within(newdata2, {
  DaysAbsent <- exp(fit)
  LL <- exp(fit - 1.96 * se.fit)
  UL <- exp(fit + 1.96 * se.fit)
})

##          math     prog      fit     se.fit residual.scale        UL       LL DaysAbsent
##   1  1.000000  General 2.609272 0.19606060              1 19.956427 9.253424  13.589161
##   2  1.989899  General 2.603340 0.19469699              1 19.785436 9.223309  13.508782
##   3  2.979798  General 2.597408 0.19335557              1 19.616763 9.192893  13.428879
##   4  3.969697  General 2.591475 0.19203679              1 19.450392 9.162169  13.349449
...
## 122 21.787879 Academic 2.043931 0.09627755              1  9.324403 6.393149   7.720900
## 123 22.777778 Academic 2.037998 0.09499133              1  9.245912 6.371376   7.675232
## 124 23.767677 Academic 2.032066 0.09375306              1  9.168943 6.349081   7.629833
## 125 24.757576 Academic 2.026134 0.09256468              1  9.093504 6.326245   7.584704
더보기
summary(m1)

## 
## Call:
## glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
##     link = log)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.155  -1.019  -0.369   0.229   2.527  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.61527    0.19746   13.24  < 2e-16 ***
## math           -0.00599    0.00251   -2.39    0.017 *  
## progAcademic   -0.44076    0.18261   -2.41    0.016 *  
## progVocational -1.27865    0.20072   -6.37  1.9e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.033) family taken to be 1)
## 
##     Null deviance: 427.54  on 313  degrees of freedom
## Residual deviance: 358.52  on 310  degrees of freedom
## AIC: 1741
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.033 
##           Std. Err.:  0.106 
## 
##  2 x log-likelihood:  -1731.258

* dispersion parameter = ( sum of squared Pearson Residuals ) / ( degrees of freedom )

 

 

 

Reference

- https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

'R' 카테고리의 다른 글

[R] TCGAbiobanks  (0) 2022.03.10
[R] table information  (0) 2022.01.06
[R] Enhanced Volcano plot  (0) 2021.10.12
[R] boxplot options  (0) 2021.10.06
[R] install packages  (0) 2021.10.05

댓글