ADP (R)

[R] prediction from a rank-deficient fit may be misleading

멋쟁이천재사자 2022. 10. 23. 10:59

1. 발생

 

train 함수로 로지스틱 회귀 모델(method="glm")을 만든 후에 predict  함수를 이용하여 예측을 하려고 시도하다 발생하였습니다.


prediction from a rank-deficient fit may be misleading

library(caret)
library(dplyr)
library(recipes)
library(randomForest)
set.seed(2022)

titanic <- read.csv("dataset/titanic.csv")

titanic %>% 
  mutate(cabin=ifelse(cabin =="",NA,cabin)) %>% 
  mutate(embarked=ifelse(embarked =="",NA,embarked)) %>% 
  mutate(survived=factor(survived,
                         levels =c(0,1),
                         labels = c("사망", "생존"))) %>% 
  mutate(sex=factor(sex,
                    levels =c("female","male"),
                    labels = c("여성", "남성"))) %>% 
  mutate(embarked=factor(embarked,
                         levels =c("C","Q","S"),
                         labels = c("Cherbourg", "Queenstown", "Southamton"))) %>% 
  mutate(age_1=floor(age/10)) %>% 
  mutate(age_1 = ifelse(age_1 > 7,7,age_1)) %>% 
  mutate(embarked=ifelse(is.na(embarked),"Southamton",embarked),
         cabin=ifelse(is.na(cabin),"C23 C25 C27",cabin),
         fare=ifelse(is.na(fare),median(fare,na.rm = T),fare),
         age=ifelse(is.na(age),median(age,na.rm = T),age),
         age_1=ifelse(is.na(age_1),median(age_1,na.rm = T),age_1)
  ) %>% 
  data.frame() -> titanic.clean

train <- createDataPartition(y=titanic.clean$survived, p=0.7,list = F)
NROW(titanic.clean);NROW(train)
titanic.clean.train <- titanic.clean[train,]
titanic.clean.test <- titanic.clean[-train,]

train_rec <-
  recipe(survived ~ pclass+sex+sibsp+parch+fare+embarked, data = titanic.clean.train) %>%
  step_dummy(sex,embarked) %>%
  prep()

titanic.clean.train.j <- juice(train_rec)
titanic.clean.test.b <- bake(train_rec, new_data = titanic.clean.test)

titanic.glm <- train(form=survived ~ ., 
                     data=titanic.clean.train.j, 
                     method="glm",  
                     trControl=trainControl(method="cv",number=10), 
                     tuneLength=10)  
#prediction from a rank-deficient fit may be misleading
pred.glm <- predict(titanic.glm, titanic.clean.test.b)

 

 

 

2. 원인

다중회귀에서 다음 2 가지 원인으로 발생합니다.
원인1: 두개의 변수가 완전한 상관관계입니다.
원인2: 파라미터 수가 샘플수보다 많습니다. 

출처 : https://www.statology.org/prediction-from-rank-deficient-fit-may-be-misleading/


#create data frame
df <- data.frame(x1=c(1, 2, 3, 4),
                 x2=c(2, 4, 6, 8),
                 y=c(6, 10, 19, 26))

#fit multiple linear regression model
model <- lm(y~x1+x2, data=df)

#use model to make predictions
predict(model, df)

 

 

#create data frame
df <- data.frame(x1=c(1, 2, 3, 4),
                 x2=c(3, 3, 8, 12),
                 x3=c(4, 6, 3, 11),
                 y=c(6, 10, 19, 26))

#fit multiple linear regression model
model <- lm(y~x1*x2*x3, data=df)

#use model to make predictions
predict(model, df)

 

 

 

3. 해결

 

predict 사용이나 train 사용과 관련한 문제가 아니라 titanic.clean.train.j 라는 데이터셋의 문제라는 이야기입니다.

 

표본수는 917 개로서 충분하니 원인2는 아닙니다.

 

cor(titanic.clean.train.j) 를 통해 확인해보아도 상관성이 1 또는 -1 인 변수 조합은 확인되지 않습니다.
cor(titanic.clean.train.j) 는 다음 에러가 발생하므로, model.matrix 로 변형한 후에 테스트했습니다.
Error in cor(titanic.clean.test.b) : 'x'는 반드시 수치형이어야 합니다.

 

embarked ~ age 중위수 또는 최빈값으로 대체하는 로직을 수정하면서 해결이 되었습니다.

수정 전후의 어떤 차이로 인해 해결이 되었는지는 아직 연구중입니다.


titanic %>% 
  mutate(cabin=ifelse(cabin =="",NA,cabin)) %>% 
  mutate(embarked=ifelse(embarked =="",NA,embarked)) %>% 
  mutate(survived=factor(survived,
                         levels =c(0,1),
                         labels = c("사망", "생존"))) %>% 
  mutate(sex=factor(sex,
                    levels =c("female","male"),
                    labels = c("여성", "남성"))) %>% 
  mutate(embarked=factor(embarked,
                         levels =c("C","Q","S"),
                         labels = c("Cherbourg", "Queenstown", "Southamton"))) %>% 
  mutate_if(is.character,factor) %>% 
  na.roughfix() %>% 
  mutate(name=as.character(name),
         ticket=as.character(ticket),
         cabin=as.character(cabin)
  ) %>% 
  mutate(age_1=floor(age/10)) %>% 
  mutate(age_1 = ifelse(age_1 > 7,7,age_1)) %>% 
  data.frame() -> titanic.clean