[Kaggle] 주택 가격 예측 EDA #3(결측치, 이상치 처리)

728x90

1. 결측치 데이터의 이론 및 시각화 탐색

결측 데이터의 종류
1. 임의적 결측 발생(MAR: Missing at Random)
  - 누락된 데이터가 특정 변수와 관련되어 일어나지만, 그 변수의 값과는 관계가 없는 경우
  - ex)어떤 설문조사에서 누락된 자료가 특정 변수들에 국한되어 발견되었는데 알고 보니 일부 대상자가 설문지 3페이지에 반대쪽 면이 있는 것을 모르고 채우지 않았을 경우 MAR로 확인 가능
2. 완전무작위 결측 발생(MCAR: Missing Completely at Random)
  - 변수의 종류와 변수의 값과 상관없이 전체에 걸쳐 무작위적으로 발생
  - 이러한 missing data는 분석에 영향을 주지 않음
3. 비임의적 결측 발생(NMAR: Not Missing at Random)
  - 누락된 변수의 값과 누락된 이유가 관련이 있는 경우
  - ex)일부 설문지에 종교 또는 정치적인 이유로 일부러 대답을 회피하거나 데이터입력 과정에서의 실수한 경우에 발생

# 결측 데이터 시각화 
missing_total = train.isnull().sum().sort_values(ascending=False)
missing_rate = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing_total, missing_rate], axis=1, keys=['Total', 'Percent'])
print(missing_data.head(10))


# Bar Chart 
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

2. 이상치 데이터 판별 및 시각화 탐색

이상치 : 평균과 실체 관측값의 차이

# 표준편차와 박스 플롯 활용 
def out_std(series, nstd=3.0, return_thresholds=False):
    data_mean, data_std = series.mean(), series.std()
    cut_off = data_std * nstd
    lower, upper = data_mean - cut_off, data_mean + cut_off
    if return_thresholds:
        return lower, upper
    else:
        return [True if x < lower or x > upper else False for x in series]


def out_iqr(series, k=1.5, return_thresholds=False):
    # calculate interquartile range
    q25, q75 = np.percentile(series, 25), np.percentile(series, 75)
    iqr = q75 - q25
    # calculate the outlier cutoff
    cut_off = iqr * k
    lower, upper = q25 - cut_off, q75 + cut_off
    if return_thresholds:
        return lower, upper
    else: # identify outliers
        return [True if x < lower or x > upper else False for x in series]


# out_std() 활용하여 이상치 판별 
lotArea_outlier = out_std(train['LotArea'], nstd = 3)
print(lotArea_outlier[:10])


# low, high 구분 
data = train.copy()
data['LotArea'][lotArea_outlier]


# 히스토그램 작성 
plt.figure(figsize=(8,6))
sns.distplot(data['LotArea'], kde=False);
plt.vlines(data['LotArea'][lotArea_outlier], ymin=0, ymax=100, linestyles='dashed');


# IQR 활용 그래프 작성 
lotArea_outlier_iqr = out_iqr(train['LotArea'])
data = train.copy()
data['LotArea'][lotArea_outlier_iqr]

plt.figure(figsize=(8,6))
sns.distplot(data['LotArea'], kde=False);
plt.vlines(data['LotArea'][lotArea_outlier_iqr], ymin=0, ymax=100, linestyles='dashed');


# 이상치 제거 
data = train.copy()
len(data)

# lotArea_outlier값이 BOOL이기 때문에, TRUE에 해당하는 값만 확인해서 제거
data['outlier_LotArea'] = lotArea_outlier_iqr
data2 = data[data['outlier_LotArea'] == False]
len(data2)

728x90

'Study > Kaggle' 카테고리의 다른 글

[Kaggle] 아마존 리뷰 분석 #02(EDA, 감정분석) (1)	2024.06.23
[Kaggle] 아마존 리뷰 분석 #01(데이터 불러오기, 전처리) (0)	2024.06.21
[Kaggle] 주택 가격 예측 EDA #2(시각화) (0)	2023.12.29
[Kaggle] 주택 가격 예측 EDA #1(Kaggle 데이터 불러오기, EDA) (0)	2023.12.27
[Kaggle] 기초문법 (0)	2023.11.14

Home

[Kaggle] 주택 가격 예측 EDA #3(결측치, 이상치 처리)

1. 결측치 데이터의 이론 및 시각화 탐색

2. 이상치 데이터 판별 및 시각화 탐색

'Study > Kaggle' 카테고리의 다른 글

티스토리툴바

[Kaggle] 주택 가격 예측 EDA #3(결측치, 이상치 처리)

1. 결측치 데이터의 이론 및 시각화 탐색

2. 이상치 데이터 판별 및 시각화 탐색

'Study > Kaggle' 카테고리의 다른 글

관련글

티스토리툴바