[Kaggle] 회사 평점 예측 #02(EDA)

01. Heatmap

설명변수 간 상관관계 확인
- 1에 가까울수록 양의 상관관계
- -1에 가까울수록 음의 상관관계

plt.figure()
sns.heatmap(df[['rating',
                'ceo_approval',
                'employees',
                'revenue',
                'Management',
                'Compensation/Benefits',
                'Job Security/Advancement',
                'Culture','Work/Life Balance'
                ]].corr(), xticklabels=True, yticklabels=True, vmin=-1.0, vmax=1.0, cmap='coolwarm')

employees(직원 수), revenue(연간 수익)은 상관관계가 낮음

employees(직원 수), revenue(연간 수익) 평점별 데이터 분포 추가적으로 확인
- employees(직원 수)를 확인했을 때 직원 수가 1명인 기업이 평점이 가장 높음
- revenue(연간 수익)를 확인했을 때 일정 수준 이상이면 수익이 커질수록 평점이 높아짐

# Review Ratings by Scale of Employees
plt.figure()
g = sns.catplot(x="employees",y="rating",data=df,kind="bar",palette = "muted").set(title='Review Ratings by Scale of Employees')
g.despine(left=True)
g = g.set_ylabels("reviews rate")
plt.show()


# Review Ratings by Company Revenue
plt.figure()
g = sns.catplot(x="revenue",y="rating",data=df,kind="bar",
palette = "muted").set(title='Review Ratings by Company Revenue')
g.despine(left=True)
g = g.set_ylabels("reviews rate")

rating(세부항목별 평점) 간 상관관계 확인
- 급여/복지 요인이 가장 낮음

# 리뷰 5가지 요인별 heatmap
plt.figure()
sns.heatmap(df[['rating',
                'Management',
                'Compensation/Benefits',
                'Job Security/Advancement',
                'Culture','Work/Life Balance']
               ].corr(), xticklabels=True, yticklabels=True, vmin=0.5, vmax=1.0, cmap='Blues')

happiness(세부항목별 행복도) 간 상관관계 확인
- 급여 요인이 가장 낮음

# happiness 세부 요인별 heatmap
plt.figure()
key_list = [key for key in full_dict.keys()]
key_list.insert(0, 'rating')
sns.heatmap(df_happiness[key_list].corr(), xticklabels=True, yticklabels=True, vmin=0.5, vmax=1.0,  cmap='Greens')

02. Bar Chart

industry(산업) 별 평점 구분, 20개 이상인 산업만 필터링

# 산업별 rating 구분
count_by_industry = df.groupby('industry').count().rating.sort_values(ascending=False)
rating_by_industry = df.groupby('industry')['rating'].apply(lambda x: x.mean()).sort_values(ascending=False)

# 20개 이상의 산업만 filter
industries = count_by_industry[count_by_industry<20].index.tolist()
rating_by_industry = rating_by_industry.drop(labels=industries)

평점 상위 10개 산업 시각화
- 교육, 공기업&공공기관, 컴퓨터&전자 순으로 평점이 높음

# 평점 상위 10개 산업 선택
top_10_industries = rating_by_industry.head(10)

# 막대그래프 시각화
plt.figure(figsize=(12, 8))
top_10_industries.plot(kind='bar', color='skyblue')
plt.title('Review Ratings by Industries(Top 10)')
plt.xlabel('Industry')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

평점 하위 10개 산업 시각화
- 물류&운송업, 통신, 요식업 순으로 평점이 낮음

# 평점 하위 10개 산업 선택
worse_10_industries = rating_by_industry.tail(10)

# 막대그래프 시각화
plt.figure(figsize=(12, 8))
worse_10_industries.plot(kind='bar', color='salmon')
plt.title('Review Ratings by Industries(Worse 10)')
plt.xlabel('Industry')
plt.ylabel('Average Rating')
plt.xticks(rotation=90)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

03. Scatter

ceo_approval(CEO 평가(%))에 대한 평점 산점도 시각화
- 양의 상관관계를 가짐

# Review Ratings by CEO approval
plt.figure()
plt.scatter(df.ceo_approval, df.rating, c=df.rating, cmap='viridis')
plt.title("Review Ratings by CEO approval")
plt.xlabel('CEO approval')
plt.show()

📌 참고.

https://www.kaggle.com/code/yaelman/company-review-rating-factors

Company review rating factors

Explore and run machine learning code with Kaggle Notebooks | Using data from Company Reviews

www.kaggle.com

🗂️ 데이터셋.
https://www.kaggle.com/datasets/vaghefi/company-reviews

728x90

저작자표시 비영리 변경금지 (새창열림)

'Study > Kaggle' 카테고리의 다른 글

[Kaggle] 회사 평점 예측 #04(예측 모델) (0)	2024.06.20
[Kaggle] 회사 평점 예측 #03(감정 분석) (0)	2024.06.19
[Kaggle] 회사 평점 예측 #01(데이터 불러오기, 전처리) (0)	2024.06.15
[Kaggle] 주택 가격 예측 EDA #3(결측치, 이상치 처리) (0)	2024.01.02
[Kaggle] 주택 가격 예측 EDA #2(시각화) (0)	2023.12.29

Home

[Kaggle] 회사 평점 예측 #02(EDA)

목차

01. Heatmap

02. Bar Chart

03. Scatter

'Study > Kaggle' 카테고리의 다른 글

티스토리툴바

[Kaggle] 회사 평점 예측 #02(EDA)

목차

01. Heatmap

02. Bar Chart

03. Scatter

'Study > Kaggle' 카테고리의 다른 글

관련글

티스토리툴바