๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Study/Kaggle

[Kaggle] ์•„๋งˆ์กด ๋ฆฌ๋ทฐ ๋ถ„์„ #02(EDA, ๊ฐ์ •๋ถ„์„)

by jijizy 2024. 6. 23.
728x90
๐Ÿ—‚๏ธ ๋ฐ์ดํ„ฐ์…‹.
https://www.kaggle.com/datasets/tarkkaanko/amazon

 

 

 

1. ์‹œ๊ฐํ™”

 

  • ๋ฆฌ๋ทฐ ํ‰์  ์‹œ๊ฐํ™”
  • contraints๋กœ pie chart ์ƒ‰์ƒ ๊ตฌ๋ถ„
    • 5.0์ ๋Œ€ ํ‰์  ๋น„์œจ์ด 79.8%๋กœ ๊ฐ€์žฅ ๋†’์Œ
# ๋ฆฌ๋ทฐ ํ‰์  ํ™•์ธ
constraints = ['#4682B4', '#FF6347', '#32CD32', '#FFD700', '#8A2BE2']

def categorical_variable_summary(df, column_name):
    plt.figure(figsize=(10, 5))

    # Countplot
    plt.subplot(1, 2, 1)
    df[column_name].value_counts().plot(kind='bar', color='skyblue')
    plt.title('Countplot')

    # Percentages
    plt.subplot(1, 2, 2)
    df[column_name].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=constraints)
    plt.title('Percentages')

    plt.tight_layout()
    plt.show()
    
    
# ๋ฆฌ๋ทฐ ํ‰์  ์‹œ๊ฐํ™”
categorical_variable_summary(df,'overall')

 

 

 

 

 

 

 

2. ๊ฐ์ • ๋ถ„์„ 

 

  • ๋ฆฌ๋ทฐ ๋‚ด์šฉ ํ…์ŠคํŠธ ํ™•์ธ
    • ์†Œ๋ฌธ์ž๋กœ ๋ณ€ํ™˜
# ๋ฆฌ๋ทฐ ๋‚ด์šฉ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ๋ฐ ์†Œ๋ฌธ์ž ๋ณ€ํ™˜
rt = lambda x: re.sub("[^a-zA-Z]",' ',str(x))
df["reviewText"] = df["reviewText"].map(rt)
df["reviewText"] = df["reviewText"].str.lower()

 

  • ๊ฐ์ • ๋ถ„์„
    • ๊ธ์ •์ด 81.3%๋กœ ๊ฐ€์žฅ ๋†’์Œ
# Sentiment ๋ถ„์„
# polarity: ํ…์ŠคํŠธ์˜ ๊ธ์ •/๋ถ€์ • ์ •๋„
# subjectivity: ํ…์ŠคํŠธ์˜ ์ฃผ๊ด€์  ์ •๋„
df[['polarity', 'subjectivity']] = df['reviewText'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))

analyzer = SentimentIntensityAnalyzer()

for index, row in df['reviewText'].items():
    score = analyzer.polarity_scores(row)

    neg = score['neg']
    neu = score['neu']
    pos = score['pos']
    if neg > pos:
        df.loc[index, 'sentiment'] = "Negative"
    elif pos > neg:
        df.loc[index, 'sentiment'] = "Positive"
    else:
        df.loc[index, 'sentiment'] = "neutral"
        
        
# ๊ฐ์ •๋ถ„์„ ์‹œ๊ฐํ™”
categorical_variable_summary(df,'sentiment')

 

 

 

 

 

 

3. ์›Œ๋“œ ํด๋ผ์šฐ๋“œ

 

  • ๊ฐ์ • ๋ถ„์„ ๊ฒฐ๊ณผ ๊ธฐ๋ฐ˜์œผ๋กœ ์›Œ๋“œ ํด๋ผ์šฐ๋“œ ์ƒ์„ฑ
# ๊ฐ์ •๋ณ„ ์›Œ๋“œํด๋ผ์šฐ๋“œ
def plot_wordcloud(sentiment, df):
    reviews = df[df['sentiment'] == sentiment]['reviewText'].str.cat(sep=' ')

    wordcloud = WordCloud(width = 800, height = 800,
                background_color ='white',
                stopwords = None,
                min_font_size = 10).generate(reviews)

    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.title(f'{sentiment.capitalize()} Reviews WordCloud')
    plt.show()

 

# ๊ธ์ • ์›Œ๋“œํด๋ผ์šฐ๋“œ ์‹œ๊ฐํ™”
plot_wordcloud('Positive', df)

 

๊ธ์ • ๋ฆฌ๋ทฐ

 

 

# ๋ถ€์ • ์›Œ๋“œํด๋ผ์šฐ๋“œ ์‹œ๊ฐํ™”
plot_wordcloud('Negative', df)

 

๋ถ€์ • ๋ฆฌ๋ทฐ



# ์ค‘๋ฆฝ ์›Œ๋“œํด๋ผ์šฐ๋“œ ์‹œ๊ฐํ™”
plot_wordcloud('neutral', df)

 

์ค‘๋ฆฝ ๋ฆฌ๋ทฐ

 

 

 

 

 


 

 

๐Ÿ“Œ ์ฐธ๊ณ .

 

https://www.kaggle.com/code/tarkkaanko/amazon-review-sentiment-analysis

 

๐Ÿ‘€ Amazon - Review Sentiment Analysis ๐Ÿณ

Explore and run machine learning code with Kaggle Notebooks | Using data from amazon reviews for sentiment analysis

www.kaggle.com

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90