728x90
🗂️ 데이터셋.
https://www.kaggle.com/datasets/tarkkaanko/amazon
0. Introduction
- 텍스트 마이닝 기법을 활용해서 고객 리뷰를 분석하고 이를 통해 고객이 만족하는 서비스 요인과 불만족하는 서비스 요익을 파악하고자 함
1. 데이터 불러오기
- Kaggle에서 제공하는 'amazon reviews' 데이터 불러오기
# library setting
!pip install chart_studio
!pip install TextBlob
!pip install plotly
!pip install WordCloud
!pip install cufflinks
!pip install SentimentIntensityAnalyzer
!pip install vaderSentiment
!pip install pyLDAvis gensim
!pip install sumy
import pandas as pd
import numpy as np
import nltk
import re
import seaborn as sns
import matplotlib.pyplot as plt
import cufflinks as cf
import plotly.graph_objs as go
import plotly.express as px
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import gensim
import networkx as nx
import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot
from plotly.offline import iplot
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from wordcloud import WordCloud
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
from itertools import combinations
from collections import Counter
init_notebook_mode(connected = True)
cf.go_offline();
- 원본 데이터는 df_로 유지하고 copy 해서 사용함
df_ = pd.read_csv("/content/amazon_reviews.csv")
df = df_.copy()
df = df.sort_values("wilson_lower_bound", ascending=False)
df.drop('Unnamed: 0', inplace = True, axis = 1)
df.head()
02. 전처리
- 결측치/고유값 확인하기
- 수치형 데이터만 선택(EDA용)
# 결측치 확인
def missing_values_analysis(df):
na_columns_ = [col for col in df.columns if df[col].isnull().sum() > 0]
n_miss = df[na_columns_].isnull().sum().sort_values(ascending=True)
ratio_ = (df[na_columns_].isnull().sum() / df.shape[0] * 100).sort_values(ascending=True)
missing_df = pd.concat([n_miss, np.round(ratio_, 2)], axis=1, keys=['Total Missing Values', 'Ratio'])
missing_df = pd.DataFrame(missing_df)
return missing_df
# 데이터프레임 확인
def check_dataframe(df, head=5, tail=5):
print(" SHAPE ".center(82,'~'))
print('Rows: {}'.format(df.shape[0]))
print('Columns: {}'.format(df.shape[1]))
print(" TYPES ".center(82,'~'))
print(df.dtypes)
print("".center(82,'~'))
print(missing_values_analysis(df))
print(' DUPLICATED VALUES '.center(83,'~'))
print(df.duplicated().sum())
print(" QUANTILES ".center(82,'~'))
# 수치형 데이터 열만 선택
numeric_df = df.select_dtypes(include=[np.number])
print(numeric_df.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)
# 고유값 확인
def check_class(dataframe):
nunique_df = pd.DataFrame({'Variable': dataframe.columns,
'Classes': [dataframe[i].nunique() \
for i in dataframe.columns]})
nunique_df = nunique_df.sort_values('Classes', ascending=False)
nunique_df = nunique_df.reset_index(drop = True)
return nunique_df
check_class(df)
📌 참고.
https://www.kaggle.com/code/tarkkaanko/amazon-review-sentiment-analysis
👀 Amazon - Review Sentiment Analysis 🐳
Explore and run machine learning code with Kaggle Notebooks | Using data from amazon reviews for sentiment analysis
www.kaggle.com
728x90
'Study > Kaggle' 카테고리의 다른 글
[Kaggle] 아마존 리뷰 분석 #02(EDA, 감정분석) (1) | 2024.06.23 |
---|---|
[Kaggle] 주택 가격 예측 EDA #3(결측치, 이상치 처리) (0) | 2024.01.02 |
[Kaggle] 주택 가격 예측 EDA #2(시각화) (0) | 2023.12.29 |
[Kaggle] 주택 가격 예측 EDA #1(Kaggle 데이터 불러오기, EDA) (0) | 2023.12.27 |
[Kaggle] 기초문법 (0) | 2023.11.14 |