Sessi 10 — Sentimen Analitik: TF–IDF vs Word2Vec

Tujuan: membangun sistem analisis sentimen end‑to‑end dan membandingkan representasi berbasis sparsititas (TF–IDF) dengan representasi dens (Word2Vec) menggunakan mean‑pooling & TF–IDF weighted pooling.

Learning Outcomes: (1) Memahami perbedaan fitur sparse vs dense untuk teks; (2) Melatih Word2Vec kecil & menurunkan vektor dokumen; (3) Melatih klasifier (LogReg/KNN) dan mengevaluasi; (4) Melakukan analisis kesalahan & rekomendasi perbaikan.

1) Konsep Inti

  • TF–IDF: fitur sparse, interpretabel, kuat untuk n‑gram pendek.
  • Word2Vec: pembelajaran distribusional (CBOW/Skip‑gram) menghasilkan vektor kata dens yang menangkap kemiripan semantik.
  • Pooling dokumen: rata‑rata sederhana (mean) atau rata‑rata berbobot TF–IDF untuk menyusun vektor dokumen dari vektor kata.

2) Praktik Google Colab — Pipeline TF–IDF & Word2Vec

Gunakan label dari Sessi 7/6 atau buat label lemah lalu koreksi cepat. Dataset minimal 30–50 contoh.

A. Setup & Data

!pip -q install pandas numpy scikit-learn gensim matplotlib

import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load data berlabel
try:
    df = pd.read_csv('logreg_dataset_sessi7.csv')
    assert {'text','y'}.issubset(df.columns)
except:
    # fallback: build from Sessi 3 variants + weak labels
    base = pd.read_csv('corpus_sessi3_variants.csv')['v2_stop_stemID'].dropna().astype(str).tolist()
    POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","tajam","bersih"}
    NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed","delay"}
    def weak_label(t):
        w=set(t.split()); p=len(w & POS); n=len(w & NEG)
        if p>n: return 1
        if n>p: return 0
        return None
    df = pd.DataFrame({'text':base, 'y':[weak_label(t) for t in base]}).dropna()

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['y'].astype(int), test_size=0.25, stratify=df['y'].astype(int), random_state=42)
print('Dataset:', df.shape)

B. Baseline TF–IDF → LogReg

vec = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')
Xtr = vec.fit_transform(X_train)
Xte = vec.transform(X_test)

lr = LogisticRegression(max_iter=200, class_weight='balanced')
lr.fit(Xtr, y_train)
print('TF–IDF + LogReg Test Report')
print(classification_report(y_test, lr.predict(Xte), digits=3))
print('AP=', round(average_precision_score(y_test, lr.predict_proba(Xte)[:,1]),3),
      'ROC-AUC=', round(roc_auc_score(y_test, lr.predict_proba(Xte)[:,1]),3))

C. Melatih Word2Vec & Menurunkan Vektor Dokumen

from gensim.models import Word2Vec

# Tokenisasi sederhana (data sudah dipraproses pada Sessi 3)
train_tokens = [t.split() for t in X_train]

w2v = Word2Vec(
    sentences=train_tokens,
    vector_size=200,
    window=5,
    min_count=1,
    sg=1,           # skip-gram (umumnya lebih baik di data kecil)
    negative=10,
    workers=2,
    epochs=20,
    seed=42
)

# Pooling fungsi
import numpy as np
from collections import Counter

idf = None
# buat idf dari TF–IDF agar bisa tfidf-weighted pooling
from sklearn.feature_extraction.text import TfidfVectorizer
vec_idf = TfidfVectorizer(ngram_range=(1,1), min_df=1)
vec_idf.fit(X_train)
idf_map = dict(zip(vec_idf.get_feature_names_out(), vec_idf.idf_))


def doc_vector_mean(tokens):
    vs=[w2v.wv[w] for w in tokens if w in w2v.wv]
    return np.mean(vs, axis=0) if len(vs) else np.zeros(w2v.vector_size)

def doc_vector_tfidf(tokens):
    ws=[]; ws_weight=[]
    for w in tokens:
        if w in w2v.wv and w in idf_map:
            ws.append(w2v.wv[w])
            ws_weight.append(idf_map[w])
    if not ws:
        return np.zeros(w2v.vector_size)
    ws = np.array(ws); ws_weight = np.array(ws_weight)
    return (ws * ws_weight[:,None]).sum(axis=0) / (ws_weight.sum()+1e-12)

Xtr_mean = np.vstack([doc_vector_mean(t) for t in train_tokens])
Xte_mean = np.vstack([doc_vector_mean(t.split()) for t in X_test])
Xtr_tfidf= np.vstack([doc_vector_tfidf(t) for t in train_tokens])
Xte_tfidf= np.vstack([doc_vector_tfidf(t.split()) for t in X_test])
print('Shapes:', Xtr_mean.shape, Xtr_tfidf.shape)

D. Klasifikasi pada Vektor Dense (Word2Vec)

# LogReg pada dense vectors
lr_mean  = LogisticRegression(max_iter=500, class_weight='balanced').fit(Xtr_mean, y_train)
lr_wtd   = LogisticRegression(max_iter=500, class_weight='balanced').fit(Xtr_tfidf, y_train)

from sklearn.metrics import accuracy_score
print('\nWord2Vec mean-pool Test Report')
print(classification_report(y_test, lr_mean.predict(Xte_mean), digits=3))
print('AP=', round(average_precision_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]),3),
      'ROC-AUC=', round(roc_auc_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]),3))

print('\nWord2Vec tfidf-weighted Test Report')
print(classification_report(y_test, lr_wtd.predict(Xte_tfidf), digits=3))
print('AP=', round(average_precision_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]),3),
      'ROC-AUC=', round(roc_auc_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]),3))

E. Perbandingan Hasil & Visual Ringkas

import pandas as pd
rows = [
  ['TF–IDF + LogReg', average_precision_score(y_test, lr.predict_proba(Xte)[:,1]), roc_auc_score(y_test, lr.predict_proba(Xte)[:,1])],
  ['W2V mean + LogReg', average_precision_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1]), roc_auc_score(y_test, lr_mean.predict_proba(Xte_mean)[:,1])],
  ['W2V tfidf + LogReg', average_precision_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1]), roc_auc_score(y_test, lr_wtd.predict_proba(Xte_tfidf)[:,1])]
]
res = pd.DataFrame(rows, columns=['Model','AP','ROC-AUC'])
print(res)

plt.figure(figsize=(7,4))
plt.bar(res['Model'], res['AP'])
plt.title('Perbandingan Average Precision (AP)')
plt.xticks(rotation=10); plt.tight_layout(); plt.show()

F. (Opsional) KNN pada Vektor Dense

knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')
knn.fit(Xtr_tfidf, y_train)
print('\nKNN(cosine) pada W2V tfidf-weighted — Test Report')
print(classification_report(y_test, knn.predict(Xte_tfidf), digits=3))

G. Simpan Artefak

import joblib
joblib.dump(vec, 'tfidf_vec_sessi10.joblib')
joblib.dump(lr,  'tfidf_logreg_sessi10.joblib')
joblib.dump(w2v, 'word2vec_sessi10.model')
joblib.dump(lr_wtd, 'w2v_tfidf_logreg_sessi10.joblib')
print('Artefak tersimpan: tfidf_vec_sessi10.joblib, tfidf_logreg_sessi10.joblib, word2vec_sessi10.model, w2v_tfidf_logreg_sessi10.joblib')

3) Studi Kasus & Analisis

KasusPendekatanCatatan
Ulasan e‑commerceTF–IDF(1,2)+LogRegStabil di data kecil, interpretabel (top koefisien)
Opini media sosialW2V tfidf‑weighted + LogRegLebih tangguh terhadap sinonimi/variasi ejaan
Bahasa campuran/typoTambah char n‑gram (Sessi 4) atau fastTextSubword membantu (di luar cakupan sesi ini)

4) Tugas Mini (Dinilai)

  1. Latih Word2Vec(200D) pada korpus Anda; bandingkan mean‑pool vs TF–IDF weighted pooling melawan TF–IDF baseline menggunakan **AP** & **ROC‑AUC**.
  2. Analisis 5 kesalahan terbesar pada model terbaik: tampilkan teks, prediksi, probabilitas, dan dugaan penyebab.
  3. (Opsional) Coba **KNN(cosine)** pada vektor dense dan bandingkan F1 dengan LogReg.