Sessi 3 — Stopword, Stemming, dan Lemmatization

Tujuan: memahami trade‑off antara penghapusan stopword, stemming (akar kata morfologis), dan lemmatization (bentuk kamus), serta mengukur dampaknya pada representasi (|V|, sparsity) dan performa tugas awal (kemiripan & sentimen berlabel lemah).

Learning Outcomes: (1) Menjelaskan perbedaan stopword, stemming, lemmatization; (2) Menerapkan Sastrawi (ID) & spaCy (EN); (3) Mengevaluasi dampak ke TF–IDF & cosine similarity; (4) Menyiapkan korpus siap klasifikasi untuk sesi 4–7.

1) Definisi & Intuisi

Stopword: kata sangat umum ("dan", "yang", "the") yang sering tidak informatif untuk topik; bisa berguna untuk sentimen ("tidak").
Stemming: memotong afiks untuk mendapatkan stem ("berlari"→"lari"); cepat namun bisa agresif.
Lemmatization: memetakan ke lemma kamus berdasarkan analisis morfologis (EN: spaCy). Lebih mahal, cenderung akurat.

Prinsip: pilih teknik sesuai tugas. Untuk topik: stopword removal + stemming sering membantu. Untuk sentimen: hati‑hati menghapus negasi ("tidak").

2) Praktik Google Colab — Pipeline Stopword/Stemming/Lemma

Gunakan korpus dari sesi 2 (corpus_sessi2_normalized.csv) atau langsung gunakan daftar di bawah (≥30 dokumen).

A. Setup

!pip -q install pandas numpy scikit-learn nltk Sastrawi spacy emoji Unidecode
# Model bahasa Inggris untuk lemmatization
!python -m spacy download en_core_web_sm > /dev/null

import re, math
import numpy as np, pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy.linalg import norm
import nltk
nltk.download('punkt', quiet=True)

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import spacy
nlp_en = spacy.load('en_core_web_sm')

try:
    df = pd.read_csv('corpus_sessi2_normalized.csv')
    corpus = df['text'].dropna().astype(str).tolist()
    print('Loaded corpus_sessi2_normalized.csv:', len(corpus), 'dokumen')
except Exception as e:
    print('Tidak menemukan CSV, gunakan korpus default.')
    corpus = [
      "pengiriman cepat kualitas barang baik sesuai ",
      "battery life is impressive however the screen glare is noticeable outdoors",
      "layanan pelanggan responsif tetapi proses retur memakan waktu",
      "the movie had stunning visuals but a weak storyline",
      "harga terjangkau performa mantap untuk kebutuhan harian",
      "network latency spikes during peak hours affected our dashboard",
      "rasa kopi kuat aroma enak kemasan rapi",
      "app update fixed the crash but introduced login delays",
      "fitur kamera malam bagus namun stabilisasi video kurang",
      "documentation is thorough examples help onboarding",
      "pengalaman belanja menyenangkan voucher potongan harga membantu",
      "server downtime impacted api integrations",
      "tekstur kue lembut rasa tidak terlalu manis",
      "delivery was late packaging dented item worked fine",
      "antarmuka aplikasi intuitif navigasi mudah dipahami pemula",
      "customer support promised a callback but never followed up",
      "kualitas kain adem ukuran sesuai chart jahitan rapi",
      "dataset imbalance requires stratified sampling for evaluation",
      "suara speaker jernih bass cukup volume maksimal tidak pecah",
      "checkout process failed at payment gateway intermittently",
      "konektivitas bluetooth stabil jarak efektif sekitar  meter",
      "the hotel staff were friendly but the room smelled of smoke",
      "keyboard travel nyaman backlight konsisten keycaps anti slip",
      "performa menurun setelah update cache clear membantu sementara",
      "shipping fee too high for small accessories",
      "resep mudah diikuti waktu memasak sesuai estimasi",
      "price to performance is excellent for students",
      "layar amoled tajam warna hidup tingkat kecerahan tinggi",
      "refund diproses cepat setelah komplain diajukan",
      "weather data ingestion requires timezone normalization"
    ]

print('Total dokumen:', len(corpus))

B. Stopword (ID & EN)

# Stopword EN bawaan scikit-learn + stopword ID dari Sastrawi
from sklearn.feature_extraction import text as sktext
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

stop_en = sktext.ENGLISH_STOP_WORDS
stop_id_default = set(StopWordRemoverFactory().get_stop_words())

# Amankan negasi penting
SAFE_NEG = {"tidak","bukan","no","not","never"}
stop_id = {w for w in stop_id_default if w not in SAFE_NEG}

print(list(sorted(list(stop_id))[:15]), '... (ID)')
print(list(sorted(list(stop_en))[:15]), '... (EN)')

C. Stemming (Sastrawi) & Lemma (spaCy EN)

factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stem_id(text):
    return ' '.join(stemmer.stem(t) for t in text.split())

def lemma_en(text):
    doc = nlp_en(text)
    return ' '.join(tok.lemma_ for tok in doc)

# Pipeline kebijakan
# v0: tanpa stopword, tanpa stemming, tanpa lemma
# v1: stopword EN+ID
# v2: v1 + stemming ID (Sastrawi)
# v3: v1 + lemma EN (spaCy)

def pipe_variant(texts):
    v0 = [t for t in texts]
    v1 = []
    v2 = []
    v3 = []
    for t in texts:
        toks = t.split()
        toks_v1 = [w for w in toks if w not in stop_id and w not in stop_en]
        v1t = ' '.join(toks_v1)
        v1.append(v1t)
        # Stemming untuk token ID (heuristik: token all‑lower latin & tanpa anglisisme khusus)
        v2.append(stem_id(v1t))
        # Lemma EN (jalankan di versi v1 agar stopword EN sudah dihapus)
        v3.append(lemma_en(v1t))
    return v0, v1, v2, v3

v0, v1, v2, v3 = pipe_variant(corpus)

pd.DataFrame({
  'original': corpus[:6],
  'v0_raw': v0[:6],
  'v1_stop': v1[:6],
  'v2_stop_stemID': v2[:6],
  'v3_stop_lemmaEN': v3[:6]
})

D. TF–IDF & Sparsity

def tfidf_stats(texts, ngram=(1,1)):
    vec = TfidfVectorizer(ngram_range=ngram, min_df=1)
    X = vec.fit_transform(texts)
    vocab = vec.get_feature_names_out()
    density = X.nnz/(X.shape[0]*X.shape[1])
    return X, vocab, density

for name, variant in [('v0_raw', v0), ('v1_stop', v1), ('v2_stop_stemID', v2), ('v3_stop_lemmaEN', v3)]:
    X, vocab, dens = tfidf_stats(variant)
    print(name, 'shape=', X.shape, 'vocab=', len(vocab), 'sparsity=', 1-dens)

E. Dampak pada Kemiripan (Cosine)

from numpy.linalg import norm
import numpy as np

def cosine(a,b):
    return float(a.dot(b.T).toarray()[0,0] / (norm(a.toarray())*norm(b.toarray()) + 1e-12))

# Pilih dokumen contoh & query
query = "pengiriman cepat kualitas bagus"

vec = TfidfVectorizer()
for name, variant in [('v0_raw', v0), ('v1_stop', v1), ('v2_stop_stemID', v2), ('v3_stop_lemmaEN', v3)]:
    X = vec.fit_transform(variant)
    q = vec.transform([query])
    sims = (X @ q.T).toarray().ravel() / (np.linalg.norm(X.toarray(), axis=1)*np.linalg.norm(q.toarray()) + 1e-12)
    top = np.argsort(sims)[::-1][:5]
    print('\n==', name, '==')
    for r in top:
        print(f'sim={sims[r]:.3f}', variant[r][:90])

F. Label Lemah (Distant Supervision) untuk Sentimen

# Label lemah: berdasarkan leksikon kecil (ID+EN)
POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","clean","tajam"}
NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed"}

def weak_label(text):
    toks = set(text.split())
    pos = len(toks & POS)
    neg = len(toks & NEG)
    if pos>neg: return 1
    if neg>pos: return 0
    return None  # tidak jelas

labels = [weak_label(t) for t in v1]  # gunakan v1 (stopword removed)
X, vocab, _ = tfidf_stats(v1)

# Pilih hanya dokumen yang terlabel
idx = [i for i,l in enumerate(labels) if l is not None]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

if len(idx) >= 10:
    y = np.array([labels[i] for i in idx])
    Xsub = X[idx]
    clf = LogisticRegression(max_iter=200)
    scores = cross_val_score(clf, Xsub, y, cv=5, scoring='f1')
    print('F1 (weak labels, v1 stopword):', scores.mean().round(3), '+/-', scores.std().round(3))
else:
    print('Sedikit dokumen yang terlabel lemah; tambahkan korpus atau leksikon.')

G. Ekspor Korpus Terproses

pd.DataFrame({
  'v0_raw': v0,
  'v1_stop': v1,
  'v2_stop_stemID': v2,
  'v3_stop_lemmaEN': v3
}).to_csv('corpus_sessi3_variants.csv', index=False)
print('Tersimpan: corpus_sessi3_variants.csv')

3) Studi Kasus & Analisis

Kasus	Harapan	Catatan
Klasifikasi topik berita	Stopword removal + stemming menurunkan \|V\| dan menaikkan stabilitas fitur	Hati‑hati nama entitas; jangan dihapus
Sentimen ulasan	Jangan hapus negasi ("tidak", "not"); stemming aman	Emoji dapat dipertahankan sebagai fitur
Similarity pencarian	Stemming membantu recall query berimbuhan	Lemma EN menjaga keakuratan bentuk kata

4) Tugas Mini (Dinilai)

Bandingkan empat varian (v0–v3) pada metrik: |V|, sparsity, akurasi retrieval@5 untuk 5 query yang Anda desain sendiri.
Uji label lemah pada varian v1 dan v2; laporkan F1 (CV=5).
Tulis ringkasan kapan memilih stopword saja vs +stemming vs +lemma.

⟵ Sessi 2 Sessi 4 ⟶