Sessi 6 — KNN untuk Klasifikasi Teks
Tujuan: memahami dan menerapkan K‑Nearest Neighbors (KNN) untuk klasifikasi teks di ruang TF–IDF / embedding, memilih k, membandingkan metrik jarak (cosine vs euclidean), dan mengevaluasi dengan metrik klasifikasi.
Learning Outcomes: (1) Menjelaskan prinsip KNN & pengaruh k; (2) Menerapkan KNN untuk teks dengan metrik yang sesuai; (3) Melakukan validasi silang untuk memilih hyperparameter; (4) Membaca confusion matrix & metrik (precision/recall/F1).
1) Konsep Inti
- KNN: klasifikasi ditentukan oleh mayoritas label di tetangga terdekat.
- Ruang vektor teks: gunakan TF–IDF ter‑normalisasi L2 agar jarak euclidean ≈ fungsi cosine.
- Pemilihan k: kecil → sensitif noise; besar → bias tinggi. Cari k melalui validasi silang.
- Bobot tetangga: weights='uniform' vs 'distance'.
2) Praktik Google Colab — KNN Teks
Kita gunakan dataset berlabel sederhana (sentimen/keluhan) yang diturunkan dari korpus Sessi 2–3 dan diperbaiki manual. Minimal 30–50 contoh agar evaluasi dasar masuk akal.
A. Setup & Data Label
!pip -q install pandas numpy scikit-learn matplotlib
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
# ----- Muat korpus terproses -----
try:
dfv = pd.read_csv('corpus_sessi3_variants.csv')
texts = dfv['v2_stop_stemID'].fillna('').astype(str).tolist()
print('Loaded v2_stop_stemID from corpus_sessi3_variants.csv:', len(texts))
except:
dfv = pd.read_csv('corpus_sessi2_normalized.csv')
texts = dfv['text'].fillna('').astype(str).tolist()
print('Loaded from corpus_sessi2_normalized.csv:', len(texts))
# ----- Buat label sederhana (positif=1 / negatif=0) + koreksi manual -----
POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","tajam","bersih"}
NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed","delay"}
def weak_label(t):
w = set(t.split())
p = len(w & POS); n = len(w & NEG)
if p>n: return 1
if n>p: return 0
return None
labels = [weak_label(t) for t in texts]
# Buat DataFrame dan drop yang None, lalu sampling minimal 30-50
df = pd.DataFrame({'text':texts, 'y':labels}).dropna()
if len(df) < 30:
# Jika terlalu sedikit, tambah contoh sintetis
extra_pos = ["pengiriman cepat kualitas bagus", "ui intuitif pengalaman menyenangkan", "respon cs baik"]
extra_neg = ["login delay berkepanjangan", "refund telat proses lambat", "server downtime berulang"]
df = pd.concat([df, pd.DataFrame({'text':extra_pos, 'y':1}), pd.DataFrame({'text':extra_neg, 'y':0})], ignore_index=True)
print('Dataset labeled (weak+manual):', len(df))
print(df.sample(min(5, len(df))))
B. Split Data & Pipeline TF–IDF → KNN
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['y'].astype(int), test_size=0.25, random_state=42, stratify=df['y']
)
pipe = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=1, max_df=0.95, sublinear_tf=True, norm='l2')),
('knn', KNeighborsClassifier())
])
param_grid = {
'knn__n_neighbors': [1,3,5,7,9],
'knn__weights': ['uniform','distance'],
# gunakan cosine bila versi scikit‑learn mendukung; fallback: euclidean + norm L2
'knn__metric': ['cosine','euclidean']
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipe, param_grid, cv=cv, scoring='f1', n_jobs=-1)
search.fit(X_train, y_train)
print('Best params:', search.best_params_)
print('Best CV F1 :', round(search.best_score_,3))
C. Evaluasi pada Test Set
best = search.best_estimator_
y_pred = best.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))
# Confusion matrix
import seaborn as sns
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues'); plt.title('Confusion Matrix — KNN'); plt.tight_layout(); plt.show()
D. Analisis: Pengaruh k, Metrik, & N‑gram
# Uji variasi k dan metric pada CV
results = pd.DataFrame(search.cv_results_)
cols = ['param_knn__n_neighbors','param_knn__weights','param_knn__metric','mean_test_score']
print(results[cols].sort_values('mean_test_score', ascending=False).head(10))
# Opsional: bandingkan ngram_range
from sklearn.model_selection import cross_val_score
for ngr in [(1,1),(1,2)]:
pipe_tmp = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=ngr, min_df=1, max_df=0.95, sublinear_tf=True, norm='l2')),
('knn', KNeighborsClassifier(n_neighbors=search.best_params_['knn__n_neighbors'],
weights=search.best_params_['knn__weights'],
metric=search.best_params_['knn__metric']))
])
score = cross_val_score(pipe_tmp, df['text'], df['y'].astype(int), cv=cv, scoring='f1').mean()
print(f'ngram={ngr} → CV F1={score:.3f}')
E. (Opsional) Percepat dengan LSA (SVD)
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
# Pipeline: TF–IDF → SVD(100) → normalisasi → KNN
svd_knn = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=1, max_df=0.95, sublinear_tf=True, norm='l2')),
('svd', TruncatedSVD(n_components=100, random_state=42)),
('norm', Normalizer(copy=False)),
('knn', KNeighborsClassifier(n_neighbors=5, weights='distance', metric='cosine'))
])
f1_svd = cross_val_score(svd_knn, df['text'], df['y'].astype(int), cv=cv, scoring='f1').mean()
print('CV F1 dengan LSA (100D):', round(f1_svd,3))
F. Simpan Artefak
import joblib
joblib.dump(search.best_estimator_, 'knn_text_clf_sessi6.joblib')
print('Tersimpan: knn_text_clf_sessi6.joblib')
3) Studi Kasus
| Kasus | Tujuan | Pendekatan | Catatan |
|---|---|---|---|
| Moderasi ulasan aplikasi | Label cepat positif/negatif untuk prioritas respon | TF–IDF (1,2) + KNN(metric=cosine) | Gunakan weights='distance' saat data berisik |
| Klasifikasi tiket helpdesk | Kelompokkan tiket ke kategori awal | TF–IDF + KNN(k kecil) | Tambahkan char n‑gram untuk typo |
| Deteksi keluhan logis | Identifikasi keluhan terkait performa | TF–IDF + SVD + KNN | Kurangi dimensi untuk kecepatan |
4) Tugas Mini (Dinilai)
- Lakukan GridSearchCV untuk k∈{1,3,5,7,9}, weights∈{uniform,distance}, metric∈{cosine,euclidean} (CV=5). Laporkan kombinasi terbaik dan alasannya.
- Bandingkan kinerja dengan/ tanpa LSA(100D). Catat perubahan waktu fit/predict dan F1.
- Analisis 5 prediksi yang salah: tampilkan 3 tetangga terdekat dan jelaskan mengapa KNN keliru.