Sessi 7 — Regresi untuk Klasifikasi: Logistik & Linear (One‑vs‑Rest)
Tujuan: membangun baseline kuat untuk klasifikasi teks dengan Regresi Logistik (dan regresi linear via SGDClassifier / LinearSVC preview), menerapkan regularisasi L1/L2, menyesuaikan ambang keputusan, dan melakukan probability calibration.
Learning Outcomes: (1) Memformulasikan logit & loss logistik; (2) Menerapkan regulasi dan pemilihan C/alpha; (3) Tuning ambang berbasis metrik/biaya; (4) Membaca koefisien fitur sebagai pentingnya fitur; (5) Kalibrasi probabilitas (Platt/Isotonic).
1) Konsep Inti
- Regresi Logistik: \(p(y=1|x)=\sigma(w^Tx+b)\), dengan \(\sigma(z)=1/(1+e^{-z})\).
- Regularisasi: L2 (ridge) menstabilkan koefisien; L1 (lasso) mendorong sparsitas (seleksi fitur).
- Ambang Keputusan: default 0.5; dapat diubah untuk memaksimalkan F1 atau meminimumkan biaya kesalahan.
- Kalibrasi: Platt scaling (logistik) dan Isotonic Regression untuk memperbaiki akurasi probabilitas.
2) Praktik Google Colab — Baseline LogReg yang Kuat
Gunakan korpus berlabel dari Sessi 6 (atau buat label manual). Pipeline berikut lengkap dengan tuning dan evaluasi.
A. Setup & Data
!pip -q install pandas numpy scikit-learn matplotlib
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_curve, roc_curve, average_precision_score
import matplotlib.pyplot as plt
# Muat data berlabel dari sesi sebelumnya (atau labeling manual)
try:
df = pd.read_csv('knn_dataset_sessi6.csv') # jika Anda menyimpan di Sessi 6
print('Loaded knn_dataset_sessi6.csv:', df.shape)
except:
# fallback: buat dari weak labels seperti di Sessi 6
try:
base = pd.read_csv('corpus_sessi3_variants.csv')['v2_stop_stemID'].dropna().astype(str).tolist()
except:
base = pd.read_csv('corpus_sessi2_normalized.csv')['text'].dropna().astype(str).tolist()
POS = {"bagus","mantap","menyenangkan","cepat","baik","excellent","impressive","friendly","tajam","bersih"}
NEG = {"buruk","lambat","telat","downtime","lemah","weak","late","dented","smelled","failed","delay"}
def weak_label(t):
w = set(t.split()); p=len(w & POS); n=len(w & NEG)
if p>n: return 1
if n>p: return 0
return None
df = pd.DataFrame({'text':base, 'y':[weak_label(t) for t in base]}).dropna()
# Tambah contoh manual jika sedikit
if len(df) < 40:
extra = [
('pengiriman cepat kualitas bagus',1), ('ui intuitif menyenangkan',1),
('login delay panjang',0), ('refund telat proses lambat',0)
]
df = pd.concat([df, pd.DataFrame(extra, columns=['text','y'])], ignore_index=True)
print('Dataset:', df.shape)
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['y'].astype(int), test_size=0.25, stratify=df['y'], random_state=42)
B. GridSearch LogReg (L1 vs L2) + TF–IDF
pipe = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')),
('clf', LogisticRegression(max_iter=200, class_weight='balanced', solver='liblinear'))
])
param_grid = {
'clf__penalty': ['l1','l2'],
'clf__C': [0.25, 0.5, 1.0, 2.0, 4.0]
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
search = GridSearchCV(pipe, param_grid, cv=cv, scoring='f1', n_jobs=-1)
search.fit(X_train, y_train)
print('Best params:', search.best_params_)
print('Best CV F1 :', round(search.best_score_,3))
C. Evaluasi, Koefisien Fitur, & Interpretasi
best = search.best_estimator_
proba = best.predict_proba(X_test)[:,1]
y_pred = (proba >= 0.5).astype(int)
print(classification_report(y_test, y_pred, digits=3))
# Koefisien fitur (fitur paling pro-positif & pro-negatif)
vec = best.named_steps['tfidf']
clf = best.named_steps['clf']
feat = vec.get_feature_names_out()
coef = clf.coef_.ravel()
ix_pos = coef.argsort()[::-1][:20]
ix_neg = coef.argsort()[:20]
print('\nTop fitur (+):')
print([ (feat[i], round(coef[i],3)) for i in ix_pos ])
print('\nTop fitur (-):')
print([ (feat[i], round(coef[i],3)) for i in ix_neg ])
D. ROC/PR Curve & Ambang Keputusan
# ROC-AUC & PR-AUC
roc = roc_auc_score(y_test, proba)
ap = average_precision_score(y_test, proba)
print('ROC-AUC=', round(roc,3), ' AP=', round(ap,3))
fpr, tpr, thr = roc_curve(y_test, proba)
prec, rec, thr2 = precision_recall_curve(y_test, proba)
plt.figure(figsize=(5,4)); plt.plot(fpr, tpr); plt.title('ROC Curve'); plt.xlabel('FPR'); plt.ylabel('TPR'); plt.tight_layout(); plt.show()
plt.figure(figsize=(5,4)); plt.plot(rec, prec); plt.title('PR Curve'); plt.xlabel('Recall'); plt.ylabel('Precision'); plt.tight_layout(); plt.show()
# Tuning ambang berbasis F1
f1_scores = [(2*(p*r)/(p+r+1e-12), t) for p,r,t in zip(prec, rec, list(thr2)+[1.0])]
thr_star = sorted(f1_scores, key=lambda x: x[0])[-1][1]
print('Ambang terbaik (F1) ≈', round(float(thr_star),3))
y_opt = (proba >= thr_star).astype(int)
print('Report @threshold*:')
from sklearn.metrics import f1_score
print('F1*=', round(f1_score(y_test, y_opt),3))
E. Kalibrasi Probabilitas (Platt & Isotonic)
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
base = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')),
('clf', LogisticRegression(max_iter=200, class_weight='balanced', penalty=search.best_params_['clf__penalty'], C=search.best_params_['clf__C']))
])
cal_platt = CalibratedClassifierCV(base, method='sigmoid', cv=5)
cal_iso = CalibratedClassifierCV(base, method='isotonic', cv=5)
cal_platt.fit(X_train, y_train)
cal_iso.fit(X_train, y_train)
pp = cal_platt.predict_proba(X_test)[:,1]
pi = cal_iso.predict_proba(X_test)[:,1]
# Kurva kalibrasi
for name, p in [('Uncal', proba), ('Platt', pp), ('Isotonic', pi)]:
frac_pos, mean_pred = calibration_curve(y_test, p, n_bins=10, strategy='quantile')
plt.plot(mean_pred, frac_pos, label=name)
plt.plot([0,1],[0,1],'--', lw=1)
plt.legend(); plt.title('Calibration Curves'); plt.xlabel('Mean predicted'); plt.ylabel('Fraction positive'); plt.tight_layout(); plt.show()
F. Alternatif: Regresi Linear (SGDClassifier)
sgd = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95, sublinear_tf=True, norm='l2')),
('clf', SGDClassifier(loss='log_loss', penalty='l2', alpha=1e-4, class_weight='balanced', max_iter=1000))
])
from sklearn.model_selection import cross_val_score
score = cross_val_score(sgd, df['text'], df['y'].astype(int), cv=5, scoring='f1').mean()
print('SGD log-loss (approx. logistic) CV F1=', round(score,3))
G. Simpan Artefak
import joblib
joblib.dump(search.best_estimator_, 'logreg_text_clf_sessi7.joblib')
print('Tersimpan: logreg_text_clf_sessi7.joblib')
3) Studi Kasus & Analisis
| Kasus | Tujuan | Pendekatan | Catatan |
|---|---|---|---|
| Analisis Sentimen | Baseline akurat & cepat | TF–IDF(1,2) + LogReg (L2) | Kalibrasi untuk probabilitas reliabel |
| Klasifikasi tiket | Multi‑kelas awal | One‑vs‑Rest LogReg | Gunakan L1 untuk seleksi fitur |
| Moderasi konten | Prioritas tinggi vs rendah | Threshold tuning berdasarkan biaya | Gunakan PR‑AUC saat imbalance |
4) Tugas Mini (Dinilai)
- GridSearch: penalty∈{L1,L2}, C∈{0.25,0.5,1,2,4}; report F1 (CV=5) & pilih model terbaik.
- Tuning ambang untuk memaksimalkan F1 pada validation set; bandingkan dengan threshold 0.5.
- Plot kurva kalibrasi (uncalibrated vs Platt vs Isotonic) dan jelaskan mana yang paling baik & mengapa.