Penalaran probabilistik memperhitungkan ketidakpastian. Dengan teorema Bayes, kita memperbarui keyakinan (prior) setelah melihat bukti (evidence) menjadi posterior. Naive Bayes adalah classifier sederhana namun kuat yang mengasumsikan fitur kondisional independen diberi kelas.
Tujuan: memahami Bayes, membangun Naive Bayes kecil, dan menganalisis dampak prior/likelihoodTeorema Bayes
Bayes menyatakan bagaimana kita memperbarui probabilitas hipotesis H setelah melihat bukti E: P(H|E) = P(E|H)·P(H)/P(E). Prior = keyakinan awal; likelihood = seberapa mungkin bukti muncul jika H benar; evidence = normalisasi agar probabilitas total = 1.
Teorema Bayes: P(H|E) = [ P(E|H) * P(H) ] / P(E) Contoh (tes penyakit): Pr(positif|sakit) = 0.99, Pr(positif|sehat)=0.05, Pr(sakit)=0.01. Pr(positif) = 0.99*0.01 + 0.05*0.99 = 0.0594. Maka: Pr(sakit|positif) = (0.99*0.01)/0.0594 ≈ 0.1667 (≈16.7%). Intuisi: meskipun akurasi tes tinggi, jika penyakit langka (prior kecil), probabilitas sakit setelah hasil positif masih bisa rendah.
Naive Bayes
Dengan asumsi fitur independen bersyarat pada kelas, likelihood bersama memfaktor menjadi hasil kali per-fitur. Pada teks, ini cocok dengan model multinomial di mana frekuensi kata menjadi sinyal kuat.
- Kelebihan: cepat, stabil pada data kecil, baseline kuat.
- Keterbatasan: asumsi independensi sering dilanggar; tetap efektif jika pelanggaran tidak ekstrem.
Studi Kasus
- Filter Spam: kata kunci seperti “gratis”, “diskon” meningkatkan likelihood kelas spam.
- Diagnosis Sederhana: gejala menjadi fitur; Naive Bayes mengestimasi penyakit paling mungkin.
# ====== Demo Teorema Bayes (Numerik & Visual) ======
import numpy as np
import matplotlib.pyplot as plt
# Parameter skenario tes penyakit
P_sick = 0.01
P_pos_given_sick = 0.99
P_pos_given_healthy = 0.05
P_healthy = 1 - P_sick
P_pos = P_pos_given_sick*P_sick + P_pos_given_healthy*P_healthy
P_sick_given_pos = (P_pos_given_sick*P_sick) / P_pos
print(f"P(positif) = {P_pos:.4f}")
print(f"P(sakit | positif) = {P_sick_given_pos:.4f}")
# Visualisasi kontribusi komponen
parts = [P_pos_given_sick*P_sick, P_pos_given_healthy*P_healthy]
labels = ["True Positive mass", "False Positive mass"]
plt.figure(figsize=(5,3))
plt.bar(labels, parts)
plt.title("Decomposisi P(positif)")
plt.ylabel("Probabilitas")
plt.show()
# Sensitivitas prior: variasikan P(sick)
priors = np.linspace(0.001, 0.2, 50)
posteriors = []
for p in priors:
Ppos = P_pos_given_sick*p + P_pos_given_healthy*(1-p)
posteriors.append((P_pos_given_sick*p)/Ppos)
plt.figure(figsize=(5,3))
plt.plot(priors, posteriors)
plt.xlabel("Prior P(sick)")
plt.ylabel("Posterior P(sick | positif)")
plt.title("Dampak Prior pada Posterior")
plt.grid(True, alpha=0.3)
plt.show()
Eksplor: ubah P_sick dan bandingkan perubahan posterior.
# ====== Naive Bayes dari Nol (Klasifikasi Teks Mini) ======
# Dataset kecil: kalimat dan label {spam, ham}
import re, math, collections
train = [
("diskon besar besaran beli sekarang", "spam"),
("promo gratis ongkir hari ini", "spam"),
("rapat dosen jam sepuluh", "ham"),
("tugas praktikum dikumpulkan besok", "ham"),
]
# Tokenisasi sederhana
TOKEN = re.compile(r"[a-zA-Z0-9]+")
def tokenize(s):
return [w.lower() for w in TOKEN.findall(s)]
# Estimasi prior & likelihood dengan smoothing Laplace
class NaiveBayes:
def __init__(self):
self.class_counts = collections.Counter()
self.word_counts = {"spam": collections.Counter(), "ham": collections.Counter()}
self.vocab = set()
self.total_docs = 0
def fit(self, data):
for text, label in data:
self.total_docs += 1
self.class_counts[label] += 1
for w in tokenize(text):
self.vocab.add(w)
self.word_counts[label][w] += 1
self.vocab = sorted(self.vocab)
def predict_proba(self, text):
words = tokenize(text)
V = len(self.vocab)
scores = {}
for c in self.class_counts:
# log prior
lp = math.log(self.class_counts[c]/self.total_docs)
# log likelihood (multinomial NB)
total_wc = sum(self.word_counts[c].values())
ll = 0.0
for w in words:
ll += math.log((self.word_counts[c][w]+1)/(total_wc+V))
scores[c] = lp + ll
# normalisasi log-sum-exp
m = max(scores.values())
exps = {c: math.exp(scores[c]-m) for c in scores}
Z = sum(exps.values())
return {c: exps[c]/Z for c in scores}
def predict(self, text):
proba = self.predict_proba(text)
return max(proba, key=proba.get), proba
nb = NaiveBayes()
nb.fit(train)
for s in [
"gratis ongkir sekarang",
"rapat praktikum besok",
"diskon ongkir tugas"
]:
y, p = nb.predict(s)
print(f"'{s}' => {y}, P={p}")
Tantangan: tambahkan stopword filtering atau TF-IDF sederhana.
# ====== Naive Bayes dengan scikit-learn (Teks) ======
# Jika error module, jalankan: !pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
train_text = [
"diskon besar besaran beli sekarang",
"promo gratis ongkir hari ini",
"rapat dosen jam sepuluh",
"tugas praktikum dikumpulkan besok",
]
train_y = ["spam","spam","ham","ham"]
clf = Pipeline([
("vec", CountVectorizer()),
("nb", MultinomialNB())
])
clf.fit(train_text, train_y)
for s in ["gratis ongkir sekarang", "rapat praktikum besok"]:
print(s, "=>", clf.predict([s])[0], clf.predict_proba([s]))
Jika modul tidak ada: jalankan !pip install scikit-learn.
# ====== (Opsional) Bayesian Network kecil dengan pgmpy ======
# Instal sekali: !pip install pgmpy
try:
from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination
# Struktur: Flu -> Demam, Flu -> Batuk
model = BayesianModel([('Flu','Demam'), ('Flu','Batuk')])
cpd_flu = TabularCPD('Flu', 2, [[0.95],[0.05]]) # 0: no, 1: yes
cpd_demam = TabularCPD('Demam', 2,
[[0.9, 0.2], # P(Demam=0|Flu)
[0.1, 0.8]], # P(Demam=1|Flu)
evidence=['Flu'], evidence_card=[2])
cpd_batuk = TabularCPD('Batuk', 2,
[[0.85, 0.3],
[0.15, 0.7]],
evidence=['Flu'], evidence_card=[2])
model.add_cpds(cpd_flu, cpd_demam, cpd_batuk)
model.check_model()
infer = VariableElimination(model)
q = infer.query(variables=['Flu'], evidence={'Demam':1, 'Batuk':1})
print(q)
except Exception as e:
print("Jika modul belum ada, jalankan: !pip install pgmpy\nError:", e)
Perlu instalasi pgmpy. Menunjukkan inferensi P(Flu | Demam∧Batuk) pada jaringan kecil.
# ====== Kuis 7 (cek mandiri) ======
qs=[
("Teorema Bayes menyatakan...",{"a":"P(H|E)=P(H)+P(E)","b":"P(H|E)=[P(E|H)P(H)]/P(E)","c":"P(E|H)=P(H|E)","d":"P(H|E)=P(E)"},"b"),
("Asumsi Naive Bayes adalah...",{"a":"Fitur saling bebas bersyarat given kelas","b":"Semua fitur identik","c":"Distribusi harus Gaussian","d":"Tidak ada prior"},"a"),
("Jika semua fitur bersifat biner dan bobot sama, Naive Bayes mendekati...",{"a":"k-NN","b":"Logistic Regression","c":"Decision Tree","d":"Linear SVM"},"b"),
]
print('Kunci jawaban:')
for i,(_,__,ans) in enumerate(qs,1):
print(f'Q{i}: {ans}')
Tugas Koding 4: Bangun Naive Bayes untuk dataset kecil (boleh teks/gejala). Laporan ≤ 1 halaman berisi: skema fitur, hasil akurasi (train-test split), contoh prediksi, dan analisis singkat tentang dampak prior/likelihood.
- Russell & Norvig (2020), bab probabilistic reasoning.
- Bishop (2006), Pattern Recognition and Machine Learning, bab klasifikasi generatif.