Naive Bayes Questions and Discovered Answers

Setup

from backend import *

X, y = make_synthetic_data(n=500, w=50, c=2,
                           avg_doc_length=50,
                           class_sep=0.001,
                           random_state=123)
synth_acc = synthetic_model(X, y, random_state=123)
print("Accuracy:", synth_acc)
Accuracy: 0.87

Questions

How does vocabulary size vs sample size vs document size affect the model?

ns = [50, 500, 5000, 50000]

for num in ns:
    X, y = make_synthetic_data(n=num, w=50, c=2,
                                avg_doc_length=50,
                                class_sep=0.001,
                                random_state=123)
    synth_acc = synthetic_model(X, y, random_state=123)
    print("Accuracy:", synth_acc)
Accuracy: 0.9
Accuracy: 0.87
Accuracy: 0.855
Accuracy: 0.8623

Holding other variables consistent, sample size adds noise. Not all data is “good” data: the data must provide meaningful evidence.

# Vary vocabulary size (w)
ws = [50, 500, 5000, 50000]

for num in ws:
    X, y = make_synthetic_data(n=500, w=num, c=2,
                                avg_doc_length=50,
                                class_sep=0.001,
                                random_state=123)
    synth_acc = synthetic_model(X, y, random_state=123)
    print("Accuracy:", synth_acc)
Accuracy: 0.87
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0

As vocabulary size increases, data becomes sparse and many words become one-to-one identifiers for classes.

# Vary document size (avg_doc_length)
ds = [50, 500, 5000, 50000]

for num in ds:
    X, y = make_synthetic_data(n=500, w=50, c=2,
                                avg_doc_length=num,
                                class_sep=0.001,
                                random_state=123)
    synth_acc = synthetic_model(X, y, random_state=123)
    print("Accuracy:", synth_acc)
Accuracy: 0.87
Accuracy: 1.0
Accuracy: 1.0
Accuracy: 1.0

Increasing document size introduces many uniquely identifying words, making classification trivial.

How does increasing classes affect accuracy?

Intuition: more classes usually require more data.

# Vary class count
cs = [i for i in range(2, 5)]
ns = [50, 500, 5000, 50000]

for num in cs:
    X, y = make_synthetic_data(n=500, w=50, c=num,
                                avg_doc_length=50,
                                class_sep=0.001,
                                random_state=123)
    synth_acc = synthetic_model(X, y, random_state=123)
    print("Accuracy:", synth_acc)
Accuracy: 0.87
Accuracy: 0.65
Accuracy: 0.56

Accuracy drops as number of classes increases.

# Compare many classes vs many samples
cs = [i for i in range(2, 6)]
ns = [500, 50000, 500000, 500000]

for i in range(0, 4):
    X, y = make_synthetic_data(n=ns[i], w=50, c=cs[i],
                                avg_doc_length=50,
                                class_sep=0.001,
                                random_state=123)
    synth_acc = synthetic_model(X, y, random_state=123)
    print("Accuracy:", synth_acc)
Accuracy: 0.87
Accuracy: 0.7342
Accuracy: 0.6835
Accuracy: 0.64954

Sample size must increase very quickly when class count grows.

How does spam adding “non-spam gibberish” affect predictions?

Intuition: Shouldn't affect much.

# Add random features from class 0 to class 1
X, y = make_synthetic_data(
    n=500, w=50, c=2,
    avg_doc_length=50,
    class_sep=0.001,
    random_state=123
)

data = np.concatenate((X, y.reshape(-1, 1)), axis=1)

data_0 = data[data[:, 50] == 0]
data_1 = data[data[:, 50] == 1]

for idx in range(len(data_1)):
    rand_row = data_0[np.random.randint(0, len(data_0))]
    data_1[idx, :50] = data_1[idx, :50] + rand_row[:50] 
    # Leave data_1[idx, 50] unchanged (the label stays 1)

data = np.concatenate((data_0, data_1), axis=0)

X_new = data[:, :50]   # all rows, first 50 columns
y_new = data[:, 50]    # all rows, last column

print(synthetic_model(X, y, random_state=123))
print(synthetic_model(X_new, y_new, random_state=123))
0.87
0.66

With heavily overlapping classes, adding noise reduces accuracy significantly.

# Repeat with better class separation
X, y = make_synthetic_data(
    n=500, w=50, c=2,
    avg_doc_length=50,
    class_sep=0.01,
    random_state=123
)

data = np.concatenate((X, y.reshape(-1, 1)), axis=1)

data_0 = data[data[:, 50] == 0]
data_1 = data[data[:, 50] == 1]

for idx in range(len(data_1)):
    rand_row = data_0[np.random.randint(0, len(data_0))]
    data_1[idx, :50] = data_1[idx, :50] + rand_row[:50] 
    # Leave data_1[idx, 50] unchanged (the label stays 1)

data = np.concatenate((data_0, data_1), axis=0)

X_new = data[:, :50]   # all rows, first 50 columns
y_new = data[:, 50]    # all rows, last column

print(synthetic_model(X, y, random_state=123))
print(synthetic_model(X_new, y_new, random_state=123))
1.0
0.94

With more separation, degradation is smaller.

How does large proportion difference affect the model?

# Drop 90% of class 1 examples
X, y = make_synthetic_data(
    n=1000, w=50, c=2,
    avg_doc_length=50,
    class_sep=0.001,
    random_state=123
)

# Probability of dropping class 1 examples (e.g., drop 50%)
p_drop = 0.9

# random draw: True = drop, False = keep
drop_mask = np.random.rand(len(y)) < p_drop

# keep = either class 0 OR (class 1 & not dropped)
keep = (y == 0) | ((y == 1) & (~drop_mask))

X_new = X[keep]
y_new = y[keep]

print("Original proportion of class 1:", np.mean(y))
print("New proportion of class 1:", round(np.mean(y_new), 3))
print()
print(synthetic_model(X, y, random_state=123))
print(round(synthetic_model(X_new, y_new, random_state=123), 3))
print()
print(synthetic_confusion(X_new, y_new))
Original proportion of class 1: 0.488
New proportion of class 1: 0.092

0.83
0.929

[[102   2]
 [  7   2]]

Accuracy increases because nearly all examples belong to one class. But the model still predicts the minority class occasionally.