Ml

!NOTE

This is Slop

CHAPTER 1

1. Why ML & When to Use It

Traditional programming requires hardcoding rules. ML dynamically learns them.

2. Classification of ML Systems

Géron categorizes systems across three distinct axes. A single system can be a combination of these (e.g., a supervised, online, model-based system).

Axis A: Human Supervision

[Image of supervised vs unsupervised machine learning]

Axis B: Batch vs. Online Learning

How does the system handle incoming data?

Feature Batch Learning (Offline) Online Learning (Incremental)
Training Trains on all available data at once. Trains incrementally in mini-batches.
Resources High compute/time. Requires replacing the old model. Fast, cheap, immediate.
Use Case Static environments, plenty of compute time. Rapidly changing data, limited compute, Out-of-core learning (datasets too large for main memory).
Key Metric N/A Learning Rate: High rate = adapts fast but forgets old data. Low rate = high inertia, less sensitive to noise.

Axis C: Instance-based vs. Model-based Learning

How does the system generalize to new, unseen data?

Type How it works Example
Instance-based Memorizes training data, then uses a similarity measure to compare new instances to memorized ones. K-Nearest Neighbors (K-NN).
Model-based Detects patterns to build a mathematical model, then uses it to make predictions. Linear Regression.

The Model-Based Workflow: > 1. Define a model. 2. Define a Utility Function (measures how good the model is) or a Cost Function (measures how bad it is). 3. Train (optimize parameters to minimize cost). 4. Predict.


3. The Main Challenges of Machine Learning

Errors stem from either Bad Data or a Bad Algorithm.

Problem A: Bad Data

  1. Insufficient Quantity: ML algorithms need thousands to millions of examples. (Recall the “Unreasonable Effectiveness of Data” concept—sometimes a dumb algorithm with massive data beats a smart algorithm with little data).
  2. Nonrepresentative Data: If training data doesn’t perfectly mirror the production environment, the model won’t generalize.
    • Sampling Noise: Nonrepresentative data due to a sample being too small.
    • Sampling Bias: The sample is large enough, but the collection method was flawed (e.g., Landon vs. Roosevelt poll).
  3. Poor-Quality Data: Riddled with errors, outliers, and noise. Fix: Clean it up (discard outliers, fill missing features).
  4. Irrelevant Features: Garbage in, garbage out.
    • Feature Engineering: The most critical step. Involves Feature Selection (choosing the most useful features) and Feature Extraction (combining existing features to produce a more useful one, e.g., using dimensionality reduction).

Problem B: Bad Algorithms

[Image of overfitting and underfitting in machine learning]

  1. Overfitting: The model performs perfectly on training data but bombs on new data. It has memorized the noise.
    • Cause: Model is too complex relative to the amount/noisiness of the training data.
    • Solutions: Simplify the model (fewer parameters), gather more data, clean the data, or Constrain the model (Regularization).
      • Note: The amount of regularization applied is controlled by a Hyperparameter (a parameter of the learning algorithm, not the model itself, set prior to training).
  2. Underfitting: The model is too simple to learn the underlying structure of the data.
    • Cause: Using a linear model for highly non-linear data.
    • Solutions: Select a more powerful model, feed better features (feature engineering), reduce regularization constraints.

4. Testing, Validating, and Data Mismatch

You shouldn’t train a model and blindly deploy it. You need a rigorous testing framework.


Chapter 3: Classification

Source: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (A. Géron)

1. The Starting Point: MNIST & Binary Classification


2. Performance Measures (⚠️ High Exam Probability)

Evaluating a classifier is significantly trickier than evaluating a regressor. This section is the core of the chapter.

A. Measuring Accuracy Using Cross-Validation

B. The Confusion Matrix

The ultimate tool for classification performance. It counts how many times instances of class A are classified as class B.

C. Precision, Recall, and the F_1 Score

Instead of looking at the whole matrix, we extract concise metrics.

  1. Precision (Accuracy of the positive predictions): Out of all the times the model yelled “It’s a 5!”, how often was it right? Precision = .rac{TP}{TP + FP}
  2. Recall (Sensitivity / True Positive Rate): Out of all the actual 5s in the dataset, how many did the model find? Recall = .rac{TP}{TP + FN}
  3. F_1 Score: The harmonic mean of precision and recall. It heavily punishes extreme values—you only get a high F_1 score if both recall and precision are high. F_1 = .rac{2}{.rac{1}{Precision} + .rac{1}{Recall}} = 2 * .rac{Precision * Recall}{Precision + Recall}

D. The Precision/Recall Trade-off

Intuition Check: You cannot have perfect precision and perfect recall. It is a slider.

E. The ROC Curve (Receiver Operating Characteristic)

Plots the True Positive Rate (Recall) against the False Positive Rate (FPR).

Metric When to use it? (Classic Exam Question)
PR Curve Use when the positive class is rare, or when you care more about False Positives than False Negatives.
ROC Curve Use strictly when you have roughly equal numbers of positive and negative classes.

3. Beyond Binary: Multiclass Classification

Algorithms like Random Forest and Naive Bayes can handle multiple classes natively. Others (like SVM or Linear classifiers) are strictly binary. To do multiclass with binary algorithms, we use two strategies:

Strategy How it works Pros / Cons
OvR (One-versus-Rest) Train 10 binary classifiers (one for 0, one for 1…). Keep the one with the highest decision score. Pro: Default for most algorithms. Efficient. Con: Classifiers are trained on the whole dataset.
OvO (One-versus-One) Train a classifier for every pair of digits (0s vs 1s, 0s vs 2s). Requires N * (N-1) / 2 classifiers (45 for MNIST). Pro: Each classifier is only trained on a small subset of data. Best for algorithms that scale poorly (like SVMs).

4. Error Analysis

Once you have a model, you analyze its mistakes to improve it.

  1. Normalize the Confusion Matrix: Divide each value by the number of images in the corresponding class. This prevents classes with lots of data from looking artificially bad.
  2. Visualize: Fill the diagonal with zeros to highlight the errors strictly.
  3. Act: If the model confuses 3s and 5s, you might preprocess the images to center them better, or add synthetic data of shifted 3s and 5s.

5. Advanced Output Types

Don’t confuse these two—professors love testing the distinction.


Chapter 4: Training Models

💡 The Core Theme: You are opening the “black box” of Machine Learning. Instead of just calling .fit(), you are learning how the algorithms actually find the best parameters.

1. Linear Regression

Goal: Predict a continuous value by fitting a straight line (or hyperplane) to your data. Model Prediction: .at{y} = .heta_0 + .heta_1 x_1 + ... + .heta_n x_n (or vectorized: .at{y} = .heta^T .athbf{x}) Cost Function: Mean Squared Error (MSE). We want to find the .heta (weights) that minimize this.

The Normal Equation

Instead of iterating, we use a math formula to jump straight to the exact minimum.


2. Gradient Descent (GD)

💡 Intuition: Imagine you are blindfolded at the top of a mountain. To get to the bottom, you feel the slope of the ground with your feet and take a step in the steepest downward direction.

The 3 Flavors of Gradient Descent

Feature Batch GD Stochastic GD (SGD) Mini-batch GD
Data per step The entire dataset. One single random instance. A small random subset (e.g., 32).
Speed per step Very slow. Very fast. Fast (can use GPU optimization).
Path to minimum Smooth and direct. Erratic, bounces around. Less erratic than SGD.
Escapes local minima? No. Yes, because of its random bouncing. Yes, better than Batch.
Final accuracy Stops exactly at minimum. Never settles, needs learning schedule. Close to minimum.

3. Polynomial Regression

What if your data isn’t a straight line? You can still use Linear Regression!


4. Learning Curves

Plots of the model’s performance on the training set and validation set as a function of the training set size.


5. Regularized Linear Models

💡 Intuition: “Keep it simple, stupid.” Regularization forces the learning algorithm to not only fit the data but also keep the model weights (.heta) as small as possible. This prevents overfitting.

(Note: We generally do not regularize the bias term .heta_0)

Ridge Regression (L2 Regularization)

Lasso Regression (L1 Regularization)

Elastic Net

Early Stopping


6. Logistic Regression

[Image of Logistic Regression Sigmoid curve]

Despite the name, this is a classification algorithm.


7. Softmax Regression (Multinomial Logistic Regression)

Logistic Regression is strictly binary (Yes/No). What if you have multiple classes (e.g., Red, Blue, Green)?


Final Review Step: Before the exam, make sure you can visualize the shape of the cost function (MSE vs Log Loss) and the difference between L1 (diamond shape) and L2 (circular shape) regularization penalties!