Classification? More like Logistic Regression
Logistic regression is a machine learning algorithm used to predict the probability that an instance belongs to a particular class. It is typically used for binary classification problems, where the output is either 0 or 1. However, it can be generalized to support multiple classes using softmax regression.
Mathematically, the logistic regression model prediction can be represented as:
Logistic regression is similar to linear regression in that it computes a weighted sum of the input features (plus a bias term). However, instead of outputting the result directly, logistic regression applies the sigmoid function, also called the logistic function, to the result.
The sigmoid function is an S-shaped function that outputs a number between 0 and 1. This output represents the estimated probability that the instance belongs to the positive class.1(Sigmoid Logistic function)
How Does Logistic Regression Make Predictions?
If the estimated probability is greater than a given threshold (typically 50%), then the model predicts that the instance belongs to the positive class (labeled "1"). Otherwise, it predicts that it belongs to the negative class (labeled "0").
Example: Logistic regression model prediction using a 50% threshold probability
where:
- is the predicted class
- is the estimated probability that the instance belongs to the positive class.
How is a Logistic Regression Model Trained?
The objective of training a logistic regression model is to set the parameter vector so that the model estimates high probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0). This is achieved by minimizing a cost function, typically the log loss.
The log loss function penalizes the model when it estimates a low probability for a target class. For a single training instance, the cost function is:
The cost function for the entire training set is simply the average cost over all training instances represented by:
What is Softmax Regression?
Softmax regression, also known as multinomial logistic regression, is a generalization of logistic regression used for multiclass classification problems where the classes are mutually exclusive. This means that each instance can only belong to one class.
For example, you could use softmax regression to classify iris flowers into different species, but not to recognize multiple people in one picture.
How Softmax Regression Works:
-
1 Calculating Scores: Softmax regression calculates a score, , for each class given an instance x. This score represents how likely the instance is to belong to that class.
The score is calculated using an equation similar to linear regression, where each class has its own parameter vector,
: These parameter vectors are stored as rows in a parameter matrix, Θ.
-
2 Estimating Probabilities: The scores for each class are then converted into probabilities using the softmax function. The softmax function, also called the normalized exponential, calculates the exponential of each score and then divides by the sum of all exponentials:
Where
- is the number of classes.
- is a vector containing the scores of each class for the instance x
- is the estimated probability that the instance x belongs to class , given the scores of each class for that instance.
-
3 Making Predictions: The softmax regression classifier predicts the class with the highest estimated probability, which corresponds to the class with the highest score:
Training Softmax Regression:
-
The Goal: To train a softmax regression model is to adjust the parameter matrix Θ so that the model assigns a high probability to the correct class and low probabilities to the other classes.
-
Cost Function: This is achieved by minimizing a cost function called the cross-entropy. Cross-entropy measures how well the estimated class probabilities match the true class probabilities. The cross-entropy cost function is given by:
where:
-
is the target probability that the instance belongs to class . This is typically 1 if the instance belongs to class and 0 otherwise.
-
Minimizing Cross-Entropy: Gradient descent can be used to find the parameter values that minimize the cross-entropy cost function.
The cross-entropy will be low when the predicted probabilities are close to the target probabilities. Conversely, the cross-entropy will be high if the predicted probabilities are far from the target probabilities.
Example
Let's imagine a scenario with three classes (A, B, and C) and a single training instance.
- Scenario 1: Correct Prediction with High Confidence
- True class: Class A (represented as)
- Predicted probabilities: [0.9, 0.05, 0.05]
- The cross-entropy would be relatively low because the model correctly predicted a high probability for class A.
- Scenario 2: Correct Prediction with Low Confidence
- True class: Class A (represented as)
- Predicted probabilities: [0.4, 0.3, 0.3]
- The cross-entropy would be higher than in Scenario 1 because, although the model predicts the correct class, it does so with lower confidence (a lower probability for class A).
- Scenario 3: Incorrect Prediction
- True class: Class A (represented as)
- Predicted probabilities: [0.1, 0.7, 0.2]
- The cross-entropy would be the highest in this scenario because the model predicted the wrong class (assigning the highest probability to class B).