Logistic Regression

From CS Wiki

Logistic regression is a statistical and machine learning algorithm used for binary classification tasks, where the output variable is categorical and typically represents two classes (e.g., yes/no, spam/not spam, fraud/not fraud). Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm, as it predicts probabilities of classes rather than continuous values.

How It Works[edit | edit source]

Logistic Regression models the probability of a binary outcome using a logistic function, also known as the sigmoid function. The sigmoid function compresses values to range between 0 and 1, representing the probability of belonging to a particular class. The model predicts the probability that the input belongs to the positive class (1) and classifies it by applying a threshold, often 0.5.

The logistic function is represented by:

P(y=1 | X) = 1 / (1 + e-(b0 + b1X1 + b2X2 + ... + bnXn))

where:

  • P(y=1 | X) is the probability of the output being 1 given the input features.
  • X1, X2, ..., Xn are the input features.
  • b0 is the intercept, and b1, b2, ..., bn are the coefficients of the features.

Types of Logistic Regression[edit | edit source]

  • Binary Logistic Regression: Used for binary classification with two possible outcomes (e.g., yes/no).
  • Multinomial Logistic Regression: Used when the outcome variable has more than two categories without any ordering (e.g., classifying types of animals).
  • Ordinal Logistic Regression: Used when the outcome variable has ordered categories (e.g., ranking levels from low to high).

Applications of Logistic Regression[edit | edit source]

Logistic Regression is widely used across industries due to its simplicity, interpretability, and effectiveness in binary classification tasks:

  • Healthcare: Predicting disease outcomes, risk assessments, and patient survival chances.
  • Finance: Credit scoring, fraud detection, and risk analysis.
  • Marketing: Customer churn prediction, targeting potential buyers, and lead qualification.
  • Social Sciences: Survey analysis, where responses fall into categories like agree/disagree or support/oppose.

Key Metrics for Evaluating Logistic Regression[edit | edit source]

To assess the performance of a Logistic Regression model, common metrics include:

  • Accuracy: The proportion of correct predictions.
  • Precision: The ratio of true positive predictions to all positive predictions.
  • Recall: The ratio of true positive predictions to all actual positives.
  • F1 Score: The harmonic mean of precision and recall, useful when dealing with imbalanced data.
  • AUC-ROC Curve: Measures the model’s ability to distinguish between classes, where a higher Area Under the Curve (AUC) indicates better performance.

Assumptions of Logistic Regression[edit | edit source]

Logistic Regression relies on several assumptions for accurate results:

1. Linearity of Independent Variables and Log-Odds: Assumes a linear relationship between the log-odds of the outcome and the independent variables.

2. Independence of Observations: Observations should be independent of each other to avoid biased results.

3. No Multicollinearity: Independent variables should not be highly correlated with each other, which can be checked using Variance Inflation Factor (VIF).

4. Sufficient Sample Size: Logistic Regression requires a large enough sample size, especially for categorical variables, to make accurate predictions.

Handling Limitations[edit | edit source]

Logistic Regression may not perform well if the relationship between variables is highly non-linear. In such cases, transformations, polynomial features, or using a more complex model like Decision Trees or Neural Networks can be considered.

See Also[edit | edit source]