Data Science Cheat Sheet

From CS Wiki
Revision as of 00:26, 4 November 2024 by 핵톤 (talk | contribs)

Confusion Matrix and F1 Score

Confusion Matrix

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

  • 2 * (Positive Predictive Value * True Positive Rate) / (Positive Predictive Value + True Positive Rate)
  • 2 * (TP) / (TP + FP + FN)

Key Evaluation Metrics

True Positive Rate (TPR), Sensitivity, Recall

  • TPR = Sensitivity = Recall = TP / (TP + FN)
  • Application: Measures the model's ability to correctly identify positive cases, useful in medical diagnostics to ensure true positives are detected.

Precision (Positive Predictive Value)

  • Precision = TP / (TP + FP)
  • Application: Indicates the proportion of positive predictions that are correct, valuable in applications like spam filtering to minimize false alarms.

Specificity (True Negative Rate, TNR)

  • Specificity = TNR = TN / (TN + FP)
  • Application: Assesses the model's accuracy in identifying negative cases, crucial in fraud detection to avoid unnecessary scrutiny of legitimate transactions.

False Positive Rate (FPR)

  • FPR = FP / (FP + TN)
  • Application: Reflects the rate of false alarms for negative cases, significant in security systems where false positives can lead to excessive interventions.

Negative Predictive Value (NPV)

  • NPV = TN / (TN + FN)
  • Application: Shows the likelihood that a negative prediction is accurate, important in screening tests to reassure negative cases reliably.

Accuracy

  • Accuracy = (TP + TN) / (TP + TN + FP + FN)
  • Application: Provides an overall measure of model correctness, often used as a baseline metric but less informative for imbalanced datasets.

Curves & Chart

Lift Curve

  • X-axis: Percent of data (typically population percentile or cumulative population)
  • Y-axis: Lift (ratio of model's performance vs. baseline)
  • Application: Helps in evaluating the effectiveness of a model in prioritizing high-response cases, often used in marketing to identify segments likely to respond to promotions.

Gain Chart

  • X-axis: Percent of data (typically cumulative population)
  • Y-axis: Cumulative gain (proportion of positives captured)
  • Application: Illustrates the cumulative capture of positive responses at different cutoffs, useful in customer targeting to assess the efficiency of resource allocation.

Cumulative Response Curve

  • X-axis: Percent of data (cumulative population)
  • Y-axis: Cumulative response (actual positives captured as cumulative total)
  • Application: Evaluates model performance by showing how many true positives are captured as more of the population is included, applicable in direct marketing to optimize campaign reach.

ROC Curve

  • X-axis: False Positive Rate (FPR)
  • Y-axis: True Positive Rate (TPR or Sensitivity)
  • Application: Used to evaluate the trade-off between true positive and false positive rates at various thresholds, crucial in medical testing to balance sensitivity and specificity.

Precision-Recall Curve

  • X-axis: Recall (True Positive Rate)
  • Y-axis: Precision (Positive Predictive Value)
  • Application: Focuses on the balance between recall and precision, especially useful in cases of class imbalance, like fraud detection or medical diagnosis, where positive class accuracy is vital.