Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here

Morgan May

Project Title

Research on Calibration of Deep Learning Models

Project Description

We work on finding new methods for improving the calibration and hence reliability of deep learning models. We try to address the problem of uncalibrated models, which could be overconfident or underconfident in their predictions, and can lead to unreliable outcomes in real-world applications. We review some existing methods for calibrating deep learning models, such as temperature scaling, label smoothing, Bayesian neural networks, subnetwork ensembling and data augmentation.

[1] presents a survey of methods for calibrating deep learning models based on three categories: post-hoc methods, in-training methods, and Bayesian methods. It discusses current challenges in calibration research, such as evaluation metrics, uncertainty quantification, adversarial robustness, and out-of-distribution detection.

[2] compares different methods to improve the confidence calibration of deep learning models, as the agreement between the predicted probability and the true class prevalence. It tests natural image classification and lung cancer risk estimation tasks with balanced and imbalanced training sets using methods such as temperature scaling and label smoothing.

[3] studies performance of deep learning models for class-imbalanced medical image classification. It investigates the degree of imbalances in the dataset used for training, calibration methods, and two classification thresholds: default threshold of 0.5, and optimal threshold from precision-recall curves. It concludes that at varying degrees of imbalance, at the default classification threshold of 0.5, the performance achieved through calibration is significantly superior to using uncalibrated probabilities. However, at the PR-guided threshold, these gains are not significant.

[4] investigates how well deep learning models can predict the probabilities of different outcomes for classification problems in mechanics. It compares several methods to improve calibration of deep learning models, such as ensemble averaging and temperature scaling.

[5] studies effect of combining Ensembles with data augmentation in multi-class image classification problems, and conclude that using subnetwork ensemble with data augmentation improves model calibration and robustness. It also suggests that combining subnetwork ensemble with MixUp or CutMix improves accuracy while not harming model calibration.

Some possible directions for continuing the research are:

  • Exploring more subnetwork ensembling methods, e.g., pruning, distillation, and dropout. This could lead to more efficient and robust models that are well-calibrated and generalize better to unseen data.
  • Applying the calibration methods to other types of data and tasks, such as regression, segmentation, detection, and generation. This could improve the reliability and quality of the outputs of these tasks, often used in critical applications such as autonomous driving, medical diagnosis, and natural language generation.
  • Investigating the theoretical aspects of calibration, such as the relationship between calibration and generalization, the trade-off between accuracy and calibration, and the optimal calibration method for different scenarios. This could help in understanding the underlying principles and mechanisms of calibration and developing more principled and effective methods.

[1] arXiv:2308.01222v1

[2] arXiv:2206.08833v1

[3] https://doi.org/10.1371/journal.pone.0262838

[4] https://onlinelibrary.wiley.com/doi/10.1111/exsy.13252

[5] arXiv:2212.00881v2