Working Papers

From Cox to Neural Networks: Flexible Modeling Improves Modeling of Post-Kidney Transplant Survival

Under review at Statistical Analysis and Data Mining

Accurate prediction of graft failure is critical to enhance patient care following transplantation. Traditional predictive models often focus on graft failure, overlooking other potential outcomes, known as competing risks, such as death with a functioning graft. This oversight theoretically biases risk estimates, yet the literature presents conflicting evidence on the gain associated with incorporating competing risks and leveraging flexible survival models. Our work compares a traditional Cox proportional hazards model with the Fine-Gray model, which accounts for competing risks, utilising simulation studies and real-world kidney transplant data from the United Network for Organ Sharing (UNOS). Additionally, we extend traditional methodologies with neural networks to assess the predictive gain associated with more flexible models while maintaining the same modeling assumptions. Our contributions include a detailed performance assessment between traditional and competing risks models, measuring predictive gains associated with neural networks, and developing a Python implementation for these models and associated evaluation metrics. Our findings demonstrate the importance of accounting for competing risks to improve risk estimation. These insights have substantial implications for improving patient prioritisation and transplantation management practices.

Recommended citation: Jeanselme, V., Defor, E., Bandyopadhyay, D., and Gupta, G. From Cox to Neural Networks: Flexible Modeling Improves Modeling of Post-Kidney Transplant Survival.

FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records

Under review at NeurIPS

Foundation models hold significant promise in healthcare, given their capacity to extract meaningful representations independent of downstream tasks. This property has enabled state-of-the-art performance across several clinical applications trained on structured electronic health record (EHR) data, even in settings with limited labeled data, a prevalent challenge in healthcare. However, there is little consensus on these models’ potential for clinical utility due to the lack of desiderata of comprehensive and meaningful tasks and sufficiently diverse evaluations to characterize the benefit over conventional supervised learning. To address this gap, we propose a suite of clinically meaningful tasks spanning patient outcomes, early prediction of acute and chronic conditions, including desiderata for robust evaluations. We evaluate state-of-the-art foundation models on EHR data consisting of 5 million patients from Columbia University Irving Medical Center (CUMC), a large urban academic medical center in New York City, across 14 clinically relevant tasks. We measure overall accuracy, calibration, and subpopulation performance to surface tradeoffs based on the choice of pre-training, tokenization, and data representation strategies. Our study aims to advance the empirical evaluation of structured EHR foundation models and guide the development of future healthcare foundation models.

Recommended citation: Pang*, C., Jeanselme*, V., Choi, Y., Jiang, X., Jing, Z., Kashyap, A., Kobayashi, Y., Li, Y., Pollet, F., Natarajan, K., and Joshi, S. FoMoH: A clinically meaningful foundation model evaluation for structured electronic health records. https://arxiv.org/abs/2505.16941

Assessing the impact of variance heterogeneity and mis-specification in mixed-effects location-scale models

Under review at BMC Medical Research Methodology

Linear Mixed Model (LMM) is a common statistical approach to model the relation between exposure and outcome while capturing individual variability through random effects. However, this model assumes the homogeneity of the error term’s variance. Breaking this assumption, known as homoscedasticity, can bias estimates and, consequently, may change a study’s conclusions. If this assumption is unmet, the mixed-effect location-scale model (MELSM) offers a solution to account for within-individual variability. Our work explores how LMMs and MELSMs behave when the homoscedasticity assumption is not met. Further, we study how misspecification affects inference for MELSM. To this aim, we propose a simulation study with longitudinal data and evaluate the estimates’ bias and coverage. Our simulations show that neglecting heteroscedasticity in LMMs leads to loss of coverage for the estimated coefficients and biases the estimates of the standard deviations of the random effects. In MELSMs, scale misspecification does not bias the location model, but location misspecification alters the scale estimates. Our simulation study illustrates the importance of modelling heteroscedasticity, with potential implications beyond mixed effect models, for generalised linear mixed models for non-normal outcomes and joint models with survival data.

Recommended citation: Jeanselme, V., Palma, M., and Barrett, J. Assessing the impact of variance heterogeneity and mis-specification in mixed-effects location-scale models. https://arxiv.org/abs/2505.18038

ADHAM: Additive deep hazard analysis mixtures for interpretable survival regression

Under review at MLHC

Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performances. However, these models do not provide interpretable insights into the association between exposures and the modeled outcomes, a critical requirement for decision-making in clinical practice. To address this limitation, we propose Additive Deep Hazard Analysis Mixtures (ADHAM), an interpretable additive survival model. ADHAM assumes a conditional latent subpopulation structure that characterizes an individual, combined with covariate-specific hazard functions. To select the number of subpopulations, we introduce a post-training group refinement-based model-selection procedure; ie an efficient approach to merge similar clusters to reduce the number of repetitive latent subpopulations identified by the model. We perform comprehensive studies to demonstrate ADHAM’s interpretability on population, subpopulation, and individual levels. Extensive experiments on real-world datasets show that ADHAM provides novel insights into the association between exposures and outcomes. Further, ADHAM remains on par with existing state-of-the-art survival baselines, offering a scalable and interpretable approach to time-to-event prediction in healthcare.

Recommended citation: Ketenci, M., Jeanselme, V., Nieva, H., Joshi, S., and Elhadad, N. ADHAM: Additive deep hazard analysis mixtures for interpretable survival regression.

Identifying treatment response subgroups in observational time-to-event data

Under review at NeurIPS

Identifying patient subgroups with different treatment responses is an important task to inform medical recommendations, guidelines, and the design of future clinical trials. Existing approaches for subgroup analysis primarily focus on Randomised Controlled Trials (RCTs), in which treatment assignment is randomised. Furthermore, the patient cohort of an RCT is often constrained by cost, and is not representative of the heterogeneity of patients likely to receive treatment in real-world clinical practice. Therefore, when applied to observational studies, such approaches suffer from significant statistical biases because of the non-randomisation of treatment. Our work introduces a novel, outcome-guided method for identifying treatment response subgroups in observational studies. Our approach assigns each patient to a subgroup associated with two time-to-event distributions: one under treatment and one under control regime. It hence positions itself in between individualised and average treatment effect estimation. The assumptions of our model result in a simple correction of the statistical bias from treatment non-randomisation through inverse propensity weighting. In experiments, our approach significantly outperforms the current state-of-the-art method for outcome-guided subgroup analysis in both randomised and observational treatment regimes.

Recommended citation: Jeanselme, V., Yoon, C., Falck, F., Tom, B., and Barrett, J. Identifying treatment response subgroups in observational time-to-event data. https://www.arxiv.org/abs/2408.03463

Ignoring Competing Risks: Impact on Algorithmic Fairness

To be submitted to Management Science

Recommended citation: Jeanselme, V., Barrett, J., and Tom, B. Ignoring Competing Risks: Impact on Algorithmic Fairness.

Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness

Under review at Management Science (Reject and Resubmit)

Machine learning risks reinforcing biases present in data, and, as we argue in this work, in what is absent from data. In healthcare, biases have marked medical history, leading to unequal care affecting marginalised groups. Patterns in missing data often reflect these group discrepancies, but the algorithmic fairness implications of group-specific missingness are not well understood. Despite its potential impact, imputation is often an overlooked preprocessing step, with attention placed on the reduction of reconstruction error and overall performance, ignoring how imputation can affect groups differently. Our work studies how imputation choices affect reconstruction errors across groups and algorithmic fairness properties of downstream predictions. First, we provide a structured view of the relationship between clinical presence mechanisms and group-specific missingness patterns. Then, we theoretically demonstrate that the optimal choice between two common imputation strategies is under-determined, both in terms of group-specific imputation quality and of the gap in quality across groups. Particularly, the use of group-specific imputation strategies may counter-intuitively reduce data quality for marginalised group. We complement these theoretical results with simulations and real-world empirical evidence showing that imputation choices influence group-specific data quality and downstream algorithmic fairness, and that no imputation strategy consistently reduces group disparities in reconstruction error or predictions. Importantly, our results show that current practices may be detrimental to health equity as similarly performing imputation strategies at the population level can affect marginalised groups differently. Finally, we propose recommendations for mitigating inequities that may stem from an overlooked step of the machine learning pipeline.

Recommended citation: Jeanselme, V., De-Arteaga, M., Zhang, Z., Barrett, J., and Tom, B. Imputation Strategies Under Clinical Presence: Impact on Algorithmic Fairness. https://arxiv.org/abs/2208.06648

Vincent Jeanselme