AI model generalization assessed on unlabeled pulmonary function data from daily use
Rationale: AI has achieved promising applications in respiratory care, but generalization of models to new data is underexplored. We aim to investigate the representativeness of an AI diagnostic model for pulmonary function testing (ArtiQ.PFT) to its clinical usage.
Methods: We compared the training dataset (n=1430) against the clinical usage from March 2018 until January 2024 (n=104490). Univariate analysis included demographic and PFT data (spirometry, volume, diffusion). We estimated multivariate generalization using the Data Representativeness Criterion (DRC, Schat et al) that evaluates performance on unseen data based on dataset similarity. A DRC below 1 indicates better generalization.
Results: In the training dataset and clinical use, 52% and 57% of participants were males, with average BMIs of 26 (SD=5.4) and 27 (SD=5.7), and average ages of 54 (SD=16) and 61 (SD=15), respectively. Among the training dataset, 43% were active smokers compared to 36% in clinical usage. Kernel Density Estimation showed similar distributions of PFT parameters across different age groups (Figure 1). The average DRC value was 0.16 (SD= 0.09).
Conclusions: Univariate and multivariate analyses indicate strong similarity between training data and clinical usage, suggesting the validity of AI interpretation of PFT in real clinical settings. Further research is needed to monitor usage among specific regions and sub-populations.
Authors: A. Elmahy¹, J. Maes², E. Smets², M. De Vos¹, M. Topalovic²
Affiliations:
1. ArtiQ NV & Department of Electrical Engineering (ESAT – STADIUS), KU Leuven – Leuven (Belgium)
2. ArtiQ NV – Leuven (Belgium)