Background: Early identification of individuals at risk for paroxysmal atrial fibrillation (PAF) remains challenging, particularly when conventional electrocardiography (ECG) demonstrates sinus rhythm. Artificial intelligence (AI)–enabled ECG analysis has emerged as a novel approach to detect subclinical electrical signatures predictive of future atrial fibrillation. Methods: A systematic review and meta-analysis of external validation studies was conducted to evaluate the performance of AI-based ECG models applied to sinus rhythm ECGs for predicting incident PAF. Studies reporting diagnostic performance metrics were included. Pooled estimates of accuracy, precision, recall, and F1 score were calculated using random-effects restricted maximum likelihood models. Between-study heterogeneity was assessed using the I² statistic. Results: Six external validation studies were included. The pooled accuracy was 72.32% (95% CI: 59.96–84.67), precision was 72.32% (95% CI: 60.79–83.85), recall was 77.53% (95% CI: 70.49–84.56), and the pooled F1 score was 67.22% (95% CI: 51.11–83.33). Substantial heterogeneity was observed across all analyses (I² > 99%), reflecting variability in study populations, ECG acquisition methods, and AI model architectures. Conclusions: AI-enabled ECG analysis in sinus rhythm demonstrates moderate predictive performance for future PAF but is characterized by marked heterogeneity across studies. Standardized reporting, robust external validation, and prospective clinical impact studies are required before widespread clinical adoption.
Atrial fibrillation (AF) is the most common sustained cardiac arrhythmia worldwide and represents a major contributor to cardiovascular morbidity, mortality, and healthcare utilization [1]. Its prevalence increases markedly with advancing age and the burden of comorbidities such as hypertension, diabetes mellitus, heart failure, and structural heart disease [2]. Importantly, AF is a potent risk factor for ischemic stroke, heart failure hospitalization, cognitive decline, and all-cause mortality. Early identification of individuals at risk for AF, particularly before the onset of overt or persistent arrhythmia, remains a critical unmet clinical need, as timely initiation of rhythm monitoring, anticoagulation, and risk factor modification can substantially reduce adverse outcomes [3].
Paroxysmal atrial fibrillation (PAF) poses a unique diagnostic challenge [4]. By definition, PAF is intermittent and often asymptomatic, with episodes that may be brief and self-terminating [5]. Standard 12-lead electrocardiography (ECG), which is typically recorded during sinus rhythm in routine clinical practice, frequently fails to capture these transient arrhythmic events [6]. Even extended rhythm monitoring strategies such as Holter monitoring or event recorders have limited sensitivity, particularly when AF burden is low [7]. Consequently, a substantial proportion of individuals with PAF remain undiagnosed until they present with complications such as stroke [8]. This diagnostic gap has driven growing interest in novel strategies that can identify latent susceptibility to AF even when the ECG demonstrates normal sinus rhythm.
In parallel with these clinical challenges, advances in artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), have transformed the analysis of biomedical signals [9]. The ECG, a low-cost, noninvasive, and ubiquitously available diagnostic tool, has emerged as a promising substrate for AI-based analysis [10]. Unlike traditional ECG interpretation, which relies on visually discernible features and predefined criteria, AI models can detect complex, high-dimensional patterns within raw ECG waveforms that may be imperceptible to human observers [11]. These latent signatures may reflect subtle atrial structural remodeling, conduction heterogeneity, autonomic influences, or electrophysiological alterations that precede the clinical manifestation of AF [12].
Over the past decade, multiple AI-enabled ECG models have been developed to identify AF or predict its future occurrence from ECGs recorded during sinus rhythm [13]. Initial studies demonstrated that deep neural networks, particularly convolutional neural networks, could accurately distinguish individuals with a history of AF from those without, even when analyzing sinus rhythm ECGs alone [14]. Subsequent investigations extended this paradigm to the prediction of incident or paroxysmal AF, raising the possibility that a single resting ECG could serve as a powerful screening and risk stratification tool [15]. Such an approach has substantial clinical appeal, as it could be seamlessly integrated into routine workflows across outpatient clinics, emergency departments, and population-level screening programs [16].
However, despite the growing volume of literature in this field, several important limitations remain. Many early AI-ECG studies were conducted in single-center cohorts, often using retrospective designs and internally derived validation datasets [17]. While these studies demonstrated promising discriminatory performance, internal validation alone is insufficient to establish clinical generalizability. Model performance may degrade substantially when applied to new populations with different demographic characteristics, comorbidity profiles, ECG acquisition systems, or AF ascertainment strategies [18]. External validation—defined as testing model performance in independent cohorts not used during model development—is therefore essential before such tools can be considered for widespread clinical adoption [19].
Furthermore, the reported performance metrics across studies are heterogeneous. Some investigations report area under the receiver operating characteristic curve (AUC) alone, whereas others provide sensitivity, specificity, predictive values, or complete 2×2 contingency data. Prediction horizons, reference standards for AF diagnosis, duration and intensity of rhythm monitoring, and definitions of paroxysmal or incident AF vary considerably [20]. Differences also exist in ECG characteristics, including the use of standard 12-lead versus single-lead recordings, sampling frequencies, and preprocessing techniques [21]. Collectively, this heterogeneity complicates direct comparison across studies and obscures the true diagnostic and predictive accuracy of AI-based ECG approaches in sinus rhythm [22].
To date, although narrative reviews and methodological commentaries have summarized the potential of AI-enabled electrocardiography for AF detection, a focused quantitative synthesis of external validation studies is lacking [23]. In particular, no comprehensive meta-analysis has systematically evaluated the diagnostic accuracy of AI-based ECG models applied to sinus rhythm ECGs for the prediction of paroxysmal or future AF, while explicitly restricting inclusion to externally validated cohorts [24]. Such an analysis is essential to provide pooled estimates of performance, explore sources of heterogeneity, and assess the robustness and clinical applicability of these models across diverse settings.
Accordingly, the present study aims to systematically review and meta-analyze external validation studies evaluating artificial intelligence–based electrocardiography in sinus rhythm for the prediction of paroxysmal or incident atrial fibrillation. By synthesizing available evidence using established diagnostic accuracy meta-analytic methods and rigorously assessing risk of bias and applicability, this review seeks to clarify the current state of the field, identify gaps in evidence, and inform future research and clinical translation of AI-enabled ECG technologies for AF risk prediction.
Literature Search A comprehensive literature search was conducted across PubMed/MEDLINE, Embase, Scopus, and Web of Science from database inception to the most recent update. Search terms combined controlled vocabulary and free-text keywords related to artificial intelligence, machine learning, deep learning, electrocardiography, and atrial fibrillation, with specific emphasis on sinus rhythm and prediction or detection of paroxysmal or incident atrial fibrillation. Reference lists of relevant reviews and included studies were manually screened to identify additional eligible articles. Only studies involving human participants and published in English were considered. The complete search strategy is provided in the Supplementary Appendix. Registered with Prospero CRD420251272967. Prisma Guideline was processed [25]. Study Selection and Data Extraction All records identified through the database search were imported into a reference management software, and duplicate citations were removed. Two reviewers independently screened titles and abstracts for eligibility based on predefined inclusion and exclusion criteria. Full-text articles were subsequently assessed for inclusion, with disagreements resolved by consensus or consultation with a third reviewer. Eligible studies included external validation cohorts evaluating artificial intelligence–based electrocardiography performed during sinus rhythm for prediction of paroxysmal or incident atrial fibrillation. Data were independently extracted using a standardized, prepiloted form capturing study design, population characteristics, ECG acquisition parameters, AI model architecture, reference standards for atrial fibrillation diagnosis, and performance metrics. When required data were unclear or incomplete, corresponding authors were contacted for clarification. Statistical Analysis and Risk of Bias Meta-analyses were performed using random-effects models to account for between-study heterogeneity. When available, 2×2 contingency data were extracted and pooled using bivariate or hierarchical summary receiver operating characteristic models. Otherwise, pooled estimates of area under the receiver operating characteristic curve with 95% confidence intervals were calculated. Statistical heterogeneity was assessed using the I² statistic. Prespecified subgroup analyses were conducted based on ECG type and atrial fibrillation ascertainment. All analyses were conducted using Stata version 18.0. Risk of bias was independently assessed using the ROBINS-I tool across seven domains, with each study categorized as low, moderate, serious, or critical risk of bias [26].
Demographics
A total of six studies [27-32] were included in the final quantitative synthesis, comprising an aggregate population of 441,593 participants. Of these, 232,491 (52.6%) were male and 208,736 (47.4%) were female. The pooled mean age of the study population was 66.9 years, reflecting an older cohort at increased risk for atrial fibrillation. The mean follow-up duration across studies was 12 months. Regarding baseline comorbidities, hypertension was the most prevalent condition, present in 91,057 participants, followed by diabetes mellitus in 49,280, heart failure in 26,365, and prior stroke in 71,080 individuals. Coronary artery disease was reported in 7,338 participants. Overall, the included studies represented a high-risk cardiovascular population undergoing artificial intelligence–based electrocardiographic assessment during sinus rhythm.
Pooled Accuracy of AI-Based ECG Models for Prediction of Paroxysmal Atrial Fibrillation
This forest plot summarizes the accuracy of artificial intelligence–enabled electrocardiographic models developed to predict future paroxysmal atrial fibrillation from sinus rhythm ECG recordings across external validation studies. Each study is represented by a square indicating the point estimate of accuracy, with the horizontal line denoting the corresponding 95% confidence interval; the size of the square reflects the relative weight of the study in the meta-analysis. The pooled estimate is depicted by the diamond and was calculated using a random-effects restricted maximum likelihood (REML) model to account for between-study variability. Overall, the meta-analysis demonstrated a pooled accuracy of 72.32% (95% CI: 59.96–84.67). Notably, substantial heterogeneity was observed among the included studies (I² = 99.96%), highlighting considerable variation in model performance across different populations, ECG acquisition settings, and AI architectures.
Pooled Precision of AI-Based ECG Models for Prediction of Paroxysmal Atrial Fibrillation
This forest plot illustrates the precision (positive predictive value) of artificial intelligence–enabled electrocardiographic models applied to sinus rhythm ECGs for the prediction of paroxysmal atrial fibrillation across external validation studies. Individual study estimates are represented by squares with corresponding 95% confidence intervals, while the size of each square reflects the relative statistical weight of the study. The pooled precision estimate, depicted by the diamond, was calculated using a random-effects restricted maximum likelihood (REML) model to account for between-study variability. Overall, the pooled precision was 72.32% (95% CI: 60.79–83.85). Substantial heterogeneity was observed among studies (I² = 99.99%), indicating marked variability in positive predictive performance across different populations, ECG acquisition methods, and AI model architectures.
Pooled Recall (Sensitivity) of AI-Based ECG Models for Prediction of Paroxysmal Atrial Fibrillation
This forest plot depicts the recall (sensitivity) of artificial intelligence–enabled electrocardiographic models applied to sinus rhythm ECGs for the prediction of paroxysmal atrial fibrillation across external validation studies. Each square represents the study-specific recall estimate with corresponding 95% confidence intervals, while the size of the square reflects the relative contribution of each study to the pooled analysis. The diamond represents the overall pooled recall, estimated using a random-effects restricted maximum likelihood (REML) model to account for between-study heterogeneity. The pooled recall was 77.53% (95% CI: 70.49–84.56), indicating a moderate-to-high ability of AI-based ECG models to correctly identify individuals who subsequently developed paroxysmal atrial fibrillation. Considerable heterogeneity was observed across studies (I² = 99.83%), underscoring variability in sensitivity related to differences in study populations, ECG acquisition protocols, and AI model architectures.
Pooled F1 Score of AI-Based ECG Models for Prediction of Paroxysmal Atrial Fibrillation
This forest plot presents the F1 score of artificial intelligence–enabled electrocardiographic models used for predicting paroxysmal atrial fibrillation from sinus rhythm ECGs across external validation studies. The F1 score represents the harmonic mean of precision and recall, providing a balanced measure of model performance, particularly in the context of class imbalance. Each square denotes the study-specific F1 score with corresponding 95% confidence intervals, and the size of the square reflects the relative weight assigned to each study in the meta-analysis. The pooled F1 score, represented by the diamond, was estimated using a random-effects restricted maximum likelihood (REML) model. The overall pooled F1 score was 67.22% (95% CI: 51.11–83.33). Marked between-study heterogeneity was observed (I² = 99.97%), indicating substantial variability in the balance between sensitivity and precision across different AI models, ECG datasets, and study populations.
This systematic review and meta-analysis demonstrate that artificial intelligence–enabled electrocardiography applied to sinus rhythm ECGs shows moderate predictive performance for identifying individuals at risk of developing paroxysmal atrial fibrillation. The pooled accuracy, precision, recall, and F1 score indicate meaningful discriminatory capability, supporting the concept that subclinical electrical remodeling can be detected prior to overt arrhythmia. However, substantial between-study heterogeneity highlights considerable variability in model performance across populations, ECG acquisition settings, and AI architectures. Future studies should prioritize standardized reporting, robust external validation, and prospective outcome-driven evaluations before routine clinical implementation. Conflict of Interest The authors certify that there is no conflict of interest with any financial organization regarding the material discussed in the manuscript. Funding The authors report no involvement in the research by the sponsor that could have influenced the outcome of this work. Authors’ contributions. All authors contributed equally to the manuscript and read and approved the final version of the manuscript. Acknowledgement This paper is the collaborative work of all authors under the mentorship for the research work from BIR (Biomedical and International Research). We all authors acknowledge this mentorship for this meta-analysis.
39. Awwad A, Parashar Y, Bagchi S, Siddiqui SA, Ajari O, deFilippi C. Preclinical screening for cardiovascular disease with high-sensitivity cardiac troponins: ready, set, go? Front Cardiovasc Med. 2024 Nov 14;11:1350573. doi: 10.3389/fcvm.2024.1350573. PMID: 39610975; PMCID: PMC11602307.