Examination of disease trajectories through longitudinal clinical observation of symptoms led to the development of modern classification systems for mental disorders, which differentiate categories of serious mental illness (SMI), including schizophrenia (SCZ), bipolar disorder (BPD) and severe major depressive disorder (MDD). While such classification systems advocate a parsimonious approach in which patients are assigned unique diagnoses, this conflicts with the clinical reality that many features of psychiatric illness (such as suicidality or psychosis) are present across several diagnoses. Reference Forbes, Neo, Nezami, Fried, Faure and Michelsen1 Furthermore, while traditional classification systems reflect a longitudinal perspective, current research on SMI relies primarily on cross-sectional assessments, in which the only available trajectory information is supplied by patient recall. Reference Kendler2,Reference Hyman3 This lack of detailed longitudinal data may be a factor contributing to the heterogeneity observed in studies based on current SMI categories, Reference Regier, Narrow, Clarke, Kraemer, Kuramoto and Kuhl4 notably in cross-disorder genetic analyses. Reference Anttila, Bulik-Sullivan, Finucane, Walters and Bras5 In addition, while cross-sectional data may support the notion that each patient may be characterised according to a unique diagnosis, it may take several years after initial presentations for most patients with SMI to achieve a stable diagnosis. Reference Peralta, Janda, García de Jalón, Moreno-Izco, Sánchez-Torres and Cuesta6,Reference Scott, Graham, Yung, Morgan, Bellivier and Etain7
Recent studies using longitudinal data collected from participants in national registries, Reference Plana-Ripoll, Pedersen, Holtz, Benros, Dalsgaard and de Jonge8,Reference Høj Jørgensen, Osler, Jorgensen and Jorgensen9 precision health initiatives and birth cohorts Reference Caspi, Houts, Ambler, Danese, Elliott and Hariri10 have begun to identify transdiagnostic risk factors and to describe patterns of variation across diagnoses over time. These resources, which are mainly limited to upper-income countries (UICs), typically contain only sparse data for individual clinical features, such as symptoms and behaviours. In contrast, electronic health record (EHR) data, available in both UICs and in many low- and middle–income countries (LMICs), may contain extensive descriptions of such clinical features during the periods when patients experience them. As we demonstrate here for an institution located in a LMIC, EHR databases thus facilitate investigations of features that are important both transdiagnostically and longitudinally. These features typically vary by diagnosis and gender, and their patterns of emergence and stability unfold over years. The rich clinical data in EHRs enable investigation of these relationships and may be used to predict clinically important outcomes, such as the onset of psychosis, Reference Irving, Patel, Oliver, Colling, Pritchard and Broadbent11 the occurrence of suicidal behaviours Reference Tsui, Shi, Ruiz, Ryan, Biernesser and Iyengar12 or the stability of diagnoses. Reference Palomar-Ciria, Cegla-Schvartzman, Bello, Martínez-Alés, Migoya-Borja and Baca-García13
Method
EHR database
The Clínica San Juan de Dios in Manizales (CSJDM), Colombia, provides comprehensive mental healthcare to the one million inhabitants of the department (state) of Caldas. Reference Song, Ramírez, Okano, Service, de la Hoz and Díaz-Zuluaga14 For this study, we extracted structured EHR data collected between 2005 and 2022, including demographic information; duration, type and site of visits (in-patient, out-patient or emergency department); diagnostic codes (ICD-10 15 ); and unstructured data, consisting of free-text from clinical notes. For our analyses, we included all patients with at least one clinical note in their EHR, and excluded patients with missing gender information. We excluded visits that were outside the age range of 4–90, without a valid diagnostic code or with primary diagnostic codes outside of Chapter V of the ICD-10 (mental, behavioural and neurodevelopmental disorders categories); see Supplementary Fig. 1.
ICD-10 code extraction and cohort definition
Following each visit to the hospital, a patient is assigned a single primary ICD-10 diagnosis by their treating psychiatrist, generating a time-stamped sequence of diagnoses. We extracted this sequence for every patient and selected for analyses patients who had at least one primary diagnosis of SMI, defined here as BPD (F301, F302, F310–317), severe/recurrent MDD (F322, F323, F331–334), SCZ (F20X) and other chronic psychoses (delusional disorder; F22X, schizoaffective disorder; F25X) (Supplementary Table 1). In total, this cohort includes 22 447 patients with 157 003 visits (Supplementary Fig. 1B).
Reliability of the current ICD-10 diagnosis and its association with clinical features
We quantified the reliability of the current ICD-10 diagnoses by comparing them to those obtained through a complete manual chart review (Supplementary Note 1; Supplementary Table 2). We then used a Spanish-language natural language processing (NLP) algorithm to extract clinical features from free-text notes (Supplementary Note 2; Supplementary Tables 3–6). Specifically, we focused on four transdiagnostic features that are routinely assessed in clinical practice: suicide attempts, suicidal ideation, delusions and hallucinations. To validate our NLP-derived features against established patterns, we tested three sets of relationships: (a) associations with diagnoses, for example, higher frequency of suicidal features in MDD and higher frequency of psychotic features in SCZ; (b) gender differences, where we expected to see higher frequency of suicidal ideation in females Reference Canetto and Sakinofsky16 and psychotic features in males; Reference Jongsma, Turner, Kirkbride and Jones17 and (c) feature co-occurrence, with the expectation of observing positive associations between suicidal and psychotic features. Reference Yates, Lång, Cederlöf, Boland, Taylor and Cannon18 Association tests were performed individually for each feature using logistic regression, adjusting for the length of patients’ records and history of hospital admission. For the third set of tests, we evaluated feature–feature associations using the same modelling framework, adding the presence of other features as predictors, and examined potential gender moderation through interaction terms (Supplementary Note 3). Data to test these associations were restricted to patients with at least two recorded clinical notes (Supplementary Fig. 1C; Supplementary Fig. 2).
Diagnostic trajectories
To describe the diversity of diagnostic trajectories observed in the EHR database, we used the sequence of primary ICD-10 diagnoses extracted above. We defined two types of diagnostic changes: diagnostic switches and the addition of comorbidities. We use the term diagnostic switches to refer to changes between two psychiatric diagnoses that cannot, by definition, be held at the same time; specifically, the diagnoses in the ICD-10 F2 and F3 chapters (psychotic and mood disorders, respectively; Supplementary Note 4; Supplementary Table 7). By contrast, we use the term diagnostic comorbidities to refer to all other combinations of ICD-10 codes; comorbid psychiatric diagnoses can accumulate over time, without limit. We used this definition of diagnostic trajectories in patients with at least three recorded visits (Supplementary Fig. 1D) to estimate the proportion of patients with diagnostic switches, recorded comorbidities or both.
Factors affecting diagnostic stability
We explored factors contributing to visit-to-visit diagnostic stability. First, we used a mixed-effect logistic regression to estimate the probability of switching diagnoses as a function of time, accounting for repeated patient observations. We measured time as the number of visits and, separately, as the number of years since the first visit. Then, we expanded this model to include ten additional factors: (a) patient’s gender, (b) age, (c) current primary diagnosis, (d) in-patient status, (e)–(h) the four NLP-derived clinical features, (i) receiving a ‘not otherwise specified’ (NOS) code and (j) previous diagnostic switching (Supplementary Note 5). A NOS code indicates diagnostic uncertainty in cases of atypical or confusing patient presentations or when temporal criteria are not yet met. Reference First, Rebello, Keeley, Bhargava, Dai and Kulygina19 As we expect NOS codes to be associated with a higher degree of diagnostic instability than other codes, we consider them to serve as a positive control.
To evaluate the possibility that clinical features extracted from the notes at a given visit anticipate specific diagnostic changes recorded in future visits, we tested whether psychosis features (delusions and hallucinations) predict a subsequent application of ICD-10 codes specifying psychotic features in diagnoses of BPD or MDD (Supplementary Note 6; Supplementary Fig. 1E).
Given that previous studies suggest reaching a definitive diagnosis typically takes several years, Reference Fritz, Russell, Allwang, Kuiper, Lampe and Malhi20 we analysed the time course to diagnostic stability in our sample. Using patients with records of 10 years or longer (Supplementary Fig. 1F), we defined stability as the absence of future diagnostic switches. For each year of follow-up, we calculated the percentage of patients reaching stability by that time. We also identified patients with high levels of diagnostic instability as those who had five or more diagnostic switches and at least one occurring after 5 years of illness.
Ethical approval
All procedures were performed in compliance with Colombian and US laws and institutional guidelines and have been approved by Medical Institutional Review Board 3 at the University of California, Los Angeles (UCLA) (IRB#16-002084), the Comité de Ética del Instituto de Investigaciones Médicas at Universidad de Antioquia and the Comité de Bioética at CSJDM. The data that support the findings of this study are not publicly available because of restrictions by the local institutional review board (IRB) to protect participant privacy. These restrictions prohibit the authors from making the data-set available to other researchers.
Results
Study sample
As of June 2022, the CSJDM EHR included 157 003 visits from 22 447 patients who were assigned a SMI diagnosis at any point in their records (Supplementary Fig. 1B). The demographic and clinical characteristics of this sample are described in Supplementary Table 8.
Transdiagnostic characterisation of features extracted from EHR notes
We found that compared to manual chart reviews, patients’ current ICD-10 diagnoses of MDD, BPD and SCZ were highly accurate (accuracy of 0.90, 0.88 and 0.95, respectively, Supplementary Table 2). Each of the four NLP-extracted features (suicidal ideation, suicide attempt, delusions and hallucinations) occurred in all of the SMI diagnoses, stratified by gender, at frequencies above 5%, demonstrating their transdiagnostic quality (Fig. 1(a)).

Fig. 1 Transdiagnostic characterisation and co-occurrence of clinical features extracted from EHR notes. (a) The proportion of patients with each of the four features is stratified by primary diagnosis. (b) Number of patients with co-occurrence of two, three or four clinical features. All data in these plots are limited to patients with at least two electronic health record notes. EHR, electronic health record; MDD, major depressive disorder; BPD, bipolar disorder; SCZ, schizophrenia.
Associations between clinical features and diagnoses aligned with expected patterns and showed strong effect sizes: suicidal features were more frequent in MDD but less frequent in SCZ compared to BPD (odds ratio >2 and <0.5, respectively), while psychotic features were more frequent in SCZ but less frequent in MDD compared to BPD (odds ratio >3 and <0.5, respectively) (Supplementary Tables 9 and 10). Unexpectedly, we observed a negative association of female gender with suicidal ideation (odds ratio = 0.84, 95% CI 0.78–0.9), a finding largely driven by the relatively low frequency of suicidal ideation in females compared to males with MDD (34% v. 45%, interaction odds ratio = 0.65, 95% CI 0.56–0.75). Rates of suicide attempt, however, were similar in both genders. Consistent with previous literature, psychotic features showed reduced frequency in females compared to males (odds ratio = 0.67, 95% CI 0.62–0.73 delusions; odds ratio = 0.88, 95% CI 0.82–0.95 hallucinations). While this pattern remained constant across all diagnoses in our sample, it may contrast with meta-analytic evidence showing no gender differences in the incidence of affective psychoses. Reference Jongsma, Turner, Kirkbride and Jones17
The co-occurrence of the four features is displayed in Fig. 1(b). As expected, the two suicide-related features tended to co-occur, as did the two psychotic features. In addition, the mention of hallucinations increased the likelihood of suicidal features, and vice versa (odds ratio 1.29–1.98, 95% CI 1.17–2.16), accounting for gender, diagnosis, in-patient history and number of visits (Supplementary Table 10). However, unexpectedly, the mention of delusions in the notes decreased the likelihood of notes mentioning suicidal feature in the same patient and vice versa (odds ratio 0.59–0.62, 95% CI 0.53–0.68). As a post hoc analysis, we examined these associations stratified by diagnosis to evaluate whether this is a diagnosis-specific pattern. The stratification revealed that the negative association between delusions and suicidality was unique to BPD (Supplementary Table 10C). In contrast, SCZ showed positive associations between both psychotic features and suicidal ideation (odds ratio 1.68–2.17, 95% CI 1.19–3.03), while MDD showed no association between delusions and suicidality (odds ratio 0.82–1.01, 95% CI 0.68–1.2). To better understand this BPD-specific pattern, we evaluated the frequency of features reported during different episode types. We found that different episodes have distinct symptom profiles: delusions were overwhelmingly more common in manic episodes, while suicidal ideation was the main feature in depressive episodes (Supplementary Fig. 3) and, in both depressive and mixed episodes, patients were more likely to present with either suicidal ideation or delusions alone rather than with both features simultaneously. Finally, BPD patients without documented manic episodes showed the strongest negative association between delusions and suicidal ideation (odds ratio 0.59–0.66, 95% CI 0.5–0.79; Supplementary Table 11).
Diverse diagnostic trajectories in SMI
We described diagnostic trajectories among SMI patients with at least three recorded visits (n = 12 962; Supplementary Fig. 1D). The majority (64%; Fig. 2(a)) had multiple diagnoses recorded in their EHR, broken down as follows: 30% displayed comorbidities (light blue bars; Supplementary Table 12), 19% displayed diagnostic switches (medium blue bars; Supplementary Fig. 4) and 15% displayed both switches and comorbidities (dark blue bars). Early switches often involved a change from brief psychotic disorder or single-episode MDD to BPD, Reference Fritz, Russell, Allwang, Kuiper, Lampe and Malhi20 while later switches frequently involved SCZ, BPD and schizoaffective disorder Reference Florentin, Reuveni, Rosca, Zwi-Ran and Neumark21 (Supplementary Fig. 5).

Fig. 2 Disease trajectories of serious mental illness (SMI) in patients with at least three visits. (a) UpSet plot presenting diagnostic switches (between SMI categories) and comorbidities (SMI and non-SMI categories). Patients with a single SMI diagnosis (light grey, medium grey, dark grey, total n = 4620); a single SMI diagnosis and other comorbidities (light blue, n = 3955); multiple SMI diagnoses and no other comorbidities (medium blue, n = 2468); multiple SMI diagnoses and other comorbidities (dark blue, n = 1919). Bars with n < 100 are not shown. (b) Sankey diagram of diagnostic trajectories. The left-hand nodes represent the diagnosis given at the initial visit, and the right-hand nodes represent the most recent SMI code (diagnostic switches within SMI are shown in Supplementary Fig. 4). ORG, other mental disorders caused by brain damage and dysfunction and physical disease (F06); SUD, mental and behavioural disorders caused by multiple drug use and use of other psychoactive substances (F19); BPE, acute and transient psychotic disorders (F23); MDE, major depressive episode (F32); PMD, persistent mood disorders (F34); UMD, unspecified mood disorder (F39); ANX, other anxiety disorders (F41); PTSD, reaction to severe stress and adjustment disorders (F43); ADHD, hyperkinetic disorders (F90); CON, conduct disorders (F91); MDD, major depressive disorder; SCZ, schizophrenia; BPD, bipolar disorder.
Some trajectories are comprised of diagnoses that are frequently paired, for example, the diagnostic switch from MDD to BPD (observed in 22% of current BPD patients) or the comorbidity between MDD and other anxiety disorders (observed in 28% of current MDD patients). We found that the majority of cases (58%) follow rare trajectories (occurring in fewer than 1% of patients). Altogether, we counted 3149 unique trajectories.
Clinical features, time and other factors affecting diagnostic stability
We identified multiple factors that influenced diagnostic stability. Diagnostic switching was most frequent during the early stages of treatment. While 11.3% of the patients changed diagnosis on their second visit, this percentage decreased over the patient’s course of illness (Fig. 3(a); log10(k) odds ratio = 0.57, 95% CI 0.54–0.61) and stabilised at around 4% after the tenth visit. Additional predictors of future diagnostic instability included the following observations at the current visit: a diagnostic switch from the previous visit (odds ratio = 4.01, 95% CI 3.7–4.34; Supplementary Fig. 6); an in-patient visit (odds ratio = 1.67, 95% CI 1.53–1.82); a NOS diagnosis (odds ratio = 1.58, 95% CI 1.48–1.69); and the presence of the clinical features of delusions or hallucinations (odds ratio = 1.47, 95% CI 1.34–1.61 and odds ratio = 1.19, 95% CI 1.09–1.3, respectively). Predictors of future diagnostic stability included diagnoses of SCZ or BPD compared to MDD (odds ratio = 0.30, 95% CI 0.27–0.32 and odds ratio = 0.32, 95% CI 0.28–0.36), male gender (odds ratio = 0.70, 95% CI 0.65–0.76) and age (odds ratio per decade = 0.96, 95% CI 0.94–0.98). The same pattern was observed when modelling switching by time rather than visit number (Supplementary Fig. 6).

Fig. 3 Diagnostic stability over time. (a) At each visit k, the proportion of patients that will switch primary diagnosis code on their next visit k + 1. Stratified by age groups: age at first visit before and after 30 years. (b) The x-axis shows the time since the first encounter instead of the visit number. For every year, the observed proportion of visits that will have a diagnostic switch on the next visit. The solid line is the average probability of switching at any given visit during that year, as estimated by the model. Lines and shaded areas correspond to 95% confidence intervals. (c) Proportion of patients by year who have reached a stable diagnosis. N = 1952 patients with 10 or more years in the electronic health record. It takes 6 years for 80% of patients to reach a stable diagnosis.
Reaching diagnostic stability
We identified 1952 patients with over 10 years of EHR data. Of these, more than 80% reached a stable diagnosis within 6 years (Fig. 3(c)). In addition, 162 individuals showed high levels of diagnostic instability. In this group, switches between BPD and SCZ were most common (Supplementary Table 13).
Discussion
We leveraged EHR data spanning 17 years and encompassing over 20 000 patients from a large mental health facility in a LMIC to characterise transdiagnostic and longitudinal features of SMI.
Transdiagnostic features
Previous studies describing the association of demographic variables and clinical features have varied greatly in methodology and scale. Reference Canetto and Sakinofsky16,Reference Jongsma, Turner, Kirkbride and Jones17,Reference Zalpuri and Rothschild22–Reference Fu, Qian, Jin, Yu, Wu and Du24 Our study is unique because, on the one hand, we uniformly applied a NLP approach to clinical notes to determine the presence or absence of four clinically important SMI features – suicidal ideation, suicide attempts, delusions and hallucinations – detecting them at high frequencies across all SMI diagnoses and, on the other hand, we tested for gender differences across these features after adjusting for factors such as diagnosis, treatment duration and age.
The ‘gender paradox of suicide’ – the observation of higher rates of suicidal ideation and suicide attempts in females compared to males but higher rates of completed suicide in males – is well documented globally, including in Latin America. Reference Canetto and Sakinofsky16,Reference Miranda-Mendizabal, Castellví, Parés-Badell, Alayo, Almenara and Alonso23–Reference Naghavi26 Our observation of a higher rate of suicidal ideation in males compared to females was therefore unexpected and warrants further investigation. Regarding suicide attempts, we found approximately equal frequencies among males and females. This may reflect the high severity of patients in the EHR database; supporting this interpretation, a cross-national study reported a male excess for suicide attempts that were designated as ’serious’. Reference Freeman, Mergl, Kohls, Székely, Gusmao and Arensman27 However, our current methodology does not distinguish between different levels of severity of suicide attempt.
Psychotic symptoms have consistently been linked with higher suicidality rates across various SMI diagnoses. Reference Zalpuri and Rothschild22,Reference Kuperberg, Katz, Greenebaum, George, Sylvia and Kinrys28 Interestingly, our study found opposite associations between suicidal features and specific psychotic symptoms – positive for hallucinations, negative for delusions. Post hoc analyses by diagnosis revealed that this pattern was diagnosis-specific: the negative association between delusions and suicidality was unique to BPD patients, while within SCZ, both delusions and hallucinations showed the expected positive associations with suicidal features. While the mechanism underlying this pattern is unknown, we hypothesised that patients with BPD primarily seek care for prior depression (frequently marked by suicidality) or mania (frequently marked by delusions) – inducing a negative correlation between these features. Supporting this hypothesis, our analyses revealed that clinical features were associated with distinct mood episodes: suicidal ideation with depression and delusions with mania. Furthermore, BPD patients without manic episodes were more likely to present with either suicidal ideation or delusions alone rather than both features simultaneously. Our findings uniquely contribute to the existing literature, which typically does not differentiate between types of psychotic symptoms. However, future research using more granular classification of psychotic symptoms could help disentangle these potential mechanisms and their implications for risk assessment.
Evidence has accumulated indicating a high degree of shared genetic risk across SMI diagnoses. Reference Anttila, Bulik-Sullivan, Finucane, Walters and Bras5 It has been hypothesised that this shared risk may reflect genetic associations to transdiagnostic component phenotypes, but such phenotypes have rarely been assessed at a scale adequate to test this hypothesis. The results that we present here for suicidality and psychotic symptoms suggest that our approach for extracting transdiagnostic features from EHR notes may provide a general strategy for mounting well-powered genetic association studies of such phenotypes in cohorts ascertained for SMI broadly. In future studies, we plan to deploy such a strategy through investigations of additional EHR databases and the inclusion of a more extensive set of clinical features.
Diagnostic trajectories
The continuous EHR record since 2005 at the CSJDM allowed us to analyse longitudinal trajectories of SMI. We characterised the observed diagnostic trajectories in terms of diagnostic switches and the accumulation of comorbidities. Consistent with previous research, Reference Plana-Ripoll, Pedersen, Holtz, Benros, Dalsgaard and de Jonge8,Reference Høj Jørgensen, Osler, Jorgensen and Jorgensen9,Reference Bromet, Kotov, Fochtmann, Carlson, Tanenberg-Karant and Ruggero29,Reference Lopez-Castroman, Leiva-Murillo, Cegla-Schvartzman, Blasco-Fontecilla, Garcia-Nieto and Artes-Rodriguez30 we find that diagnostic instability is characteristic of the early stages of SMI and that this contributes to a large diversity of disease trajectories. This diversity of trajectories underscores the complexity of SMI and highlights the need to identify patterns relevant to understanding disease causation and informing clinical practice.
We found that most SMI patients in the CSJDM achieve diagnostic stability within 6 years, consistent with reports from UICs. Reference Peralta, Janda, García de Jalón, Moreno-Izco, Sánchez-Torres and Cuesta6,Reference Scott, Graham, Yung, Morgan, Bellivier and Etain7 This timeline reflects the complex process of diagnostic confirmation, where clinicians weigh current diagnostic impressions and past episodes. While some diagnostic evolution is expected, prolonged uncertainty can undermine patient trust and delay appropriate treatment. Reference Scott, Graham, Yung, Morgan, Bellivier and Etain7,Reference Fritz, Russell, Allwang, Kuiper, Lampe and Malhi20 Notably, psychotic features in clinical notes often preceded formal psychotic diagnoses, while patients with prior diagnostic changes showed increased likelihood of future diagnostic shifts. These clinical features could serve as elements to monitor to accelerate the process of diagnostic confirmation.
Clinical and research implications of integrated feature and trajectory analysis
Despite methodological differences, the overall alignment with previous registry studies Reference Plana-Ripoll, Pedersen, Holtz, Benros, Dalsgaard and de Jonge8,Reference Høj Jørgensen, Osler, Jorgensen and Jorgensen9 lends validation to our approach. However, our study’s unique contribution lies in the integration of features from clinical notes alongside diagnostic codes. This enabled us to delineate trajectories with greater granularity than typically available in registry data. Our detailed characterisation of SMI features, even before formal diagnosis, creates opportunities for earlier detection. As demonstrated in recent work, Reference Arribas, Oliver, Patel, Kornblum, Shetty and Damiani31 many transdiagnostic prodromal features manifest early in secondary care. Our findings at the CSJDM align with this observation, showing that many SMI patients initially present with conditions such as anxiety or attention-deficit hyperactivity disorder (ADHD), potentially representing opportunities for early detection and intervention. In addition, our systematic analysis of symptom emergence across diagnostic boundaries provides insights into naturalistic disease trajectories that may enhance clinical decision-making capabilities. For example, the longitudinal clinical data used here have supported the development of predictive models for conversion from MDD to BPD, Reference Service, De La Hoz, Diaz‐Zuluaga, Arias, Pimplaskar and Luu32 directly addressing the substantial diagnostic delays characteristic of BPD. Reference Fritz, Russell, Allwang, Kuiper, Lampe and Malhi20 Tools leveraging EHR data to assist clinicians in identifying patients who may benefit from closer monitoring or early intervention are starting to emerge in UICs Reference Irving, Patel, Oliver, Colling, Pritchard and Broadbent11,Reference Tsui, Shi, Ruiz, Ryan, Biernesser and Iyengar12 – our work paves the way for future developments of this kind in LMICs.
Our results support the presumption that research classifications incorporating past and future trajectories at both symptom and diagnosis levels will lead to less heterogeneous categories than those that are based only on a ‘lifetime’ diagnosis. Reference Wray, Lee and Kendler33 Prior studies have suggested that certain genetic risk profiles might contribute to specific SMI trajectories, such as polarity at the onset of BPD Reference Kalman, Olde Loohuis, Vreeker, McQuillin, Stahl and Ruderfer34 or conversion from non-psychotic to psychotic illness. Reference Perkins, Olde Loohuis, Barbee, Ford, Jeffries and Addington35–Reference Jonas, Lencz, Li, Malhotra, Perlman and Fochtmann37 Efforts to replicate and extend such findings, however, have been limited by variation in ascertainment strategies, reliance on patient recall and small sample sizes. Centring genetic studies of SMI trajectories on EHR databases, such as that of the CSJDM, could provide a means to overcome these limitations; but as large-scale analyses of thousands of different disease trajectories appears impractical, it will be crucial to develop methods for reducing dimensionality by clustering patients with similar trajectories. Reference Krebs, Themudo, Benros, Mors, Børglum and Hougaard38
Limitations
As described above, the limited granularity of clinical features extracted from free-text notes (e.g. we have not extracted specific types of delusions or the level of severity of suicide attempts) is a limitation of this work. We are currently improving our NLP algorithms to address both limitations (e.g. by identifying instances of language that signifies intent to die) and, simultaneously, we are exploring approaches to expand our NLP toolset (e.g. by including a combination of pattern-based detection and large language models Reference Taub-Tabib, Shamay, Shlain, Pinhasov, Polak and Tiktinsky39 ). Along with this, a key limitation of using administrative data for research, including in this study, is the inability to differentiate between true diagnostic switches and variation in clinician subjectivity. Future studies involving extensive chart reviews at switch points could help evaluate this limitation. Finally, given our goal of validating the use of LMIC-based EHRs for psychiatric research, this study is primarily descriptive in nature. While we tested several hypotheses drawn from the existing literature to validate our approach, these were not pre-registered. Our findings lay the groundwork for future hypothesis-driven studies using these validated EHR-based methods.
Our results demonstrate the utility of EHR databases for population-level research on SMI in a LMIC setting at unprecedented detail and scale. The availability of longitudinal EHRs enables the characterisation of SMI trajectories over extended periods of time, while the use of NLP to uniformly phenotype patients across diagnoses enables the investigation of transdiagnostic components of SMI. Extension of this approach could play an important role in advancing psychiatric research beyond categorical syndromes, transforming our understanding and treatment of mental illness globally.
Supplementary material
The supplementary material can be found online at https://doi.org/10.1192/bjp.2025.107
Author contributions
J.F.D.L.H., N.B.F. and L.M.O.L. formulated the research question; J.F.D.L.H., S.K.S., N.B.F. and L.M.O.L. designed the study; J.F.D.L.H., S.K.S., M.C., A.M.D.-Z., J.S., C.G., S.R.-S. and L.M.O.L. carried out the study; J.F.D.L.H., A.A. and S.K.S. analysed the data; J.F.D.L.H., L.M.O.L. and N.B.F. wrote the initial draft of the manuscript. All the authors reviewed the manuscript, provided feedback and approved the final version.
Funding
Research reported here was supported by R01 MH123157 (to L.M.O.L., C.L.-J. and N.B.F.), R01 MH113078 (to C.E.B., C.L.-J. and N.B.F.), R00 MH116115 (to L.M.O.L.), T32 MH073526 (to J.F.D.L.H.) and the Fulbright Commission in Colombia under the Fulbright-Colciencias grant (to J.F.D.L.H.). The content is solely the responsibility of the authors and does not represent the official views of the Fulbright Program or the National Institutes of Health.
Declaration of interest
N.B.F. and A.B. receive research funding from Apple, Inc. No other authors report financial relationships with commercial interests.
eLetters
No eLetters have been published for this article.