We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Analysts often seek to compare representations in high-dimensional space, e.g., embedding vectors of the same word across groups. We show that the distance measures calculated in such cases can exhibit considerable statistical bias, that stems from uncertainty in the estimation of the elements of those vectors. This problem applies to Euclidean distance, cosine similarity, and other similar measures. After illustrating the severity of this problem for text-as-data applications, we provide and validate a bias correction for the squared Euclidean distance. This same correction also substantially reduces bias in ordinary Euclidean distance and cosine similarity estimates, but corrections for these measures are not quite unbiased and are (non-intuitively) bimodal when distances are close to zero. The estimators require obtaining the variance of the latent positions. We (will) implement the estimator in free software, and we offer recommendations for related work.
Recent advances in natural language processing (NLP), particularly in language processing methods, have opened new avenues in semantic data analysis. A promising application of NLP is data harmonization in questionnaire-based cohort studies, where it can be used as an additional method, specifically when only different instruments are available for one construct as well as for the evaluation of potentially new construct-constellations. The present article therefore explores embedding models’ potential to detect opportunities for semantic harmonization.
Methods
Using models like SBERT and OpenAI’s ADA, we developed a prototype application (“Semantic Search Helper”) to facilitate the harmonization process of detecting semantically similar items within extensive health-related datasets. The approach’s feasibility and applicability were evaluated through a use case analysis involving data from four large cohort studies with heterogeneous data obtained with a different set of instruments for common constructs.
Results
With the prototype, we effectively identified potential harmonization pairs, which significantly reduced manual evaluation efforts. Expert ratings of semantic similarity candidates showed high agreement with model-generated pairs, confirming the validity of our approach.
Conclusions
This study demonstrates the potential of embeddings in matching semantic similarity as a promising add-on tool to assist harmonization processes of multiplex data sets and instruments but with similar content, within and across studies.
Exposure to maternal mental illness during foetal development may lead to altered development, resulting in permanent changes in offspring functioning.
Aims
To assess whether there is an association between prenatal maternal psychiatric disorders and offspring behavioural problems in early childhood, using linked health administrative data and the Australian Early Development Census from New South Wales, Australia.
Method
The sample included all mother–child pairs of children who commenced full-time school in 2009 in New South Wales, and met the inclusion criteria (N = 69 165). Univariable logistic regression analysis assessed unadjusted associations between categories of maternal prenatal psychiatric disorders with indicators of offspring behavioural problems. Multivariable logistic regression adjusted the associations of interest for psychiatric categories and a priori selected covariates. Sensitivity analyses included adjusting the final model for primary psychiatric diagnoses and assessing association of interest for effect modification by child's biological gender.
Results
Children exposed in the prenatal period to maternal psychiatric disorders had greater odds of being developmentally vulnerable in their first year of school. Children exposed to maternal anxiety disorders prenatally had the greatest odds for behavioural problems (adjusted odds ratio 1.98; 95% CI 1.43–2.69). A statistically significant interaction was found between child biological gender and prenatal hospital admissions for substance use disorders, for emotional subdomains, aggression and hyperactivity/inattention.
Conclusions
Children exposed to prenatal maternal mental illness had greater odds for behavioural problems, independent of postnatal exposure. Those exposed to prenatal maternal anxiety were at greatest risk, highlighting the need for targeted interventions for, and support of, families with mental illness.
The Concluding Reflections explore democracy’s potential to overcome its contradictions and challenges. The rise of populism, seen as democratic autoimmunity, is examined, where leaders manipulate public sentiment, often through xenophobia and anti-elitism, undermining democratic principles. The tyranny of an exclusory majority is also cautioned against. The potential for democracy’s reimagining in the face of contemporary challenges such as cybernetic culture, migration, and globalization is considered. Ezrahi reflects on the role of creative individuals and cultural forces in shaping political imaginaries. The transformation of the internet and major platforms like Google, Facebook, Amazon, and Twitter from democratizing communication to powerful monopolies is analyzed, as well as the misuse of Big Data, illustrated by the Cambridge Analytica scandal, and the unintended consequences of digital platforms, including the spread of misinformation. The discussion concludes with a reflection on the broader deterioration of democratic epistemology. Ezrahi argues for a shift from a positivistic, naturalistic ontology to an ethical-normative anchorage, proposing to replace the current ontological defense of democracy with a commitment to preserving freedom based on novel axioms, framing politics as alternative productive fictions. Ezrahi proposes to reimagine a democratic epistemology which is anchored in ethics and collective commitment.
Making informed clinical decisions based on individualised outcome predictions is the cornerstone of precision psychiatry. Prediction models currently employed in psychiatry rely on algorithms that map a statistical relationship between clinical features (predictors/risk factors) and subsequent clinical outcomes. They rely on associations that overlook the underlying causal structures within the data, including the presence of latent variables, and the evolution of predictors and outcomes over time. As a result, predictions from sparse associative models from routinely collected data are rarely actionable at an individual level. To be actionable, prediction models should address these shortcomings. We provide a brief overview of a general framework for the rationale for implementing causal and actionable predictions using counterfactual explanations to advance predictive modelling studies, which has translational implications. We have included an extensive glossary of terminology used in this paper and the literature (Supplementary Box 1) and provide a concrete example to demonstrate this conceptually, and a reading list for those interested in this field (Supplementary Box 2).
An efficient compression scheme for modal flow analysis is proposed and validated on data sequences of compressible flow through a linear turbomachinery blade row. The key feature of the compression scheme is a minimal, user-defined distortion of the mutual distance of any snapshot pair in phase space. Through this imposed feature, the model reduction process preserves the temporal dynamics contained in the data sequence, while still decreasing the spatial complexity. The mathematical foundation of the scheme is the fast Johnson–Lindenstrauss transformation (FJLT) which uses randomized projections and a tree-based spectral transform to accomplish the embedding of a high-dimensional data sequence into a lower-dimensional latent space. The compression scheme is coupled to a proper orthogonal decomposition and dynamic mode decomposition analysis of flow through a linear blade row. The application to a complex flow-field sequence demonstrates the efficacy of the scheme, where compression rates of two orders of magnitude are achieved, while incurring very small relative errors in the dominant temporal dynamics. This FJLT technique should be attractive to a wide range of modal analyses of large-scale and multi-physics fluid motion.
The new software package OpenMx 2.0 for structural equation and other statistical modeling is introduced and its features are described. OpenMx is evolving in a modular direction and now allows a mix-and-match computational approach that separates model expectations from fit functions and optimizers. Major backend architectural improvements include a move to swappable open-source optimizers such as the newly written CSOLNP. Entire new methodologies such as item factor analysis and state space modeling have been implemented. New model expectation functions including support for the expression of models in LISREL syntax and a simplified multigroup expectation function are available. Ease-of-use improvements include helper functions to standardize model parameters and compute their Jacobian-based standard errors, access to model components through standard R $ mechanisms, and improved tab completion from within the R Graphical User Interface.
There are ethnic differences, including differences related to indigeneity, in the incidence of first episode psychosis (FEP) and pathways into care, but research on ethnic disparities in outcomes following FEP is limited.
Aims
In this study we examined social and health outcomes following FEP diagnosis for a cohort of Māori (Indigenous people of New Zealand) and non-Māori (non-Indigenous) young people. We have focused on understanding the opportunities for better outcomes for Māori by examining the relative advantage of non-Māori with FEP.
Method
Statistics New Zealand's Integrated Data Infrastructure was accessed to describe mental health and social service interactions and outcomes for a retrospective FEP cohort comprising 918 young Māori and 1275 non-Māori aged 13 to 25 at diagnosis. Logistic regression models were used to examine whether social outcomes including employment, benefit receipt, education and justice involvement in year 5 differed by indigeneity.
Results
Non-Māori young people were more likely than Māori to have positive outcomes in the fifth year after FEP diagnosis, including higher levels of employment and income, and lower rates of benefit receipt and criminal justice system involvement. These patterns were seen across diagnostic groups, and for both those receiving ongoing mental healthcare and those who were not.
Conclusions
Non-Māori experience relative advantage in outcomes 5 years after FEP diagnosis. Indigenous-based social disparities following FEP urgently require a response from the health, education, employment, justice and political systems to avoid perpetuating these inequities, alongside efforts to address the disadvantages faced by all young people with FEP.
As people migrate to digital environments they produce an enormous amount of data, such as images, videos, data from mobile sensors, text, and usage logs. These digital footprints documenting people’s spontaneous behaviors in natural environments are a gold mine for social scientists, offering novel insights; more diversity; and more reliable, replicable, and ecologically valid results.
This last chapter summarizes most of the material in this book in a range of concluding statements. It provides a summary of the lessons learned. These lessons can be viewed as guidelines for research practice.
As the field of migration studies evolves in the digital age, big data analytics emerge as a potential game-changer, promising unprecedented granularity, timeliness, and dynamism in understanding migration patterns. However, the epistemic value added by this data explosion remains an open question. This paper critically appraises the claim, investigating the extent to which big data augments, rather than merely replicates, traditional data insights in migration studies. Through a rigorous literature review of empirical research, complemented by a conceptual analysis, we aim to map out the methodological shifts and intellectual advancements brought forth by big data. The potential scientific impact of this study extends into the heart of the discipline, providing critical illumination on the actual knowledge contribution of big data to migration studies. This, in turn, delivers a clarified roadmap for navigating the intersections of data science, migration research, and policymaking.
This chapter is dedicated to the memory of Sue Atkins, the Grande Dame of lexicography, who passed away in 2021. In a prologue we argue that she must be seen on a par with other visionaries and their visions, such as Paul Dirac in mathematics or Beethoven in music. We review the last half century through the eyes of Sue Atkins. In the process, insights of other luminaries come into the picture, including those of Patrick Hanks, Michael Rundell, Adam Kilgarriff, John Sinclair, and Charles Fillmore. This material serves as background to start thinking out of the box about the future of dictionaries. About fifty oppositions are presented, in which the past is contrasted with the future, divided into five subsections: the dictionary-making process, supporting tools and concepts, the appearance of the dictionary, facts about the dictionary, and the image of the dictionary. Moving from the future of dictionaries to the future of lexicographers, the argument is made that dictionary makers need to join forces with the Big Data companies, a move that, by its nature, brings us to the US and thus Americans, including Gregory Grefenstette, Erin McKean, Laurence Urdang, and Sidney I. Landau. In an epilogue, the presentation’s methodology is defined as being “a fact-based extrapolation of the future” and includes good advice from Steve Jobs.
This is the first of a two-part paper. We formulate a data-driven method for constructing finite-volume discretizations of an arbitrary dynamical system's underlying Liouville/Fokker–Planck equation. A method is employed that allows for flexibility in partitioning state space, generalizes to function spaces, applies to arbitrarily long sequences of time-series data, is robust to noise and quantifies uncertainty with respect to finite sample effects. After applying the method, one is left with Markov states (cell centres) and a random matrix approximation to the generator. When used in tandem, they emulate the statistics of the underlying system. We illustrate the method on the Lorenz equations (a three-dimensional ordinary differential equation) saving a fluid dynamical application for Part 2 (Souza, J. Fluid Mech., vol. 997, 2024, A2).
This is the second part of a two-part paper. We apply the methodology of the first paper (Souza, J. Fluid Mech., vol. 997, 2024, A1) to construct a data-driven finite-volume discretization of the Liouville/Fokker–Planck equation of a high-dimensional dynamical system, i.e. the compressible Euler equations with gravity and rotation evolved on a thin spherical shell. We show that the method recovers a subset of the statistical properties of the underlying system, steady-state distributions of observables and autocorrelations of particular observables, as well as revealing the global Koopman modes of the system. We employ two different strategies for the partitioning of a high-dimensional state space, and explore their consequences.
This paper aims at exploring the dynamic interplay between advanced technological developments in AI and Big Data and the sustained relevance of theoretical frameworks in scientific inquiry. It questions whether the abundance of data in the AI era reduces the necessity for theory or, conversely, enhances its importance. Arguing for a synergistic approach, the paper emphasizes the need for integrating computational capabilities with theoretical insight to uncover deeper truths within extensive datasets. The discussion extends into computational social science, where elements from sociology, psychology, and economics converge. The application of these interdisciplinary theories in the context of AI is critically examined, highlighting the need for methodological diversity and addressing the ethical implications of AI-driven research. The paper concludes by identifying future trends and challenges in AI and computational social science, offering a call to action for the scientific community, policymakers, and society. Being positioned at the intersection of AI, data science, and social theory, this paper illuminates the complexities of our digital era and inspires a re-evaluation of the methodologies and ethics guiding our pursuit of knowledge.
The Conclusion chapter reiterates the book’s approach, focus and main points. It reminds the reader that the book has concentrated on local, provincial, peripatetic and otherwise relatively marginal sites of scientific activity and shown how a wide variety of spaces were constituted and reconfigured as meteorological observatories. The conclusion reiterates the point that nineteenth-century meteorological observatories, and indeed the very idea of observatory meteorology, were under constant scrutiny. The conclusion interrogates four crucial conditions of these observatory experiments: the significance of geographical particularity in justifications of observatory operations; the sustainability of coordinated observatory networks at a distance; the ability to manage, manipulate and interpret large datasets; and the potential public value of meteorology as it was prosecuted in observatory settings. Finally, the chapter considers the use of historic weather data in recent attempts by climate scientists to reconstruct past climates and extreme weather events.
Second language (L2) learners need to acquire large vocabularies to approach native-like proficiency. Many controlled experiments have investigated the factors facilitating and hindering word learning; however, few studies have validated these findings in real-world learning scenarios. We use data from the language learning app Lingvist to explore how L2 word learning is affected by valence (positivity/negativity) and concreteness of target words and their linguistic contexts. We found that valence, but not concreteness, affects learning. Users learned positive and negative words better than neutral ones. Moreover, positive words are learned best in positive contexts and negative words in more negative contexts. Word and context valence effects are strongest on the learner’s second encounter with the target word and diminish across subsequent encounters. These findings provide support for theories of embodied cognition and the lexical quality hypothesis and point to the linguistic factors that make learning words, and by extension languages, faster.
Personal independence payment (PIP) is a benefit that covers additional daily living costs people may incur from a long-term health condition or disability. Little is known about PIP receipt and associated factors among people who access mental health services, and trends over time. Individual-level data linking healthcare records with administrative records on benefits receipt have been non-existent in the UK.
Aims
To explore how PIP receipt varies over time, including PIP type, and its association with sociodemographic and diagnostic patient characteristics among people who access mental health services.
Method
A data-set was established by linking electronic mental health records from the South London and Maudsley NHS Foundation Trust with administrative records from the Department for Work and Pensions.
Results
Of 143 714 working-age patients, 37 120 (25.8%) had received PIP between 2013 and 2019, with PIP receipt steadily increasing over time. Two in three patients (63.2%) had received both the daily living and mobility component. PIP receipt increased with age. Those in more deprived areas were more likely to receive PIP. The likelihood of PIP receipt varied by ethnicity. Patients diagnosed with a severe mental illness had 1.48 odds (95% CI 1.42–1.53) of having received PIP, compared with those with a different psychiatric diagnosis.
Conclusions
One in four people who accessed mental health services had received PIP, with higher levels seen among those most likely in need, as indicated by a severe mental illness diagnosis. Future research using this data-set could explore the average duration of PIP receipt in people who access mental health services, and re-assessment patterns by psychiatric diagnosis.
High-dimensional dynamical systems projected onto a lower-dimensional manifold cease to be deterministic and are best described by probability distributions in the projected state space. Their equations of motion map onto an evolution operator with a deterministic component, describing the projected dynamics, and a stochastic one representing the neglected dimensions. This is illustrated with data-driven models for a moderate-Reynolds-number turbulent channel. It is shown that, for projections in which the deterministic component is dominant, relatively ‘physics-free’ stochastic Markovian models can be constructed that mimic many of the statistics of the real flow, even for fairly crude operator approximations, and this is related to general properties of Markov chains. Deterministic models converge to steady states, but the simplified stochastic models can be used to suggest what is essential to the flow and what is not.