To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In 1679, the astronomer Giovanni Domenico Cassini published a large print detailing the entire visible surface of the moon with unprecedented meticulousness. This Grand Selenography is undoubtedly one of the most spectacular pictures ever produced within the Académie royale des sciences. However, it has remained widely neglected by historians up to now. This study offers the first account of the making and early reception of the print. It argues that the Grand Selenography remains uncompleted because it failed to satisfy Cassini and his contemporaries. Furthermore, its history allows us to shed new light on the range of issues that scientific pictures might have raised during Louis XIV’s reign.
This study examined the capacity of ChatGPT-4 to assess L2 writing in an accurate, specific, and relevant way. Based on 35 argumentative essays written by upper-intermediate L2 writers in higher education, we evaluated ChatGPT-4’s assessment capacity across four L2 writing dimensions: (1) Task Response, (2) Coherence and Cohesion, (3) Lexical Resource, and (4) Grammatical Range and Accuracy. The main findings were (a) ChatGPT-4 was exceptionally accurate in identifying the issues across the four dimensions; (b) ChatGPT-4 demonstrated more variability in feedback specificity, with more specific feedback in Grammatical Range and Accuracy and Lexical Resource, but more general feedback in Task Response and Coherence and Cohesion; and (c) ChatGPT-4’s feedback was highly relevant to the criteria in the Task Response and Coherence and Cohesion dimensions, but it occasionally misclassified errors in the Grammatical Range and Accuracy and Lexical Resource dimensions. Our findings contribute to a better understanding of ChatGPT-4 as an assessment tool, informing future research and practical applications in L2 writing assessment.
Dmitri Gallow has recently proposed an ingenious accuracy-first vindication that our credence should be updated in accordance with conditionalization. This paper extends his idea to cases where we undergo only a partial learning experience. In particular, I attempt to vindicate, in the spirit of accuracy-first epistemology, that when we undergo what is called ‘Jeffrey partial learning’ in this paper, our credences should be updated in accordance with Jeffrey conditionalization, or at the very least, the update should be rigid. In doing so, I propose what I call the ‘Jeffrey-accuracy function.’ This function is not strictly proper and, at first glance, seems to rationalize ill-motivated credence updating. However, this turns out not to be the case.
Designers often rely on their self-evaluations – either independently or using design tools – to make concept selection decisions. When evaluating designs for sustainability, novice designers, given their lack of experience, could demonstrate psychological distance from sustainability-related issues, leading to faulty concept evaluations. We aim to investigate the accuracy of novice designers’ self-evaluations of the sustainability of their solutions and the moderating role of their (1) trait empathy and (2) their beliefs, attitudes and intentions toward sustainability on this accuracy. We conducted an experiment with first-year engineering students comprising a sustainable design activity. In the activity, participants evaluated the sustainability of their own designs, and these self-evaluations were compared against expert evaluations. We see that participants’ self-evaluations were consistent with the expert evaluations on the following sustainable design heuristics: (1) longevity and (2) finding wholesome alternatives. Second, trait empathy moderated the accuracy of self-evaluations, with lower levels of fantasy and perspective-taking relating to more accurate self-evaluations. Finally, beliefs, attitudes and intentions toward sustainability also moderated the accuracy of self-evaluations, and these effects vary based on the sustainable design heuristic. Taken together, these findings suggest that novice designers’ individual differences (e.g., trait empathy) could moderate the accuracy of the evaluation of their designs in the context of sustainability.
This research report presents the development and validation of Auto Error Analyzer, a prototype web application designed to automate the calculation of accuracy and its related metrics for measuring second language (L2) production. Building on recent advancements in natural language processing (NLP) and artificial intelligence (AI), Auto Error Analyzer introduces an automated accuracy measurement component, bridging a gap in existing assessment tools, which traditionally require human judgment for accuracy evaluation. By utilizing a state-of-the-art generative AI model (Llama 3.3) for error detection, Auto Error Analyzer analyzes L2 texts efficiently and cost-effectively, producing accuracy metrics (e.g., errors per 100 words). Validation results demonstrate high agreement between the tool’s error counts and human rater judgments (r = .94), with microaverage precision and recall in error detection being high as well (.96 and .94 respectively, F1 = .95), and its T-unit and clause counts matched outputs from established tools like L2SCA. Developed under open science principles to ensure transparency and replicability, the tool aims to support researchers and educators while emphasizing the complementary role of human expertise in language assessment. The possibilities of Auto Error Analyzer for efficient and scalable error analysis, as well as its limitations in detecting context-dependent and first-language (L1)-influenced errors, are also discussed.
As the field of heritage language acquisition expands, there is a need for proficiency to compare speakers across groups and studies. Elicited imitation tasks (EITs) are efficient cost-effective tasks with a long tradition in proficiency assessment of second language (L2) learners, first language children, and adults. However, little research has investigated their use with heritage speakers (HSs), despite their oral nature, which makes them appropriate for speakers with variable literacy skills. This study is a partial replication of Solon, Park, Dehghan-Chaleshtori, Carver & Long (2022), who administered an EIT originally developed for advanced L2 learners on a group of HSs. In this study, we administered the same EIT with minor modifications to 70 HSs and 132 L2 learners of Spanish with different levels of proficiency and ran a Rasch analysis to evaluate the functioning of the task with the two groups. To obtain concurrent validity evidence, scores on the EIT were compared with participants’ performance in an oral narration; evaluated for complexity, accuracy, and fluency (CAF); and compared with a standardized oral proficiency test, the Versant Spanish Test. Results of Rasch analyses showed that the EIT was effective at distinguishing different levels of ability for both groups, and analyses showed moderate to strong correlations between CAF measures and the EIT and very strong correlations between the EIT and the Versant Spanish Test. These results provide evidence that the EIT is an efficient and adequate proficiency test for HSs and L2 learners of Spanish; its use in research settings is recommended.
One of the most significant challenges in research related to nutritional epidemiology is the achievement of high accuracy and validity of dietary data to establish an adequate link between dietary exposure and health outcomes. Recently, the emergence of artificial intelligence (AI) in various fields has filled this gap with advanced statistical models and techniques for nutrient and food analysis. We aimed to systematically review available evidence regarding the validity and accuracy of AI-based dietary intake assessment methods (AI-DIA). In accordance with PRISMA guidelines, an exhaustive search of the EMBASE, PubMed, Scopus and Web of Science databases was conducted to identify relevant publications from their inception to 1 December 2024. Thirteen studies that met the inclusion criteria were included in this analysis. Of the studies identified, 61·5 % were conducted in preclinical settings. Likewise, 46·2 % used AI techniques based on deep learning and 15·3 % on machine learning. Correlation coefficients of over 0·7 were reported in six articles concerning the estimation of calories between the AI and traditional assessment methods. Similarly, six studies obtained a correlation above 0·7 for macronutrients. In the case of micronutrients, four studies achieved the correlation mentioned above. A moderate risk of bias was observed in 61·5 % (n 8) of the articles analysed, with confounding bias being the most frequently observed. AI-DIA methods are promising, reliable and valid alternatives for nutrient and food estimations. However, more research comparing different populations is needed, as well as larger sample sizes, to ensure the validity of the experimental designs.
Lexical proficiency is a multifaceted phenomenon that greatly impacts human judgments of writing quality. However, the importance of collocations’ contribution to proficiency assessment has received less attention than that of single words, despite collocations’ essential role in language production. This study, therefore, investigated how aspects of collocational proficiency affect the ratings that examiners give to English learner essays. To do so, collocational features related to sophistication and accuracy were manipulated in a set of argumentative essays. Examiners then rated the texts and provided rationales for their choices. The findings revealed that the use of lower-frequency words significantly and positively impacted the experts’ ratings. When used as part of collocations, such words then provided a small yet significant additional boost to ratings. Notably, there was no significant effect for increased collocational accuracy. These findings suggest that low-frequency words within collocations are particularly salient to examiners and deserving of pedagogic focus.
Whether, when, and why perceivers are able to accurately infer the personality traits of other individuals is a key topic in psychological science. Studies examining this question typically ask a number of perceivers to judge a number of targets with regard to a specific trait. The resulting data are then analyzed by averaging the judgments across perceivers or by computing the respective statistic for each single perceiver. Here, we discuss the limitations of the average-perceiver and single-perceiver approaches. Furthermore, we argue that and illustrate how cross-classified structural equation models can be used for the flexible analysis of accuracy data.
Response times on test items are easily collected in modern computerized testing. When collecting both (binary) responses and (continuous) response times on test items, it is possible to measure the accuracy and speed of test takers. To study the relationships between these two constructs, the model is extended with a multivariate multilevel regression structure which allows the incorporation of covariates to explain the variance in speed and accuracy between individuals and groups of test takers. A Bayesian approach with Markov chain Monte Carlo (MCMC) computation enables straightforward estimation of all model parameters. Model-specific implementations of a Bayes factor (BF) and deviance information criterium (DIC) for model selection are proposed which are easily calculated as byproducts of the MCMC computation. Both results from simulation studies and real-data examples are given to illustrate several novel analyses possible with this modeling framework.
The goal of this paper is to systematically review the literature on United States Department of Agriculture (USDA) forecast evaluation and critically assess their methods and findings. The fundamental characteristics of optimal forecasts are bias, accuracy and efficiency as well as encompassing and informativeness. This review revealed that the findings of these studies can be very different based on the forecasts examined, commodity, sample period, and methodology. Some forecasts performed very well, while others were not very reliable, resulting in forecast specific optimality record. We discuss methodological and empirical contributions of these studies as well as their shortcomings and potential opportunities for future work.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.
This study suggests that there may be considerable difficulties in providing accurate calendar age estimates in the Roman period in Europe, between ca. AD 60 and ca. AD 230, using the radiocarbon calibration datasets that are currently available. Incorporating the potential for systematic offsets between the measured data and the calibration curve using the ΔR approach suggested by Hogg et al. (2019), only marginally mitigates the biases in calendar date estimates observed. At present, it clearly behoves researchers in this period to “caveat emptor” and validate the accuracy of their calibrated radiocarbon dates and chronological models against other sources of dating information.
Response times (RTs) have become ubiquitous in second language acquisition (SLA) research, providing empirical evidence for the theorization of the language learning process. Recently, there have been discussions of some fundamental psychometric properties of RT data, including, but not limited to, their reliability and validity. In this light, we take a step back to reflect on the use of RT data to tap into linguistic knowledge in SLA. First, we offer a brief overview of how RT data are most commonly used as vocabulary and grammar measures. We then point out three key limitations of such uses, namely that (a) RT data can lack substantive importance without considerations of accuracy, (b) RT differences may or may not be a satisfactory psychometric individual difference measure, and (c) some tasks designed to elicit RT data may not be sufficiently fine-grained to target specific language processes. Our overarching goal is to enhance the awareness among SLA researchers of these issues when interpreting RT results and stimulate research endeavors that delve into the unique properties of RT data when used in our field.
This chapter, authored by a computer scientist and an industry expert in computer vision, briefly explains the fundamentals of artificial intelligence and facial recognition technologies. The discussion encompasses the typical development life cycle of these technologies and unravels the essential building blocks integral to understanding the complexities of facial recognition systems. The author further explores key challenges confronting computer and data scientists in their pursuit of ensuring the accuracy, effectiveness, and trustworthiness of these technologies, which also drives many of the common concerns regarding facial recognition technologies.
Recently released Moderate-Resolution Imaging Spectroradiometer (MODIS) land surface temperature (LST) collection 6.1 (C6.1) products are useful for understanding ice–atmosphere interactions over East Antarctica, but their accuracy should be known prior to application. This study assessed Level 2 and Level 3 MODIS C6.1 LST products (MxD11_L2 and MxD11C1) in comparison with the radiance-derived in situ LSTs from 12 weather stations. Significant cloud-related issues were identified in both LST products. By utilizing a stricter filter based on automatic weather station cloud data, despite losing 29.4% of the data, accuracy of MODIS LST was greatly improved. The cloud-screened MODIS LST exhibited cold biases (−5.18 to −0.07°C, and root mean square errors from 2.37 to 6.28°C) than in situ LSTs at most stations, with smaller cold biases at inland stations, but larger ones at coastal regions and the edge of plateau. The accuracy was notably higher during warm periods (October–March) than during cold periods (April–September). The cloud-screened MODIS C6.1 LST did not show significant improvements over C5 (Collection 5) version across East Antarctica. Ice-crystal precipitation occurring during temperature inversions at the surface (Tair-Tsurface) played a crucial role in MODIS LST accuracy on inland plateau. In coastal regions, larger MODIS LST biases were observed when the original measurements were lower.
In 2000, The Clay Minerals Society established a biennial quantitative mineralogy round robin. The so-called Reynolds Cup competition is named after Bob Reynolds for his pioneering work in quantitative clay mineralogy and exceptional contributions to clay science. The first contest was run in 2002 with 40 sets of three samples, which were prepared from mixtures of purified, natural, and synthetic minerals that are commonly found in clay-bearing rocks and soils and represent realistic mineral assemblages. The rules of the competition allow any method or combination of methods to be used in the quantitative analysis of the mineral assemblages. Throughout the competition, X-ray diffraction has been the method of choice for quantifying the mineralogy of the sample mixtures with a multitude of other techniques used to assist with phase identification and quantification. In the first twelve years of the Reynolds Cup competition (2002 to 2014), around 14,000 analyses from 448 participants have been carried out on a total of 21 samples. The data provided by these analyses constitute an extensive database on the accuracy of quantitative mineral analyses and also has given enough time for the progression of improvements in such analyses. In the Reynolds Cup competition, the accuracy of a particular quantification is judged by calculating a “bias” for each phase in an assemblage. Determining exactly the true amount of a phase in the assemblage would give a bias of zero. Generally, the higher placed participants correctly identified all or most of the mineral phases present. Conversely, the worst performers failed to identify or misidentified phases. Several contestants reported a long list of minor exotic phases, which were likely reported by automated search/match programs and were mineralogically implausible. Not surprisingly, clay minerals were among the greatest sources of error reported. This article reports on the first 12 years of the Reynolds Cup competition results and analyzes the competition data to determine the overall accuracy of the mineral assemblage quantities reported by the participants. The data from the competition were also used to ascertain trends in quantification accuracy over a 12 year period and to highlight sources of error in quantitative analyses.
Chapter 11 provides an overview of the terms for talking about grammar instruction and learning, including implicit learning vs. explicit learning and implicit knowledge vs. explicit knowledge. With these common terms defined, the chapter then describes several instructional approaches that researchers have utilized to better understand how language learners build their understanding of the target language. Particular attention is paid to focus-on-form and form-focused instructional strategies.
GIRI (Glasgow International Radiocarbon Intercomparison) was designed to meet a number of objectives, including to provide an independent assessment of the analytical quality of the laboratory/measurement and an opportunity for a laboratory to participate and improve (if needed). The principles in the design of GIRI were to provide the following: (a) a series of unrelated individual samples, spanning the dating age range, (b) linked samples to earlier intercomparisons to allow traceability, (c) known age samples, to allow independent accuracy checks, (d) a small number of duplicates, to allow independent estimation of laboratory uncertainty, and (e) two categories of samples—bulk and individual—to support laboratory investigation of variability. All of the GIRI samples are natural (wood, peat, and grain), some are known age, and overall their age spans approx. >40,000 years BP to modern. The complete list of sample materials includes humic acid, whalebone, grain, single ring dendro-dated samples, dendro-dated wood samples spanning a number of rings (e.g., 10 rings), background and near background samples of bone and wood. We present an overview of the results received and preliminary consensus values for the samples supporting a more in-depth evaluation of laboratory performance and variability.
Status hierarchies are ubiquitous across cultures and have been over deep time. Position in hierarchies shows important links with fitness outcomes. Consequently, humans should possess psychological adaptations for navigating the adaptive challenges posed by living in hierarchically organised groups. One hypothesised adaptation functions to assess, track, and store the status impacts of different acts, characteristics and events in order to guide hierarchy navigation. Although this status-impact assessment system is expected to be universal, there are several ways in which differences in assessment accuracy could arise. This variation may link to broader individual difference constructs. In a preregistered study with samples from India (N = 815) and the USA (N = 822), we sought to examine how individual differences in the accuracy of status-impact assessments covary with status motivations and personality. In both countries, greater overall status-impact assessment accuracy was associated with higher status motivations, as well as higher standing on two broad personality constructs: Honesty–Humility and Conscientiousness. These findings help map broad personality constructs onto variation in the functioning of specific cognitive mechanisms and contribute to an evolutionary understanding of individual differences.