We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Prediction and classification are two very active areas in modern data analysis. In this paper, prediction with nonlinear optimal scaling transformations of the variables is reviewed, and extended to the use of multiple additive components, much in the spirit of statistical learning techniques that are currently popular, among other areas, in data mining. Also, a classification/clustering method is described that is particularly suitable for analyzing attribute-value data from systems biology (genomics, proteomics, and metabolomics), and which is able to detect groups of objects that have similar values on small subsets of the attributes.
This last chapter summarizes most of the material in this book in a range of concluding statements. It provides a summary of the lessons learned. These lessons can be viewed as guidelines for research practice.
We first discuss a phenomenon called data mining. This can involve multiple tests on which variables or correlations are relevant. If used improperly, data mining may associate with scientific misconduct. Next, we discuss one way to arrive at a single final model, involving stepwise methods. We see that various stepwise methods lead to different final models. Next, we see that various configurations in test situations, here illustrated for testing for cointegration, lead to different outcomes. It may be possible to see which configurations make most sense and can be used for empirical analysis. However, we suggest that it is better to keep various models and somehow combine inferences. This is illustrated by an analysis of the losses in airline revenues in the United States owing to 9/11. We see that out of four different models, three estimate a similar loss, while the fourth model suggests only 10 percent of that figure. We argue that it is better to maintain various models, that is, models that stand various diagnostic tests, for inference and for forecasting, and to combine what can be learned from them.
The sheer number of research outputs published every year makes systematic reviewing increasingly time- and resource-intensive. This paper explores the use of machine learning techniques to help navigate the systematic review process. Machine learning has previously been used to reliably “screen” articles for review – that is, identify relevant articles based on reviewers’ inclusion criteria. The application of machine learning techniques to subsequent stages of a review, however, such as data extraction and evidence mapping, is in its infancy. We, therefore, set out to develop a series of tools that would assist in the profiling and analysis of 1952 publications on the theme of “outcomes-based contracting.” Tools were developed for the following tasks: assigning publications into “policy area” categories; identifying and extracting key information for evidence mapping, such as organizations, laws, and geographical information; connecting the evidence base to an existing dataset on the same topic; and identifying subgroups of articles that may share thematic content. An interactive tool using these techniques and a public dataset with their outputs have been released. Our results demonstrate the utility of machine learning techniques to enhance evidence accessibility and analysis within the systematic review processes. These efforts show promise in potentially yielding substantial efficiencies for future systematic reviewing and for broadening their analytical scope. Beyond this, our work suggests that there may be implications for the ease with which policymakers and practitioners can access evidence. While machine learning techniques seem poised to play a significant role in bridging the gap between research and policy by offering innovative ways of gathering, accessing, and analyzing data from systematic reviews, we also highlight their current limitations and the need to exercise caution in their application, particularly given the potential for errors and biases.
U.S. law imposes strict recording and reporting requirements on all entities that manufacture and distribute controlled substances. As a result, the prescription opioid crisis has unfolded in a data-saturated environment. This article asks why the systematic documentation of opioid transactions failed to prevent or mitigate the crisis. Drawing on a recently disclosed trove of 1.4 million internal records from Mallinckrodt Pharmaceuticals, a leading manufacturer of prescription opioids, we highlight a phenomenon we propose to call data diversion, whereby data ostensibly generated or collected for the purpose of regulating the distribution of controlled substances were repurposed by the industry for the opposite aim of increasing sales at all costs. Systematic data diversion, we argue, contributed substantially to the scale of drug diversion seen with opioids and should become a focus of policy intervention.
This study aimed to identify and understand the major topics of discussion under the #sustainability hashtag on Twitter (now known as “X”) and understand user engagement. The sharp increase in social media usage combined with a rise in climate anomalies in recent years makes the area of sustainability with respect to social media a critical topic. Python was used to gather Twitter posts between January 1, 2023, and March 1, 2023. User engagement metrics were analyzed using a variety of statistical analysis methods, including keyword-frequency analysis and Latent Dirichlet Allocation (LDA), which were used to identify significant topics of discussion under the #sustainability hashtag. Additionally, histograms and scatter plots were used to visualize user engagement. LDA analysis was conducted with 7 topics after trials were run with various topics and results were analyzed to determine which number of topics best fit the dataset. The frequency analysis provided a basic overview of the discourse surrounding #sustainability with the topics of technology, business and industry, environmental awareness, and discussion of the future. The LDA model provided a more comprehensive view, including additional topics such as Environmental, Social, and Governance (ESG) and infrastructure, investing, collaboration, and education. These findings have implications for researchers, businesses, organizations, and politicians seeking to align their strategies and actions with the major topics surrounding sustainability on Twitter to have a greater impact on their audience. Researchers can use the results of this study to guide further research on the topic or contextualize their study with existing literature within the field of sustainability.
Understanding the spatio-temporal patterns of users’ travel behavior on public transport (PT) systems is essential for more assertive transit planning. With this in mind, the aim of this article is to diagnose the spatial and temporal travel patterns of users of Fortaleza’s PT network, which is a trunk-feeder network whose fares are charged by a tap-on system. To this end, 20 databases were used, including global positioning system, user registration, and PT smart card data from November 2018, prior to the pandemic. The data set was processed and organized into a database with a relational model and an Extraction, Transformation, and Loading process. A data mining approach based on Machine Learning models was applied to evaluate travel patterns. As a result, it was observed that users’ first daily use has a higher percentage of spatial and temporal patterns when compared to their last daily use. In addition, users rarely show spatial and temporal patterns at the same time.
The COVID-19 pandemic itself constitutes an environment for people to experience the potential loss of control and freedom due to social distancing measures and other government orders. Variety-seeking has been treated as a mechanism to regain a sense of self-control. Using Machine Learning model and household-level data with a focus on the wine market in the United States, this study showcases the changing variety-seeking behavior over the pandemic year of 2020, in which people’s perception of the status of restriction measures influences the degree of their use of variety-seeking behavior as a coping strategy. It is the shopping pattern and store environments that drive the behavioral responses in wine purchases to freedom-limited circumstances. Coupon use is associated with a lower variety-seeking tendency at the beginning of the stay-at-home order, but the variety level resumes when more time has passed in the restriction periods. Variety-seeking tendency increases with shopping frequency at the beginning of the social distancing measure but decreases to a level lower than all the non-restriction periods.
Natural language processing (NLP) methods hold promise for improving clinical prediction by utilising information otherwise hidden in the clinical notes of electronic health records. However, clinical practice – as well as the systems and databases in which clinical notes are recorded and stored – change over time. As a consequence, the content of clinical notes may also change over time, which could degrade the performance of prediction models. Despite its importance, the stability of clinical notes over time has rarely been tested.
Methods:
The lexical stability of clinical notes from the Psychiatric Services of the Central Denmark Region in the period from January 1, 2011, to November 22, 2021 (a total of 14,811,551 clinical notes describing 129,570 patients) was assessed by quantifying sentence length, readability, syntactic complexity and clinical content. Changepoint detection models were used to estimate potential changes in these metrics.
Results:
We find lexical stability of the clinical notes over time, with minor deviations during the COVID-19 pandemic. Out of 2988 data points, 17 possible changepoints (corresponding to 0.6%) were detected. The majority of these were related to the discontinuation of a specific note type.
Conclusion:
We find lexical and syntactic stability of clinical notes from psychiatric services over time, which bodes well for the use of NLP for predictive modelling in clinical psychiatry.
Machine learning (ML) and in particular deep learning (DL) methods push state-of-the-art solutions for many hard problems, for example, image classification, speech recognition, or time series forecasting. In the domain of climate science, ML and DL are known to be effective for identifying causally linked modes of climate variability as key to understand the climate system and to improve the predictive skills of forecast systems. To attribute climate events in a data-driven way, we need sufficient training data, which is often limited for real-world measurements. The data science community provides standard data sets for many applications. As a new data set, we introduce a consistent and comprehensive collection of climate indices typically used to describe Earth System dynamics. Therefore, we use 1000-year control simulations from Earth System Models. The data set is provided as an open-source framework that can be extended and customized to individual needs. It allows users to develop new ML methodologies and to compare results to existing methods and models as benchmark. For example, we use the data set to predict rainfall in the African Sahel region and El Niño Southern Oscillation with various ML models. Our aim is to build a bridge between the data science community and researchers and practitioners from the domain of climate science to jointly improve our understanding of the climate system.
Avian radar systems are effective for wide-area bird detection and tracking, but application significances need further exploration. Existing radar data mining methods provide long-term functionalities, but they are problematic for bird activity modelling especially in temporal domain. This paper complements this insufficiency by introducing a temporal bird activity extraction and interpretation method. The bird behaviour is quantified as the activity degree which integrates intensity and uncertainty characters with an entropy weighing algorithm. The method is applicable in multiple temporal scales. Historical radar dataset from a system deployed in an airport is adopted for verification. Temporal characters demonstrate good consistency with understandings from local observers and ornithologists. Daily commuting and roosting characters of local birds are well reflected, evening bat activities are also extracted. Night migration activities are demonstrated clearly. Results indicate the proposed method is effective in temporal bird activity modelling and interpretation. Its integration with bird strike risk models might be more useful for airport safety management with wildlife interference.
As the world becomes increasingly connected, it is also more exposed to a myriad of cyber threats. We need to use multiple types of tools and techniques to learn and understand the evolving threat landscape. Data is a common thread linking various types of devices and end users. Analyzing data across different segments of cybersecurity domains, particularly data generated during cyber-attacks, can help us understand threats better, prevent future cyber-attacks, and provide insights into the evolving cyber threat landscape. This book takes a data oriented approach to studying cyber threats, showing in depth how traditional methods such as anomaly detection can be extended using data analytics and also applies data analytics to non-traditional views of cybersecurity, such as multi domain analysis, time series and spatial data analysis, and human-centered cybersecurity.
FOLD-RM is an automated inductive learning algorithm for learning default rules for mixed (numerical and categorical) data. It generates an (explainable) answer set programming (ASP) rule set for multi-category classification tasks while maintaining efficiency and scalability. The FOLD-RM algorithm is competitive in performance with the widely used, state-of-the-art algorithms such as XGBoost and multi-layer perceptrons, however, unlike these algorithms, the FOLD-RM algorithm produces an explainable model. FOLD-RM outperforms XGBoost on some datasets, particularly large ones. FOLD-RM also provides human-friendly explanations for predictions.
Digital engineering is increasingly established in the industrial routine. Especially the application of machine learning on geometry data is a growing research issue. Driven by this, the paper presents a new method for the classification of mechanical components, which utilizes the projection of points onto a spherical detector surfaces to transfer the geometries into matrices. These matrices are then classified using deep learning networks. Different types of projection are examined, as are several deep learning models. Finally, a benchmark dataset is used to demonstrate the competitiveness.
Patent data have been utilized for engineering design research for long because it contains massive amount of design information. Recent advances in artificial intelligence and data science present unprecedented opportunities to mine, analyse and make sense of patent data to develop design theory and methodology. Herein, we survey the patent-for-design literature by their contributions to design theories, methods, tools, and strategies, as well as different forms of patent data and various methods. Our review sheds light on promising future research directions for the field.
Industrial Data Analytics needs access to huge amounts of data, which is scattered across different IT systems. As part of an integrated reference kit for Industrial Data Analytics, there is a need for a data backend system that provides access to data. This system needs to have solutions for the extraction of data, the management of data and an analysis pipeline for those data. This paper presents an approach for this data backend system.
This paper chronicles and reflects on the processes and the meanings of a project of speculative design that creates a narrative based on the scientific notion of phytomining, the activity of extracting metals from the soil using plants. The paper reflects on the ability of the project of bringing together people from different expertise, as a successful case study of Speculative Design and Research through Practice. Besides the scientific and technical challenges posed by GeoMerce, the authors of this paper reflect on the critical framework that set the basis for such a complex project.
This research envisages an automated system to inform engineers when opportunities occur to use existing features or configurations during the development of new products. Such a system could be termed a "predictive CAD system" because it would be able to suggest feature choices that follow patterns established in existing products. The predictive CAD literature largely focuses on predicting components for assemblies using 3D solid models. In contrast, this research work focuses on feature-based predictive CAD system using B-rep models. This paper investigates the performance of predictive models that could enable the creation of such an intelligent CAD system by assessing three different methods to support inference: sequential, machine learning, or probabilistic methods using N-Grams, Neural Networks (NNs), and Bayesian Networks (BNs) as representative of these methods. After defining the functional properties that characterize a predictive design system, a generic development methodology is presented. The methodology is used to carry out a systematic assessment of the relative performance of three methods each used to predict the diameter value of the next hole and boss feature type being added during the design of a hydraulic valve body. Evaluating predictive performance providing five recommendations ($k = 5$) for hole or boss features as a new design was developed, recall@k increased from around 30% to 50% and precision@k from around 50% to 70% as one to three features were added. The results indicate that the BN and NN models perform better than those using N-Grams. The practical impact of this contribution is assessed using a prototype (implemented as an extension to a commercial CAD system) by engineers whose comments defined an agenda for ongoing research in this area.
Data mining and knowledge discovery (DMKD) focuses on extracting useful information from data. In the chemical process industry, tasks such as process monitoring, fault detection, process control, optimization, etc., can be achieved using DMKD. However, the selection of the appropriate method for each step in the DMKD process, namely data cleaning, sampling, scaling, dimensionality reduction (DR), clustering, clustering analysis and data visualization to obtain meaningful insights is far from trivial. In this contribution, a computational environment (FastMan) is introduced and used to illustrate how method selection affects DMKD in chemical process data. Two case studies, using data from a simulated natural gas liquid plant and real data from an industrial pyrolysis unit, were conducted to demonstrate the applicability of these methodologies in real-life scenarios. Sampling and normalization methods were found to have a great impact on the quality of the DMKD results. Also, a neighbor graphs method for DR, t-distributed stochastic neighbor embedding, outperformed principal component analysis, a matrix factorization method frequently used in the chemical process industry for identifying both local and global changes.
This chapter reviews recent developments in modern soft computing models, including heuristic algorithms, extreme learning machines and models based on deep learning strategies applied to water management. In this context, we describe the basics and fundamentals of the mentioned soft computing methods. We then provide a brief review of the models applied in three fields of water management: drought forecasting, evapotranspiration modelling and rainfall-runoff simulation. Thus, we provide guidelines for modern soft computing techniques applied to water management.