To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Scientists must be ethical and conscientious, always. Data bring with them much promise to improve our understanding of the world around us, and improve our lives within it. But there are risks as well. Scientists must understand the potential harms of their work, and follow norms and standards of conduct to mitigate those concerns. But network data are different. As we discuss in this chapter, network data are some of the most important but also most sensitive data. Before we dive into the data, we discuss the ethics of data science in general and network data in specific. The ethical issues that we face often do not have clear solutions but require thoughtful approaches and understanding complex contexts and difficult circumstances.
In this chapter, we discuss how to represent network data inside a computer, with some examples of computational tasks and the data structures that enable those computations. When working with network data using code, you have many choices of data structures---but which ones are best for our given goals? Writing your own code to process network data can be valuable, yet existing libraries, which feature extensively-tested and efficiently-engineered functionalities, are worth considering as well. Python and R, both excellent programming languages for data science, come well-equipped with third-party libraries for working with network data, and we describe some examples. We also discuss choosing and using typical file formats for storing network data, as many standard formats exist.
Network data, like all data, are imperfect measures of objects of study. There may be missing information or false information. For networks, these measurement errors can lead to missing nodes or links (network elements that exist in reality but are absent from the network data) or spurious nodes or links (nodes or links present in the data but absent in reality). More troubling is that these conditions exist in a continuum, and there is a spectrum of scenarios where nodes or links may exist but not be meaningful in some way. In this chapter, we describe how such errors can appear and affect network data and introduce some ways to handle such errors in the data processing steps. Fixes for errors can lead to different networks, before and after processing, for example, and we must be careful and circumspect in identifying and planning for such errors.
What are the nodes? What are the links? These questions are not the start of your work—the upstream task makes sure of that—but they are an inflection point. Keep them front of mind. Your methods, the paths you take to analyze and interrogate your data, all unfold from the answers (plural!) to these questions. This chapter reflects on where we have gone, where we can go for more, and, perhaps, what the future has in store for data science, networks and network data.
Machine learning has revolutionized many fields, including science, healthcare, and business. It is also widely used in network data analysis. This chapter provides an overview of machine learning methods and how they can be applied to network data. Machine learning can be used to clean, process, and analyze network data, as well as make predictions about networks and network attributes. Methods that transform networks into meaningful representations are especially useful for specific network prediction tasks, such as classifying nodes and predicting links. The challenges of using machine learning with network data include recognizing data leakage and detecting dataset shift. As with all machine learning, effective use of machine learning on networks depends on practicing good data hygiene when evaluating a predictive model’s performance.
Some networks, many in fact, vary with time. They may grow in size, gaining nodes and links. Or they may shrink, losing links and becoming sparser over time. Sitting behind many networks are drivers that change the structure, predictably or not, leading to dynamic networks that exhibit all manner of changes. This chapter focuses on describing and quantifying such dynamic networks, recognizing the challenges that dynamics bring, and finding ways to address those challenges. We show how to represent dynamic networks in different ways, how to devise null models for dynamic networks, and how to compare and contrast dynamical processes running on top of the network against a network structure that is itself dynamic. Dynamic network data also brings practical issues, and we discuss working with date and time data and file formats.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.
In working with network data, data acquisition is often the most basic yet the most important and challenging step. The availability of data and norms around data vary drastically across different areas and types of research. A team of biologists may spend more than a decade running assays to gather a cells interactome; another team of biologists may only analyze publicly available data. A social scientist may spend years conducting surveys of underrepresented groups. A computational social scientist may examine the entire network of Facebook. An economist may comb through large financial documents to gather tables of data on stakes in corporate holdings. In this chapter, we move one step along the network study life-cycle. Key to data gathering is good record-keeping and data provenance. Good data gathering sets us up for future success—otherwise, garbage in, garbage out—making it critical to ensure the best quality and most appropriate data is used to power your investigation.
This chapter covers ways to explore your network data using visual means and basic summary statistics, and how to apply statistical models to validate aspects of the data. Data analysis can generally be divided into two main approaches, exploratory and confirmatory. Exploratory data analysis (EDA) is a pillar of statistics and data mining and we can leverage existing techniques when working with networks. However, we can also use specialized techniques for network data and uncover insights that general-purpose EDA tools, which neglect the network nature of our data, may miss. Confirmatory analysis, on the other hand, grounds the researcher with specific, preexisting hypotheses or theories, and then seeks to understand whether the given data either support or refute the preexisting knowledge. Thus, complementing EDA, we can define statistical models for properties of the network, such as the degree distribution, or for the network structure itself. Fitting and analyzing these models then recapitulates effectively all of statistical inference, including hypothesis testing and Bayesian inference.
Realistic networks are rich in information. Often too rich for all that information to be easily conveyed. Summarizing the network then becomes useful, often necessary, for communication and understanding but, being wary, of course, that a summary necessarily loses information about the network. Further, networks often do not exist in isolation. Multiple networks may arise from a given dataset or multiple datasets may each give rise to different views of the same network. In such cases and more, researchers need tools and techniques to compare and contrast those networks. In this chapter, In this chapter, well show you how to summarize a network, using statistics, visualizations, and even other networks. From these summaries we then describe ways to compare networks, defining a distance between networks for example. Comparing multiple networks using the techniques we describe can help researchers choose the best data processing options and unearth intriguing similarities and differences between networks in diverse fields.
Machine learning, especially neural network methods, is increasingly important in network analysis. This chapter will discuss the theoretical aspects of network embedding methods and graph neural networks. As we have seen, much of the success of advanced machine learning is thanks to useful representations—embeddings—of data. Embedding and machine learning are closely aligned. Translating network elements to embedding vectors and sending those vectors as features to a predictive model often leads to a simpler, more performant model than trying to work directly with the network. Embeddings help with network learning tasks, from node classification to link prediction. We can even embed entire networks and then use models to summarize and compare networks. But not only does machine learning benefit from embeddings, but embeddings benefit from machine learning. Inspired by the incredible recent progress with natural language data, embeddings created by predictive models are becoming more useful and important. Often these embeddings are produced by neural networks of various flavors, and we explore current approaches for using neural networks on network data.
This chapter discusses record keeping, like maintaining a lab notebook. Historically, lab notebooks were analog, pen-and-paper affairs. With so much work being performed on the computer and with most scientific instruments creating digital data directly, most record-keeping efforts are digital. Therefore, we focus on strategies for establishing and maintaining records of computer-based work. Keeping good records of your work is essential. These records inform your future thoughts as you reflect on the work you have already done, acting as reminders and inspiration. They also provide important details for collaborators, and scientists working in large groups often have predefined standards for group members to use when keeping lab notebooks and the like. Computational work differs from traditional bench science, and this chapter describes practices for good record-keeping habits in the more slippery world of computer work.
Drawing examples from real-world networks, this essential book traces the methods behind network analysis and explains how network data is first gathered, then processed and interpreted. The text will equip you with a toolbox of diverse methods and data modelling approaches, allowing you to quickly start making your own calculations on a huge variety of networked systems. This book sets you up to succeed, addressing the questions of what you need to know and what to do with it, when beginning to work with network data. The hands-on approach adopted throughout means that beginners quickly become capable practitioners, guided by a wealth of interesting examples that demonstrate key concepts. Exercises using real-world data extend and deepen your understanding, and develop effective working patterns in network calculations and analysis. Suitable for both graduate students and researchers across a range of disciplines, this novel text provides a fast-track to network data expertise.
This chapter uses the ideas of hydrodynamics introduced in the last chapter to formulate the hydrodynamic theory of the flocking problem (i.e., the “Toner–Tu” equations).