Transformation of the Earth’s social and environmental systems is happening at an incredible pace. The global population has more than doubled over the last five decades, while food and water consumption has tripled and fossil-fuel use quadrupled. Attendant benefits such as longer lifespans and economic growth are increasingly joined by corresponding drawbacks, including mounting socioeconomic inequality, environmental degradation, and climate change. Over the past half-century, interregional differences in population growth rates, unprecedented urbanization, and international migration have led to profound shifts in the spatial distribution of the global population. Economic changes have been dramatic as well. The global per-capita gross domestic product doubled while economic disparities grew in many regions (Reference Rosa, Diekmann, Dietz and JaegerRosa et al. 2010).
These socioeconomic shifts have affected a host of natural systems and ecosystem services. Demographic shifts and economic development are distal causes of proximate drivers of environmental change, such as fossil-fuel emissions and land-cover change. These changes affect many natural systems, including land-cover composition, soil and water quality, climate regulation and temperature, and vegetation and animal communities. These environmental dynamics have profound implications for human well-being. Flooding, erosion of coastal areas, and drought already affect human societies in many ways, and these effects will grow sharply in coming decades. These shifts are all facets of interlinked human-environment systems that arise from complex interactions among individuals, society, and the environment (Reference Ehrlich, Kareiva and DailyEhrlich et al. 2012).
Can data science help address human-environment challenges? Scientific and policy bodies have called for more and better data and attendant analyses to support the research needed to meet the impacts of rapid human-environmental change (Reference Millett and EstrinMillett & Estrin 2012). Socioeconomic, demographic, and other social data that can be closely integrated with Earth systems data are essential to describing the continuously unfolding transformation of human and ecological systems (Reference Holm, Goodsite and CloetinghHolm et al. 2013). Of particular interest is big data, or data sets that are larger and more difficult to handle than those typically used in most fields, and data science, the larger field concerned with big data and analysis.
Data science offers advances in processing and analysis for research and policy development. Special issues in leading journals like Science and Nature highlight the need for new data and methods to help answer a wide array of questions at the intersection of nature and society (Reference BaraniukBaraniuk 2011). National scientific bodies such as the US National Academy of Sciences, United Kingdom’s Royal Society, European Science Foundation, and Chinese Academy of Sciences have issued high-profile calls to develop and use big data to understand and address scientific and policy challenges stemming from human-environment interactions. We also see the advent of specialized journals, such as Big Data Earth and the International Journal of Digital Earth, that focus on large human-environment data sets.
Researchers and policymakers see data science’s promise and pitfalls for human-environment systems. The move toward analyzing vast new data sets redefines disciplines that range from physics to economics to Earth sciences. These data are gleaned from a host of new sensors, internet activities, and the merging of existing databases. At the same time, some of the initial hype around data science and big data has been tempered by how this work plays out in real-world contexts. The fast growth of some forms of data has highlighted the considerable gaps in other kinds. Humans have studied only a tiny part of the world’s oceans or a fraction of the millions of species on the Earth’s surface. There are also significant gaps in data on people and society over much of the globe. Human-environment data pose many significant unresolved methodological challenges because they represent complex social and environmental entities and relationships that span multiple organizational, spatial, and temporal levels (Reference Kugler, van Riper and MansonKugler et al. 2015). Data science also faces many unsolved challenges around theory development and myriad policy dimensions. Even as vast databases become more readily accessible and tractable, many problems have yet to be addressed, and much of the promise of big data remains just that – a promise unfulfilled.
1.1 Data Science and Human-Environment Research
There is broad interest in using big data for understanding human-environment interactions and attendant issues – including climate change, natural hazards, ecosystem services, and sustainability. This volume brings together these various research streams while assessing the pros and cons of data science for human-environment scholarship. It draws on various sources but focuses almost exclusively on peer-reviewed research. The goal here is to bridge various camps of scholarly work on big data and data science for human-environment systems. Big data and data science are here to stay; maybe not in their current incarnation, but certainly in some form. Addressing the toughest human-environment issues requires scholars to work together across fields. This list includes (and is not limited to) data scientists, statisticians, and computer scientists; domain scientists working on social, environmental, and natural systems; and scholars in policy and law, and arts and humanities.
The nature of global environmental change and other human-environment topics is one of vast spatial and temporal scales in some ways and the hyperlocal in others. One need only look to action around climate change to see how global social and environmental systems are inextricably linked to individual behavior. These incredible scale shifts mean we deal with a vast range of data, methods, and theories across research domains. Scholars also deal with problems that do not neatly fall along human or environmental lines.
Given these pervasive scale-related problems and the inherent complexity they create, it is not surprising that inter-disciplinary and trans-disciplinary research are both seen as necessary; the problems of global change transcend conventional disciplinary inquiry. Global change is often treated largely as an environmental problem, but the environment is not simply an ‘‘independent variable’’; indeed, global change is a consequence of social processes.
In simple terms, human-environment research is not the domain of any single research field. Doing this research well requires a deliberate commitment to boundary-crossing and integrated scholarship.
Data science is making deep inroads into many kinds of scholarship on human-environment topics, but the literature is splintered. Some of the most extensive work centers on big data (Section 1.2 dives into definitions of big data), primarily focused on providing wide-ranging and generic overviews. These are often trade books that cite primarily from the gray literature or nonpeer-reviewed blogs and web pages. Increasingly these works include research perspectives as data science has ramped up over the past decade. These resources often have an exuberant bent that is driven by just-so case studies that capture the attention of mass media. This large and general body of work directly (or often indirectly) reflects how big data is big business. Data science is vital to a growing array of economic sectors. This commercial success results from big data and data science, which means they are often couched with an optimistic viewpoint with a mercenary perspective at its core. Much of the early writing on big data was commercial, and the authors were understandably looking to sell their products (Reference WylyWyly 2014).
Much of the early work in big data and data science relied on nonscholarly and nonpeer-reviewed sources. References to blog posts, web pages, and gray literature abound. Informal and nonpeer-reviewed sites will always be essential venues of information on rapidly emerging issues in technology since more deliberate and careful research and subsequent publications can require years. Apart from not being peer-reviewed, the major drawback of these sites is that they too often disappear. For example, the site www.bigdata-startups.com is cited by dozens of academic papers as a source of crucial information; however, it no longer exists beyond partial and fragmented backups in internet archives. Another example is the work of McKinsey & Company, a management consulting firm. This significant proponent of big data published well-cited work at the now-defunct website www.mckinseyonsociety.com, and its articles only live on as informal copies and references.
Scholarly work in data science and big data has proliferated over the past decade. This work falls into several camps and reflects the rapidity with which data science and big data worked their way into the arenas of science agenda setting, funding, and publication. Academia has always been as prone as any other human endeavor to embrace fads, fashions, and folderal (Reference DunnetteDunnette 1966). The rapid embrace of all-things-data is driven in part by fashion, but it is also clear that data science approaches work well for many questions, even when there is room for improvement with others. As explored in later chapters, there are also deeper issues in how scholars can, or should, engage with these approaches. This book speaks to many communities in the hope of helping bring them together around a robust data science of human-environment systems.
Social scientists and humanities scholars have long been interested in nature and human-environment relationships. However, the recent increased visibility of human well-being, climate change, environmental justice, ecological resilience, and sustainability have rapidly expanded social science research on the environment. We are also seeing an increase in digital and environmental humanities, areas with an interest in data science as both a methodology and a subject of critical study. Social science and humanities scholarship comprises a large and growing body of perspectives on big data. The majority of this work critiques big data and its role in specific application areas, such as cities or policing, or from a specific perspective, especially in science and technology studies. There is also scholarship, still in the minority, that offers grounded accounts of the promise and drawbacks of big data for particular scientific and policy domains.
Earth, planetary, ecological, and natural scientists have embraced the study of the Earth as an integrated human-environment system. The physical, chemical, and biological impacts of human activities in the Anthropocene have taken on planetary import (Reference RuddimanRuddiman 2013). The transdisciplinary field of Earth-system science focuses on ocean, land, and atmosphere processes, recognizing that changes in the Earth result from complex interactions among these Earth systems and human systems. Ecological, natural, and Earth sciences research with data science tends to center on fairly narrowly defined areas of interest, such as using remote sensing for climate change research or geospatial data to study animal movement. In keeping with environmental scientific publishing in general, this work is usually shared via articles, but a growing number of books, predominantly edited volumes, focus on specific research questions.
Information, data, and computer scientists perform much big data research. Many articles and editorials by these researchers call for greater engagement with domain experts to advance big data. One of the goals of this book is to offer these scholars an overview of significant challenges and opportunities in human-environment research. Information and computer science publications provide a mix of general overviews on the computational aspects of big data or advanced information on specific challenges. Articles and edited volumes also offer case studies within narrowly defined research topics. The vast majority of this work is in keeping with the general publishing model of computer sciences, which tends toward shorter pieces in conference proceedings that may or may not be peer-reviewed.
Debates over the potential and problems of data science can be uneven or narrowly defined. Reference HidalgoHidalgo (2014) expresses frustration with these problems in his opinion piece “Saving big data from big mouths,” which argues that coverage of big data seems to oscillate between uncritical reports or even hyperbolic odes versus underinformed critiques or jeremiads about the big data strawman. Calls for greater collaboration among fields tend to revolve around linking core fields in data science, especially statistics, computer science, and domain fields in the social and natural sciences, and into the arts and humanities. One common complaint is that data science focuses too often on important yet narrow technical and computational considerations. It gives short shrift to many aspects of substantive domain knowledge. At the same time, domain scholars outside of data science run the risk of ham-handedly using data approaches or caricaturing the entire field based on limited engagement. As we explore later, there are many threads to this conversation. There are fundamental differences among fields and their conceptual and epistemological bases. There are marked disparities in funding and infrastructural support for some kinds of work over others that have far-reaching effects on the kinds of questions being asked and answered by scholars of all stripes.
Communication issues between data scientists and domain scholars are related to the need for better communication between human and environmental researchers. Three decades ago, Reference SternStern (1993) called for a second environmental science that highlighted the need for environmental science to embrace the human. While there have been positive developments in integration, there is much potential for greater collaboration. As Holm and others put it,
various important disciplines, mainly social and human, are too often overlooked or neglected as a science, such as law, architecture, history, literature, communication, sociology, and psychology. These are important disciplines to fully understand Earth systems and human motivation and to guide decision-makers. However, they are not routinely seen as fundamental to giving policy advice. Proponents of interdisciplinary research at times relegate human and social science research to an auxiliary, advisory, and essentially nonscientific status.
Finally, while the focus on relationships between humans and nature anchors most discussion in this book, it is helpful to recognize that this division can be seen as an arbitrary. People have been looking at human-environment systems for thousands of years (Reference MarshMarsh 1864). At the same time, there is a long-standing body of work in posthumanism that questions human-centric explanations and correspondingly rejects the dual construction of nature and culture (Reference BraunBraun 2004). This scholarship rejects the concept that nonhuman beings lack agency and embraces the idea that human and nonhuman beings cocreate many spaces. These spaces range from our stomach microbiome to human relationships with animals to interactions with the Earth.
Posthumanism has critics. It can be seen as perpetuating Eurocentric forms of knowledge, as highlighted by Indigenous critiques of posthumanism that argue that the universalizing claims of ways of knowing and being are themselves problematic (Reference SundbergSundberg 2013). For example, there is an ongoing need for Euro-American scholarship to take more seriously Indigenous knowledge, and how the intellectual labor and activist work of Indigenous scholars and practitioners on the mutual interdependence of humans and the environment illustrates how this division may be illusory (Reference WattsWatts 2013). It is important to bear these issues in mind, even as this book primarily uses a human-environment framing as a helpful shorthand for a complex set of dynamics.
1.2 What Are Big Data?
Data science deals with data, unsurprisingly. Data science has subsumed many aspects of big data as a scholarly endeavor, but it is important to consider data and big data on their own. Most scholarly work relies on data harnessed to various methods and concepts. Most researchers can readily point to the kinds of data they use. The simple notion of data as measures of phenomena that we find interesting (e.g., temperature, population counts, or interviews) suffices for most conversations about data science. However, it is essential to dig a little deeper at times and recognize the long and fraught history of data in science. A note on terminology – we will use big data as a plural noun when speaking of the data as such (e.g., “big data are collected”) and as a singular noun when speaking of the larger field of big data (e.g., “big data offers perils and promise”).
People have collected data for millennia. People twenty thousand years ago were using tally sticks, where they would make notches in pieces of wood or bone to keep track of important things, which presumably came in handy for activities such as trading and keeping inventory of possessions (Reference MankiewiczMankiewicz 2000). Four thousand years ago, people used calculating devices such as the abacus and stored information in libraries. The rise of modern statistics and record-keeping originated in the 1600s and was codified by the 1800s. In the nineteenth century, people used data in ways nearly indistinguishable from how we employ data, statistics, and modeling today to design descriptive measures and find associations in data (Reference PorterPorter 1986). The scientific meaning of data, which underpins big data, came into being in the 1600s. The term data is the Latin plural for datum, or “what is given” from the verb dare, “to give,” but it has a deep, contested, and varied history over the centuries for notions of facts or evidence (Reference Rosenberg and GitelmanRosenberg 2013). Data are not always simple!
The key to understanding data science is understanding that data are made or captured by an observer. Indeed, some scholars would argue that the term data should be better considered as the term capta, from the Latin verb capere, meaning “to take” (Reference Checkland, Holwell and HintonCheckland & Holwell 2006). This book uses “data” since capta is a technical term for what most people think of as data, but it is helpful to consider what the concept implies for big data. Since observers capture data, this information is biased from initial observations to subsequent data handling, interpretation, and analysis. Statisticians spend much time developing new ways to plumb the nuances of data. Social scientists debate endlessly about how data map onto complicated social phenomena like race or trust. Natural scientists are heavily invested in ensuring their instrumentation and observations are free of systemic bias. The humanities have led the charge against naïve realism, noting that data are not the same as related phenomena, despite how they are often treated as inseparable. Nonetheless, despite best efforts to reduce bias in data, it is inescapable (Section 2.3).
Despite (or perhaps because of) big data being a trendy topic, there is no single commonly shared definition. There is an ongoing scholarly conversation around the origins of big data. Reference DieboldDiebold (2012) dives into its definition as one of the earlier users of the term during an academic presentation in 2000. He argues that for the field of econometrics, he is likely one of the originators of big data as a term that refers to data sets being too large to be used with existing approaches. However, he uncovers several instances of the term before 2001. Reference Weiss and IndurkhyaWeiss and Indurkhya (1998) use the term repeatedly in their data mining textbook, and researchers with the firm Silicon Graphics used it as early as the mid-1990s. Big data is composed of two common words and associated ideas, so perhaps it is not surprising that there are multiple routes to current usage.
Despite being coined almost two decades ago, the definition of big data remains loose. Critical characteristics for many scholars define big data and data science. Among the most long-lived attributes are the “three v’s” of big data: volume, velocity, and variety. We will not belabor these because there is a tremendous amount written on them already, but it helps frame the discussion. The v’s of big data trace back to a four-page memo written by Reference LaneyLaney (2001) in his role as an analyst for the now-defunct Meta Group. Volume refers to where there is much data, orders of magnitudes larger than is commonly used in most research fields. Velocity describes how data are collected and stored at speed or in real-time. Variety refers to how big data have varying degrees of organization and structure, from well-defined tables to text scraped from the web. Beyond these three basic characteristics, there are ongoing conversations on whether big data should have other v’s, such as veracity (accuracy of data) and value (the usefulness of data to answer specific questions (Reference Chen, Mao and LiuChen et al. 2014). Dozens of definitions relate to the three v’s, additional v’s, and other characteristics of big data that start with other letters besides “v.”
Volume, or the raw amount of data, is central to any definition of big data and data science. Many fields have large volumes of data. Natural science disciplines, including particle physics, astronomy, and genomics, were early adopters of big data approaches. Genomics and astronomy are home to vast amounts of data. They will grow even more because research projects collect amounts of data that were unthinkable even a few decades ago – on the order of ~25 zettabytes per year. The volume of information generated globally doubles every three years, and this pace is increasing (Reference Henke, Bughin and ChuiHenke et al. 2016). Key challenges posed by these data are related to their acquisition, storage, distribution, and analysis. Outside of academia, platforms such as Twitter and Facebook collect and monetize large amounts of data, primarily by developing sophisticated analyses of their users to sell advertising.
A tremendous amount of ink has been dedicated to writing about the size of big data and attendant issues of measuring and defining what “big” means. There is not much value in rehashing those arguments here. Perhaps the easiest way to think about it is that context matters. “Big” is relative to the underlying technology and data format; video files are larger than tweets, but their use matters, such as trying to extract semantic understanding. Bigness tends to revolve around the inability of many existing computing systems or approaches to cope with data and the idea that the amount of data is increasing rapidly, exponentially in some cases. Bigness implies we are always moving toward the horizon and will never get there; in that what is big today, will someday be merely large, or just plain old data.
Most authors are careful to note that the term big is almost meaningless, given how increases in storage, processing speed, and analytical power almost always make the big data of yesterday into the small data of today. There are also debates over whether the bigness of data matters when data science in many fields goes well beyond the engineering and computing challenges that are the focus of so much work in big data (for a more in-depth take, see Reference Chang and GradyChang & Grady 2015). As Reference JacobsJacobs (2009, p. 44) puts it, big data are those “whose size forces us to look beyond the tried-and-true methods that are prevalent at that time.” However, this definition is (necessarily) vague in order to apply to many specific problems. No matter the measure, the size of global data holdings is increasing (Figure 1.1).
Figure 1.1 Size of global data holdings (2010–25). (Reference Reinsel, Gantz and RydningReinsel et al. 2018).
People have attempted to measure how much information exists. The International Data Corporation is a maker of digital data storage and has attendant biases, but it predicts that the amount of data in the world (termed the Datasphere) will grow from 33 zettabytes in 2018 to 175 by 2025 (Reference Reinsel, Gantz and RydningReinsel et al. 2018, p. 3). The same study posits that over 75 percent of the world’s population will interact in some way with the data and, by definition, contribute to big data. Global data storage capacity is growing and increasingly moving to digital format. In 1986, 99.2 percent of all storage capacity was in analog forms such as paper volumes, and within two decades, 94 percent of storage capacity was digital (Reference Hilbert and LópezHilbert & López 2011). Measuring data is an imprecise process and often relies on commercial interests using opaque methods, but it is safe to say there is a lot of data out there (more on data in Chapter 2).
Velocity is another defining characteristic of big data, referring to the rate at which it is collected or moved. Most big data conversations center on how fast data are collected, but velocity also involves how quickly a given computer or processing system can perform calculations over these data. Human-environment data are derived and stored across time frames spanning from paper records and ship logs in the 1600s to real-time digital sensors operating today. The flow rate increases exponentially, especially when considering how scientists use computational modeling to generate simulated data alongside traditional sources (Reference Overpeck, Meehl, Bony and EasterlingOverpeck et al. 2011). Many data sources are termed streaming because they are collected constantly. A significant challenge for human-environment research and data science is developing ways to analyze these data on the fly without assuming that they will be stored in their entirety for later use.
Standard units are used to measure data size. Computer performance has been typically measured by the number of floating-point arithmetic calculations a system can perform in a second (FLOPS). In contrast, data storage is usually measured in bits and bytes. The bit, a contraction of a binary digit, is the smallest unit of computer data storage and usually takes the binary value of 0 or 1. A byte is a collection of eight bits and is usually written in binary notation (i.e., 00000000 to 11111111). When using the FLOPS or bytes terminology, we use Greek prefixes to indicate speed or size (Table 1.1). More generically, the suffix scale denotes the class of computers dealing with data of a given size or processing at a certain speed. For example, a petascale system can perform at least one petaFLOPS calculation or store one petabyte of data.
| Prefix | Storage in bytes | Speed in FLOPS | Storage examples | ||
|---|---|---|---|---|---|
| Byte (B) | 1 | 10° | Single character | ||
| Kilo | Kilobyte (KB) | 1,0241 | KiloFLOPS | 103 | Half a page of text |
| Mega | Megabyte (MB) | 1,0242 | MegaFLOPS | 106 | Photograph |
| Giga | Gigabyte (GB) | 1,0243 | GigaFLOPS | 109 | Hour-long video |
| Tera | Terabyte (TB) | 1,0244 | TeraFLOPS | 1012 | One day of Earth Observing System data in 2000 (Reference Frew and DozierFrew & Dozier 1997) |
| Peta | Petabyte (PB) | 1,0245 | PetaFLOPS | 1015 | One year of data collected by the United States National Aeronautics and Space Administration in 2015 |
| Exa | Exabyte (EB) | 1,0246 | ExaFLOPS | 1018 | One day of data from the Square Kilometer Array (SKA) telescope (Reference Farnes, Mort, Dulwich, Salvini and ArmourFarnes et al. 2018) |
| Zetta | Zettabyte (ZB) | 1,0247 | ZettaFLOPS | 1021 | One year of digital data in 2010 (Reference Gantz and ReinselGantz & Reinsel 2010) |
| Yotta | Yottabyte (YB) | 1,0248 | YottaFLOPS | 1024 | One day of data generated globally in the mid-2020s (Reference Parhami, Sakr and ZomayaParhami 2019) |
Variety refers to the enormously heterogeneous nature of data organization, structure, meaning, and sources. Big data can span many knowledge or application domains, especially when dealing with human-environment systems. This range of domains makes these data flexible. They may be collected for one use but be applied to many, even those not foreseen when captured. In some ways, this situation is not new. Land-change scientists, climate modelers, hydrologists, and other researchers on human-environment topics are adept at drawing data from various sources. Traditionally, these data are transformed before analysis, such as converting them into variables for statistical modeling or into layers in a geographic information system. Big data usually focuses on integrating these data on the fly or otherwise reconciling them. It also tackles various sources, from social network posts to remotely sensed imagery to sensor feeds. Much attention on the engineering side of big data is paid to vast data sets and the wide variety of formats and models in place and on the fly.
Definitions of big data center on volume, velocity, and variety. To these three core characteristics, scholars have added others. These include a pile of other “v”s, including variability, veracity, value, and visualization. Some newer characteristics do not begin with the letter “v” because there should be limits on how willing people are to torture the thesaurus. These include relationality, exhaustivity, and complexity. As time goes by, researchers are sure to add more! Human-environment data handily exemplify these features of big data.
Variability is often added as a fourth “v” that speaks to changes in the first three of volume, velocity, and variety. Variability refers to shifts in some characteristic of data, such as volume or variety (Reference Chang and GradyChang & Grady 2015). The ability of a system to accommodate bursts of data or temporary increases in volume, for example, is a significant challenge for computing hardware and software. Variability also poses financial or resource challenges because hardware capable of handling volume or velocity demands is often expensive. Even when computing hardware can be spun down or up quickly to reduce operating costs, their initial capital costs can be high. Their relative performance against each new hardware iteration declines rapidly (Section 3.3.4 goes into different kinds of computing, and their pros and cons for research). Variability also encapsulates the challenges of ingesting, ideally without much human intervention, new data formats, or accommodating a shift in the balance or makeup of different sources.
Veracity refers to data accuracy and other more nuanced notions of data quality. Accuracy and uncertainty have long been topics of interest in science, but they take on new meaning in the big data era (Reference Gandomi and HaiderGandomi & Haider 2015). One measure of accuracy is how well data match reality, but this entails understanding fitness for use (see Section 2.1). An advantage of big data is that common tasks like correlation or pattern matching can accommodate a fair amount of noise or error in data. Data science practitioners regularly lose hundreds of thousands of data points due to handling and errors, but these losses are typically immaterial when hundreds of millions are in play. At the same time, all data can suffer from systemic temporal, spatial, or attribute biases. Big data can overcome some of these through the sheer number of observations, but not all, and indeed, some kinds of biases are more pronounced with big data. A small but growing body of work examines how bias in data or methods impairs decision-making about people or the environment. Section 2.3 examines bias in data while much of Chapter 4 explores challenges in theory and explanation, and Chapter 5 examines the many real-world challenges of insufficient data.
Value refers to the usefulness of data to answer specific questions and is therefore tied to veracity and accuracy, which often only makes sense regarding fitness for use. Some authors focus primarily on its business value (Reference Ishwarappa and AnuradhaIshwarappa & Anuradha 2015), but the value of big data goes far beyond this, applying to a wide array of fields. As data science expands its reach, more disciplines will examine the value of big data for their research questions.
Visualization, or data display, is essential to some data science researchers. They argue that the sheer size of big data means that humans are unusually reliant on visualization as an approach to understanding large and complex data sets. In addition, one of the original computational demands for big data was visualization. This focus has only gained more urgency with the advent of three-dimensional modeling and graphics (Reference Liu, Guo and WangLiu et al. 2016).
Relationality is helpful for linking data. Big data become truly big when disparate data sets come together, and relationality refers to data containing common fields that enable two or more data sets to be joined. Name, identifying number, and location are typical examples, but these fields have few limits (Reference boyd and Crawfordboyd & Crawford 2012). Therefore, relational data must be indexical, having attributes that uniquely identify them and can be linked to other data. An extraordinary amount of effort and money is dedicated to making data indexical and relational and linking them. Location can make human-environment data information relational in a sense. Researchers can use spatial coordinates to overlay spatial data layers or spatial traces left by people via phones or animals tracked with locational technology. Relationality becomes especially powerful when links between objects can be traced to form networks or graphs of connectivity (more in Section 2.1 on data basics).
Exhaustivity refers to how big data comprises a population of observations and not a sample (Reference Mayer-Schönberger and CukierMayer-Schönberger & Cukier 2013). Big data can approach being exhaustive in how their scope spans the, say, entire set of tweets about an event. Of course, these tweets are just a narrow slice of knowledge about a given phenomenon. Data can seem exhaustive in coverage but be sparse in many attributes (Reference PoorthuisPoorthuis 2018). For example, this occurs when collecting global-scale remotely sensed imagery. These data are an excellent source of human and environmental data and encompass the globe but may be collected infrequently or offer only limited attributes (Section 2.4).
Complexity captures how data are complicated in how they are structured. While variety refers to the range of data, some of which can be more complicated than others, complexity refers to data with internal relationships that can require unique or challenging data structures. An example is data on households. Data on individuals, such as age or income, can be more difficult (but more rewarding) when individuals can be linked to other people in a household. In order to answer some kinds of questions, knowing characteristics about a person, like their age or profession, is less helpful than understanding their partnerships or familiar relationships. Complex data require more effort to link, clean, match, and transform (Reference Katal, Wazid and GoudarKatal et al. 2013).
There is a cottage industry in defining new “v”s and other letters, so different forms or flavors of big data can exist simultaneously. Other non-“v” attributes can matter a great deal. For example, in exploratory research, data should exhibit extensionality (whether one can easily add or change new fields or attributes) or scalability (data can expand in size rapidly and hopefully seamlessly) (after Reference Marz and WarrenMarz & Warren 2015). There is general agreement about volume anchoring the notion of big data, and some scholars argue for using size as the primary criterion (Reference Baro, Degoul, Beuscart and ChazardBaro et al. 2015). However, such definitions are always contingent on the broader sociotechnical landscape. Over the past few decades, the overall trend is that yesterday’s big data are tomorrow’s small data sets as data infrastructure and computing power grow.
1.3 Data Science, a Science about How We Use Data?
Big data is increasingly associated with data science, a field centered on gathering and analyzing data. Indeed, big data is probably best considered subsidiary to data science as the latter evolves and takes on a broad array of tasks. One indicator of the rise of data science relative to big data is how internet searches for both big data and data science grew rapidly after 2010. Data science overtook big data around 2014 and grew while big data peaked. The term data science goes back decades, and there is growing recognition that big data and data science are not as new as many of their proponents and practitioners claim. Several data science journals were founded before the term big data became popular, including the Journal of Data Science and the Data Science Journal, both founded in 2002. Data science goes back decades, and the field is an increasingly essential focus for data, methods, theory, and policy in human-environment scholarship.
Data science has many antecedents. Its roots extend to various forms of data analysis as proposed by scholars from the 1960s onward (Reference Hoaglin, Mosteller and TukeyHoaglin et al. 1984). It is well worth reading Reference TukeyTukey’s (1962) “The Future of Data Analysis,” a substantial (sixty-seven–page) article in the Annals of Mathematical Statistics. Tukey speaks primarily to statisticians in a polemic or provocation designed to spur work beyond mathematical and theoretical statistics. He argues that scholars should recognize how much data analysis goes beyond developing models or sample-to-population inferences because it involves using data to guide broader efforts in observation, experimentation, and analysis (Reference TukeyTukey 1962, p. 2). He also notes the importance of the growing use of computers in allowing scholars untrained in statistics or data analysis to engage in both to answer questions of interest. Reference NaurNaur (1974) similarly talks about the science of data, how computers may be used to understand data, and how researchers must explore their data.
Data science in its modern form started to coalesce in the 1990s. C. F. Jeff Wu is credited with one of the earliest uses of data science in statistics, delivering a talk in 1997 entitled “Statistics = Data Science?” (Reference Chipman and JosephChipman & Joseph 2016). The presentation described statistics as the trilogy of data collection, data analysis, and decision-making informed by data. It goes so far as to suggest the name of the field should be changed from “statistics” to “data science,” and “statistician” to “data scientist.” Wu also advocated making statistical education broader and science-driven in ways that focused on modeling with large data sets, and interaction with scholars in other disciplines (more on education in Section 4.4).
Soon after, Reference ClevelandCleveland (2001) wrote “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics” as a manifesto for recognizing that statistics is essentially data science, and that statistics should better embrace all elements of data science. This work describes six areas and levels of effort for university-based statistics departments: multidisciplinary investigations (25 percent); models and methods for data (20 percent); computing with data (15 percent); pedagogy (15 percent); tool evaluation (5 percent); and theory (20 percent). This explicit focus by statisticians to embrace data science was partially driven by a desire to claim territory that they saw as theirs, and a recognition that there was significant work outside of statistics in big data and data science.
Other fields were embracing data-intensive science as a precursor of data science. Data-intensive science is one where data drives the initial analysis, and information and knowledge emerge from the data instead of more hypothesis-driven testing and exploration (Reference Newman, Ellisman and OrcuttNewman et al. 2003). This term predates most mentions of big data and data science. However, it captures much of their essence by focusing on the need for large-scale cyberinfrastructure that can manage vast amounts of data in short time frames (real-time in some instances), and allows researchers to sieve through these data to find insight. When data-intensive science is envisioned for the study of biodiversity, for example, there is a distinct focus on the utility of automated exploratory analysis techniques for discovering interesting patterns. These novel findings feed into the development of ecological hypotheses that are proven or disproven by standard data collection and hypothesis testing approaches (Reference Kelling, Hochachka and FinkKelling et al. 2009). Similar work was occurring in geography and regional science in artificial intelligence, data-intensive modeling, and automated exploratory analysis of problems, including spatial interaction and point-pattern analysis (Reference OpenshawOpenshaw 1992).
Big data and data science share some myths (Reference JagadishJagadish 2015). A central myth of big data is that the primary research challenge lies in developing new computing architectures and approaches. Software and hardware are essential, but the focus on eking out performance gains in storage and processing can eclipse more challenging efforts in many workflows, especially in developing better human interfaces and ways for scholars to use their data to answer questions. Another myth is that data science is equivalent to big data. The two are certainly related, but there are also differences, as discussed throughout this book. A researcher can do data science with any data set, and some big data can ignore much of data science. Finally, the commercial impetus for big data has led to a focus on the raw size or volume of big data that has overshadowed other data characteristics. Relatively small and medium-sized data sets can pose their own knotty challenges that benefit from data science approaches, especially regarding variety and veracity.
Learning from data is data science. This focus on learning from data implies the creation of better forms of reasoning from data alongside an interest in understanding how people learn from data. It also means explicitly studying how the field of data science is unfolding in the era of big data. A learning-centric perspective takes a step back from the hype around big data and proposes that it is primarily the “science of learning from data; it studies the methods involved in the analysis and processing of data and proposes technology to improve methods in an evidence-based manner” (Reference DonohoDonoho 2017, p. 763). It is not enough to see data science as a loose conglomeration of statistics, machine learning, and engineering centered on large data sets. A broader vision embraces the myriad scientific aspects of data science, centered on a workflow that begins with the initial creation of platforms for collecting these data through to data science approaches to analyzing these data. It also explicitly seeks to engage with other scholarly domains, including human-environment research (Table 1.2).
The book structure loosely follows this data science workflow. Chapter 2 examines data and delves into the first two phases of platform creation and generation and collection. Note that the remainder of Chapter 1 refers to methods that subsequent chapters examine more thoroughly. These include artificial intelligence, computers that can think, and the closely related topic of machine learning, the use of computational algorithms that improve solving a problem over time. Standard machine learning approaches include neural networks, computational analogs to biological brains, and deep learning, which uses advanced neural networks. Common tasks include data mining, searching through data for relationships among variables, and data modeling, or creating computational representations of entities and relationships. Chapter 3 looks at methods including artificial intelligence, machine learning, and others and covers the subsequent phases of the data science workflow from data preparation to visualization. Chapters 4 and 5 are on theory and policy, respectively. They tackle various components of taking action or making decisions with data science and doing science about data science.
1.4 Why Is Data Science Growing?
Like most sea shifts in technology, there are many reasons why big data, data science, and cognate fields are rising in prominence. The fact that these fields are hard to define underscores that they are messy and evolving. It is no surprise that the reasons for their existence and growth are also messy and evolving. Here, we focus on three primary reasons: computation is becoming less expensive, networking is becoming near-ubiquitous in many settings, and big data is big business. There are other reasons why data science and data are garnering so much attention. However, these three are an excellent starting place to explore data science for human-environment systems.
Computation is getting less expensive, or seen another way, more powerful for a given unit of cost. The potential of big data relies on continued evolution in data storage, memory handling, and computational power. This growth focuses on basic technology, such as more powerful computer chips, better storage, and faster networking. It also involves new technologies, such as distributing computing over many different machines or harnessing specialized processing units to compete with supercomputers (more in Section 3.3). In many ways, the story of data science is part of an older tale of Moore’s law (computing power doubles in power on a regular schedule) and the attendant drop in the cost of computing power.
Increased computational power leads to more data. One route is the ease of generation afforded by computing, such as when a person can take a few seconds to send out a picture or message via social media. Another is how computers are increasingly the creators of information. The Internet of Things (IoT) refers to networked, low-cost, and ubiquitous computing in everyday devices (more in Section 3.3.5). Computation is inexpensive to the point where devices as prosaic as toasters and alarm clocks are networked. They combine computer processing with near field communications (e.g., Bluetooth or radio-frequency identification), real-time location sharing (via satellite systems or triangulating access to cell phone towers or wireless access points), and the use of embedded sensors for motion and other physical characteristics (Reference AshtonAshton 2009).
Computation is increasingly embedded in everyday higher-end devices such as personal automobiles. They go beyond using data to help run the vehicle and become tools for collecting vast amounts of data about the environment that can be monetized and shared (more on mobile data collection in Section 2.2.5). Some estimates contend that the IoT generates more than half of the world’s internet traffic (Networking and Information Technology Research and Development [NITRD] 2016). The world reached a tipping point in 2007 when sensors and systems generated more data than could be kept in the entirety of global storage (Reference BaraniukBaraniuk 2011). Computerization is also at the heart of automated content generation ranging from search results and news stories generated with no human oversight to procedurally derived landscape simulations and computerized sensors collecting and sharing information without pause.
Second, advances in computation are enhanced by the promulgation of networks. The Internet and other networks offer more one-to-one and one-to-many forms of communication, such as texting a friend or making an environmental sensor reading available to many users. The move to cloud computing for many universities and researchers exemplifies this move to networks. While much data are held on private clouds or other forms of enterprise storage, growing amounts are stored on the public cloud and analyzed there instead of being downloaded first and then examined (Section 3.3.4). Much more data is being generated by converting existing analog data (e.g., scanning paper books or documents to make digital copies) or from the vast array of born-digital media, including text, photographs, and video (Reference Kugler, Manson and DonatoKugler et al. 2017). These are usually shared or promulgated across networks, inducing even greater demand. Dozens of human-environment domains embrace data science as a mode of inquiry and rely on networks for data and analysis. They deal with petabytes of data ranging from human epidemiology and mobility analysis to atmospheric dynamics, radar-based remote sensing, and climate modeling (Reference Fox and ChangFox & Chang 2018).
Computing and networking combine to make data pervasive. Data permeate every facet of the human sphere and are rapidly impinging on the natural world. Throughout this volume, we will explore the data collection that is encompassing the globe and how these data and attendant analyses affect many human-environment systems. Another consideration is that network information itself is also a new form of data, particularly how researchers can examine relationships in a social network. Reference GreenfieldGreenfield (2010) terms this move toward ubiquitous computing everyware (a portmanteau of everywhere and software), in that software (and by definition, the underlying computing infrastructure) are enmeshed into many facets of daily life. In essence, computation is becoming pervasive in how it is being embedded in a broad range of objects and services. It is ubiquitous in how it is found in many locations, especially as people around the globe adopt mobile telecommunications, and in how technology becomes increasingly locationally aware (Reference Kitchin and DodgeKitchin & Dodge 2011).
Third, big data is big business, pure and simple, leading to many issues in using data science approaches for research. While estimates vary, the value of big data is likely in the trillions of dollars, including using consumer locational data (US $600 billion), gains in the public sector in Europe (US $300 billion), or US health care system (US $300 billion) (Reference Manyika, Chui and BrownManyika et al. 2011). Even if only partially correct and increasingly outdated, these estimates are staggeringly large and illustrate how any seemingly valueless single datum, when joined to others, can be made into big business. Recent estimates of specific industries, such as artificial intelligence (US $330 billion) or data analytics (US $60 billion), point to the importance of data science and big data across sectors (Reference HolstHolst 2021). Section 5.3 looks at open data and examines how many policymakers see data science and big data as increasingly essential to national security and growth. For example, China is predicted to have the largest datasphere globally by 2025, in large part because it is building a vast video surveillance system (Reference Reinsel, Gantz and RydningReinsel et al. 2018). The desire of governments and the private sector worldwide to spy on people drives the growth in data and new ways of analyzing them.
Critically, data science has a large footprint outside of academia, and it behooves scholars to understand how this work affects their research. The rampant growth in data and surge in data science owes much to largely unmonitored data collection tied to advertising dollars. These advertising dollars support the optimistic and opportunistic growth in data science hardware and software through an enormous computing ecosystem (Reference WylyWyly 2014). Writ large, big data and data science are integral to surveillance capitalism, a socioeconomic system driven by the collection and commodification of data (Reference ZuboffZuboff 2019). Researchers have an ethical imperative to understand how data science’s underlying business orientation can lead to unscientific or harmful inquiry. Section 5.1 visits these and other ethical issues that data science poses for researchers and others.
1.5 Data Philosophy
Any broader discussion of data science should include contributions from the philosophy of science. This field has been a freestanding discipline for decades and draws on centuries of antecedent scholarship. In particular, the subfield of science and technology studies poses important questions for data science and big data. Reference Feenberg, Jones and de VriesFeenberg (2009) argues that the philosophy of technology is separate from the philosophy of science, given that the focus on truth in science is different from the focus on control of technology. Others would contend that both fields are part of the larger corpus of work that examines science and technology’s epistemological and ontological underpinnings, alongside contextualizing their social and political nature (expanded in later work; Reference FeenbergFeenberg 2017). Either way, data science needs data philosophy.
When defining big data or data science, it is vital to go beyond computational issues and examine the theoretical or epistemological stance of the work. In this view, part of understanding data science and big data is discerning the underlying commercial motives of much of this work (Reference WylyWyly 2014). More broadly, data science cannot be understood outside social, economic, and cultural contexts. There is so much money in big data that commercial dimensions are inescapable. Data science has social, economic, political, and cultural impacts that can only be barely discerned or guessed in some contexts. Reference KitchinKitchin (2014a) draws on science and technology concepts in describing data assemblage, where data must be seen as part of a vast, intricate, and largely unplanned enterprise intimately tied to culture, politics, and socioeconomic systems.
Another view of big data describes it as the interplay of technology, analysis, and mythology. The technology and analysis are straightforward. The former refers to how data tax existing computational methods, the latter, the ability to draw on large data sets to develop causal claims. The notion of big data as mythology helps situate its claims to develop new insights “with the aura of truth, objectivity, and accuracy” (Reference boyd and Crawfordboyd & Crawford 2012, p. 662). Others have called this mythology a form of hubris because it relies on blind faith in the power of data science to distill the essence of very complicated systems. There is an “often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis” (Reference Lazer, Kennedy, King and VespignaniLazer et al. 2014, p. 1203). Critical data studies examines how the data, methods, and assumptions of big data are rarely as simple or innocent as made out by its proponents (Reference Iliadis and RussoIliadis & Russo 2016). It is also invested in establishing an ethical basis for data science and exploring its broader social and environmental ramifications.
Data science also studies the science of data. Research in science and technology studies often employs ethnographies of data science in action. In contrast, data scientists have called for using big data tools to understand how scholars, policymakers, and others work with data. Examples include using metaanalysis and citation studies to illuminate how science works. The work of Reference IoannidisIoannidis (2008) exemplifies this trend by looking at vast swathes of medical research and illuminating issues such as the difficulty of replication or apparent misuse of statistics in reporting significance. Data science has a similar potential to help clarify the advantages and disadvantages of different workflows across studies or computing ecosystems. This research encourages conformity in scientific practice and replicability when running experiments and handling data (more on reproducibility in Section 3.4).
We dip into science and technology work on data in a few places in this volume, but it is hard to do it justice. Instead, we can follow the lead of geographer Muki Reference HaklayHaklay (2013), who writes on data, technology, and democracy and draws on work in the philosophy of science (Reference DusekDusek 2006; especially Reference FeenbergFeenberg 2002). This scholarship applies to spatial data in ways that resonate with big data and data science conversations. Haklay frames technology as the interplay between values and autonomy, borrowing from and expanding on Reference Feenberg, Jones and de VriesFeenberg’s (2009) formulation (Table 1.3). Technology is either value-neutral or value-laden and either humanly controlled or autonomous.
Table 1.3 Technology in terms of values versus agency
| Technology is … | Autonomous | Humanly controlled |
| Value-neutral: Complete separation of means and ends | Determinism: Technology has universal qualities that make it evolve along a set trajectory independent of human intervention | Instrumentalism: Liberal faith in progress in which technology is value-free but the ends or purposes are important |
| Value-laden: Form a way of life that includes ends | Substantivism: Means and ends are linked in systems via political and economic dynamics | Constructionism: Humans can modify technological trajectory and choose among alternative means-ends systems, although dystopian facets of technology are never far away |
|
(Adapted from Reference Feenberg, Jones and de VriesFeenberg 2009) | ||
Under this schema, the term value connotes whether a technological means is tied to specific ends or outcomes. If technology is value-laden, it can be judged as falling somewhere along a continuum of good-to-bad for specific circumstances or issues that go beyond the technology itself and affect human or environmental systems. In this view, a method like artificial intelligence is assumed to have inherent characteristics that will lead to good outcomes, like freeing humans from unnecessary labor, or bad outcomes, like leading to a robot uprising against humans. In comparison, seeing technology as value-neutral implies that it may only be judged on its own terms, such as whether one technology is more efficient or efficacious than another. A given computational approach is better or worse solely in the context of the technology itself, such as the speed with which an analysis can be completed or the accuracy with which a given artificial intelligence method can classify data.
Autonomy speaks to the degree of human control over technology. Specifically, it recognizes that while humans create technology, there is disagreement about the extent to which the growth and nature of technology are either undirected or almost preordained in the sense that there is some inherent evolutionary path. Asked another way, how much of a role do humans play in conducting the direction and use of technology? For example, assuming that artificial intelligence is autonomous means that this technology will evolve independently of human intervention. It does not matter where the technology is created or who is developing it – most efforts will end up with similar technologies or outcomes. In contrast, if technology is human-controlled, then artificial intelligence’s nature can be shaped, and humans can and must control the ends to which it is applied.
This values and autonomy framework places data science into one of four categories.
Instrumentalism sees data science as a value-neutral tool that humans control. This view characterizes much of the published work by the proponents of big data, data science, and much work about technology. Explicitly or implicitly, this control is seen as being in the service of making the world a better place. However, it is impossible to ignore the potential for adverse outcomes. Chapter 5 explores policy dimensions of data science and how big data is often considered an unalloyed opportunity to develop solutions to many of the world’s human and environmental ills. The instrumentalist view recognizes that data can be misused, but the chances of negative impacts are lessened by having a firm and deliberate hand on the tiller. However, this conception often contends that most of the negative impacts can be attenuated and be outweighed by the benefits.
Determinism shares with instrumentalism the assumption of value-neutral technology, but technology has imperatives independent of its larger social context. Regardless of society’s organizing principles, data science will be energetically used to pursue the goals of that system and inherit its values. Technology is simply an extension of the overarching socioeconomic system while retaining much autonomy. We discuss this later in the context of informational sovereignty for firms and nations where data are seen as integral to the ability of nations and firms to use data science to further their interests.
Substantivism agrees with determinism that technology is autonomous but differs in seeing it as having inbuilt values. Substantivism focuses on the adverse effects of power, control, and domination. In this view, data science and big data have an inbuilt tendency toward centralizing power and control. Surveillance exemplifies the substantivist argument that data mining or artificial intelligence technologies are essentially autonomous and tend to support the centralization of power. Technology channels social norms but is not overly guided by them. Societies will choose among rationales to support surveillance – security, safety, commerce – that fit within their guiding ethos, but they will ultimately choose surveillance. Section 5.1.2 explores the control of data and how data mining and artificial intelligence are used to advance a surveillance society where people and groups are tracked.
Constructionism is the longstanding body of thought that posits that our shared understanding of many facets of the world, including technology, are developed via social processes, including communication and power dynamics. Constructionism posits that a range of social imperatives shapes technology and, in turn, will shape many aspects of society (Reference FeenbergFeenberg 2002). This view shares with substantivism the notion that technology is not value-free, but it tends to be more optimistic than substantivism at the prospect of human control. Constructionism is not always sufficiently attuned to how socially contested technology can be, but it offers a sizeable academic arena that accommodates many views (Reference Bantwal Rao, Jongerden, Lemmens and RuivenkampBantwal Rao et al. 2015). Constructionism is valuable in examining how technology is value-laden and contends that it is crucial to assess the ends to which technology can be applied. One corollary is that data science can offer myriad outcomes depending on the norms and regulations that govern its use, so humans should direct the growth and use of technology. Of course, it is an open question on how to best guide the development and application of technology. Throughout this volume, we will examine whether and how humans can control the data science of coupled human-environment systems.
The tensions among the different schools of thought on technology around issues of autonomy and value illustrate how important it is to go beyond many of the common critiques of big data and data science. This volume focuses on many practical issues in data science but will occasionally foreground the discussion of values versus autonomy. For example, an instrumentalist approach recognizes that bias is a problem. It argues that biased data can be eradicated by developing better and more objective observing systems. In contrast, a constructionist approach contends that the problem of bias goes deeper than ridding data of errors or modifying data collection systems. Feminist constructionist work in particular demonstrates that data are theory-laden artifacts within a larger context or knowledge infrastructure of people, things, and institutions and tied to webs of race, gender, ethnicity, and class among others human characteristics (Reference D’Ignazio and KleinD’Ignazio & Klein 2020). There are no unbiased data because observations are inextricably entwined with messy layers of human systems.
These differences among ways of seeing technology can seem arbitrary or abstract, but they say a lot about data science for human-environment systems. They reflect a longstanding engagement in the philosophy of science and cognate fields around the subtlety of data and its technological trappings. For the busy researcher who wants to get on with their work in data science, it can be hard to extract concrete lessons from the long and complicated history of scholarship in the philosophy of science, but these lessons must be heeded. At the most basic, there is a need to recognize the essential concept that data are fundamentally a social construction that defines research of all stripes (Reference LatourLatour 1986). There are times when knowing that data are social artifacts does not matter, but there are times when it does, as explored throughout this volume.
1.6 Promise and Pitfalls of Data Science for Human-Environment Research
Despite the great promise of data science for studying human-environment systems, there are also significant challenges. This book has six chapters: this introduction and a conclusion bookend four chapters that examine the pitfalls and promises of data science for human-environment scholarship in the focal areas of data, methods, theory, and policy. This detailed examination will draw widely on work in the social, natural, and information sciences and double back to examine the promising ways in which big data allows researchers, policymakers, and other stakeholders to understand human-environment dynamics better. Each chapter has a penultimate “focus” subsection that dives into a topic to tie together some of the chapter’s main points.
Chapter 2 examines how data science grapples with big human-environment data. These data fundamentally change how many scholars answer a range of human-environment questions, but these data come with new challenges. They have large volumes because they span broad spatial and temporal extents, are collected across multiple scales, and have increasingly high resolutions. These data have high velocity because they are continuously collected and manipulated via a broad array of sensing platforms, ranging from ground-based stations and satellite remote sensing to new sources like social network data. Human-environment data often have complex structures and have many forms of bias, error, or uncertainty. The chapter concludes with a look at the field of remote sensing, which has for decades used many of the tools now considered central to data science, and how remotely sensed imagery has long served as one of the best sources of human-environment information.
Chapter 3 looks at data science’s methodological shortcomings and opportunities for manipulating and analyzing big human-environment data sets. Core approaches, including big data, machine learning, or artificial intelligence, are challenged by the complex nature of human-environment data. These data represent various social and biophysical entities and interactions across multiple spatial and temporal scales. Data science can adopt ongoing research in the information sciences on data lifecycles, metadata, ontologies, and data provenance. Many data science challenges are computational, as are the solutions, such as parallel, distributed, cloud, and high-performance computing. Many of these approaches have been developed for generic big data and adapted to spatiotemporal data. Human-environment research also benefits from advances in smart computing, embedded processing and sensing, and the IoT. Another class of methodology central to data science is the burgeoning interest in formalizing and supporting how we share data, workflows, and models in the name of scientific reproducibility. The chapter concludes with a focus on research on handling spatial and temporal patterns and processes and examines this work in the context of global land use and land cover data.
Chapter 4 digs into the many outstanding questions and concerns around the theory of data science and big data. Data science is often positioned as being theory-free or purely inductive, which oversimplifies a more complicated discussion around the various epistemologies of data science. There is an emerging consensus on linking core approaches in data science, like machine learning or modeling, to domain knowledge to overcome some of the challenges of theory development in data science. Work in these substantive areas of inquiry demonstrates the value of linking existing research with smaller data sets to work with larger ones. How these various forms of data science play out in human-environment research is conditioned on the kinds of data science training students receive and how this education relates to competing conceptions of what data science should look like as an academic field or commercial enterprise. Many of the conversations about theory development and data science are happening in the context of the science of cities and the concept of smart cities as coupled human-environment systems.
Chapter 5 concerns the significant policy considerations arising from the profound legal, social, political, and ethical dimensions of data science in general and for human and environmental systems. Data science is touted by many scholars, policymakers, and others as holding extraordinary promise for making policymaking better for people and the planet. Skeptics of the rapid and largely uncontrolled roll-out of data science in many policy arenas have identified a range of harms and argue for policy interventions. There are many policy and data science issues related to discrimination, data dispossession, surveillance, and privacy and consent. How data science plays out in many human-environment contexts is governed in part by data divides, the gaps between people and places in how they may access and use data science tools. Also crucial to successful science and policymaking is open data as defined by a range of competing public and private interests. Dilemmas and potentials of data science for policy development are brought to the fore in debates around the use of data science in sustainable development around the globe.
Chapter 6 wraps up with a discussion of how data science is assuming an increasingly prominent role in examining and addressing human-environment dynamics. Data science is here to stay. It is a valuable way to understand human-environment dynamics and a source of challenges in how the dynamics play out for people and the environment. This final chapter revisits some of the conversations around how data science can, or should, be carried out across research domains. It also delves into issues around the role of data and research in democracy and decision-making. Finally, it brings together many of the threads teased apart in the volume by examining the promise and pitfalls of scientific infrastructure as one of the primary ways forward in applying data science to advancing our knowledge of coupled human-environment systems.