Policy Significance Statement
Preventing the spread of a global pandemic requires effective, swift access to high-quality and up-to-date data. Despite this urgency, data are often stored behind access-controlled systems that prevent effective re-use, and human time constraints for access requests result in bottlenecks, especially in high-demand pandemic-instigated scenarios. Even when data can be accessed, poor quality often makes it difficult or impossible for researchers to effectively use this data and may necessitate lengthy workarounds or “best guesses” to create policy-informing computational models. To reduce costly death tolls in the future, we must implement effective computational data sharing pipelines and permissive re-use policies. This paper provides new information that can be used to develop data sharing policies that will support effective responses in future pandemics.
1. Introduction
Barriers to the effective use of data to inform policy are not new but became noteworthy during a global pandemic that has already infected hundreds of millions (Mathieu et al., Reference Mathieu, Ritchie, Rodés-Guirao, Appel, Giattino, Hasell, Macdonald, Dattani, Beltekian, Ortiz-Ospina and Roser2020). Building on data sharing challenges we encountered as scientists practising in our own fields, we designed a qualitative remote interview-based study to investigate these barriers.
After exploring the literature (Section 2), we present methods (Section 3) and then share a comprehensive list of the consecutive systemic barriers our participants encountered when working to address the effects of the pandemic (Section 4). This list is presented to inform policy and create effective sharing mechanisms before another such deadly event occurs.
Whilst this research was intended specifically to look at non-private data, such as hospital bed capacity, or viral genomes, many participants had experience with private data types as well - and it quickly became clear throughout the study that most of the barriers we identified were common to both private and non-private data types. Much of the evidence we present is quotes and use cases of the barriers our participants encountered, and given the sensitivity of topics such as data privacy (Section 4.1.2) and governmental data suppression (Section 5.1), all participants and quotes are anonymous.
The barriers reported fall into five main categories, four of which are sequential: Knowing data exists, accessing that data, being able to understand and use the data, and being able to share your data and analyses in an online repository so others can re-use them. Throughout this, a fifth barrier is interwoven: Human throughput capacity, which may be stretched even before acute stress periods such as a pandemic.
This study focused specifically on recruiting people who wish to share barriers that they had encountered. Whilst it is not the primary focus of the study, we also discuss common themes and “wish list” items that our participants shared, when describing what a good - or even “dream” data source might look like (Section 4.3).
In the discussion (Section 5), we highlight three key areas of note: the governmental tension between following science and controlling the pandemic narrative; equity concerns as those with the least resilience are also the worst affected; and the need for structured computationally-readable temporal metadata for legal rulings such as COVID social distancing requirements.
Finally, we conclude the paper with a short summary and call to action (Section 6).
1.1. Research question
In times of a pandemic or epidemic when rapid response is required, what are attitudes towards pathogen-related data sharing? In particular, what barriers do researchers encounter, and what do they do, if anything, to get around these barriers?
2. Related literature
Literature related to this study’s goals spans a number of topics. We conducted a targeted literature review, specifically covering domains that were addressed by study participants or that emerged as recurring barriers during the interviews. Articles were retrieved by a combination of recommended reading from participants and domain experts as well as Google Scholar searches for the keywords “data sharing policy,” “privacy vs transparency,” and “open data,” in combination with modifiers such as “COVID” and “healthcare” when a narrower scope was required.
In particular, we have reviewed literature related to: governmental transparency and data sharing; tensions between the right to personal privacy in medical records, versus sharing data for the public good; COVID-19 data sharing policies; secondary data use and re-use; computational and mathematical modelling; industry and research institute data sharing policies, and data characteristics that make data useful (including the Findable, Accessible, Interoperable, and Reusable (FAIR) principles).
2.1. Personal medical privacy versus the collective greater good
Throughout this paper, we do not dispute that individuals have a right to confidentiality in health. The tension between sharing private medical data for the public good and the right to personal privacy is well documented in the literature, even outside pandemic and epidemic-related concerns. Acknowledgement of this right goes back to at least 320 BCE, if not earlier, with Hippocrates’ vow not to divulge what he heard in the course of his profession Jones (Reference Jones1923).
Epidemiologists went to long lengths to share data when possible and call for further data sharing (Xu et al., Reference Xu, Kraemer, Gutierrez, Mekaru, Sewalk, Loskill, Wang, Cohn, Hill, Zarebski and Li2020a). Lenert and McSwain (Reference Lenert and McSwain2020) observe that pandemic-related health data sharing presents a specific urgency: In an emergency setting, we need to drastically reduce the barriers to frictionless [Health Information Exchange]: one law of the land for COVID–19-related data. They further propose to ease common anonymisation measures in the face of pandemic needs, such as reducing the threshold of “anonymous” geographical location blocks from 20,000 people to a smaller range of 50 to 100 individuals. Aggregation of data can also reduce the privacy risks of sharing some types of data, such as mobility data (Buckee et al., Reference Buckee, Balsari, Chan, Crosas, Dominici, Gasser, Grad, Grenfell, Halloran, Kraemer, Lipsitch, Metcalf, Meyers, Perkins, Santillana, Scarpino, Viboud, Wesolowski and Schroeder2020).
Henderson (Reference Henderson2021) notes that the right to individual privacy in epidemics could reasonably be abrogated to provide a better population-level epidemic response, much like the right to free movement is restricted during lockdowns, but eventually concludes that the benefits are not likely to outweigh the risks, given that these datasets are not likely to be representative and bias-free, and risk being exploited by commercial entities.
Dron et al. highlight that despite our rapid response to the crisis, data sharing changes may not be optimal or equitable: many health outcomes were worse for marginalised populations than they were for “majority” populations. In addition, changes and policies made in the acute phase of the pandemic have in some cases been forgotten or de-emphasised during times of lower urgency, suggesting we have learned little about preparing for such challenges before there is an urgent need.
Biomedical data needs to be drawn from varied sources in order to provide meaningful and timely responses. Leonelli (Reference Leonelli2023), highlights that attitudes and expectations around evidence-based medicine changed as a result of pandemic urgency, further illustrated by calls in 2020 and 2021 for better data sharing frameworks (Buckee et al., Reference Buckee, Balsari, Chan, Crosas, Dominici, Gasser, Grad, Grenfell, Halloran, Kraemer, Lipsitch, Metcalf, Meyers, Perkins, Santillana, Scarpino, Viboud, Wesolowski and Schroeder2020; Xu et al., Reference Xu, Gutierrez, Mekaru, Sewalk, Goodwin, Loskill, Cohn, Hswen, Hill, Cobo, Zarebski, Li, Wu, Hulland, Morgan, Wang, O’Brien, Scarpino, Brownstein, Pybus, Pigott and Kraemer2020b). In previous decades, randomised controlled trials (RCTs) were more likely to have been used by researchers over all other data sources. Data were often biased to easily digestible or computation-friendly data sources, rather than “complex disaggregated data sources.” Leonelli highlights the need for varied and multidisciplinary data sources to create a whole and realistic picture of epidemiology. These sources should include social factors which are likely to influence outcomes and behaviours, linked data (where possible) and data sources from medical professionals and social workers, rather than only from researchers.
Starnini et al. (Reference Starnini, Aleta, Tizzoni and Moreno2021) report on COVID-19 data use in Italy and Spain, observing that these policy-informing studies show inadvertent biases in their analyses; they note that cross-country comparisons are challenging to perform effectively or correctly based on current non-uniform data collection and formatting mechanisms.
2.2. Transparency and data sharing in local and national governments
Effective government inter-organisational data sharing provides expertise, revenue, policy, and efficiency benefits (Gil-Garcia and Sayogo, Reference Gil-Garcia and Sayogo2016; Gil-Garcia et al., Reference Gil-Garcia, Chengalur-Smith and Duchessi2007). Given the inherent risk and tension around personal privacy vs transparency, it is not surprising that government data sharing policies vary not only from country to country, and local authority to local authority, and even between individual governmental agencies in the same country. Governmental data are often siloed, however clear direction from high-level leadership can ease the silo culture and make inter-agency consensus and collaboration easier (Graham et al., Reference Graham, Gooden and Martin2016).
Governmental administrative data are often limited by practical constraints. Allard et al. (Reference Allard, Wiegand, Colleen Schlecht, Goerge and Weigensberg2018) list some of these: internal lack of capacity to analyse it; data that is of poor quality that needs significant preparation before it can be analysed, and a lack of common data formats and identifiers. Whilst IT infrastructure and data-management systems are necessary to solve these issues, they are far from the only elements needed. Factors that can improve these socio-technical challenges include early intervention when or before data sharing problems arise, favourable political environment and internal policy and active data sharing culture-setting, especially from organisational leadership (Dawes et al., Reference Dawes, Cresswell and Pardo2009). Gil-Garcia and Sayogo (Reference Gil-Garcia and Sayogo2016) find that both having a project manager dedicated to inter-organisational data sharing and having sufficient financial resources significantly correlated with governmental inter-organisation data sharing success. Allard et al. (Reference Allard, Wiegand, Colleen Schlecht, Goerge and Weigensberg2018) further observe that data associated with mandatory compliance reports tends to receive more attention and resources than other data types, and are often of higher quality.
Allard et al. (Reference Allard, Wiegand, Colleen Schlecht, Goerge and Weigensberg2018) and Graham et al. (Reference Graham, Gooden and Martin2016) further note that researchers are well-positioned to lead with the use of data systems, and as such it would be advantageous for governments to collaborate with researchers in policy-forming partnerships. These partnerships are shaped not only by government agency perception of risks but also by the type of data access structures in use, suggesting that careful design for sharing early on is more likely to be effective.
Researchers may be disincentivised from working with governmental agencies due to the time taken to navigate data-access bureaucracy, especially if there is no standardised process to access the data and/or if data access agreements forbid re-sharing data.
2.3. Transparency, data sharing, and secondary data re-use in academic settings
A sub-set of academic researchers have been pushing for openly shared research data for some time - for example, the Bermuda principles for genomic data sharing were established in 1996 (Jones et al., Reference Jones, Ankeny and Cook-Deegan2018).
Generally, whilst opinions on research and academic data sharing are favourable, in practice many researchers find that the effort associated with data sharing (that is preparing data to be shared and/or metadata that contextualises the raw data) and risks (ethical risks, misinterpretation, risk of scoop, lack of incentive) often outweigh an individual’s reasons to share data (Zhu, Reference Zhu2020; Zenk-Möltgen et al., Reference Zenk-Möltgen, Akdeniz, Katsanidou, Naßhoven and Balaban2018). It is thus perhaps unsurprising that in a survey of 1800 UK-based academics by Zhu (Reference Zhu2020), 86% of respondents rated data sharing as “very important” or “fairly important,” but only one-fifth of the respondents had personal experience of sharing their data.
It takes time and effort to prepare research data for sharing, both making sure that the data are tidy and documented, as well as making sure that the data are sufficiently contextualised so they are not dangerously misinterpreted (Naudet et al., Reference Naudet, Sakarovitch, Janiaud, Cristea, Fanelli, Moher and John2018; Zhu, Reference Zhu2020; Savage and Vickers, Reference Savage and Vickers2009). Whilst many journals require data availability statements when publishing scholarly results, data availability is rarely enforced by the journals. In replication and data sharing studies for journals that have data sharing expectations, Savage and Vickers (Reference Savage and Vickers2009) were able to access only one dataset out of the ten they tried to access, while Naudet et al. (Reference Naudet, Sakarovitch, Janiaud, Cristea, Fanelli, Moher and John2018) successfully gained access to data for 19 out of 37 studies they wished to replicate. Zenk-Möltgen et al. (Reference Zenk-Möltgen, Akdeniz, Katsanidou, Naßhoven and Balaban2018) note that only a small number of research journals state how data sharing policy adherence is checked, and further note that there is a gap between recommendations from policymakers and the realities of data sharing in social sciences.
In addition, data sharing practices vary by research domain. Zhu (Reference Zhu2020) observed that natural sciences and engineering are more likely to share data than medical and life sciences, whilst Naudet et al. (Reference Naudet, Sakarovitch, Janiaud, Cristea, Fanelli, Moher and John2018) note that while biomedical sciences generally do not share data, its sub-domain of genetics does have a data sharing culture. Zenk-Möltgen et al. (Reference Zenk-Möltgen, Akdeniz, Katsanidou, Naßhoven and Balaban2018) note that half as many social science articles had data available compared with political science journals, observing that this may be down to specific ethical concerns in each domain - sociology is more likely to have qualitative interview data that cannot be effectively anonymised, compared with standardised quantitative political survey data.
Infrastructure to support data sharing is a final important element. Zhu (Reference Zhu2020) notes that humanities data sharing is not only blocked by privacy concerns, but also by a lack of infrastructure on which to share the data, and Kim and Zhang (Reference Kim and Zhang2015) found that when people had data repositories available to them, they were more likely to actually act on their good intent to share data.
2.4. Data standards, integration, quality and usability
One observation in the previous section on academic data sharing is that preparing data for re-use takes time and effort. It is therefore unsurprising that there are many calls for data standardisation and reviews on the (often less than ideal) current data standards compliance in many domains.
Fairchild et al. (Reference Fairchild, Tasseff, Khalsa, Generous, Daughton, Velappan, Priedhorsky and Deshpande2018) observe that epidemiological data are challenged both by a lack of standardised interfaces to data and by a variety of data formats, leading to fragmentation. Different access mechanisms span from computational Application Programming Interface (API) access to manual human extraction from Portable Document Format (PDF) files. Formats may range from general and simple data formats such as comma-separated value (CSV) files, which are easily readable but can vary massively in terms of content, to more complex dedicated domain-specific time-series formats such as EpiJSON. They also observe that just because a standard exists does not mean it is necessarily embraced by the field, and whether or not a given data format will be widely adopted is often an ongoing question.
Even when standards exist, they may not meet the needs of their users. Gardner et al. (Reference Gardner, Ratcliff, Dong and Katz2021) note that in 2020, systems for COVID-19 data reporting ‘were not empowered or equipped to fully meet the public’s expectation for timely open data at an actionable level of spatial resolution’ - that is, whilst systems for sharing this data existed, they did not provide the depth of information needed to be useful.
In genomics and bioinformatics, Thorogood et al. (Reference Thorogood, Rehm, Goodhand, Page, Joly, Baudis, Rambla, Navarro, Nyronen, Linden, Dove, Fiume, Brudno, Cline and Birney2021) observe that data may be standardised in a single data source (e.g. a particular research organisation or lab), but vary across different data sources of similar types. This results in adherence to standards that nevertheless make federated (cross-data-source) queries a difficult technical challenge. Thorogood et al. (Reference Thorogood, Rehm, Goodhand, Page, Joly, Baudis, Rambla, Navarro, Nyronen, Linden, Dove, Fiume, Brudno, Cline and Birney2021) propose that an independent organisation such as the Global Alliance for Genomic Health (GA4GH) can act as a convener for its member organisations to help create interoperable data standards.
Building on the collective experience of research data management, Wilkinson et al. (Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak, Blomberg, Boiten, da Silva Santos, Bourne, Bouwman, Brookes, Clark, Crosas, Dillo, Dumon, Edmunds, Evelo, Finkers, Gonzalez-Beltran, Gray, Groth, Goble, Grethe, Heringa, ‘t Hoen, Hooft, Kuhn, Kok, Kok, Lusher, Martone, Mons, Packer, Persson, Rocca-Serra, Roos, van Schaik, Sansone, Schultes, Sengstag, Slater, Strawn, Swertz, Thompson, van der Lei, van Mulligen, Velterop, Waagmeester, Wittenburg, Wolstencroft, Zhao and Mons2016) launched the FAIR data principles, which assert that data should be FAIR if it is to effectively support research discovery and innovation. To be FAIR, data should be indexed on a publicly accessible data portal, use a common data standard and/or vocabulary, provide machine-readable data access and detailed descriptive metadata that is available even if the original data are not available, and provide clear data provenance and licence information about how the data can be re-used.
Similar principles have been described for other data-relevant domains, such as the FAIR software principles (Hong et al., Reference Hong, Katz, Barker, Lamprecht, Martinez, Psomopoulos, Harrow, Castro, Gruenpeter, Martinez and Honeyman2021), which provide similar guidelines for the computer code that produces, analyses, and visualises datasets. The European Commission with ELIXIR has established a pan-European open COVID Data Portal (ELIXIR, 2021) for public data deposition and access, and hosts an open letter with over 750 signatories, calling for open and FAIR data sharing (COVID-19 data portal, 2021). Additional COVID-19 FAIR data recommendations can be found in Maxwell et al. (Reference Maxwell2022).
3. Methods and data used
To gain a nuanced understanding of the barriers to data access, sharing, and re-use, we conducted a qualitative study, interviewing professionals who work with COVID-19-related data. This study was approved by the University of Manchester Department of Computer Science ethics panel, original approval review reference 2020-9917-15859 and amendment approval reference 2021-9917-19823.
A detailed interview guide and study protocol are available on protocols.io for re-use (Yehudi et al., Reference Yehudi, Hughes-Noehrer, Goble and Jay2021).
3.1. Recruitment
Interviews were carried out in two rounds, and all participants who applied to participate were accepted, resulting in fifteen completed interviews.
Round-one (July–September 2020 and March 2021) recruitment was targeted at bulk-emailed mailing groups and slack channels, with additional recruitment via social media (Twitter). No potential participants were approached directly individually, and everyone who volunteered to participate was accepted into the study. Areas approached were primarily COVID-19 hack events, open data communities, and bioinformatics/biology communities.
Due to the limited global spread of the round-one sample, we initiated a second round of recruitment. Round-two (July–September 2022) recruitment was more directly targeted towards specific individuals who had domain experience in non-private data, and/or were based in continents and countries that were previously not represented in our sample. Round-two participants were approached via instant messaging and email.
Due to the low level of response in round one, in round two, we did not focus on any specific domain beyond the experience of working with COVID-19 data in some way, resulting in the sample domains described in the results Section 4.1.
3.2. Data gathering and analysis
3.2.1. Data gathering: Interviews
Prior to the interviews, participants were sent an information sheet via email explaining the purpose of the interviews and how their data would be handled.
Interviews were semi-structured and conducted remotely via an institutional Zoom Video Communications (2022) video conferencing account. We guided participants through onscreen consent forms before asking the following questions:
-
1. Tell me a little about the data sources you have been working with so far – have any been restrictive or hard to access?
-
(a) What did you do when you encountered the access restriction?
-
(b) Did you ever consider going to another source, or encouraging data creators and curators to share their information somewhere less restrictive?
-
(c) Did others you work with share your views and opinions? Can you provide any examples?
-
-
2. What about good experiences – are there any data sources that are exemplary?
-
(a) What did they do right?
-
(b) Was there anything they could do better?
-
-
3. If you could design your own “dream” COVID-19 data source, would there be anything else you’d like to see?
-
4. Are there any ethical issues you are aware of or have considered, with regard to this data sharing?
-
5. Is there anyone else working in this domain who you think might be interested in participating in this study?
3.2.2. Data management and analysis
Post-interview, recordings of the interviews as well as transcriptions were downloaded from Zoom, speech-to-text errors were corrected, and then the text transcriptions were analysed for themes. Transcripts were coded in NVivo 12 QSR International Pty Ltd (2018), resulting in 304 codes in total. Of these, we filtered a list of 101 codes that were present three or more times across the fifteen transcript files. A second coder reviewed the filtered list to confirm the presence or absence of codes in each transcript file.
Of the 1515 data points (101 codes * 15 files), both coders agreed to 1514 of the codes (greater than 99.93% agreement). After discussion between the coders, an agreement was reached on the final code, giving 100% agreement for all codes present three or more times across the transcripts.
All codes that appeared three or more times relating to barriers are presented in Section 4, with illustrative quotes where possible. All quotes used in the text are shared as a supplemental data file.
3.2.2.1. The subject matter was potentially sensitive
Participants may have spoken about disapproval of data sharing practices in their institutions or countries, and/or may have witnessed, conducted, or experienced inappropriate or illegal data sharing activities. All interview videos were deleted after transcription. To preserve anonymity, corrected interview transcripts were assigned a pseudonym identifier and stored securely on the University of Manchester infrastructure for at least five years in line with University of Manchester guidelines. Transcripts are not deposited in an open repository.
3.2.3. Preserving anonymity in reporting
When writing, quotes are in some cases lightly paraphrased to replace identifying details, e.g. a phrase such as “Transport for London” might be replaced with the phrase “[municipal transit authority].” All participant quotes that are re-used in publications were first shared with the participants, who were given the chance to amend or redact their statements to avoid inadvertent data breaches. This approach has been applied in various cases to place names, institutions and governmental/healthcare agencies and committees, and data source names.
To mitigate against the potential re-identification of the individuals whose data we share, we have created multiple separate tables for data points such as location, occupation, and domain expertise. For example, whilst we might share a table of occupations of participants, and separately share a list of countries where they come from, we never link the occupation and physical location data of participants in the same table, as a combination of specific datapoints may be sufficient to re-identify an individual.
3.3. Reflexivity and positionality statement
This research was carried out by researchers based in the United Kingdom, and led by YY. Whilst YY has lived in various high-income settings around the world, including New Zealand, the United States, and Israel, they also have high levels of experience working with scientific data and running open-focused equity-driven global scientific communities. Most recently, this includes working with OLS (Formerly Open Life Science), an open science mentoring and training organisation, as well as software engineering and data manipulation experience at InterMine, an open source biological data warehouse.
In particular, we recognise that the study was designed to focus on open data, by a researcher who advocates for and prefers to work with open data whenever possible and ethically prudent. Importantly, we do not assert that this should be taken as a comprehensive or balanced review of data access mechanisms - instead, this study specifically investigates barriers, and how professionals may handle them.
In the sample section, we also discuss the limitations of the sample and recruitment method - as noted above, our communities are open-focused and equity-focused, which is likely reflected in the sample we were able to reach for the interview.
4. Results
4.1. Sample
Fifteen participants were interviewed in total, through two recruitment rounds. Six participants signed up during round one, recruited via bulk-mailing lists and social media. All round-one participants were researchers based in Europe or North America and came from a range of domains spanning healthcare and computational, biological and social research.
Round two recruitment was more directly targeted towards individuals, and resulted in an additional nine participants, for a total of fifteen participants overall in the study. Round two brought participants from Africa, Asia, Australasia, and South America, with two-thirds of the final sample-based in High-Income Countries (as defined by the World Bank) and one-third based in Low or Middle-Income Countries.
Tables 1–3 show profiles of the participants. These results are intentionally disaggregated to present a profile of participants without linking the data in a way that might facilitate de-anonymisation.
Participants came from a broad range of overlapping professions and expertise domains, as shown in Table 2 below. The number of participants adds up to greater than fifteen due to some participants falling into more than one category.
4.1.1. Data and data sensitivity: Private, non-private, commercial, and political data
Pandemic data falls on a spectrum from the individually identifiable, such as patient medical records, data which may be identifiable depending on the level of detail, that as mobility data and hospital admissions, and to data where there are no individual privacy concerns, such as viral genome sequences.
Throughout the paper, we will work with the following definitions:
Private data refers to records that, if shared inappropriately, could compromise an individual’s right to privacy, such as medical records, detailed geolocation data, or personal human genomes.
Semi-private data refers to data records that do not have direct identifying information such as names or addresses, but that nevertheless may be re-identifiable depending on the level of detail, such as phone mobility data, census-based commute information, or hospital admissions.
Non-private data refers to records that do not compromise individual privacy as defined above. This might include hospital bed capacity information, vaccine availability, or viral genome sequences.
We also define two categories of data that fall across the privacy spectrum: Commercial data and Political data.
Commercial data are data that are withheld from the public due to industrial competition and intellectual property reasons. This category spans both private and non-private data above - for example, a for-profit science lab might decline to share a viral genome it has sequenced, considering it proprietary sensitive information (which would be defined in this paper as commercial and non-private), or a health insurance company might consider information it could learn from its health records to be proprietary (which in this paper we would define as both commercial and private data).
Political data are data that may or may not be private, but that a government agency of some kind wishes to hide from the public view. In the case of this study, all data that fell in this category tended to be data that made the pandemic look severe in some way, e.g. hospitals over capacity, high community/school COVID-19 case loads, or deaths.
4.1.1.1. Data sensitivity: profile of study participants
Despite this study initially aiming to recruit participants who worked in non-private subject domains, many participants who signed up worked with private and semi-private data sources as well. These participants were not excluded as many of the access, quality, and re-sharing barriers were still relevant to the research question. As Table 3 shows, most participants worked with more than one of these categories of data.
We neither targeted nor disqualified participants based on their experience with commercial or political data.
4.1.1.2. Data types: profile of data types discussed by participants
Figure 1 illustrates the spectrum of private and non-private data discussed in this study. The study was initially designed in response to barriers the authors had encountered around non-private data: viral genome and COVID-19 spread information.
In the course of the interviews, the scope of data expanded to cover additional data types:
4.1.1.3. Non-private, but potentially commercial and/or political
-
• Genomic virus sequences.
-
• Vaccine supply.
-
• Vaccines administered.
-
• COVID-19 cases (tests and positivity rate) - in the general community and in schools.
-
• COVID-19 hospitalisations.
-
• COVID-19 and excess deaths.
-
• Hospital bed availability - intensive and regular beds.
-
• Hospital deaths (with or without COVID-19 positivity, as excess deaths are still pandemic-relevant).
-
• COVID-19-related lockdown and interpersonal mingling rules.
-
• Geographical regions and population density.
-
• Synthetic data designed to resemble real private data.
-
• Metadata and data structures for private data, without the private data itself.
Viral sequences are potentially commercial if sequenced by organisations that wish to exploit the sequence for commercial gain and can be political as well. More than one participant noted that there were viral sequences from Wuhan, where the pandemic originated, that later disappeared from the servers they were on - although these sequences were later recovered (Bloom, Reference Bloom2021).
Vaccine, hospital, and COVID spread data in particular are data types that were political and may have been obfuscated, redacted, or simply not tracked by government or health officials. This is further explored in the discussion section of this paper.
4.1.1.4. Semi-private
-
• Mobile phone mobility data - actual geolocation, exercise routes, or planned navigation using maps. Commercial.
-
• Any data from the non-private category, when it concerns individual people and the data available is in very low numbers and/or combined with additional data points that could de-anonymise individuals.
-
• Race, age, gender, sexuality, ethnicity, disability, socioeconomic status.
-
• Census commute data (place of work and place of dwelling).
-
• Social media.
4.1.1.5. Private
-
• Patient medical records (anonymised, pseudonymised, or not anonymised).
-
• Viral host/patient genome records.
-
• Any of the semi-private data above when very detailed and re-identifiable, even if it is pseudonymised.
-
• Lockdown-related domestic abuse, and femicides.
4.1.2. Sampling challenges and limitations
This study is not intended to be generalisable and comprehensive, but instead, it is exploratory, designed to shed light on areas for potential future action, exploration, and research.
All participants who reached out to participate in the study were accepted, regardless of background, which resulted in a population that was skewed towards higher-income European and North American countries. Ideally, sampling would have been more evenly balanced across continents, cultures, and expertise domains. Because of the small sample size, we did not segregate the analysis by any of these features.
We speculate that there may be several possible reasons for the sample distribution:
-
1. Recruitment methods: Initially ethical consent for this study was granted on the condition that no individuals were approached directly to participate, as this direct approach may have seemed too coercive. Whilst we emailed international mailing lists, Slack groups, and had thousands of engagements with recruitment tweets, uptake was still very poor. Round two had a much higher uptake rate after an amendment to the ethics plan was approved which allowed researchers to directly approach potential participants - but was then limited to targeted individuals within the researchers’ networks.
-
2. Time constraints: Several potential participants expressed interest in participating but did not ultimately sign up or attend the interview slot, and one participant was originally contacted during round one but eventually only signed up to participate in round two, citing extreme busyness. Given that people who have expertise in pandemic-related data sharing were generally involved in the pandemic response, it seems likely that many potential participants self-selected out of participation due to high workload.
-
3. Personal and reputational vulnerability: Given the sensitive nature of the subject matter, people may have been reluctant to participate for fear of re-identification and subsequent responses. Even people who did consent to participate were cautious about the information they shared. Multiple participants redacted some of their statements when given the option, and/or indicated regret for having been as candid as they were in the interview. One participant even went so far as to change their Zoom display name and turn off their camera before they gave consent to record the interview.
4.2. Barriers
4.2.1. Barrier 1. Knowing the data exists, and being able to find it
In order to work with or request data, researchers and data analysts must know the data exists. Multiple interview participants talked about their experiences gaining access to non-private data that was neither deposited on a public data repository, nor made use of, nor publicised. Indeed, one participant describes accidentally stumbling onto restricted-access but highly relevant bed capacity data whilst browsing an institute-shared data archive:
It was only on an off-chance in perusing through these documents on a Sunday evening. I was having a chat with my colleague, and we were both going ‘What else exists in this data dump that we could do something useful with?’ when we stumbled across [this datasource].
Even if data are made accessible to the public in some way, e.g. by being deposited in a data repository that is not access-controlled, discoverability can still be a challenge, as data are usually distributed across a broad range of repositories and may not be designed to facilitate data discovery. Three separate participants reported that they had manually curated lists of data sources on a word processor document because there was nowhere else that effectively collated relevant data sources. Participants share their experiences finding data:
On finding data within a data source with minimal metadata:
They’re literally just sorted by date. PDF only. No underlying data, and unless you know what the [document] was about, it is impossible. If you say ‘what does it say about travel restrictions?’ unless you know when they met to discuss that, good luck finding it without literally going through 300 PDFs.
On finding out whether or not viral sequences exist anywhere accessible:
There are lots of viruses and sequences that fall under the radar […] when you don’t have the data in a central repository. So that’s been kind of the linchpin. A lot of people are working on, probably the same viruses and it’s not that we don’t want to cite them, it’s just when you’re doing a high throughput analysis it’s hard to track down every single case and really do your due diligence that someone didn’t say ‘oh this library actually has this sequence, and we named it.’
And on discovering data and data attributes in publicly accessible data:
Unless you know things are available, you can’t ask for them, so it’s not obvious that you can get data by sex and age from the dashboard. […] I’ve got this document I have collected over the last 18 months, and it’s just links. It’s got everything. New cases types of test, variants of concern, age profiles, countries, contact tracing data, international comparisons, positivity rates, public health. It’s a 10-page document now, and I still add new things to it. What’s frustrating is that you kind of have to know where these things are, otherwise, you can’t really access them. Although it’s all public data, it is not accessible, not really, right?
Even once data has been found, it may not be easy to find again if the naming conventions are not logical and there is little metadata:
It’s going to sound obvious, but more metadata would be great. Every time I need to transfer between [two tables] I spend longer than I should searching the website. It’s surprising how easy it is to lose one of these tables, and you can’t find it again because they have funny names.
4.2.2. Barrier 2. Access: Barriers to obtaining data
The next hurdle researchers face is gaining access to data sources they have identified as relevant or useful. While many participants in this study focused on access to non-private and non-clinical data, we still found that researchers faced systematic access barriers.
4.2.2.1. Cost
Cost can range prohibitively from thousands to tens of thousands or more for government agency-hosted data and for mobile phone-generated mobility data. Participants reported that some providers shared data without additional cost - for example, Facebook (https://dataforgood.facebook.com/dfg/covid-19) and Google (https://health.google.com/covid-19/open-data/) shared their mobility data publicly without charging for it. For-pay data prices ranged from sufficient to cover infrastructure and maintenance costs, up to commercial rates in the tens of thousands of pounds, euros, or dollars. One participant remarked that even “at cost” prices can still be significant when dealing with extremely large datasets.
We’ve allocated for one year, something like 30,000 pounds to pay to get data extracts, which just means that when we’re an unfunded volunteer effort - we can’t. Whose grant does that come out of? […] I think it’s just a business model. That’s part of [government agency], that’s how they pay their bills, to maintain their servers.
Mobility data providers in particular varied based on which organisation was providing data. One participant reports:
I think [mobile network operator] in Ireland shared their data initially for free, and then, after a couple of months, they charged the cost for that content, to collect and publish datasets and share this data with the central government. […] it’s not expensive. Low four figures or less, for complete access. Whereas [mobile network operator] and [multinational telecommunications company] expected, I think, upward of 10,000 pounds for the same data and they offered no discounts […] they basically said ‘look on our website: we haven’t changed anything during the pandemic, you’ve always been able to pay for this data with ten grand.’
4.2.2.2. Culture
Two-thirds of participants (ten out of fifteen) referenced culture as a barrier around data sharing and access. Sometimes access control was treated as a default without questioning if the data actually needs privacy-preservation measures. Participants reported this both from an industry angle, where corporations were striving to keep intellectual property secret, as well as from academia, where non-private data were access-controlled by default, by placing it on gate-kept institutional servers.
On individual researchers wishing to share data when their institute does not encourage it:
Even people who I would say, want to do this [data sharing] - they’re just not going to do it - institutional momentum, right? ‘that’s not how things are done and therefore that’s not how we’re going to do it’, and I feel like that’s what’s carried over from the 80s and 90s in this field, and they just never stopped to really reflect on this, Is this the way that we really want to do things moving forward?’
And when a supervisor discourages data sharing:
If the person who is overall responsible for organizing the whole thing doesn’t seem in general to be concerned with open research or open data, or anything like that […] at the point where there’s a supervisor who’s saying ‘this is how it’s done’ or not saying how it’s done… I guess you don’t want to step out of line.
4.2.2.3. Institutional access barriers
In some cases, the type of institution or mode of employment may affect access to relevant data. One researcher reported that whilst public universities were allowed to access national COVID-19-related data in their country, any non-government-sponsored universities were not permitted to access the same data, even though they too might be using the data for the exact same purposes as public institutions.
Data access agreements may not always be agreeable for a research institute, either:
I would really like to use Facebook data for that, but I can’t get my university to sign the agreement because it gives Facebook the right to use their branding in advertising.
4.2.2.4. Volunteer labour
Taking advantage of volunteer labour can also present institution-related access issues: Volunteers are not usually considered members of an organisation even if they are working on an institute-backed project. One participant reported needing to arrange honorary positions at a university for their volunteers, whilst another participant was asked to analyse data in a volunteer capacity but then denied access to the same data (including access to the metadata model, which would not reveal any private data) by the same institute that had asked for help. Volunteer-related access barriers may also be technical, e.g. if the volunteer does not have access to an institution’s computer network, they may not be able to connect to the institution’s servers to access the data.
4.2.2.5. Friends in high places
One unevenly distributed data access route that participants mentioned was having the right connections. Multiple participants noted that knowing someone who was already involved in the project or knowing someone extremely prominent, was often one of the most effective ways to gain access to data.
One participant, reflecting on prominent people asking for data:
If the mechanism for accessing data is ‘email someone and wait for them to respond’, then if you don’t ideally have both professor and OBE involved in your name, you know you’re going to struggle.
Another participant, reflecting on gaining access to data due to incidental association with another researcher:
We got lucky - we were grandfathered in because we sit within an adjacent group to the person who sat on the original […] committee. There was no formal application process.
4.2.3. Barrier 3: Utility: Hard-to-use data, even once available
If a researcher successfully gains access to privileged data or uses open data, they face a new challenge. Access alone does not guarantee data is well-documented or easy to understand. As one participant asserts:
Essentially, you had a treasure trove of information that was not at all mapped to each other, that a lot could have been done with, which was being heavily access managed and not at all curated.
4.2.3.1. Data trustworthiness
The biggest barrier by far in this category – reported by fourteen of the fifteen participants – related to the trustworthiness of the data they were using. Often this was down to poor quality control – information that conflicted with itself, such as three PCR tests for a single patient in a single day: one negative, the second positive, and the third negative again. Other examples of poor quality control reported were:
-
• Obviously corrupt data, such as English sentences embedded in data that purported to be amino acid or nucleotide character sequences. This type of sequence usually consists of a limited sub-set of the alphabet, and usually appears to be random to the human eye – for example, the first few characters of the nucleotide sequence for BRCA1, a breast cancer gene, are “gctgagacttcctggacgggggacaggctgt” (InterMine, 2022; Smith et al., Reference Smith, Aleksic, Butano, Carr, Contrino, Hu, Lyne, Lyne, Kalderimis, Rutherford, Stepan, Sullivan, Wakeling, Watkins and Micklem2012).
-
• Lack of outlier or input error correction, such as hospital bed capacity dropping from 50 to 0 overnight, then returning to 50 the next day.
-
• Variations between input from different people in the same setting.
-
• Incorrect reporting dates: for time-series data such as vaccinations, infections, hospitalisations, and deaths, there may be reporting lags resulting in higher or lower numbers than the reality on a given date.
4.2.3.2. Data may have explained or unexplained gaps
Gaps in data may arise due to causes such as intentional institutional or governmental redaction, human recording error, and lack of time to record variables. Lack of foresight or planning can result in a failure to capture certain data aspects that might be useful including inpatient health-related variables, or socioeconomic status, gender, ethnicity, and disability variables.
4.2.3.3. Data provenance and methods may be missing
Being able to understand where data came from (provenance) and how it was gathered (methods) is an essential part of being able to rely on data and use it for trustworthy analyses and policy/behavioural advice.
Lack of provenance can result in data being discarded or ignored, even if it would otherwise have been useful. One participant illustrates a scenario where they know a data point, but are unable to provide the provenance of this knowledge:
Having access to highly protected things, like healthcare data, for half of my work is bizarre because, for example, we need to know what proportion of people who enter the hospital over the age of seventy dies? What is that rate? I know it because I generated it for one of my models, but I can’t use it for the other model.
Multiple participants discussed issues with “derived” data, where a number of datasets are provided, but the methods to calculate these numbers are not clearly laid out and can be difficult to trust or re-use. Often, this kind of barrier can result in laborious reverse engineering, and re-calculations based on guesswork.
[There are] many pages of supplementary where we have documented everything we’ve done, because what’s amazing is that I don’t think the way [a healthcare organisation] calculates those numbers is necessarily right. But given that we’re reverse engineering, it wasn’t something we could call out, because we don’t know the definition, but we’re like ‘this doesn’t make sense.’
4.2.3.4. Spreadsheets, CSV files, and Microsoft excel
Multiple participants discussed issues with spreadsheets, mentioning CSV file issues, and sometimes expressly mentioned Microsoft’s Excel data file format.
It is in an Excel sheet, which is kind of fine, except […] the table appears partway down the sheet, with some explanatory text, and the order of the sheets changes, sometimes. […] It just has a series of different tables copied one after the other in one CSV file, and you just have to know which line the table you want starts on, so if they ever reorder them it completely breaks.
Whilst spreadsheet use facilitates data analysis and creation for non-data professionals, inexpert use can cause challenges. One participant reported that whilst their institution had a database-backed data repository, clinicians were extracting spreadsheet-based reports and editing/annotating them manually in the spreadsheets, with no effective way to bulk-import the data back into the institutional database. When data analysts were called in to create advanced data analysis pipelines, they then were forced to either sanitise and re-integrate the spreadsheet-based data sources or abandon the meaningful additions created by clinicians. In the long term, this institute was able to identify and disincentivise fragmented records, but this requires both training and technical functionality to allow sufficient free-text or structured annotation.
Another participant reported on spreadsheets used for time-series data, reporting on multiple healthcare institutions at a national level. Each day’s data was represented in a single sheet. In order to create time-series-based analyses, multiple sheets needed to be amalgamated across a long period of time, a highly technical challenge that might have been avoided by a more suitable data format. Institutional reporting lags and missing days further complicate this scenario.
Spreadsheets facilitate “hidden” columns, rows, and sheets of data. This may be used to appear tidy or to prevent inappropriate data editing, but when a human attempts to run data analysis, these hidden subsets of data may obscure important details, including meaningful data provenance.
One participant commented on the UK’s COVID-19 case misreporting due to an .xls spreadsheet file failure:
The other thing that this pandemic revealed […] is that our data infrastructure just is not up for this kind of task, at least in the health sector. […] I recall a case from the UK, where essentially the government was reporting […] most of the cases as an Excel sheet and at some time you know an Excel sheet was not going to be enough. I mean maybe a different kind of database could have done it. And you wonder like ‘well, hang on, I mean this is the way we’re counting these in one of the most advanced countries on earth? So, well, what’s up?’
4.2.3.5. Temporal changes and temporal context
Many participants pointed out that data often needs to change over time as columns are added, more efficient formats are created, or new analyses are devised. However, when a data format changes over time, this context may not be recorded, nor may the date of the change be noted.
4.2.3.6. Geographical region definitions
Different geographical data sources may name the same regions differently – e.g. “Cambridgeshire” and “Cambs” are equivalent to human readers but stymie a computer script. Regions also change over time as boundaries and districts are re-defined, making it hard to reliably define regions without significant temporal context. Discussing difficulties converting from one geographical coordinate set to another, one participant stated.
Guides on how to do that from the government come with a big warning sheet saying you should under no circumstances use this for any software that informs policy, which is not very helpful… because we have to.
4.2.3.7. Temporal contexts for mobility and behavioural changes
Geography is not the only area that is enriched by temporal context. When interpreting infection spread rates, it is important to understand what legal movement and behaviour controls were in place at the time, and when significant changes were made, such as the lifting of lockdowns or social distancing. One participant reported that whilst it is usually straightforward to find what a given legal requirement is right now, finding out what legal protection measures were in place for historical events becomes much harder.
4.2.3.8. Inability to integrate or link datasets
Sometimes different datasets have the potential to meaningfully answer a question present in the other dataset, but nevertheless cannot be combined. This barrier may be technical – that is there is no common unique identifier with which to harmonise records across data sources – or it may be a sociolegal barrier: some licence terms prevent data remixing. One participant illustrates how this can be a challenge.
What I don’t want to see happen is to create like a multi-tiered system or a system where data is not centralized, because that’s the nightmare of bioinformaticians, where you have to go to some private database for these sequences, and then you go to the GenBank for these sequences and then cross-referencing them, where are they duplicated…
4.2.3.9. Data may not be designed for computational analyses
Participants reported accessing data as Excel “data dumps” - potentially hundreds of sheets that must be laboriously manually downloaded by a human, or as PDFs, which are not easily read by a computational script. Other participants reported having to copy/paste data from web pages daily and re-typing data embedded in graphics. When data are not designed to allow programmatic access (e.g. via an API), fetching data and slicing it into the correct sub-sets of data (for example, fetching only hospitalised patients over a certain age), becomes difficult or impossible.
4.2.3.10. Technical barriers
Many of the use barriers discussed thus far are a result of data structure design issues, training needs, and human error. Technical infrastructure, especially infrastructure that is suddenly used in unanticipated ways, may present additional barriers – it may not be designed to be accessed externally, e.g. because of the use of a data safe haven, or it may not be designed to cope with a heavy load of users. One participant reported that they regularly experienced incomplete/interrupted downloads, and potential rate limits, and had little choice but to script repeated attempts until a full data download was completed.
[The] server just terminates at some point, and your curl command just hangs for hours or days. I mean, we had no workarounds in place […], and resubmit data hundreds of jobs a hundred times until they succeed.
If you have a pandemic, and if you have data sets that […] thousands of researchers on a daily basis want to use, or need to use, to create dashboards - our current infrastructure was struggling.
4.2.3.11. “Data cleaning” is time-consuming
Given the rife data quality and usability issues, it is perhaps unsurprising that participants reported laborious processes to prepare the data for analysis. One participant stated:
The data came in a very basic format and we found a lot of issues. We spent a lot of time, in fact almost about a year, just to clean up the data.
4.2.4. Barrier 4: Further distribution: Barriers to re-sharing data and analyses
Once a researcher overcomes the barriers in finding data, accessing it, and preparing it for use, they may face difficulty disseminating the analyses they have created - perhaps due to data use agreements that prohibit re-sharing data, or permit re-sharing of analyses and derived numbers, or simply require pre-approval before analyses are shared.
4.2.4.1. “Open” data, but no sharing onwards
Restrictive data sources may disallow any redistribution of their data, which can result in analyses being impossible to reproduce or researchers simply abandoning the datasource entirely in favour of less restrictive data sources. Participants weigh in on data sources with strict or prohibitive re-use policies:
Two participants weigh in on GISAID, a viral genome sequence database:
The GISAID database which contains the SARS-CoV-2 sequences is blocked off, and it’s kind of not worth the hassle to use that data, even though it might be informative because if we use it we then can’t share our data, which is a derivative.
One of the things we care about is not sharing just the data itself, but also all the code and scripts and stuff that we use to analyze the data and produce the final output. But for the parts where we’re building trees to compare the final genetics of our isolates versus the world’s, we can’t actually bundle that data together. The GISAID data that is used as the context is not distributable, and so that’s where there’s kind of this gap between our ideal of being able to provide everything bundled together.
4.2.4.2. Re-sharing only approved items
Other data sources may require that they pre-approve all published items that use their data in some way:
On Strava Metro, a city mobility datasource, which has a policy requiring all re-use to be approved by Strava:
I spoke to one person from local government who just went ‘we’re just not going to use it then’ - there’s just no value to it if you’re going to provide that amount of restriction.
Two participants talk about getting permission to share analyses of data that were deemed politically sensitive during the peak of a pandemic wave:
A big battle that we had with [the data creator] was that [the data user] had made an agreement that, before anything was published, it would be run past them and so as we put our preprint on [a preprint server], [the data user] let [the data creator] know, and there began a little six-week ordeal where we would just go back and forth being like ‘can you please sign this off’ and they’d be like… they didn’t even respond, and then after six weeks, the only reason we got signed off was that they released a portion of the [dataset] into the public domain as they normally would.
I think they’ve managed to publish a few papers, but they’re always kind of outdated because it takes so long for [governmental body] to say yes. And they weren’t allowed to preprint them, so I know that that has been quite frustrating and would have been actually really important data to help shape how people talk about COVID-19 and adherence.
4.2.4.3. Data integration results in redistribution challenges
A recurring theme researchers reported was that mixed-permissions datasets are hard to re-share. Data users are forced to choose whether to ignore restrictive datasets entirely and use only open-permissioned datasets or to integrate richer restricted datasets but have difficulty distributing their work. Mixed-permission datasets can either be redistributed with gaps, sharing only the open elements and making their research non-reproducible or alternatively, data distributors may invest n time- and resource-heavy multi-tier access systems.
4.2.4.4. Sharing may be blocked due to data quality concerns
Whilst ensuring quality is unarguably important, over-strict rules can prevent even good-quality data from being shared onwards. One participant reports that despite their data being suitable for BLAST, a popular computational biology analysis tool, their re-shared data was deemed “third-party” (that is less trustworthy), and wasn’t stored in a way that permitted BLAST analysis:
A bit of a snag with [a genomic sequence database]. [The database] was like ‘Oh, we’ll take the sequences, but then they’re going to be classified as third-party assemblies,’ which means that they’re not BLASTable, they’re not really essentially useful in any real way. And so now I need to try to convince [the database] that ‘no, these are viral genomes’ - they might not be complete in every case, but these are very distinct viruses and there is no representation in [the database].
Another participant describes the trade-off between making data easy to share and enforcing quality:
Both go hand in hand, and the challenge is - and I don’t have a good answer for that - how can we keep good metadata, but still make it easy to upload? […] We are now at a stage where we have an Excel table, so people can use the tools that they know and put data in, and then we upload it with an Excel table. But still, it still can be more complicated than [another database that the participant expressed quality concerns with]. So yes, there’s a trade-off. But I guess, if we do our best, and if we make this as simple as possible, people will hopefully make the extra effort of the metadata.
4.2.5. Barrier 5: A barrier woven throughout: Human throughput
Human throughput creates bottlenecks throughout the entire process of working with data: it takes time and effort to find existing data sources, humans are gatekeepers for access control, data are often recorded manually by busy staff, many data pipelines and analyses require intervention to run, and human dissemination is required to share results.
4.2.5.1. No time to record data or respond to access requests
People working in these areas may have been resource-stretched even before COVID-19 created acute demand periods. A participant comments on pandemic response staff having trouble finding the time to accurately record patient information:
All this data is manually updated by the residents who ran helter-skelter across the various COVID-19 wards, and whenever they get time, they do it, so we can not expect much. But fairly, by and large, it’s okay.
Multiple participants commented that people often had too much going on to handle data access requests:
I would have loved the other stuff, but everything they had of value was being either guarded, or people were just too focused on doing the work to be able to facilitate others getting access.
Lots of people were working flat out and it’s just a time and resources barrier for being able to… lots of people in different bodies were working at capacity, and therefore just did not have time to do extra stuff.
4.2.5.2. No time to try new approaches or analyses
In other cases, a team may have access to data but not have enough time to try new things or to analyse the data they have, or the quality of data produced may be poorer than in non-crisis times.
One participant elaborates on an un-used data set:
[They were] given a real big data set and it sat untouched by the people on the [analysis] team. It was a really exciting kind of access and they could have probably done quite a lot with it, but they were just working on the things that they always worked on in a way that they always worked on. They didn’t have time to go ‘Okay, how do we actually integrate this into the data we already collect in the dashboards, that we already produce?’
Another participant comments on trying to encourage their colleagues to produce better-quality data:
They don’t exactly understand why they have to, and feel lost and even misunderstood - like a reaction of ‘come on, I have millions of things to do before, and you came and are telling me that I’m not doing my work well’, you know? No, because I’m gathering data in Excel spreadsheets…
4.2.6. Barriers: A summary
Table 4 concisely presents key points from each of the five barrier types.
4.3. Good experiences and “dream” data sources
Whilst presenting the many consecutive barriers to effective data sharing in a pandemic can appear bleak, there were also beacons of light. Participants described both specific datasets and aspects of existing data sources that worked well, as well as suggesting “dream” or “wish list” items that may not have all existed, or at least not all in a single place, but that participants considered desirable aspects of a high-quality data source. Here, we will describe the attributes that participants valued, and when there are no anonymity concerns, we will provide the example dataset names that illustrate a given point.
Both the wish list and the examples of existing good data sources covered areas that addressed barrier 1 (knowing data exists), barrier 2 (accessing data), and barrier 3 (hard-to-use data). No wish list items or good examples specifically addressed barrier 4 (re-sharing data) or barrier 5 (human throughput).
4.3.1. Addressing barriers 1 & 2: Actively sharing or updating data
Barriers 1 and 2 are around knowing data exists, and then being able to access that data, so it is perhaps unsurprising that pro-actively sharing data was a behaviour that made people particularly appreciate dataset creators and originators. Participants applauded organisations - research organisations, government sources, and commercial organisations - which pre-emptively shared their data without needing to be asked.
One participant discusses un-prompted sharing of viral genomic sequences:
The Craig Venter Institute had tons of Sanger sequencing on coronaviruses, and in January as soon as the pandemic started, they released everything into the SRA, and they didn’t make a big deal about it or anything - they were like ‘here are all these coronaviruses that we’ve had sitting in our database.’
Another participant pointed out that reducing data access application overhead when possible can make a difference. Here, they reference a Facebook-provided dataset:
They have shared data that usually you would need a very lengthy agreement to get. You do have to sign the agreement, but it’s like ‘tick this box on one form.’
Other examples that were cited included a dataset which, without prompting or being asked, was updated to comply with newer data standards and the public release of Google, Apple, and Citymapper records from mobile phone mapping/navigation records.
In a similar vein, one participant reported the value of knowing when a dataset had been updated:
When the numbers are moving fast, they send us updates like two or three times per day. This is really nice, and we can actually have an idea of what’s going on, like on a daily basis.
4.3.2. Addressing barrier 3: Making data easy-to-use
The third barrier we presented was around the difficulties of using data, once (or if) a participant has successfully found and accessed the data they needed. Easy-to-use datasets as described by study participants were machine-readable, integrated multiple data sources and data types seamlessly, were trustworthy - they did not need pre-processing, verification, or “cleaning” - and they had standardised tooling available to wrangle the data with.
4.3.2.1. Integrated and/or linked data
Participants wished frequently for data that was integrated - creating a scenario where disparate datasets from different sources can be treated as a single large dataset and queried as one.
One participant cited two projects, PIONEER and DECOVID, that attempt to integrate data in this way and explain how useful this could be in UK-based medical scenarios. More info on these datasets is available via Gallier et al. (Reference Gallier, Price, Pandya, McCarmack, James, Ruane, Forty, Crosby, Atkin, Evans, Dunn, Marston, Crawford, Levermore, Modhwadia, Attwood, Perks, Doal, Gkoutos, Dormer, Rosser, Fanning and Sapey2021).
Trying to bring together two complete electronic health record data sets in OMOP [an electronic health record data standard] is a non-trivial task, I don’t know of anyone that had actually done it prior. […] It’s attempting to […] create an end-to-end solution, where from the moment you interact with acute services, to when you end up in critical care, to your discharge […] every person that interacts through the NHS 111 [National Health Service health advice hotline] but then went to A&E [Accident and Emergency Department] and ended up in critical care, then got discharged - and being able to match it to ONS [Office for National Statistics] outcomes, so that you know, actually, who died, not just the partial observation that we get in medical records - that would be the dream, but I think that’s everyone’s dream.
4.3.2.2. Computational data access and “big data”
Large datasets can be hard to download – both challenging in terms of time and in terms of data storage. Mirroring large datasets online in multiple places may allow a data user to do their processing work entirely “in the cloud” – bringing the computational analysis to the data, rather than bringing the data to a local machine:
A good example for me in terms of datasets – the Thousand Genomes data set is a really nice one because the data is completely open. We now have high-quality versions of that dataset that we can host on cloud […] one of the nice aspects are you can find mirrors on various platforms, so that regardless of what infrastructure you’re using, it’s usually easy to access it and not have to do a lot of work to get a copy, because as these data sets get bigger and bigger you really start running into limitations of how reasonable, how feasible, it is to actually get a copy for yourself.
Programmatic access to data – that is, making it easy for a computer script to access the data without any human intervention needed – was cited as a key attribute of GenBank and RefSeq:
GenBank, RefSeq have been the anchors for this project, and that’s just purely based on their availability policy. I can grab everything, I can search it, I have programmatic access to get the right data. This is fundamental because when you’re dealing with 40,000 entries you can’t manually go through this much data. I mean, we still have to go through the data, but you can’t slice it efficiently without programmatic access, right?
Programmatic access could be further enhanced by creating highly predictable computational querying schemata, using an accession number (unique computational identifier) and clear URL structure:
All the data ends up being migrated into a data warehouse where you can predict the URL of every single object, as long as you know the accession […] that not only allowed us to build a website around database and data, but that also lets anyone access the data that they want just by typing in the URL.
4.3.2.3. Wish list: Open and standardised tooling
Making data easier to use also involves making sure there is useful tooling available to manipulate the data with – to allow people to clean, slice, filter, and visualise the data they are working with.
One participant wished for open source tooling – that is, tooling where the computer code for the tool is freely available to access, and the tool itself is generally available at no monetary cost to its user:
Maybe such Open Source solutions could be made available, so others could implement them at no cost, and this can become more commonplace, so analysis could be done and more can be extracted from that data to bring benefit back to the patients from whom the data came.
Fake or “synthetic” health records that match the exact data fields of real health records allow health record analysis to be carried out computationally. A coder can design the analysis and test it on the fake data, without ever having to access privileged private health records. Someone who does have access to the records could then run that analysis on the real document set. One participant wishes for standardised tooling to create datasets like this:
To generate synthetic data, to generate fake data, there are also starting to be tools in place, but there are no standards yet. This is another blocker because if you need to produce fake data, you have to figure out how. You don’t have a protocol or procedure.
Another participant wished for tooling (and high-quality data) to allow pandemic predictions - not for modelling human movement and behaviour, but for modelling what effects different viral genome mutations have.
4.3.2.4. Wish list: Trustworthy data sources
A wish for trustworthy data sources was another aspect that mirrored the frustrations of barrier 3. Desired aspects of trustworthy data include “clean” data that does not need preparation before being used, data in standardised formats where possible, the ability to annotate existing data sources with third-party annotations, and knowing what biases may exist in your data.
Data standards: participants reference the desire for having clear naming conventions for entities, and creating schemata for data presentation.
Third-party data annotation: One way to facilitate data trustworthiness is by allowing data users to share their knowledge and experience using a given data source – such as by creating machine-readable third-party annotation options for data sets:
I think annotations of datasets are missing […] third-party annotations need to increase here. […] ‘We figured out that these hundred accession numbers are [incorrect] - the quality is not suitable for downstream experiments’, ‘I have my doubts that this is actually COVID-19′ - these kind of things.
Knowing what biases may exist in your dataset - One participant wishes for machine-readable bias metadata for datasets:
[It would be useful if] machines know what part of the dataset you have access to or not. For example, what kind of biases are supposed? If you don’t have access to one hundred percent of the data if you only have access to 70%?
Other elements of trustworthy data that participants called out include intentional and clear null values (that can be expressly discerned from accidentally forgotten data gaps), consistent naming conventions for files and entities, and collaboration between the humans who create these datasets.
Finally, a participant highlights that a small dataset which meets multiple quality standards (a dream data source, rather than a real one) is far more valuable than a large but messy dataset:
With a few variables, you can do so much, and with more variables, you can, of course, do a lot more. But to me, my main pain point was this integration, harmonization and worry about quality. […] The last thing you want is to publish something, and it’s not interpreted correctly or it’s lacking, or we missed something. That scares me, and so my dream data would be quality in terms of well-done integration, harmonization, standardization, of the data and if that’s done, even if small, I would take it, and be happy with that.
4.3.3. Data source quality may change or improve over time, and no data are ever perfect
I try to steer away from examples that kind of held up as an ‘ideal use case’ because those don’t tend to exist.
Most of the good or “dream” attributes of data sources shown above are individual aspects of imperfect datasets - no one dataset combines all these attributes into a single neat package. Over time, however, around one-third of participants report that some data sources improved in response to pandemic pressure. Others reported that data sources did change, but not for the better.
One participant describes taking a year to analyse and clean a clinical data set, and create a computational pipeline to easily produce analyses with the data in the future:
So [after] slightly more than a year, we’ve gone through the manual [data], and we’ve also gone through the automated [data]. We have a tool now and we have the data that goes through the computational pipeline. I think there’s a big difference in the quality of that data compared to the original. We also identified issues that were maybe not obvious unless you had done those detailed analyses. So all in all, I think this was a good experience and a good exercise. The teams are very happy that there’s actually a tool […] they’re excited to be able to use this tool for their other research interests.
Another participant notes that digitisation of government services has improved in response to the pandemic:
The digitalisation of government services has been a long and overdue agenda here in [our region]. We say all governments to some degree, managed to improve at least a bit, the way they engage online because of the pandemic.
4.4. Risks and ethics in COVID-19 data sharing
We asked interview participants whether they had any ethical issues they wished to highlight. We do not report in detail on this section, as in most cases, there was little additional content that had not been highlighted in previous sections of the interview, such as the ethical imperative to share information smoothly in order to reduce the impact of pandemic waves. The need to share was occasionally balanced by the experience of participants who’d been allowed access to patient data so swiftly they doubted that appropriate consent had been sought or granted. We view this range of concerns as a spectrum, as shown in Figure 2 with a difficult-to-find “sweet spot” in the middle, where data are shared in a manner that allows use without breaching ethical and privacy concerns.
Equity was also raised as a worry; one participant from a high-income region noted that they wanted to deal equitably with lower-resource settings, but were unsure how to best do this without being patronising, colonialist, or extractive. On the other hand, participants from lower-resource settings did not raise extractivism and colonialism as concerns at all. One participant from a lower-resource setting expressly wished that people from a higher-income setting would treat low-resource setting-generated research data as valuable, useful, and something to learn from. We note that these are individual points of view, and unlikely to be highly representative of others in similar settings.
5. Discussion
When the COVID-19 pandemic began to affect countries globally in early 2020, governments, funding bodies, researchers, technologists, medical professionals, and many others responded en masse to the crisis. In some cases, existing data infrastructure was ready for the demands placed upon it, either because it had been designed with an epidemic in mind, or because its human, social, legal, and technical frameworks were designed to be swiftly adapted. In many other cases, the frameworks to cope with a sudden urgent need for data were insufficient or nonexistent.
Given that human throughput may be put under sudden and unplanned pressure due to crises such as a pandemic or epidemic, an emphasis should be placed on designing flexible knowledge management systems that can swiftly be adapted. Participants in this study reported in multiple instances that some of the best responses they had seen were in scenarios where there was such an infrastructure - technical (such as a medical reporting system that could easily be adapted to add new data columns) or human (such as “data champions” embedded in organisations). These data-sharing-ready infrastructures and cultures were able to rapidly pivot and focus their tools or skills on preparing better pandemic-relevant data, even though the systems themselves were not designed with pandemic data in mind. This is also consistent with Gil-Garcia and Sayogo (Reference Gil-Garcia and Sayogo2016)‘s observations that dedicated data-sharing project managers were an essential component of effective inter-agency governmental data sharing.
Some of the barriers highlighted in the previous section are unsurprising and may have relatively straightforward solutions that primarily demand dedicated resourcing to overcome. Indeed, if data standards that “best practices” researchers have been pushing for (such as FAIR data) were implemented more widely, many of the barriers discussed in this paper would not exist. In other cases, however, the combination of social, commercial, ethical, or legal barriers may conflict and require trade-offs, or change may be needed at a more systemic and time-consuming level.
The need for complete datasets and contextual metadata in private records presents a quandary: in contrast to non-private data, federated or linked queries across multiple private or semi-private data sources present a risk of de-anonymisation that is often deemed too high-risk or complicated, although there are initiatives such as trusted research environments that do attempt this. One participant (Section 4.3.2.1) highlighted that end-to-end medical records would provide significant value both to research and to personal medical outcomes, providing context that otherwise would be missing from “the partial observation that we get in medical records.”
Getting as complete a picture as possible of data and its context is imperative if data analysts are to create valid interpretations. Existing literature shows that researchers may be reluctant to share their data unless it is accompanied by contextualised metadata, for fear of their data being misinterpreted (Naudet et al., Reference Naudet, Sakarovitch, Janiaud, Cristea, Fanelli, Moher and John2018; Zhu, Reference Zhu2020; Savage and Vickers, Reference Savage and Vickers2009). Despite this, a recurring theme in many of the data-use barriers (Barrier 3) was that contextual information may be missing from raw data sets, possibly due to a lack of time and resources to create the context (Barrier 5).
5.1. Geopolitical data suppression, or following the science? To be effective, the policy must be informed by evidence
An oft-cited pandemic mantra is “follow the science.” This assertion perhaps lacks the nuance to recognise that many scientific pandemic responses will require trade-offs between different pressures and needs, but one thing is clear nevertheless: without transparent and effective sharing of pandemic-related data, it becomes highly challenging to follow the science, as science will be missing a significant part of the picture.
Where data are provided by government bodies or by extensions of government bodies (such as national government-sponsored healthcare and schooling), data sharing and transparency can become political. Multiple participants, from multiple different countries, reported challenges, redactions, and inconsistencies around governmental data sharing in their region, and speculated upon multiple different reasons. It is usually unclear whether some of these gaps were accidental, perhaps due to human throughput challenges (Barrier 5), or whether they were part of a small-scale or concerted government efforts to obscure or ignore the effects of the pandemic upon hospitals, schools, and other infrastructure. More than one participant talked about government-agency-generated or scientific data they had seen with their own eyes which painted a grim picture of COVID-19 transmission or evolution, which was never published, was published in aggregate form that obfuscated serious problems, or which later disappeared without explanation.
As long as short-term political reputation is prioritised over transparency, governmental and healthcare responses to the pandemic will be insufficient, resulting in preventable death and chronic illness - a terrible humanitarian disaster and long-term economic burden that could otherwise have been avoidable.
5.2. A more severe crisis for those who have the least: Equity in pandemic data and pandemic response
The COVID-19 pandemic has also magnified inequity amongst the most disadvantaged (Alon et al., Reference Alon, Doepke, Olmstead-Rumsey and Tertilt2020; Pilecco et al., Reference Pilecco, Leite, Góes, Diele-Viegas and Aquino2020; Dron et al., Reference Dron, Kalatharan, Gupta, Haggstrom, Zariffa, Morris, Arora and Park2022). Multiple participants highlighted that datasets may not equitably represent populations (biased data), and/or may not have sufficient data to be able to tell if different populations are experiencing equitable outcomes (potentially biased data, with no easy way to tell if the bias is present or not). This once again underscores the importance of this study’s recurring theme of needing contextual data and metadata around pandemic data, such as socioeconomic status, gender, race/ethnicity, and so forth when designing human-centric systems for epidemic and healthcare data.
Participants in the study reported that open data - even if lower quality - is more valuable to them than data that had restrictions on access, use, or re-sharing. Chen et al. (Reference Chen, Pierson, Rose, Joshi, Ferryman and Ghassemi2021) highlighted that opportunistic use of non-representative data often serves to exaggerate and reinforce equity biases in data. Whilst some participants in the study were well aware that such “convenience sampling” may result in biases, they had relatively little control to improve this in any way, as the other choice would be not to perform their analysis at all.
5.3. Law as code: Machine-readable legal standards
One of the most unexpected findings of this study was the importance of highly detailed temporal legal metadata when modelling pathogen spreads. Sections 4.2.3.5 to 4.2.3.7 highlight that data changes over time, and when geographical governmental boundaries change, knowing when a geographical file is produced will affect how the data are interpreted. Similarly, modelling pathogen spread based on recent and historical evidence requires knowledge of what interpersonal mingling rules were in place at the time and in the days or weeks beforehand. These rules might include whether a lockdown was in place, whether mask-wearing was mandatory, whether small or large groups could legally mingle, and whether work-from-home or school-from-home rules were in force. Spikes and lulls in metrics such as tests, positive cases, hospitalisations, and death rates become hard to explain when the legal behavioural context around the spread is not known.
Whilst the simplest solution to this type of quandary is a human-readable date-based list of temporal law and/or geographical boundary changes, there is also scope for an even more powerful implementation that would better aid computational modellers seeking to interpret and predict pathogen spread: versioned and timestamped machine-readable laws. This combines the need for temporal metadata with best practices for data access and would fulfil the participant wish list for programmatic access to data (Section 4.3.2.2), would be better compliant with FAIR data guidelines, and is supported by previous literature, such as the New Zealand Better Rules for Government Discovery New Zealand Digital government (2018), which asserts:
The models developed to create [machine consumable legal] rules can be re-used: to contribute to modelling and measurement of the real impact of the policy/rules on people, economy and government.
Careful implementation of this technique could be immensely powerful, enabling computational spread modellers and policymakers to collaborate on powerful computational pathogen spread models. These models could effectively explain and understand previous pandemic data whilst also predicting the effects of different pathogen-related interpersonal mingling laws by updating the legal parameters in a test environment and allowing this to feed into existing computational models automatically. Rapid feedback loops between existing data and potential outcomes will save lives, reduce long-term disability, and reduce undesirable economic consequences.
6. Conclusion and future work
Whilst reducing the barriers discussed here would likely make future pandemic responses more effective, the COVID-19 pandemic highlighted issues that already existed. By and large, it seems unlikely that creating more robust data-sharing systems, and computer-readable legal codes, The barriers we discussed throughout this paper are known and supported by existing data sharing and data standards literature and calls for reform, this paper highlights specifically how they apply to the specific crises of pandemics and epidemics. In particular, the need for machine-readable temporal context for legal movement restrictions and geographical data files represents a new and meaningful contribution to existing knowledge. Similarly, it is important to draw attention to the fact that data are still hidden for the sake of political gain, even in a global and long-term humanitarian crisis.
Figure 3 shows a summary of the barrier types and their sequential and cumulative nature over the life cycle of a research project.
6.1. Planning for better responses in the future
A famous adage about investing in the future is that the best time to plant a tree was twenty years ago; the second-best time is now. Many of the barriers participants reported might have been mitigated or reduced, had measures around data sharing and data access been implemented before a public health crisis put significant pressure on our existing healthcare, health surveillance, and research systems. Whilst we can not collectively go back and put better data-sharing systems in place, we can use the lessons from this pandemic to prepare better for future waves and other similar crises.
Data-finding, access, use, sharing, and re-sharing barriers may occur sequentially rather than concurrently, delaying useful responses to urgent health emergencies. Data access requests and grant applications typically turn around in weeks or months. If a pandemic modeller must first apply for a grant to purchase access to data, and only later request access to the data if funding is approved, one or many pandemic waves may pass before researchers can create realistic prediction models and recommended mitigation acts.
Governments and research institutes worldwide are recognising the need for better data-related infrastructure. Tooling and data standards such as global health have been developed to ingest pandemic-relevant data, and are capable of reading disparate sources of varying quality, ranging from PDF to API-based input (Reference Kraemer, Scarpino, Marivate, Gutierrez, Xu, Lee, Hawkins, Rivers, Pigott, Katz and BrownsteinKraemer et al., 2021). Recent examples of this are the FAIR COVID-19 data recommendations for a coordinated response by Maxwell et al. (Reference Maxwell2022), and the Goldacre review (Morley et al., Reference Morley, Goldacre and Hamilton2022), which cites cultural, funding, technical, and privacy-aware recommendations for creating more effective healthcare-related data infrastructure. Implementing report recommendations such as these would dismantle many of the barriers found in this research, and improve the lot of healthcare research and outcomes significantly.
This study was carried out during the depths of lockdowns and early COVID-19 vaccine rollouts. As of late 2024, transmission tracking has largely ceased, and mitigation measures have largely disappeared globally. The vaccine prevents the most severe illness, but immunity wanes swiftly, and the virus mutates swiftly enough that some individuals have had two, three, or more cases of the virus (Collaboratory, 2022). All observed infection cases - even ones where the acute phase is “mild” - result in invisible cardiovascular damage, even in previously healthy people, and repeated infections increase the risk of the long-term consequences known as “Long COVID” (Davis et al., Reference Davis, McCorkell, Vogel and Topol2023). COVID-19 is simultaneously a serious public health crisis, likely to increase the levels of long-term disability, yet also being ignored by governments, health policy, and the populace as “over.”
All this suggests that worldwide, nations do not attempt to “follow the science,” as some may have previously claimed. Further research in the area of systemic barriers and data quality control might be useful. Given that we already have a large amount of scientific evidence that is not being acted upon, however, we would argue that the most effective future work here might be policy work. Research on real-world problems seems to be of little point if the knowledge is not used to create real-world solutions.
The human cost of COVID-19 so far has been staggering. If we wish to ensure that we are in a position to mitigate future epidemics and pandemics, we must collectively dismantle systemic barriers, and build policies which not only enable efficient re-use of non-private data but also ensure the data itself is intentionally designed to be accessed computationally with a minimum of human input, reducing costly errors and deadly bottlenecks to informed decision making.
Data availability statement
Data for this study are stored on the University of Manchester infrastructure, but are not available or deposited publicly in order to protect the privacy of the individuals who were interviewed.
Acknowledgements
This paper has been deposited on arXiv as a preprint, DOI: https://doi.org/10.48550/arXiv.2205.12098. A preliminary version of this research was presented at Data for Policy 2021 as an extended abstract, DOI: https://doi.org/10.5281/zenodo.5234417. Yo Yehudi would like to thank Kirstie Whitaker for study design and ethics support, and the open science, open data, and OLS community in particular for helping to spread the word and recruit for the study.
Author contribution
Y.Y. designed the study, gathered and analysed the data, wrote the first draft, and approved the final version of the manuscript. L.H.N. analysed the data, revised the manuscript and approved the final version. C.J and C.G. supervised ideation, and study design, revised the manuscript and approved the final version.
Funding statement
This work was supported in part by grant EP/S021779/1.
Competing interest
None.
Ethical standard
The research was approved by the University of Manchester and meets all ethical guidelines, including adherence to the legal requirements of the study country.
Comments
No Comments have been published for this article.