by Jeff Evans
3rd August 2021

COVID-19 has left policy makers struggling to understand and to intervene. This article discusses the range of existing data sources, and new ones, that they can call on, including ‘big data’. I also look at changes in the professional groupings involved in the production of such data, and the importance of these changes for wider society.

At first sight, policy makers in the UK have turned largely to regular statistical series, such as (estimated) number of daily deaths, current number of hospitalisations and number of daily cases reported. These are produced by the UK’s Office for National Statistics (ONS), or its counterparts in Scotland, Wales and Northern Ireland, or agencies forming part of the state apparatus. Policy makers also use a range of existing social surveys carried out by government or by other agencies, including universities and private-sector research organisations.

Surprisingly, we do not hear a lot about ‘big data’, despite a decade of substantial hype and coverage. Nonetheless, location data produced by Apple, Google and Facebook is one of several important sources for certain analyses, as I shall explore.

There are other sources that may well qualify as big data if we use these three criteria: sheer volume; velocity (speed of being updated); and variety (including semi-structured/unstructured data, such as records of doctor–patient consultations, rather than neatly structured data files more familiar to statisticians). Here we do see a range of non-public providers playing a crucial role in producing data and analyses. OpenSAFELY, for example, is a collaboration between The DataLab at the University of Oxford, the Electronic Health Records Group at the London School of Hygiene and Tropical Medicine, plus several private electronic health record software companies (which already manage many NHS patients’ records), working on behalf of NHS England and the Department of Health and Social Care (DHSC). This platform allows trusted researchers to run large-scale analyses across pseudonymised patient records (where personally identifiable information fields are replaced by one or more artificial identifiers). This is done inside the secure data centres managed by the health record companies. The data set is very large, and since it is being updated in ‘nearly real time’, it probably does qualify as big data in the sense above.

Concurrently, the role of the ONS has expanded, with more resources invested in enhancing its regular series and developing one-off surveys to help fight the pandemic. An example of the latter is the COVID-19 Infection Survey, a regular survey of COVID-19 infections and antibodies, which gives a periodic measure of the numbers of cases recorded, independently of those recorded by the Test-and-Trace system. The latter measures depend crucially on the number of tests conducted by the DHSC.

In ‘expanded’ data production at research organisations, the REACT projects, funded by the DHSC, are developing and implementing large-scale surveys by Imperial College and the polling firm IPSOS-MORI. REACT-2 regularly uses pinprick (lateral flow) tests to assess over 100,000 volunteers in England for evidence of whether people have already had COVID-19 and have antibodies in the blood. REACT-1 regularly selects a random sample of around 100,000, and sends to those who volunteer a self-administered swab (PCR) test, in order to estimate the number currently infected, and hence the number of people an infected person will pass the virus on to (‘R’). This is not really ‘big data’, in the sense specified above, but these are relatively large sample sizes for academic institutions to study.

The independent Alan Turing Institute hosts a range of projects and workshops, using data science and artificial intelligence (AI) in many areas. One of these projects supports the Greater London Authority’s (GLA) response to the COVID-19 pandemic. As Goldstein and Gilbert explain in their chapter on data linkage in our book, Data in Society, it uses linkage of multiple large-scale and heterogeneous datasets which capture mobility, transportation and traffic activity to better understand ‘busyness’ in London’s networks, and to enable effective policy making and targeted interventions. The approach takes geographically fixed time series data from the road network and estimates (via continuous data production) the dynamics across the capital. The Institute further addresses many aspects of artificial intelligence (AI) use, including the ethical.

The examples above show how much highly relevant data production and analysis have moved beyond official statistical methods. In terms of professional ‘élite’ expertise, recently discussed by Jenny Ozga, these developments take place in the context of the apparent eclipse of the statistical profession’s near-monopoly of quantitative data use – and the rise of ‘data science’.

One can discern here a struggle to claim expertise in the quantitative data area between classically trained statisticians and a growing group of data scientists. This is cultural – there are differences in skills, attitudes and values. On the one hand, statisticians emphasise general methodological commitments: careful and relatively enduring definitions of measured concepts; design of research – representative sampling and (sometimes) ‘randomised controlled trials’; and reflexivity in considering the effect of the context of data production. On the other hand, data scientists emphasise computing skills: facility with finding and managing multiple datasets with malleable concepts, and effectively real-time updating; use of data mining and AI; lack of interest in effectivity/causality; and the use of ‘haphazard sampling’ – claiming to justify these last two positions by the use of large samples. The difference is also to some extent generational.

These differences and the struggle between the two approaches are important for several reasons. First, they are self-perpetuating, reflected in the increasing numbers of data science courses at universities and the decreasing take-up of ‘applied statistics’ courses, as Kevin McConway’s chapter on changing statistical work explores. Second, they are reflected in the number and type of jobs created. Third, the increased take-up of data science approaches risks facilitating a trend towards more decision-making on ‘automatic pilot’ (i.e. using artificial intelligence), and/or using algorithms which may include hidden biases (for example, in gender or racial terms). As a shorthand, I think of the statistical approach as broadly deductive and data science as more inductive. The differences recall the recurring methodological disputes in the social sciences, most recently in the last 50 years. Overall, I recognise John Naughton’s recent evaluation in The Observer which points to Google’s ‘crippled epistemology that equates volume of data with improved understanding’. In terms of data, big doesn’t always mean better.

Jeff Evans, Professor Emeritus, Middlesex University

Note: I thank Kevin McConway and other colleagues involved in the production of the book for their suggestions concerning this article.

Data in Society coverData in Society: Challenging Statistics in an Age of Globalisation edited by Jeff Evans, Sally Ruane and Humphrey Southall available on the Policy Press website. Order here for £23.99.

Bristol University Press newsletter subscribers receive a 35% discount – sign up here.

Follow Transforming Society so we can let you know when new articles publish.

The views and opinions expressed on this blog site are solely those of the original blog post authors and other contributors. These views and opinions do not necessarily represent those of the Policy Press and/or any/all contributors to this site.

Image credit: fabio on Unsplash