A few Big Data and analytics lessons from the COVID-19 pandemic

By Richard Self - 27 May 2020

‘Data, analytics and models are how we understand the world’ is the current wisdom. Also ‘the data and models do not lie’. ‘We must follow The Science’.

This wisdom has been driving the developments of Big Data, analytics, AI and machine learning for the last twenty years. It is aligned with the insistence by the business science proponents that we must move away from management based on insights derived from intuition to management based on data and analytics. We have seen the development of an almost total trust in computer systems and models over the last three decades.

Trusting data

Computer models and systems are trusted to evaluate our eligibility for insurance and mortgages, check our identity at national borders, identify suspects on a police watchlist and in many other circumstances.

As we all now know, this unfounded trust in computers led to the credit crunch in 2007 and 2008. However, it seems that that experience has had little effect on the levels of unthinking trust in computer models.

The last five months with the COVID-19 pandemic have been a particularly rich source of examples and information about the levels of trust we should or should not have in data and models.

Understanding COVID-19 data has been very complicated, because it is apparently so different to previous viruses and pandemics. We have a range of data about when it first appeared in China and the rest of the world and a wide range of figures about the levels of infection and the mortality rates. There are very large differences in the predictions that different models make about the dangers of the virus, its rate and extent of spread.

Lessons already learned from COVID-19

The question, therefore, arises: are there any lessons from COVID-19 that we can take forward to improve our understanding of the limits of data, models and analytics?

It turns out that there are many lessons that we can draw from just the last few months. The following three are possibly the most important.

No single ‘science’

In terms of the impact on government policies around the world, we now know that there is no single ‘science’, as implied by politicians, that can be followed. In the UK we have seen the original Imperial College model, which predicted dire levels of deaths and that led to the lockdown policies contrast significantly with the Oxford group model led by Professor Sunetra Gupta. Two different models, with many different assumptions having to be made for lack of data to drive the models. In reality, there are many other models that all disagree with each other. Politicians have to use these sets of data as input to their political decision making. They do not have a singular version of the truth called “The Science”.

Raw data

We have learned that even the raw data that is being used cannot be relied upon. If we look at the Johns Hopkins University COVID-19 site and the Worldometer website, we will see some variations in the data being presented. Worse than that, the definitions of the data are somewhat variable, to the extent that we cannot rely on them in ways to be able to draw any valid comparisons between countries. In particular, the quoted numbers of people catching the virus is, in fact, the numbers of people who have been tested positive for the virus. These numbers only reflect the numbers of tests actually performed and do not tell us anything about how widespread the virus is in the populations, which is a vital datum in understanding the progress of the pandemic.

Data averages

We have learned that averages can be very misleading. This is particularly visible in the Reproduction factor R, which is being quoted as driving policy as we enter the loosening of the lockdowns. Calculating the R value for the whole of the UK does not provide useful public health insights.

We have discovered that the problem is not one large outbreak in the UK but of many small, localised outbreaks, each with their own R value resulting from localised demographics, age distributions, predisposing factors etc.

Care homes are clearly one of the critical sub-populations but even here we cannot treat all care homes as members of a single class, they are often very different.

However, it turns out that it is extremely difficult to measure the R value – it can only currently be modelled with all the problems of assumptions that that involves. Professor Sunetra Gupta points this out at the end of her very interesting discussion with UnHerd on YouTube.

As outlined, an interesting aspect of the pandemic is the difficulty in using and trusting any of the data that we are able to see. This is something I will be discussing in an upcoming talk in July. Find more about the event, which will focus on how the future of harnessing data is being transformed by technology.

For further information contact the press office at pressoffice@derby.ac.uk.