Neem contact op

Know Thy Error

Managing Principal – Data Science
Valtech UK

januari 10, 2019

Understanding the uncertainty (“error”) on a measurement is just as important as the measurement itself.

Measuring things, which involves acquiring data, is a fundamental part of the way human beings understand how things work, and consequently our advancement as a human race.

However, in any measurement, there is an amount of uncertainty in the measurement itself. Seeing a measurement and its uncertainty together is essential to distinguishing between different ideas on how things work, and subsequently making better decisions, right down to deciding what steps to take next - it influences how to build things that can capture how to estimate uncertainties.  Failure to see uncertainty in a measurement will inevitably lead to poor decisions, and a much more costly path to reaching the truth about the things that matter to you.

Humans have gathered data for millennia, and our curiosities have been rewarded with technological advances based on the discoveries we have made about the way things work. The discovery of the electron in the late 1800's and measuring its properties and behaviour, for example, is so important to our modern day lives that we can even recognise the name of the particle in the word 'electronics' and 'electricity'. Curiosity, followed by measurement using data, leads to discovery. This in turn leads to innovation.

Nowadays analysis of data isn't just in the hands of specialist research groups and scientific establishments. All kinds of organisations and businesses collect large amounts of data, but many are only just starting their journey of realising the need and value of a scientific process around data, and arranging their organisations and processes accordingly.  A crucial part of doing science with data is understanding the uncertainty on what we measure.

There are broadly two kinds of contributions to uncertainty when measuring something. The first is statistical uncertainty, also known as sampling error. It has a lot to do with the amount of data being used. The more data you have, the more precisely you can measure something. The second is systematic uncertainty, also known as non-sampling error. This is to do with things that you might be systematically getting wrong in a measurement. The word 'error' here is really used to mean uncertainty, rather than a mistake.

Very broadly speaking, sampling error decreases in line with the square-root of the number of data points in a data set. This is a gross generalisation, but serves to demonstrate that to halve the sampling error and be twice as accurate, rather than double the size of the dataset, you have to quadruple it! If you wanted to have a quarter of your current sampling uncertainty, you need sixteen times as much data! If you're gathering data on a website, for example, and the rate of the data coming in stays the same, this translates to time - how long you have to gather data for to get greater precision.

Systematic error is a lot more challenging, because it's to do with systematically getting measurements wrong. This can often be because we have not realised a problem. Hiring a builder to do some work for you who has a faulty tape measure whose scale mis-measures 1 metre as 99 centimeters could be a costly disaster. Simple sense-checks in the first place can save the day, but only if you thought of checking in the first place. Your wits and curiosity to check things are vital.

Many real-life measurements of critical quantities are more complex, because they chain together many mathematical steps to produce a final measurement. This could be anything ranging from measuring the mass of the Higgs Boson at the Large Hadron Collider, through to getting the time-to-collision (TTC) of a fully automated vehicle like a driverless car, or a country producing an estimate of a national statistic like GDP. Each little step in the workflow will have its own sampling and non-sampling contribution to the total final error. The uncertainty contributions of each step add up across the workflow in complex ways. The trick, and the role of an experienced scientist, is to understand what piece dominates the final uncertainty of a measurement, and to use that knowledge to actively reduce uncertainty and focus the effort where it is needed the most and not to waste effort on inconsequential contributions to uncertainty. This is enabled by having the right computational framework and having an organisational structure that enables a clear end-to-end view of the measurement.

It can be the case that the uncertainty on a measurement has been reduced to such an extent that to reduce it further in the same framework is too costly or impossible. A solution going forwards is two-fold. Firstly, one can blend in other data sources if they are measuring a similar thing. In this case, the ability to blend the sources effectively comes from a deep understanding of the uncertainties from each data source and using the uncertainties and the behaviour in the relationship between the two sources to find an optimal mix. Secondly, the learnings from understanding a particular measurement and its error components informs the design of future experiments and measurements.

Seeing measurements change their quality as time passes is also an interesting way to think of the value of capturing and quantifying uncertainty. In some cases data might be available very quickly but be quite imprecise, whereas more data further on in time helps reduce uncertainty. This does not mean to say that the data available sooner should be thrown away, but rather the appropriate weight be given to it because you have properly estimated its uncertainty. When new and more precise data land, it can be combined with the existing information to reduce the uncertainty. All the data are useful, it's just that the quality of a measurement has improved over time.

This might all seem rather abstract. However, one area that all of these principles are applied is in the creation and running portfolios of investments. An optimised portfolio has to account for the risks (uncertainties) of putting certain amounts of money in each investment or stock, whilst maximising the expected return on the portfolio. So important is the balance of return to risk that the ratio of return to risk in this setting is known as the Sharpe Ratio, and the aim is to maximise it.

Inside our own minds, uncertainty can be an unnatural thing to think about in our daily lives. We want things to be certain, in order to know how to direct ourselves through life. When things are uncertain, we have very different psychological reactions in response to thinking about being on the winning or losing side of the eventual outcome. Both of these factors come into play when we are presented with an apparent story that comes from a dataset.

What can you do with uncertainty to give your decisions and thinking more clarity?

Firstly, when you are presented with measurements, ask for that uncertainty to be estimated, and visualised, where appropriate. You have a right to know the limitations of the results you are seeing. If the measurement is complex and has many contributions to the final error, what does the breakdown of uncertainties look like? That will help to prioritise which uncertainties to reduce, and how to deploy valuable resources.

Secondly, hire people from data-driven scientific disciplines who know how to calculate and communicate uncertainties from their analyses of data. Discussing measurement uncertainty when you are acquiring people with data science skill will be a key indicator of how good each side is with engaging the other on the topic. Having people who can estimate and communicate uncertainty in measurements is a major asset.

Finally, on a personal level, embrace uncertainty. In the words of Richard Feynman, one of my scientific heroes, ``I can live with doubt and uncertainty and not knowing. I think it is much more interesting to live not knowing than to have answers that might be wrong. If we will only allow that, as we progress, we remain unsure, we will leave opportunities for alternatives. We will not become enthusiastic for the fact, the knowledge, the absolute truth of the day, but remain always uncertain … In order to make progress, one must leave the door to the unknown ajar.”

Neem contact op

Let's reinvent the future