My NHS – The Data Science View

setembro 14, 2016

Want to know what it was like for our data scientist on a 2-week open data project? Mikayel Mirzoyan gives us a short summary of his experiences with My NHS.

Being a data scientist, I will always jump at the chance to get my hands on real data and have a play around. The My NHS project gave me exactly this opportunity with healthcare data – not always the easiest to get hold of! – and so I was really excited to be part of the team.

We kicked off the project with our main source of data being the backup of the MyNHS database. Among other things, it contained information about how each organisation (CCG, trust, hospital, etc) performed across multiple different metrics. For hospitals, which were the main focus of our discovery project, some of the typical metrics were mortality rate, infection control and cleanliness, and percentage of patients waiting less than 18 weeks.

First, we wanted to explore the relationships between different metrics. For example, we wanted to see whether there was higher mortality rate in hospitals that performed poorly on infection control and cleanliness. Of course, correlation doesn't necessarily imply a causal relationship between the two metrics – it could just be coincidental, or there might be a third factor that has causal relationships with both of those metrics.

Unfortunately, not all the hospitals had values for each metric. In fact, the vast majority of metrics didn’t cover even half the hospitals. Working closely with My NHS team we selected about two-dozen metrics that were of high interest. While a couple of those happened to be among the better-populated ones, most weren't.

What we could have easily done is create a correlation matrix for all those metrics and find the ones that were most correlated. But with this approach, we would have been even more concerned about incidental correlations. Instead, we preferred a hypothesis-driven approach. We created a dashboard tool in R Shiny that allows anyone to select two metrics and see a scatterplot of values with corresponding correlation coefficient. As a result of user testing, we added the feature to see the hospital name by hovering the mouse cursor over the point on that scatterplot and allowed the user to search and highlight a particular hospital on the screen.

In addition, with the help of my colleague George Hall, we augmented our data with information about affluence indicators on local authority districts where the hospitals are located. As affluence indicators, we used the index of deprivation and the average house price, from DCLG and Land Registry respectively. Correlation between these two was just as one would expect, strong but not too strong. The reason we wouldn't expect it to be too strong is certain metropolitan districts – as it is widely known some of the highest levels of deprivation are in the cities, which have very high average house prices. We also added obesity, drugs- and alcohol-related admissions data, but due to lack of interest they didn't make it to the final version of the dashboard.

One of the key findings was that contrary to our expectations the affluence of the area where the hospital was located did not correlate with either of the key performance metrics. Since not all hospitals have values for those key metrics we examined the possibility that maybe the hospitals in less affluent districts are more prone not having the values submitted. That was not the case, the differences between the means of affluence indicators between the two groups were not statistically significant. Thus, the hypothesis that there is a postcode lottery didn't hold water.

Another interesting thing that happened was that one of the scatterplots looked like it was comprised of two different distributions, and looked very unnatural for this type of data. After some bit of analysis we found the true reason for this. When we added the third dimension as a color, we could clearly see that one group was mental health hospitals and the other acute ones.

These and other findings were presented at the NHS Expo in Manchester last week, where people could also get hands on with the dashboard prototype, and do their own relationship explorations. It was a real privilege to be part of such a key service to the future of the NHS, and something we’re all proud of here at Valtech.

Whilst the specific client and project needs, the nature of the data, and the time constraints (we had 2 weeks!) didn't allow us to use any particularly cutting-edge machine learning techniques or fully stretch the data team’s technical capabilities, this was still a fascinating project to be involved in, and it was great to work with the amazing people from the My NHS team. Now we’ve kicked off the improvement programme we’re all very interested to see what happens next at My NHS. Who knows – maybe machine learning could be part of phase 2…?