QuantiCode: Intelligent infrastructure for quantitative, coded longitudinal data

Start date: 1 March 2016
End date: 31 August 2019
Funder: EPSRC
Value: £977,832
Primary investigator: Professor Roy Ruddle
Co-investigators: Professor Mark Birkin

This cross-disciplinary project aims to develop novel data mining and visualization tools and techniques, which will transform people's ability to analyse quantitative and coded longitudinal data. Such data are common in many sectors. For example, health data is classified using a hierarchy of hundreds of thousands of Read Codes (a thesaurus of clinical terms), with analysts needing to provide business intelligence for clinical commissioning decisions, and researchers tacking challenges such modelling disease risk stratification.

Retailers such as Sainsbury's sell 50,000+ types of products, and want to combine data from purchasing, demographic and other sources to understand behavioural phenomena such as the convenience culture, to guide investment and reduce waste. To solve these needs, public and private sector organisations require an infrastructure that provides far more powerful analytical tools than are available today.

Today's analysis tools are deficient because they (a) are crude for assessing data quality, (b) often involve analysis techniques are designed to operate on aggregated, rather than fine-grained, data, and (c) are often laborious to use, which inhibits users from discovering important patterns.

The QuantiCode project will address these deficiencies by bringing together experts in statistics, modelling, visualization, user evaluation and ethics. The project will be based in the Leeds Institute for Data Analytics (LIDA), which houses the ESRC Consumer Data Research Centre (£5m ES/L011891/1) and the MRC Medical Bioinformatics Centre (£7m ES/L011891/1), and provides a development facilities complete with high-performance computing (HPC), visualization and safe rooms for sensitive data.

Our project will deliver proof of concept visual analytic systems, which we will evaluate with a wide variety of users drawn from our partners and researchers/external users based in LIDA. At the outset of the project we will engage with our partners to identify alysis use cases and requirements that drive the details of our research, which is divided into four workpackages (WPs).

WP1 (Data Fusion) will develop governance principles for the analysis of fine-grained data from multiple sources, implement tools to substantially reduce the effort of linking those sources, and develop new techniques to visualize completeness, concordance, plausibility, and other aspects of data quality. WP2 (Analytical Techniques) and WP3 (Abstraction Models) are the project's technical core. WP2 will deliver a new, robust approach for modelling data as they appear naturally in health and retail data (irregularly dispersed or sampled over time), scaling that approach with stochastic control to guide learning and resource usage, and developing a low-effort 'question-posing' visual interface to drastically lower the human effort of investigating data and finding patterns. WP3 (Abstraction Models) focuses on data granularity, and will deliver a tool that implements a working version of the governance principles we develop in WP1, and new computational and interactive techniques for exploring abstraction spaces to create inputs suited to each aspect of analysis. WP4 will implement the above tools and techniques in three versions of our proof of concept system, evaluating each with our partners and LIDA researchers/users. This will ensure that our solutions are compatible with, and scale to, challenging real-world data analysis problems.

Success criteria will be time saved, increased analysis scope, notable insights, and tackling previously unfeasible types of analysis - all compared against a baseline provided by users' current analysis tools. We will encourage adoption via showcases, workshops and licensed installations at our partners' sites. The project's legacy will include tools that are embedded as an integral part of the LIDA infrastructure, a plan for their on-going development, and a research roadmap.