May 3, 2018

Data Quality by Design

By Roger Robson

Accolade, like many companies, runs on data. We recognize that, in order to provide our clients with the best possible experience, and in order to deliver the best possible care, our Health Assistants need to have access to high quality data.

Recently, we embarked on a journey to completely redesign large parts of our data tier. During that redesign, maintaining high data quality was frequently cited as a requirement. But that begs the question: just what exactly is is data quality?

Thomas C. Redman defines it nicely is his book, Data Driven: Profiting from Your Most Important Business Asset:

...data is considered high quality data if it is fit for [its] intended uses in operations...

In other words, data quality is simply a way of assuring that the data in your organization is fit for its intended purpose. This fitness can have many dimensions, such as accuracy, precision, completeness, consistency, integrity, organization, relevance, timeliness, etc. Not all of these may be of equal importance in your application, and indeed many of these dimensions are not mutually orthogonal.

At Accolade, our data quality requirements tend to fall into three broad categories:

1. Correctness

Clearly, the primary data quality requirement is that the data be correct. Theoretically, correct should mean matching the real world, but in practice, we rarely have direct access to the real world for validation purposes. Our data tends to be reported to us, either via our clients and Health Assistants, or through our data partners, so we typically address correctness by indirect means, such as validation. Validation is really just logical rules that correct data (presumably) never violates, such as “a ZIP code is always 5 or 9 digits”, or “the sum of claim details must equal the claim total”.

Sometimes, like when we're migrating from an old system to its replacement, we can specify correctness in a more direct way, for example, “the sum of each customer's claims in the old system must equal the sum of their claims in the new system”. However, once the migration is complete, such requirements are usually retired.

We can also apply statistics to help ensure correctness, and flag any statistically significant variation in any internal measurement. Of course, the number of such possible measurements is uncountable, so choosing which data quality measurement to watch is a bit of an art form. Essentially, we need to look at the common error modes of the past, and watch those measurements that best predicted them.

Finally, for any data which Accolade itself creates, an important correctness requirement is enabling data stewardship, i.e. the ability to edit created data. This would apply, for example, to data mastering, the application of heuristic rules, or even manual data entry. Practically speaking, since practically all data can be altered at some level, this is really about lowering the cost barrier of data correction by creating stewardship interfaces, and making sure that feedback channels are working smoothly.

2. Organization

It is not usually enough that data be correct - it must also come in a convenient form. The most typical organizational requirements are around having exactly one of each piece of data (i.e., no duplicates, no gaps and self-consistent), and having it in the right coding systems (e.g., translating from partner-specific codes to Accolade codes). For Accolade, an important consideration of code translation is deciding what to do when a partner sends a novel code - should it be treat as an error, flagged for review, or simply accepted?

3. Timeliness

Perhaps the least obvious data quality requirement is timeliness. Does the data usually arrive soon enough to be useful? Obviously faster is always better, but speed costs real world money, and different applications have different requirements.

Another important aspect of timeliness is the systems for handling errors and procedures for handling alerts. The more data quality checks we have, the more alerts will be generated, and at some point, we need to stop the data flow and get some eyes on the problem. This becomes a tradeoff between quality and speed. We can't simply stop processing at every outlier, because the world throws us far too many outliers, and our data ingestion would quickly grind to a half. Instead, we have to make informed choices about how many alerts to raise, and when to stop processing. If we can't find a happy medium between those two, the next step is to invest in better tooling for investigating problems - after all, better to invest in lower MTTR than MTBF.

Implementation

We are implementing this data quality design approach iteratively, starting at the UI level. Each Agile team is asked to first consider what aspects of data quality are important for their application or service, using the above framework. Various assurance measures are considered, and, through a lean approach, the most useful ones are implemented. Often, the best approach to address a data quality requirement is to passing it on to the next team “upstream”. There is actually a huge dividend to doing so - the further upstream data quality is assured, the more downstream systems benefit.

Ultimately, data quality can neither be bolted on as an afterthought, nor assigned to a single team to implement across the organization (although we do have specialists building data quality tooling) . It must be addressed by design, and become part of every Agile team's responsibility, just like any other quality requirement.