Tuesday, 9 February 2010

A balanced approach to scoring data quality: Part 5

We've previously introduced the scorecard and discussed the 'customer', 'internal processes' and 'financial' perspectives. Now we're going to take a look at the final section of the scorecard, which deals with 'data dimensions'.

The sole reason I held this section back - preferring first to focus on other sections of the scorecard - is that this is the section you will be most familiar with. Profiling exercises have been the de facto standard for measuring data quality for many years. There are a number of tools on the market that will profile data in great detail. If you don't have the budget to buy a tool you could even build your own - Dylan Jones and Rich Murnane give us some great examples - which will allow you to profile your data against a multitude of dimensions of data quality.

What are dimensions of data quality?

When we talk about dimensions of data quality we are talking about aspects of quality such as accuracy, consistency, duplication, referential integrity and confirmity. Some basic practical examples could include:
  • Table 'SALES' should contain only 1 record per 'TRANSACTION_ID'.
  • Field 'AMOUNT' in 'SALES' table should always contain a numeric value
  • Field 'CUSTOMER_NAME' should never be NULL
  • Table 'CUSTOMER' should contain a corresponding record in our 'CUSTOMER_ADDRESSES' table
  • Field 'Post Code' in 'CUSTOMER_ADDRESSES' table adheres to Royal Mail PAF Standards

By profiling data against our data quality dimensions we can ascertain the number of records that are deemed OK, against those that are deemed to contain potential issues to review. However, in order to maximise the benefit of a data profiling exercise we should look towards answering the following question:

How well are our data quality dimensions performing when aligned to business rules and expectations?

Why Align to business rules and expectations?

Primarily because it allows us to add context to data. Context is important due to the fact that it will determine how data is translated by the business, and ultimately, how the data is used. What is deemed to be 'Good Quality Data' may differ between two business users, or two different business departments.

This is because each user/department may use data in a different way, and therefore have different requirements of what is essential to ensure that data is fit for the purpose it is intended.

How to ascertain?

In order to understand rules and expectations that should be placed upon data to ensure that it is of good, and fit, quality, we will need to speak to the business. If you have a data steward network, or have previously identified who the Subject matter experts are within the business, ask them. If you haven't already previously identified the key stakeholders and subject matter experts for business data, this would be a great time to kick start that exercise, and reap instance benefits.
  • Ask the business for their critical data items.
  • Which data items do they need to fulfill their responsibilities?
  • What should these data items look like?
  • Which rules should be applied?
  • When should data be available?

A practical example

The Sales Reporting team provide sales information to the executive team on a weekly basis. Their reporting pack consists of reports such as:
  • Total Volume/Value of Sales
  • Sales Per Store
  • Sales Per Product
  • Top/Bottom Performing Sales Staff

A number of critical data items were identified in order for this reporting pack to be built upon good quality data. Based upon these critical data items we identified a number of rules and expectations that must be met during a data profiling exercise in order for Sales data to be deemed fit for purpose.
  • The sales transactions table should only contain 1 record per 'transaction_id'
  • The sales transactions table should contain data for the previous six months, up to and including the previous working day
  • Each sale must be related to a 'store_id' and 'seller_id' that can be referenced back to our 'stores' and 'sellers' reference data.
  • The 'sales_amount' should never be a negative figure
  • A seller should only ever be attached to one store at a time. If a seller moves to a new store the previous seller/store relationship should be end dated.

If a record adheres to the above rules we can flag it as ‘GREEN’. Meaning that it adheres to business rules and expectations on data performance. Once the recordset has been aggregated we can measure % records that are of ‘Good Quality’ for Sales Reporting and visualise this upon our scorecard.

In Conclusion

The key takeaway point from this section of the scorecard is that we are attempting to benchmark data quality against business expectations. Profiling data quality without taking into account the context in which the data will be used may result in further data quality issues, and misuse of data.

In my next (and final) post of this series we'll go through an example of putting all the sections together into a usable dashboard for the business.

Related Posts

Part 1: Introduction

Part 2: Customer

Part 3: Internal Processes

Part 4: Financial

Part 6: The Dashboard


Dylan Jones said...

Really great post Phil, love where you've gone with this series (and thanks for the plug to the free pattern analysis tool I released!)

It's so important to take a bird's eye view, ask the "So What?" test and really connect those data issues to events that impact the business, profiling stats are a real turn-off to management and knowledge workers so make it simple as you say.

Great post as ever.

Phil Wright said...

Thanks Dylan, and not a problem.

Keep it simple indeed. Management are not going to react unless they understand how it impacts, for instance, their reporting.

Damian Cox said...

Hi Phil,

Great post.

We had to move away from scorecards because we found that percentages for complex data objects are sometimes misleading as it might actually be the remaining 5% for a given relationship that is critical to the object being fit for use.

An approach that we use is the fault tree approach to data quality where basically we try and make a prima facie case for "How Broken Is Broken?". That is, can it be a "bit wrong" and still function or is quality problem a show stopper.

This has been primarily driven by the behaviour of business process owners/custodians that saw headline %'s on our scorecards and wanted to "get on with it" when there were still serious problems in their data that take more than 30second sound bite to explain. Highly frustrating.



Phil Wright said...

Damian, fault trees - good approach.

I agree, % headlines can often cause viewers to think "90% sounds great, so lets just get on with it", even though they may require that particular data item to hit '100%' to ensure their processes are problem-free. This is why it's critical that we align scoring to the business requirements of 'fit for purpose' data.

Post a Comment