The sole reason I held this section back - preferring first to focus on other sections of the scorecard - is that this is the section you will be most familiar with. Profiling exercises have been the de facto standard for measuring data quality for many years. There are a number of tools on the market that will profile data in great detail. If you don't have the budget to buy a tool you could even build your own - Dylan Jones and Rich Murnane give us some great examples - which will allow you to profile your data against a multitude of dimensions of data quality.
What are dimensions of data quality?
When we talk about dimensions of data quality we are talking about aspects of quality such as accuracy, consistency, duplication, referential integrity and confirmity. Some basic practical examples could include:
- Table 'SALES' should contain only 1 record per 'TRANSACTION_ID'.
- Field 'AMOUNT' in 'SALES' table should always contain a numeric value
- Field 'CUSTOMER_NAME' should never be NULL
- Table 'CUSTOMER' should contain a corresponding record in our 'CUSTOMER_ADDRESSES' table
- Field 'Post Code' in 'CUSTOMER_ADDRESSES' table adheres to Royal Mail PAF Standards
By profiling data against our data quality dimensions we can ascertain the number of records that are deemed OK, against those that are deemed to contain potential issues to review. However, in order to maximise the benefit of a data profiling exercise we should look towards answering the following question:
How well are our data quality dimensions performing when aligned to business rules and expectations?
Why Align to business rules and expectations?
Primarily because it allows us to add context to data. Context is important due to the fact that it will determine how data is translated by the business, and ultimately, how the data is used. What is deemed to be 'Good Quality Data' may differ between two business users, or two different business departments.
This is because each user/department may use data in a different way, and therefore have different requirements of what is essential to ensure that data is fit for the purpose it is intended.
How to ascertain?
In order to understand rules and expectations that should be placed upon data to ensure that it is of good, and fit, quality, we will need to speak to the business. If you have a data steward network, or have previously identified who the Subject matter experts are within the business, ask them. If you haven't already previously identified the key stakeholders and subject matter experts for business data, this would be a great time to kick start that exercise, and reap instance benefits.
- Ask the business for their critical data items.
- Which data items do they need to fulfill their responsibilities?
- What should these data items look like?
- Which rules should be applied?
- When should data be available?
A practical example
The Sales Reporting team provide sales information to the executive team on a weekly basis. Their reporting pack consists of reports such as:
- Total Volume/Value of Sales
- Sales Per Store
- Sales Per Product
- Top/Bottom Performing Sales Staff
A number of critical data items were identified in order for this reporting pack to be built upon good quality data. Based upon these critical data items we identified a number of rules and expectations that must be met during a data profiling exercise in order for Sales data to be deemed fit for purpose.
- The sales transactions table should only contain 1 record per 'transaction_id'
- The sales transactions table should contain data for the previous six months, up to and including the previous working day
- Each sale must be related to a 'store_id' and 'seller_id' that can be referenced back to our 'stores' and 'sellers' reference data.
- The 'sales_amount' should never be a negative figure
- A seller should only ever be attached to one store at a time. If a seller moves to a new store the previous seller/store relationship should be end dated.
If a record adheres to the above rules we can flag it as ‘GREEN’. Meaning that it adheres to business rules and expectations on data performance. Once the recordset has been aggregated we can measure % records that are of ‘Good Quality’ for Sales Reporting and visualise this upon our scorecard.
The key takeaway point from this section of the scorecard is that we are attempting to benchmark data quality against business expectations. Profiling data quality without taking into account the context in which the data will be used may result in further data quality issues, and misuse of data.
In my next (and final) post of this series we'll go through an example of putting all the sections together into a usable dashboard for the business.