I recently visited a major insurance provider to learn that this company had proliferated nearly 200 data warehouses throughout its network. In addition, the company has several terabyte-sized operating databases and, of course, a myriad of desktop computers and laptops.
The company representative was wondering what could be done about data quality in this Petabyte universe, and whether a service-oriented architecture was something that would help solve some of these proliferation problems.
How to respond to this question? Well, yes, but . no. It’s complicated. Truthfully, I don’t understand any of it. I challenge anyone to wrap their mind around that kind of a problem. You’re not going to solve this with a tool. This doesn’t require a fix; this requires philosophy, strategy, organization, and willpower. We’re talking about the capital asset of data, as my friend Tom Redman points out.
As we were thinking how to approach this and similar problems, I was talking to Jay Bourland, who suggested that he’d heard of a method of organizing different solutions through the use of patterns. “A method for organizing architecture”, he said, “Maybe there’s applicability”.
Whatsa Pattern?
The method Jay was talking about turned out to be the work of Christopher Alexander, most famously published in A Pattern Language. The primary audience of this work seemed to be architects, and maybe community or city designers. But data quality? Where’s the connection?
Further looking into the considerable body of work of Christopher Alexander, I read his seminal book Notes on the Synthesis of Design. Here, in 1964, he discovers that by understanding and structuring the constraints of a design problem, a design itself emerges as a sort of ‘negative image’ of the constraint structure. He represents design constraints as a taxonomy of directed graphs, hence revealing picture patterns that are iteratively applied to all sorts of different problems.
In effect, what Alexander says is: Don’t look at the design itself, understand the patterns of the constraints, and the design emerges naturally.
After some research I found the use of patterns in the development of software is not new. All the major technology providers have flirted with the idea of software patterns, but few recent advances have really been made. The word pattern in this context is the principle use of object-oriented design structures: information hiding, abstraction (e.g., facade pattern or service proxy), polymorphism, etc.. These patterns are designed to optimize structural features such as reusability and simplicity.
All good stuff. But if you’re a programmer and you’re looking for a design or information object, all you have in your hands is a bunch of constraints, not entry points. It’s supply-side economics, and we all know demand drives the game.
The Structure of Data Quality Patterns
The most notable, complete, and persistent discussion of patterns in software I could find is Hillside.net, the only such site to mention Alexander’s work. If I overlooked anyone, my apologies in advance. Here Dick Gabriel (in a piece by James Coplien) writes an excellent definition of patterns in software engineering:
Each pattern is a three-part rule, which expresses a relation between a certain context, a certain system of forces which occurs repeatedly in that context, and a certain software configuration which allows these forces to resolve themselves.
He defines constraints into three classes: context, forces, and I add (from Alexander) configuration subcategories solution, resulting context, and rationale.
Can patterns for the assurance of data quality be defined in this way?
Let’s take a look at an example:
- Name: Customer Validation
- Problem: Duplicate customer records keep creeping into databases and are then submitted in list form to downstream processes
- Context: There are several databases with customer records in them. This pattern resolves the unique database structure by presenting a subset of customer records as two or more simple records.
- Forces: Present a ‘clean’ list of records without duplicates. Various feeder systems may insert duplicates. How many duplicates were found is not relevant, this should act as a transparent ‘insurance’ routine that warrants quality data in normalized, non-duplicate terms.
- Solution: Described in the following directed graph:
- Resulting Context: The result is a same format list of names and addresses, which then must be appropriately placed into their various target contexts.
- Rationale: Database management and normalization routines may wish to ensure no duplicates for ad hoc list extracts. This pattern doesn’t use best-of-breed selection, as this is not always needed.
Hierarchy and Fabrics
The example Customer Validation pattern is a very simple pattern that would be very usefully deployed for list extracts. Of course, one could think of five variants of this pattern that better remedy the root cause of duplicate records. If you had that thought, you’re already starting to think of data quality in terms of patterns.
I call these constraint-based data quality patterns because they are categorized by their constraints, not necessarily by their structure. This is not to say that the structure of a data quality flow doesn’t matter.
The XML revolution has brought us a world of web services, and with it an ordered, cellular universe of objects that perform every function imaginable. By using the pattern descriptions above, along with appropriate WDSL descriptions, we can implement hundreds of data quality patterns that can be stored in registries, browsed, searched, and deployed in minutes
Alexander tells us in A Pattern Language: “ no pattern is an isolated entity. Each pattern can exist [ ] only to the extent that it is supported by other patterns in which it is embedded, the patterns of the same size that surround it, and the smaller patterns that are embedded within it.”
In other words, data quality patterns form a hierarchical structure. Tree structures are defined by the dynamics of definition and inheritance. We call large data quality patterns, with many other patterns within it, data quality fabrics.
As Navin Sharma says: “Fabrics are tailored for industry-specific business processes using the principles of data quality and location intelligence–these improve accuracy, completeness and business decision support. Omit Data Quality Fabrics from your SOA architecture, and you risk instability in business processes.”
Perhaps now that we have a method for managing large sets of data quality rules, we can start defining an approach to Petabyte-sized data quality issues.
Pattern language has been something I’ve struggled to understand throughout my cs studies. I can see definite benefits.