Data Cleansing, Analysis and Profiling
Data analysis and cleansing are essential first steps towards managing the quality of data in a master index system. Performing these processes early in the project helps ensure the success of the project and can eliminate surprises down the road. EMPI’s typically provide the tools needed to analyze, profile, cleanse, and standardize legacy data before loading it into a master index database – if not, we can supply these tools on our own. The Data Profiler and Data Cleanser are generated from a master index application, and they use the object definition and configuration of the master index application to validate, transform, and standardize data.
Data profiling examines existing data and provides statistics and information about the data. It provides metrics on the quality of the incoming data to help determine the risks and challenges of data integration. The Data Cleanser detects and corrects invalid or inaccurate records based on rules you define to provide a clean and consistent data set. Together, these tools help ensure that the data loaded into the master index database is standard across all records, that it does not contain invalid values, and that it is formatted correctly.
Data profiling analyzes the frequency of data values and patterns in an existing data source based on predefined rules and user-defined rules. Rules are defined using a Rules Definition Language (RDL). The RDL provides a flexible and extensible framework that makes defining rules an easy and straightforward process. Data profiling performs an initial analysis of existing data to determine which fields contain invalid or default values, invalid formats, incorrect dates, and so on. This analysis spotlights values that need to be validated or modified during the cleansing process. For example, if several dates are in the incorrect format, they can be reformatted during cleansing. There may also be cases where the postal code extension is appended to the postal code, making the field value nine characters. Those fields could be truncated to five characters during the cleansing phase, leaving the actual zip code.
The final data analysis is performed after the data is cleansed using the a Data Cleanser. We use this analysis to verify the blocking definitions for the blocking query used in the master index match process. This analysis indicates whether the data blocks defined for the query are too wide or narrow in scope, which would result in an unreliable set of records being returned for a match search. This analysis can also show how reliably the field indicates a match between two records, which indicates how much relative weight should be given to each field in the match string.
Data Cleansing validates and modifies data based on predefined rules and user-defined rules. Rules are defined using the same Rules Definition Language (RDL) as a Data Profiler, which provides a flexible framework for defining cleansing rules. The Data Cleanser not only validates and transforms data based on the user-defined rules, but it also parses, normalizes, and phonetically encodes data using the standardization configuration in the master index project. Rules can be defined that validate or transform data during the cleansing process, and conditional rules and operators can be included. If custom processing is needed, Java classes can be defined to extend the functionality of the rules.
The output of a Data Cleanser is two record sets; one contains the records that passed all validations and was successfully transformed and standardized, and the other contains all records that failed validation or could not be transformed correctly. The “bad” data file also provides the reason each record failed so that the user can easily determine how to fix the data. This is an iterative process, a Data Cleanser may be executed several times to make sure all data is processed correctly. The final run of a Data Cleanser should produce only a good data file, with no records failing the process.
The final result of the Data Cleansing Process is a file that contains records that conform to the master index object definition and that no longer contain invalid or default values. The fields are formatted correctly and any fields that are defined for standardization in the master index application are standardized in the file. This is the file to load into the master index database using Initial Bulk Match and Load tools.