An robust EMPI engine runs standardization routines such as name, title, salutation, company and address normalization. This lets OHMPI generate an aggregate score based on each address component (i.e “Ave.”, “Ave” and “Avenue” are all considered equal). The engine does this automatically so customers don’t have to create complex business rules to parse an address. For example, the address fed to the engine is “10144 Hiawatha Ave”. The engine automatically breaks apart the address and compares scores as:
|Address1||Address on file||Min Score||Max Score||Actual Score|
|Total for Address:||6.32|
Record Matching Process
All records are filtered down to a candidate record set based on a selection query and then compared one on one for matches. Based on the Matching Score, a record is identified as being a match, a potential match or not a match.
Most EMPI’s rely on a two step matching process. The first is described as “Casting the Wide Net” which generates a set of probable candidates. The second is a refined and detailed pass which runs through a field by field comparison. The various algorithms generate scores (Soundex, NYSIIS, etc.) on each field and a total is aggregated. Only the close matches are returned (determined by score and thresholds).
The reason vendors use this approach of two steps is for efficiency and response times. Response times would suffer greatly if we had to do a field by field comparison against millions of records. Robust EMPIs maintain indices that allow result sets / subsets to be derived in milliseconds.
Step 1) “Casting the Wide Net”
The EMPI leverages a combination of several “Blocking” or fuzzy queries for the first step. Each “Blocking” query includes a different set of fields (some Blocks might overlap others). For example, one blocker search might be “First Name / Last Name” and another might be “First Name / DOB / Gender” and a third might be “Last Name / DOB / SSN”.
These fuzzy queries are already defined in the base product. They are editable and additional Blocks can be added via the configuration GUI.
Step 2) “Refine Pass”
After retrieving our set of probable candidates (which could be hundreds or thousands), it is important to cut the list down. Since our current subset is just a sliver of the overall database, we can efficiently run field by field comparisons. Each field has attributes that the algorithms use to determine a score. Some of the weight considerations:
- Probabilistic and Deterministic scoring
- Reliability of data element / field or source
- Character uncertainty (phonetic errors, transpositions, character insertion, deletion, and replacement)
- Absolute difference in numbers (distance calculation)
- Specific to Names:
- Name aliasing (ex: John = Jonny = Jon = Jonathon…)
- Enhanced phonetic matching on last, middle and first (Soundex and NIISYS)
- Filtering of “junk” values such as “baby” or “boy” or “girl”
- Frequency analysis (match on statistically common names weighted less than unique names).
- Specific to Addresses:
- Tokenization of address (ex: One 1st St. is the same as 1 First Street)
- Enhanced phonetic matching on street names and cities.