Products   »   Cleansing & Standardization
Cleansing & Standardization
Data refinement
Data refinement ensures that the data elements have gone through a series of acceptance criteria, pre defined set of validations and data values are valid, consistent and are represented uniformly, and data enrichment if any possible is carried out.

Data Quality is one of key components of the customer consolidation process. Building a trusted customer master record from information of inconsistent low quality data is almost an impossible task. When the data quality is poor, matching and linking of records and customer aggregation will most likely result in low match accuracy and produce an unacceptable number of false negative and false positive outcomes.

Key challenge to improve Data Quality is an incomplete or unclear set of semantic definitions of what the data is supposed to represent and in what form, there are many data quality dimensions and contexts each of which requires different approach for improvement of the data quality. Having worked extensively on large and varied databases, PrimeMatch® has built tools for data profiling and also has an exhaustive reference data tables for enhancing data quality.

Cleaning of invalid data:
  • All the parameters are cleaned for unwanted characters.
  • Names are cleaned of Titles (Mr, Sri, Dr etc), postfixes like M.tech
  • Certain cleanings applicable specific to the system like names, separated by a delimiter for instance "Diaa Mittle Rep By Anup Mittle" or "Anup Sharma S/o Arvind Sharma". In this cases the delimiter and post delimiter data will be ignored for matching (i.e. Rep By Anup Mittle).List of such delimiters should be obtained from source system.
  • There are 46 cleaning routines that can be applied.
  • The following are few cleaning libraries available with us.
    • Account Name Splitter List
    • Corporate Identifier List
    • Relation Name Splitter List
    • Salutation List
    • Surname Identifier List
    • Terminals
    • City Clean list
    • Phone Exclusion List
    • Phone Extract List
    • Mobile Series List
    • Phone Clean List
    • Stdcode Series List
Junk Values Cleaning: Junk values are likely to misguide the dedupe process and therefore are to be ignored for the dedupe process. For instance Ph No 11111111 etc. The source system is expected to list out such junk values.

Standardisation: Standardization is the process of changing non-standard data values to pre-decided standard format.
The following are few cleaning libraries available with us.
  • Generic Phonetic Dictionary
  • Leading Phonetic Dictionary
  • Address Standardization Dictionary
  • City Standardization Dictionary
  • Phone Standardization Dictionary

Extraction: Extraction algorithms are used for finding hidden values. For instance if city is not stored seperstely and is part of one of the address lines, it will be extracted and stored in separate field even if it appears with a variation.

Pattern Checking: Regular expresiions are built for checking pattern. For instance, PAN number will be checked for its standard format. Mobile nos are checked for the format ( using refence data available for leading mobile nos of all service providers for all states) and moved into their fields, if available in land phone fields.

Data Validation and Verification: Validation rules range from quite simple rules such as data type, distinct value list or value range, to fairly complex validation rules that check for attribute values against a predefined set of values. Some of other validations include Data Type validation, optionality, Uniqueness, value range, data length, List of allowed values, Check digit/check sum. For instance mapping info between states and cities, Pin Number leading nos and states etc are checked.

Data Enrichment: Enrichment is the process of augmenting the existing customer data fields with additional information. For example, Pin code is used to populate city where the city is null and the pincode refers to a city as per pincode master. Customer input records which are rejected at this stage are loaded into error tables with the reason for rejection and refined records are loaded in to a different table and will be processed by the subsequent processes. Overall this process ensures that data elements have gone through a series of acceptance criteria and data values are valid, consistent and are represented uniformly.

Further the user can add his own dictionaries to meet any cleaning requirements specific to his organization. The sequence in which the cleaning routines are applied is also very significant and could be controlled by the user.

For demo, product and solution evaluations and pricing details please contact us.