Deduplication is the task of identifying duplicate entities in a given set of data with the same matching information. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. Establishing record linkages involve comparing both unique and non unique attributes and over large volume.
Traditional data matching performs matching of records in the database sequentially. First record is searched against all other records in the set. Then the second record is searched against all other records except the first record and so on. The sequential nature of the linear search operations are very expensive resulting in slower response and almost impossible when the data dealt is beyond certain volume as the the complexity of the problem is inherently quadratic .For instance even when a query is very fast and fetches results in 1 sec against a data of 10 millions, the estimated time for deduplication of this 10 million data is 4 months. As the matching rules increases, it becomes almost impossible to dedupe. Even though indexing helps searching to certain extent, matching partial identities across heterogeneous databases to find duplicate records can be made achievable only by specialized software like Bulk Deduplication and Clustering (BDC) engine from Posidex.
Bulk deduplication and clustering engine is the next level innovative SetMatch search engine technology that aggregates voluminous data into multiple sets of clusters for efficient and super fast matching, with techniques originating from mathematics, statistical methods and machine learning being combined and applied to improve the matching quality and clustering, as well as to increase performance and efficiency when linking or deduplicating very large data sets.
SetMatch engine facilitates for deduplication / matching of millions of data. Clusters are then formed based on deduplication results and Customer Master (referred as Golden record) table is generated. The challenges / issues that are encountered in the process are
- Gigantic task involving trillions of comparisons
- The process gets complicated while working with names and multiple addresses while remains simple dealing with such parameters like Date of Birth, Mobile Number, Phone number etc
- Highly resource intensive
- Leads to network clogging
SetMatch employs innovative approach to deal with this problem, the salient features of which are
- Based on set theory
- Cache's the essential inputs of matching by means of persistent java objects
- Clusters records of identical features of measure and builds nested sets
- Unlike the conventional process of finding the matches record by record against the target, sets are compared for likeliness in case they are likely, the elements of the corresponding sets are sent for detailed matching
- The major bottleneck in the process, I/O operations with the database are almost completely avoided
- Uses the PrimeMatch engine for matching of names
- Speed is phenomenal compared to conventional matching
While achieving this task, the following features are offered
- Support to transform data from disparate data sources
- Flexible in building the matching rules
- Multi clustering to target high Recall & Precision. The cluster governing rules are built based on matching strengths or match score by assigning appropriate weight ages
- Splitting/merging/Realignment of clusters
- GUI for different tasks viz., User management, Cluster rule building, Cluster navigation, verification with maker checker policy etc
- Merging of cluster to form golden/master record
- Provision to manually merge
Usually the deduplication and creating a unique customer base is the first task of implementing the process of Master Data Management.It requires a careful planning and lot of iterations before freezing on the optimal rule set.
Being a complex process, which requires a thorough understanding of the data quality, the need for proper cleansing and standardization routines, the Deduplication process cannot be automated and would require manual intervention and iterations.
Some of the largest deduplication exercise in India has been carried out by Posidex.