Course curriculum
One challenge of building and maintaining error-free datasets often involves searching for and removing duplicate records. The problem of detection and elimination of duplicate database records is one of the major problems in the broad area of data cleansing and data quality. A single real-world entity may be listed multiple times in a dataset under different records due to variations in spelling, field formats, etc. One method for approximate (""fuzzy"") matching two field values is to compute the Levenshtein distance between the string representation of those values and accept a suitably low-valued result. One indexing technique that allows for this type of matching in a time-sensitive manner is called Deletion Neighborhoods. Deletion Neighborhoods are a classic space-time trade-off: You precompute and create a large index structure so that your later search operations are fast. While Deletion Neighborhoods may help when determining if field values between records approximately match, the problem of approximately matching the records themselves remains. Doing this quickly is necessary when working at big data scale. Interestingly, the problems of fuzzy deduplication and fuzzy search are essentially the same. The former is approximate matching records within a dataset, while the latter is approximate matching between a single record -- constructed from a user's web form entries, perhaps -- and the dataset. In this talk, we review string-oriented Deletion Neighborhoods and present a novel application of them where a similar technique may be applied to entire records within a dataset. Further, we show that combining both string- and record-oriented techniques allows for powerful searching and record de-duplication capabilities.
-
1
A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records
-
A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records
-
Instructor
Thaumaturge HPCC Systems Solutions Lab
Dan S. Camper