A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records

Enroll for Free

Course curriculum

One challenge of building and maintaining error-free datasets often involves searching for and removing duplicate records. The problem of detection and elimination of duplicate database records is one of the major problems in the broad area of data cleansing and data quality. A single real-world entity may be listed multiple times in a dataset under different records due to variations in spelling, field formats, etc. One method for approximate (""fuzzy"") matching two field values is to compute the Levenshtein distance between the string representation of those values and accept a suitably low-valued result. One indexing technique that allows for this type of matching in a time-sensitive manner is called Deletion Neighborhoods. Deletion Neighborhoods are a classic space-time trade-off: You precompute and create a large index structure so that your later search operations are fast. While Deletion Neighborhoods may help when determining if field values between records approximately match, the problem of approximately matching the records themselves remains. Doing this quickly is necessary when working at big data scale. Interestingly, the problems of fuzzy deduplication and fuzzy search are essentially the same. The former is approximate matching records within a dataset, while the latter is approximate matching between a single record -- constructed from a user's web form entries, perhaps -- and the dataset. In this talk, we review string-oriented Deletion Neighborhoods and present a novel application of them where a similar technique may be applied to entire records within a dataset. Further, we show that combining both string- and record-oriented techniques allows for powerful searching and record de-duplication capabilities.

1

A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records
- A New Indexing Technique for Quickly Fuzzy-Matching Entire Dataset Records

Instructor

Thaumaturge HPCC Systems Solutions Lab

Dan S. Camper

Dan has been with LexisNexis Risk Solutions Group since 2014 and is an Enterprise Architect in the Solutions Lab Group. He has worked for Apple as well as Dun & Bradstreet, and he ran his own custom programming shop for a decade. He's been writing software professionally for more than 40 years and has worked on a myriad of systems, using many different programming languages.

ODSC West 2022

50% off Early Bird Ends Friday