Description

Reconciling data from disparate datasets can be a tricky and time-consuming process. Even when the data points refer to the same real-world entities, different data sources may use different conventions for describing their properties. For example, maintaining a global, up-to-date, and accurate dataset of infections and tests related to the COVID-19 pandemic is a challenging task, in part due to the different taxonomies that distinct nations and municipalities used to classify outcomes.

Ontologies are a simple solution to this problem. Ontologies are collections of class and relationship definitions. Data scientists can align disparate taxonomies with a centralized ontology - a “source of truth” for data classification, and unify their datasets in a consolidated hierarchy. As new datasets are added, they can be easily matched to the same ontology and reconciled with the existing data. Automating this process ensures that the datasets are unified in a consistent manner, and reduces the possibility of discrepancies arising from manual data curation. While manual taxonomy alignment may be easier in the short term, maintaining a process for ongoing taxonomy reconciliation is the only effective long-term solution.

In this session, we demonstrate how taxonomies from distinct datasets can be quickly reconciled and unified using a centralized ontology. As an example, we extract the taxonomies used in two open-source retail product datasets and align them with a common retail ontology. We also demonstrate the use of knowledge graph visualizations to showcase the impact of cross-dataset standardization. Finally, we discuss how this unification pipeline can be deployed at scale, using either open-source Python libraries or proprietary solutions like Neo4j.

Main learning points:

1. The importance of having a unified taxonomy across data sources and the difficulties involved in building that universal taxonomy

2. How to use ontologies to find common ground between disparate taxonomies to align them in a systematic and sustainable way

Instructor's Bio

Elizabeth Michel

Senior Analytics Engineer at Tamr

Elizabeth Michel is a Senior Analytics Engineer at Tamr, a Boston-based enterprise data mastering software company. She graduated with a degree in engineering modified with economics from Dartmouth College in 2019, and works to help Tamr’s clients derive analytic value from their mastered data, as well as to integrate the analytic value with Tamr’s core products.

Webinar

  • 1

    A Data Scientist’s Rosetta Stone: Reconciling Disparate Data with Ontologies

    • AI+ Training

    • Webinar Link

    • AI+ Subscription Plans