Session Overview

The incomplete data is a long-standing pathological issue in the broad science and engineering domains for a variety of reasons. Incomplete data problem hampers reliable data-driven research and trustworthy critical decision-making. Not only the scientific and engineering research communities but also data science areas suffer from large incomplete data-oriented data curing tools while the existing methods and theories often require complex distributional assumptions or difficult statistical experts' interventions. Naïve data-curing mehod is widely used in data science and machine learning areas.
To meet this daunting challenge in the emerging era of machine learning and data, our team combined the theory of fractional hot-deck imputation (FHDI), computational statistics, and parallel computing to cure ultra incomplete data (i.e., concurrently big-n and big-p) with tremendous instances and high dimensionality. The ultra data-oriented FHDI is named as UP-FHDI. The parallel program and sources of UP-FHDI are made publicly available to benefit broader audiences in science, engineering, data science domains. Uncertainty measurement of the cured data is another important issue. In lieu of the computationally expensive parallel Jackknife method, the uncertainty assessment of UP-FHDI is enabled by a computationally efficient parallel linearization technique. Results confirm that UP-FHDI can handle diverse ultra data with up to millions of instances and >10,000 variables. The scale-up now depends on the amount of memory and computing power available. We also show that UP-FHDI holds a positive impact on the subsequent deep learning’s prediction performance. We believe that this achievement will catalyze large/big data-driven science and engineering where incomplete large data pose a daunting challenge to advanced machine learning and statistical predictions.


  • 1

    Assumption-free, General-purpose Ultra Large Incomplete Data Curing

    • Abstract and Bio

    • Assumption-free, General-purpose Ultra Large Incomplete Data Curing


Start your 7-days trial. Cancel anytime.