Data Engineering Summit 2023 Sessions

  • 1

    Data Engineering Talks

    • Spark, Cloud, DBT, Data warehouses by Navdeep Kaur

    • Leveraging Data in Motion in a Cloud-first World by Jun Rao

    • Automated Data Classification by Alex Gorelik

    • Reliable Pipelines and High Quality Data Without The Toil by Kyle Kirwan

    • Streaming Featurization with Ibis, Substrait and Apache Arrow by Wes Mckinney and David Palaitis

    • Beyond Monitoring: The Rise of Data Observability by Lior Gavish

    • Thrive in the Data Tooling Tornado_ Lead, Hire, and Execute Better by Escaping Older Industrial Antipatterns by Adam Breindel

    • From BI to AI: Lakehouse is the modern data architecture by Vini Jaiswal

    • Applying Engineering Best Practices in Data Lakes Architectures by Einat Orr

    • Assessing Data Quality: The 3 Facets of Data 'Fitness' by Susan McGregor

    • Demystifying Data Mesh - Tackling common misconceptions about Data Mesh by Wannes Rosiers

    • Building a Data Mesh - Strategies and Best Practices for Navigating the Implementation of a Data Mesh by Hajar Khizou

    • Data-Planning to Implementation by Balaji Raghunathan

    • Getting into Data Engineering by Joe Reis

Beyond Monitoring: The Rise of Data Observability

Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.

Shane Murray

Field CTO | Monte Carlo

Streaming Featurization with Ibis, Substrait and Apache Arrow

In this talk, you'll learn how Two Sigma and Voltron Data are collaborating to improve the performance of featurization workflows using the Ibis, Substrait, Arrow software stack. Wes McKinney and David Palaitis have been working together since 2016 on the design and implementation of high performance data engines for processing unstructured, high volume, streaming datasets for use in machine learning algorithms. While Palaitis has focused on using these tools to support machine learning at Two Sigma, Wes has built out a new business to support the open source computing libraries that are critical to supporting high performance featurization for quant finance workloads.

Wes McKinney

CTO and Co-Founder | Voltron Data


David Palaitis

Managing Director | Two Sigma

Thrive in the Data Tooling Tornado: Lead, Hire, and Execute Better by Escaping Older Industrial Antipatterns

In this talk, we'll learn why many of these challenges result from outdated anti-patterns held over from the 20th-century industry. These older patterns emphasize efficiency over effectiveness and are not appropriate for 2023 -- leading to results both ineffective and inefficient. We'll look at adjustments in an approach that make it easier for data teams to hire, manage, retain, and execute effectively using modern data tooling -- all while gaining that sought-after efficiency.

Adam Breindel

Independent Consultant

Applying Engineering Best Practices in Data Lakes Architectures

In this talk, we will show how adopting those practices to data lakes is a must, as it provides us with a safe environment to operate in, that produces higher-quality data in less time. Our time will be spent on actual data engineering and less on manual plumbing of data pipelines.

Einatt Orr

CEO and Co-Founder | Treeverse

Demystifying Data Mesh - Tackling common misconceptions about Data Mesh

Data Mesh is moving forward on its hype cycle. More and more vendors are naming them data mesh solutions. Inherently this is wrong. Data mesh is about federating responsibilities by acknowledging a distributed landscape. Within this session, we will address more misconceptions and will return to explain the core concepts of data mesh.

Wannes Rosiers

Product Manager | Ratio

From BI to AI : Lakehouse is the modern data architecture

Lakehouse combines the strengths of data warehouse and data lakes into a single system, allowing data teams to accelerate their use cases as they are able to use one system rather than needing to access multiple systems, thus eliminating data silos and duplicity of data, offering you reliability and cost efficiency. Lakehouse is based on open formats such as Delta Lake, which provides support for advanced analytics and AI with performance and reliability guarantees. Through this talk, we will cover the evolution of modern data architecture, and foundation principles and cover some production examples of lakehouses.


Vini Jaiswal

Developer Advocate | Databricks

Data-Planning to Implementation

How can businesses leverage big data, fast data, traditional data and modern data for  decision making? How can businesses realize value from data? What are the capabilities needed for enterprise data management? “Data: Planning to Implementation” will provide a strategic perspective to the "why, what, where, when, how and whom" of data management across industry

Balaji Raghunathan
VP of Digital Experience | ITC Infotech

Reliable pipelines and high quality data without the toil

Bad data sucks. But it's a struggle keeping data fresh and high quality as pipelines get complicated. Data observability is the ability to understand what’s happening to, and within, your pipelines at all times. It enables data engineers to identify pipeline issues sooner, spot pipeline performance opportunities more easily, and reduce toilsome maintenance work. Data observability techniques were pioneered by large scale data teams at companies like Uber, AirBnB, and Intuit. But today they’re accessible to team’s of nearly any size. In this talk you’ll hear about the history of data quality testing and data observability inside Uber, the differences between data observability and other methods like data pipeline tests, how techniques developed there can be applied by data engineers anywhere, and an overview of both commercially available and open source tools available today.

Kyle Kirwan
Co-Founder and CEO | Big Eye

Automated Data Classification

Automating Data Classification is key to a successful data privacy program. Data privacy policies apply to specific types of data and without knowing which datasets contain this regulated data, it is impossible to protect it. In any vast and dynamic data estate, manual labeling or classification of data is impractical. This talk will cover the challenges and different approaches for automating data classification.

Alex Gorelik
Distinguished Engineer | LinkedIn

Getting into Data Engineering

Curious about becoming a data engineer? This talk will cover the key things you should consider about a career in data engineering, particularly against the backdrop of 2023's economic climate.

Joe Reis
CEO | Ternary Data

Leveraging Data in Motion in a Cloud-first World

Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building one system of record and some real world use cases.

Jun Rao
Co-Founder | Confluent

Spark, Cloud, DBT, Data Warehouses

We are going to discuss current Data Engineering trends and how the industry is moving toward a new data stack. I will first discuss the current tech stack which most companies are using, why there is a need for a shift, and how the current tech stack is moving towards data warehouses and delta lakes.

Navdeep Kaur
Founder | Techno Avengers

Assessing Data Quality: The 3 Facets of Data "Fitness"

While most of us are used to assessing the quality of data for gaps, errors, and other data integrity problems, understanding whether the information we have is "fit" for our intended purpose can be a little trickier. In this session, we'll cover the three essential facets of data "fitness" that can help you ensure that your data can really give you the answers you want.

Susan McGregor

Associate Research Scholar | Columbia University

Building a Data Mesh: Strategies and Best Practices for navigating the Implementation of a Data Mesh

Data mesh is a new approach to to thinking about data based on a distributed architecture for data management that promotes decentralized ownership and control of data assets. It emphasizes the use of domain-driven design and self-service access to data, with the goal of improving the quality and usability of data for business decision-making. In this talk, we will explore the principles and practices of data mesh and how to implement it in an organization.

Hajar Khizou

Lead Data Engineer | SustainCERT