Spark, Cloud, DBT, Data warehouses by Navdeep Kaur
Leveraging Data in Motion in a Cloud-first World by Jun Rao
Automated Data Classification by Alex Gorelik
Reliable Pipelines and High Quality Data Without The Toil by Kyle Kirwan
Streaming Featurization with Ibis, Substrait and Apache Arrow by Wes Mckinney and David Palaitis
Beyond Monitoring: The Rise of Data Observability by Lior Gavish
Thrive in the Data Tooling Tornado_ Lead, Hire, and Execute Better by Escaping Older Industrial Antipatterns by Adam Breindel
From BI to AI: Lakehouse is the modern data architecture by Vini Jaiswal
Applying Engineering Best Practices in Data Lakes Architectures by Einat Orr
Assessing Data Quality: The 3 Facets of Data 'Fitness' by Susan McGregor
Demystifying Data Mesh - Tackling common misconceptions about Data Mesh by Wannes Rosiers
Building a Data Mesh - Strategies and Best Practices for Navigating the Implementation of a Data Mesh by Hajar Khizou
Data-Planning to Implementation by Balaji Raghunathan
Getting into Data Engineering by Joe Reis
Broken data is costly, time-consuming, and nowadays, an all-too-common reality for even the most advanced data teams. In this talk, I’ll introduce this problem, called “data downtime” — periods of time when data is partial, erroneous, missing or otherwise inaccurate — and discuss how to eliminate it in your data ecosystem with end-to-end data observability. Drawing corollaries to application observability in software engineering, data observability is a critical component of the modern DataOps workflow and the key to ensuring data trust at scale. I’ll share why data observability matters when it comes to building a better data quality strategy and highlight tactics you can use to address it today.
Field CTO | Monte Carlo
In this talk, you'll learn how Two Sigma and Voltron Data are collaborating to improve the performance of featurization workflows using the Ibis, Substrait, Arrow software stack. Wes McKinney and David Palaitis have been working together since 2016 on the design and implementation of high performance data engines for processing unstructured, high volume, streaming datasets for use in machine learning algorithms. While Palaitis has focused on using these tools to support machine learning at Two Sigma, Wes has built out a new business to support the open source computing libraries that are critical to supporting high performance featurization for quant finance workloads.
CTO and Co-Founder | Voltron Data
Managing Director | Two Sigma
In this talk, we'll learn why many of these challenges result from outdated anti-patterns held over from the 20th-century industry. These older patterns emphasize efficiency over effectiveness and are not appropriate for 2023 -- leading to results both ineffective and inefficient. We'll look at adjustments in an approach that make it easier for data teams to hire, manage, retain, and execute effectively using modern data tooling -- all while gaining that sought-after efficiency.
In this talk, we will show how adopting those practices to data lakes is a must, as it provides us with a safe environment to operate in, that produces higher-quality data in less time. Our time will be spent on actual data engineering and less on manual plumbing of data pipelines.
CEO and Co-Founder | Treeverse
Data Mesh is moving forward on its hype cycle. More and more vendors are naming them data mesh solutions. Inherently this is wrong. Data mesh is about federating responsibilities by acknowledging a distributed landscape. Within this session, we will address more misconceptions and will return to explain the core concepts of data mesh.
Product Manager | Ratio
Lakehouse combines the strengths of data warehouse and data lakes into a single system, allowing data teams to accelerate their use cases as they are able to use one system rather than needing to access multiple systems, thus eliminating data silos and duplicity of data, offering you reliability and cost efficiency. Lakehouse is based on open formats such as Delta Lake, which provides support for advanced analytics and AI with performance and reliability guarantees. Through this talk, we will cover the evolution of modern data architecture, and foundation principles and cover some production examples of lakehouses.
Developer Advocate | Databricks
How can businesses leverage big data, fast data, traditional data and modern data for decision making? How can businesses realize value from data? What are the capabilities needed for enterprise data management? “Data: Planning to Implementation” will provide a strategic perspective to the "why, what, where, when, how and whom" of data management across industry
VP of Digital Experience | ITC Infotech
Bad data sucks. But it's a struggle keeping data fresh and high quality as pipelines get complicated. Data observability is the ability to understand what’s happening to, and within, your pipelines at all times. It enables data engineers to identify pipeline issues sooner, spot pipeline performance opportunities more easily, and reduce toilsome maintenance work. Data observability techniques were pioneered by large scale data teams at companies like Uber, AirBnB, and Intuit. But today they’re accessible to team’s of nearly any size. In this talk you’ll hear about the history of data quality testing and data observability inside Uber, the differences between data observability and other methods like data pipeline tests, how techniques developed there can be applied by data engineers anywhere, and an overview of both commercially available and open source tools available today.
Automating Data Classification is key to a successful data privacy program. Data privacy policies apply to specific types of data and without knowing which datasets contain this regulated data, it is impossible to protect it. In any vast and dynamic data estate, manual labeling or classification of data is impractical. This talk will cover the challenges and different approaches for automating data classification.
Apache Kafka has emerged to the de-facto standard for event streaming platform in enterprise architectures. Many business applications are moving away from data-at-rest to an event-driven architecture so that they could leverage the data in real time as new events occur. More than 80% of the Fortune 100 are building their businesses on this new platform. In this talk, I will first share the story behind Kafka: how it was invented, what problem it was trying to solve and how it has been evolving. Then, I will talk about how making Kafka Cloud native creates new opportunities for building one system of record and some real world use cases.
We are going to discuss current Data Engineering trends and how the industry is moving toward a new data stack. I will first discuss the current tech stack which most companies are using, why there is a need for a shift, and how the current tech stack is moving towards data warehouses and delta lakes.
While most of us are used to assessing the quality of data for gaps, errors, and other data integrity problems, understanding whether the information we have is "fit" for our intended purpose can be a little trickier. In this session, we'll cover the three essential facets of data "fitness" that can help you ensure that your data can really give you the answers you want.
Associate Research Scholar | Columbia University
Data mesh is a new approach to to thinking about data based on a distributed architecture for data management that promotes decentralized ownership and control of data assets. It emphasizes the use of domain-driven design and self-service access to data, with the goal of improving the quality and usability of data for business decision-making. In this talk, we will explore the principles and practices of data mesh and how to implement it in an organization.
Lead Data Engineer | SustainCERT