The whole industry is talking about the Data Lakehouse, but what is it on a technical level beyond all the hype? On a data management level, data lakehouses combine the best elements of data lakes and data warehouses to deliver the reliability, strong governance, and performance of data warehouses with the openness, flexibility, and machine learning support of data lakes. Open source projects such as Delta Lake (https://github.com/delta-io) and many others, turn your data lake into a data lakehouse and bring back ACID transactions, schema enforcement, upserts, efficient metadata handling, and even time travel! But how far do you get with open source? And does it support streaming data? What improvements can we expect in the near future? With a focus on streaming data, this presentation explores the open source table format, data ingestion, data pipelines, data quality, workflows, streaming data analysis and machine learning on the lakehouse. I will conclude with an outlook on project Lightspeed that brings predictable low latencies to Apache Spark Structured Streaming. I will show lot's of code and show the continuous ingestion of a live Twitter stream with a declarative, auto-scaling data pipeline for sentiment analysis with Hugging Face. This talk is for data architects who are not afraid of some code, for data engineers who love open source and cloud services, and for practitioners who enjoy a fun end-to-end demo. The Databricks Lakehouse is used for the demos.
Get notified about new features and conference additions.