Data lakes and data warehouses used to be fully different animals but now they seem to be merging. A data lake was a one data repository that held all your data for analysis. The data was stored in its indigenous form at smallest initially. A data warehouse was an analytic database usually relational created from two or more data rises. The data warehouse was typically used to store historical data most frequently using a star schema or at smallest a big set of indexes to support queries.
Data lakes contained a very big amount of data and usually resided on Apache Hadoop clusters of staple computers using HDFS (Hadoop Distributed File System) and open rise analytics frameworks. Originally analytics meant MapReduce but Apache Spark made a huge advancement in processing despatch. It also supported running processing and machine learning as well as analyzing historic data. Data lakes didnt lay a schema on data until it was used—a process known as schema on read.
Data warehouses tended to have less data but it was better curated with a predetermined schema that was layd as the data was written (schema on write). Since they were designed primarily for fast analysis data warehouses used the fastest practicable storage including solid-state disks (SSDs) once they were useful and as much RAM as practicable. That made the storage hardware for data warehouses costly.
Databricks was founded by the nation behind Apache Spark and the company quiet contributes heavily to the open rise Spark project. Databricks has also contributed separate other products to open rise including MLflow Delta Lake Delta Sharing Redash and Koalas.
This review is almost Databricks running commercial cloud offering Databricks Lakehouse Platform. Lakehouse as you might conjecture is a portmanteau of data lake and data warehouse. The platform essentially adds fast SQL a data catalog and analytics capabilities to a data lake. It has the functionality of a data warehouse without the need for costly storage.