本文提供Databricks数据洞察中的Databricks Runtime Delta与社区开源版本Delta Lake。

Databricks Runtime vs Apache Spark

下表中的 feature 列表来自 Databricks 官网(https://databricks.com/spark/comparing-databricks-to-apache-spark

Feature Apache Spark Databricks数据洞察
Built-in file system optimized for cloud storage access (AWS S3, Redshift, Azure Blob No Yes
Spark-native fine grained resource sharing for optimum utilization No Yes
Fault isolation of compute resources No Yes
Faster writes to OSS No Yes
Compute optimization during joins and filters No Yes
Rapid release cycles No Yes
Auto-scaling compute No 即将发布
High availability for cluster No 即将发布

Databricks Delta vs Open-source Delta Lake

Feature Open SourceDelta Lake Databricks Delta
Snapshot Isolation / Transactional Guarantees Yes Yes
Efficient directory / File listing Yes Yes
Version history and time travel Yes Yes
Schema evolution & enforcement Yes Yes
Hidden partitions / Partitioning by expressions In Roadmap In Roadmap
HDFS Support Yes Yes
Object Storage Support Yes Yes
Streaming Data Sink Yes Yes
Streaming Data Source Yes Yes
Basic Upsert (merge into) Yes Yes
Scalable Upsert (merge into) No Yes
Data skipping based on stats No Yes
Compact small files Yes Yes
Optimize (efficiently compact small files) No Yes
Auto Optimize No Yes
Native Parquet Reader No Yes
Local SSD Caching No No
Read from Presto Yes Yes
Read from Hive In Roadmap In Roadmap