Delta Users #deltalake-questions

Sireesha Madabhushi

10/03/2023, 6:52 AM

Hi all, I have a delta table with partitions given as date and channel IDs. The very first call to load data of a channel takes lot of time > 10 seconds. Subsequent calls to get data of different channels for the same time period are faster, less than 1 second. Does this have anything to do with the logic of data is stored? Any leads are welcome.

Perfect Stranger

10/03/2023, 1:11 PM

In delta lake, what happens if a reader is reading a snapshot that is being vacuumed? Say, trino started reading from delta lake, and 5 seconds later we started vacuuming the snapshot that trino is reading. Will trino's query fail and need to be restarted? Is it guaranteed that it will fail instead of reading partial data of the snapshot in question? Or is there a lock that prevents vacuuming the snapshots that are being read from?

Sardar Khan

10/03/2023, 7:08 PM

I am trying trying to create a new delta table with a dataframe and enable iceberg at the same time:

Copy code

df.write.format("delta").mode("overwrite")\
.option("delta.columnMapping.mode", "name")\
.option("delta.enableIcebergCompatV1", "true")\
.option("delta.universalFormat.enabledFormats", "iceberg")\
.save("<s3a://cof-card-data-iceberg-research-qa/skDummyTest/>")

I am getting the following error in return:

Copy code

23/10/03 18:54:52 ERROR DeltaLog: Failed to find Iceberg converter class
java.lang.ClassNotFoundException: org.apache.spark.sql.delta.icebergShaded.IcebergConverter

Am I missing spark jar of some sort? Any one deal with this before?

Rahul Goswami

10/04/2023, 5:46 AM

Hi I am looking to host a Delta sharing server (provider) for a Databricks recipient. Is it possible to share files using Delta Sharing? or does it only support sharing data as Delta tables ?

quanns

10/04/2023, 7:08 AM

DATA ENCRYPTION ON DELTA LAKE Hi, I’m trying to apply data encryption for PII data in my organization using

Spark

and

delta-lake

. For

parquet

, it support columnar data encryption by integrating KMS client in the Spark application. Because

delta-lake

is based on parquet so I think columnar encryption will works with

delta-lake

but it does not. I tried to write data to HDFS with

delta

format and the same configuration as I used for parquet then the output data is not encrypted (I can read the data with pandas) • Are there any configuration that I need to use for applying the columnar encryption on

delta-lake

? • If

delta-lake

doesn’t support this feature, are there any solutions that works in the same way as

columnar encryption

of parquet? (all advantages of parquet are guaranteed such as pushed down filters)

👀 1

Marius Grama

10/04/2023, 3:06 PM

Can anyone explain to me what

WriteSerializable

in Delta Lake isolation level actually means? https://docs.databricks.com/en/optimizations/isolation-level.html#write-serializable-vs-serializable-isolation-levels

👍 1

Vaiva

10/05/2023, 6:06 AM

Hi, a multi-hop / medallion architecture question. Does schema validation / enforcement happen in silver/gold, or bronze as well?

Pranit Sherkar

10/05/2023, 8:47 AM

Hello I am trying to implement DLT pipelines on Databricks. I have done this earlier and I am aware of syntax. The new requirement we have with this is to load or stream this data back into MongoDB database. Is that possible, we are not fixated on mongodb as such. But the idea is to use DLT pipelines change management capture to propogate that as updates/inserts into mongo database. Let me know if anyone has done similar thing

Benny Elgazar

10/05/2023, 7:07 PM

every merge into I have at the end at the

_d_elta_log

a file with 000000000000000001.json to 00000000000000n.json_ but always when it comes to checkpoint number 10 I additional plenty of parquet files. example:

Copy code

2023-10-05 12:42:02       9793 00000000000000000010.checkpoint.0000001044.0000001066.parquet
2023-10-05 12:42:02       9354 00000000000000000010.checkpoint.0000001045.0000001066.parquet
2023-10-05 12:42:02       9428 00000000000000000010.checkpoint.0000001046.0000001066.parquet
2023-10-05 12:42:02       9468 00000000000000000010.checkpoint.0000001047.0000001066.parquet
2023-10-05 12:42:02       9400 00000000000000000010.checkpoint.0000001048.0000001066.parquet
2023-10-05 12:42:02       9543 00000000000000000010.checkpoint.0000001049.0000001066.parquet
2023-10-05 12:42:02       9428 00000000000000000010.checkpoint.0000001050.0000001066.parquet
2023-10-05 12:42:02       9436 00000000000000000010.checkpoint.0000001051.0000001066.parquet
2023-10-05 12:42:02       9531 00000000000000000010.checkpoint.0000001052.0000001066.parquet
2023-10-05 12:42:02       9354 00000000000000000010.checkpoint.0000001053.0000001066.parquet
2023-10-05 12:42:02      15689 00000000000000000010.checkpoint.0000001054.0000001066.parquet
2023-10-05 12:42:02       9400 00000000000000000010.checkpoint.0000001055.0000001066.parquet
2023-10-05 12:42:02       9571 00000000000000000010.checkpoint.0000001056.0000001066.parquet
2023-10-05 12:42:02       9634 00000000000000000010.checkpoint.0000001057.0000001066.parquet
2023-10-05 12:42:02       9468 00000000000000000010.checkpoint.0000001058.0000001066.parquet
2023-10-05 12:42:02       9426 00000000000000000010.checkpoint.0000001059.0000001066.parquet
2023-10-05 12:42:02       4235 00000000000000000010.checkpoint.0000001060.0000001066.parquet
2023-10-05 12:42:02       9644 00000000000000000010.checkpoint.0000001061.0000001066.parquet
2023-10-05 12:42:02       9638 00000000000000000010.checkpoint.0000001062.0000001066.parquet
2023-10-05 12:42:02       9427 00000000000000000010.checkpoint.0000001063.0000001066.parquet
2023-10-05 12:42:02       9743 00000000000000000010.checkpoint.0000001064.0000001066.parquet
2023-10-05 12:42:02       9401 00000000000000000010.checkpoint.0000001065.0000001066.parquet
2023-10-05 12:42:02       9428 00000000000000000010.checkpoint.0000001066.0000001066.parquet

When this happen, I cannot longer query the table using athena, its never returned. Anyone experienced this issue? and knows how to solve it? the only solution is to rewrite all of the table once again.

Beni

10/05/2023, 8:33 PM

Hi, we are using Azure Event Hubs with Databricks 13.3 to ingest about 500 events/minute using (arbitrary stateful operations in) Structured Streaming with a 3 second interval for the micro batches. Each micro batch needs to be joined to some 'reference data' from another Delta Table that is updated by a separate Notebook/Job. The Reference Data Delta table contains about 500 rows and has about 5-10 changes per day but these changes need to be reflected (ideally) in the next 'join' to the next micro-batch following the change. To avoid extra network-related delays and cost (Blob List/Read) by reading from external Azure Blob Storage, we wondered whether creating the Reference Data Delta Table on the Spark Driver's "local_disk0" would be an alternative. We don't need to persist the Reference Data after a Cluster restart as it will be re-generated and re-loaded prior to us starting the Structured Streaming Job. Is using the 'dbfs:/local_disk0' in this instance encouraged or best to be avoided? If it's not encouraged, what would be the alternative? Broadcast Variables that are mutable? Simple in-memory caching or a 'memory sink' (although it's intended for debugging only)? Thank you Beni Sample Delta Table:

Copy code

%sql
CREATE OR REPLACE TABLE Testing(
Id INTEGER NOT NULL,
Name STRING NOT NULL)
USING DELTA
LOCATION 'dbfs:/local_disk0/tmp/deltattest'

Henrique Viana

10/05/2023, 8:51 PM

Hello, everyone.How to optimize and use Z-Order in a continuous streaming table, Structured Streaming in Databricks? What is recommendation, running one time per day?What would be more recommended, best practice, to run it once a day in a decoupled flow?

Abhishek Shan

10/05/2023, 8:54 PM

#DLT Question We have an existing lakehouse where silver tables are queried and loaded to gold tables, its 1:M relationship. There is one single notebook created for each gold table and its has many complex business transformations, mostly SQL. All gold tables are of merge load pattern. The whole setup is orchestrated by Azure Data Factory and batch updates happen once a day. There are custom components built to log the run information and also set dependencies. Is it worth migrating the setup to DLT and is it appropriate for such use case? If we are to migrate to DLT then it seems we cant lift and shift the code, instead rewrite them. Can we encapsulate all the business transformation into a function and return the final dataset as dataframe and call this function in DLT? Are there ways to to this migration with limited efforts?

Abhishek Shan

10/05/2023, 9:11 PM

#Delta Table CI/CD Question We have 70-100 delta tables to be deployed from a lower environment to production. Part of the CI/CD pipeline we have a notebook that has all the DDL scripts. When deployment pipeline is triggered, it runs the DDL notebook. If we are to make amends to any of the schema of delta tables (say new column addition) that got deployed then I could see only way to do it is outside the release process by running ALTER statements. Is there an automatic way to do it part of CI/CD like we do in the SQL world where we just update the script and .sln file and schema changes happen automatically.

Sahil Shah

10/06/2023, 10:39 AM

Hello everyone, On doing a merge on a partitioned table, the merge operations are writing data to a single file. This is causing the file size to grow with more data, we ended up with > 10gb of data in one single file. This is slowing down the write operation with every batch.

Samrose

10/06/2023, 9:36 PM

Does the delta kernel have support for creating and appending to a Delta table? It seems not. We're starting a new component and deciding whether to use Kernel or Delta standalone.

Christian Daudt

10/06/2023, 10:44 PM

Hi, a question on VACUUM operation. The doc (here) says that vacuum deletes data files no longer referenced by a delta table and older than the retention threshold. Does the "delta table" reference there mean any version or the current version only? I.e. if I have 30 days of logs and run vacuum with the default 7 days, will it delete only data files not referenced in any of the last 30 days of logs that are also >7 days old, or any files not referenced by the current version that are >7 days old? TIA

Ben Magee

10/07/2023, 4:46 PM

If I configure Delta Lake to retain 30 days of history for time travel, insert some data on day 1, overwrite it on day 10, then perform an optimize, and then a vacuum operation, can I still access the historical version of my data from day 1 on day 20, or does the vacuum operation remove the historical version once the optimize operation is complete?

Samrose

10/07/2023, 9:59 PM

Anyone seeing NullPointerException when trying to use generated column with databricks.

Copy code

stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 34) (10.59.220.113 executor 0): java.lang.NullPointerException
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:113)

Aishik Saha

10/09/2023, 8:54 AM

Hi there! I'm unable to read Delta Tables stored on a MinIO like object store, which has caret
^
symbols in the filename, using deltalake Python library. Table URL:

"<s3a://my_bucket/bronze/hello^delta^team>"

Python Error:

Copy code

File ".../.venv/lib/python3.9/site-packages/deltalake/table.py", line 247, in __init__
    self._table = RawDeltaTable(
OSError: Encountered object with invalid path: Error parsing Path "/bronze/hello^delta^team": Encountered illegal character sequence "^" whilst parsing path segment "hello^delta^team"

Is there a way to override this parsing check, so that I can read the files? Thanks for reading.

👀 1

Gokhan Ozturk

10/09/2023, 1:17 PM

Hi, I am investigating OSS Delta Lake for use in our data lakehouse solution. I am wondering if OSS Delta Lake handles the small file problem (using bin-packing) automatically or manually. Can anyone help? Thanks in advance,

Carly Akerly

10/09/2023, 7:03 PM

Hey all Delta users, in the https://delta-users.slack.com/archives/CJ70UCSHM/p1695922778709909, we were going to release Delta 3.0 on Spark 3.4 and Delta 3.1 on Spark 3.5 by the end of October. But in the meantime, https://github.com/delta-io/delta/issues/2127. So we are switching to release Delta 3.0 on Spark 3.5 directly. And Delta 3.1 will have new release date that is yet to be decided. Thank you all for being patient! <!here>

👍 1

Chinhvu1111

10/10/2023, 3:09 AM

Hello team, we want to read the file changes in the delta table by using structured streaming. Do we have any solution for that? I think this use case are same as reading the new records from the kafka topic. We just want to read the new change file information by reading the metadata of the delta table.

Chinhvu1111

10/10/2023, 4:00 AM

@Carly Akerly

Lucas Zago

10/10/2023, 9:26 AM

Hello team, working with aws glue is there a way to work with delta and at the same time writing to glue catalog? My question is suppose i have 3 partitions : a, b, c and i want to overwrite a how to handle it with delta/glue?

Jatin Sharma

10/10/2023, 9:35 AM

Hi, when I'm doing a MERGE operation on a delta table through scala delta-spark APIs, how can I find out the version number checked in as part of that MERGE command? In addition I also want to view the transaction operationMetrics for this particular write also. Is there a good way to do this?

Gokhan Ozturk

10/10/2023, 12:48 PM

Hi everyone, I am investigating how Delta Lake's performance with a cloud storage solution compares to its performance with HDFS or Ceph. Can anyone share useful links or info? Thanks in advance,

Douglas

10/10/2023, 1:21 PM

@Scott Sandre (Delta Lake) when does the .rc2 release make it to pypi? https://delta-users.slack.com/archives/CG9LR6LN4/p1696894066958009

Pedro Salgado

10/10/2023, 2:11 PM

Hi all, quick question: I am using the delta-rs package from python, and I don't find any documentation related to set table properties ex:

delta.logRetentionDuration

as we usually do trough spark sql with

ALTER TABLE SET TBLPROPERTIES

. I was able to check the rust code and find reference to the config and is usage in the checkpoint creation, but the cleanup flag is not exposed in python... source: https://github.com/delta-io/delta-rs/blob/main/rust/src/protocol/checkpoints.rs#L95 Any idea how can I achieve log cleanup in delta-rs similar to delta spark? Thanks!

Christina

10/11/2023, 1:42 PM

Hi folks, spark / jdbc question. from this doc https://join.slack.com/t/carvanaalumni-dmr1806/shared_invite/zt-24tqc2ac5-ZZ3O5yUIREntCXixBjoMRg we know there is a query vs db table option. we like the db table option bc we can control parallism, but wonder in the case of a very wide table. in the below case, do i end up pulling all the data out of the source db, or will the jdbc connector only pull the columns used in the display query?

Copy code

employees_table = (spark.read
  .format("jdbc")
  .option("url", "<jdbc-url>")
  .option("dbtable", "<table-name>")
  .option("user", "<username>")
  .option("password", "<password>")
  .option("partitionColumn", "<partition-key>")
  .option("lowerBound", "<min-value>")
  .option("upperBound", "<max-value>")
  .option("numPartitions", 8)
  .load()
)

display(employees_table.select("age", "salary").groupBy("age").avg("salary"))

Perfect Stranger

10/11/2023, 4:18 PM

Can delta lake be used with hive metastore? I read something about generating manifests and then registering them in Hive metastore in order for Trino to read delta tables, but is there a newer way that doesn't require this? https://docs.delta.io/latest/presto-integration.html <- this article says that the manifest based approach is old, and now "Trino natively supports reading delta tables", however the Trino delta lake connector article talks about configuring delta lake in Trino through hive metastore. How do I configure delta lake through hive metastore without manifests?