Distributed Data Community #general

Join Slack

Yufan

05/21/2025, 11:29 PM

hey folks, wanna have your inputs on whats the most efficient way of resizing partitions

Yuri Gorokhov

05/23/2025, 9:29 PM

I am trying to explode on a column that is a list of structs (it's a fairly nested schema) and encountering this error:

Copy code

Attempting to downcast Map { key: Utf8, value: List(Utf8) } to \"daft_core::array::list_array::ListArray\"

Wondering if someone has seen this before?

Giridhar Pathak

05/25/2025, 1:37 AM

hey folks im getting a weird Type error when reading from an iceberg table:

Copy code

TypeError: pyarrow.lib.large_list() takes no keyword arguments

the code:

Copy code

table = catalog.load_table(table)
    return df.read_iceberg(table)

has anyone experienced this before?

Giridhar Pathak

05/25/2025, 10:18 PM

Im querying an iceberg table from a jyupter notebook (backed by 12Gb ram and 4 cpu)

Copy code

daft.read_table("platform.messages").filter("event_time > TIMESTAMP '2025-05-24T00:00:00Z'").limit(5).show()

running this makes the process crash. looks like memory goes thru the roof. Not sure if its trying to read the whole table into memory. pre-materialization, i can get the schema just fine.

Everett Kleven

05/28/2025, 2:15 PM

https://github.com/JanKaul/iceberg-rust 👀

👌 1

Yuri Gorokhov

05/28/2025, 4:14 PM

Is there an equivalent to pyspark's

Copy code

.dropDuplicates(subset: Optional[List[str]] = None)

where you can specify which columns to consider?

Pat Patterson

05/29/2025, 11:37 PM

Hi there - I’m trying out Daft after meeting @ChanChan Mao and @Sammy Sidhu at Data Council a few weeks ago. I got all the queries from my recent Iceberg and Backblaze B2 blog post working - see https://gist.github.com/metadaddy/ec9e645fa0929321b626d8be6e11162e Performance in general is not great, but one query in particular is extremely slow:

Copy code

# How many records are in the current Drive Stats dataset?
    count, elapsed_time = time_collect(drivestats.count())
    print(f'Total record count: {count.to_pydict()['count'][0]} ({elapsed_time:.2f} seconds)')

With the other systems I tested in my blog post, the equivalent query takes between a fraction of a second and 15 seconds. That Daft call to

drivestats.count()

takes 80 seconds. I’m guessing it’s doing way more work than it needs to - reading the record counts from each of the 365 Parquet files rather than simply reading

total-records

from the most recent metadata file. Since

SELECT COUNT(*)

is such a common operation, I think it’s worth short-circuiting the current behavior.

Giridhar Pathak

06/03/2025, 2:17 PM

Question on the Daft Native runtime 🧵

Everett Kleven

06/03/2025, 2:59 PM

Hey daft squad, If I'm using a MemoryCatalog to track lancedb tables, am I restricted to only using dataframes at the moment?

Pat Patterson

06/06/2025, 10:47 PM

Where does the work take place when I use Daft with Ray? For example, consider the following minimal code:

Copy code

import daft
import ray

ray.init("<ray://head_node_host:10001>", runtime_env={"pip": ["daft"]})

daft.context.set_runner_ray("<ray://head_node_host:10001>")

catalog = load_catalog(
    'iceberg',
    **{
        'uri': 'sqlite:///:memory:',
        # configuration to access Backblaze B2's S3-compatible API such as
        # s3.endpoint, s3.region, etc
    }
}

catalog.create_namespace('default', { 'location': f'<s3://my-bucket/'}>)
table = catalog.register_table('default.drivestats', metadata_location)

drivestats = daft.read_iceberg(table)

result = drivestats.count().collect()
print(f'Total record count: {result.to_pydict()['count'][0]}')

Presumably, the code to read Parquet files from Backblaze B2 via the AWS SDK executes on the Ray cluster, so I have to either install the necessary packages there ahead of time or specify them, and environment variables, in

runtime_env

? For example:

Copy code

ray.init("ray://<head_node_host>:10001", runtime_env={
    "pip": ["daft==0.5.2", "boto3==1.34.162", "botocore==1.34.162", ...etc...],
    "env_vars": {
        "AWS_ENDPOINT_URL": os.environ["AWS_ENDPOINT_URL"],
        "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
        "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
        ...etc...
    }
})

Dimitris

06/10/2025, 8:41 PM

Hi, how does one insert a row to a df with daft? All examples I see load an existing dataset and add columns. Thanks!

Tabrez Mohammed

06/16/2025, 10:41 PM

We use daft on Ray with Glue and Iceberg. We recently changed the table config from CoW to MoR to improve write perf on our Spark jobs. Unfortunately, Daft can't read the tables, let alone any operations like joins, without running out of memory. Tried up to 200GB worker and a 20GB table just loading in daft and converting to a ray dataset.

👀 1

Dimitris

06/16/2025, 11:22 PM

Hi, what are the recommended ways to debug UDFs? Is there a way to print or log in the console when using ray?

Sasha Shtern

06/17/2025, 8:06 PM

Hi Data Engineering Friends, I'm looking for someone to help us finalize our Daft / Ray setup. Is there anyone in the group who has solid experience with Daft and willing to do some consulting work? We're currently using Spark and I've ported our code over to Daft/Ray to give it a try. We're really love the tooling and API compared to Spark so far. I was hoping to see performance improvements, but we're hitting a few snags out of the gate. Iceberg tables writing much slower than Spark and some OOMs we weren't seeing before. I'm optimistic that these issues are solvable, but I could use the expertise of someone who's been around the block with Daft. Thank you!

Dimitris

06/18/2025, 10:21 PM

Hi, do you have recommendations for performing semantic search (with the use of embeddings) on daft? I haven’t found much so far. I’m mainly interested in the distributed case.

Garrett Weaver

06/19/2025, 4:25 PM

👋 qq, I am seeing mixed support with respect to identifiers for Iceberg tables. specifically, for some methods I can provide

catalog.namepace.table

(e.g.

write_table

), but others break and seem to expect

namespace.table

(e.g.

create_table_if_not_exists

which calls pyiceberg under the hood that had breaking change that no longer allows including catalog). Any advice on how I should be providing identifiers?

Garrett Weaver

06/20/2025, 7:52 PM

👋 I am running into weird issue where native runner is "hanging" when trying to read an Iceberg table, but switching to Ray runner (local cluster) works fine (plan in 🧵). Maybe I am hitting an edge case as this is single row test table in staging environment for testing Nessie integration.

Marco Gorelli

06/24/2025, 8:58 AM

If daft.pyspark were complete, then assuming it covers the operations one needs, would there still be an advantage to using Daft's own API instead?

Everett Kleven

06/24/2025, 4:11 PM

LFG DAFT SQUAD!

🙏 4

🙌 2

🎉 2

Sammy Sidhu

06/24/2025, 4:19 PM

Today we're thrilled to announce that Eventual has raised $30M in funding to power the future of multimodal AI infrastructure! Jay Chia and I started this journey 3 years ago, frustrated by the same wall every AI team hits: processing images, video, and documents at scale with tools built for entirely different use cases. What began as pure frustration in my basement has become the data engine trusted by Amazon, CloudKitchens, Essential AI, Together AI and other Fortune 25 companies. The numbers speak for themselves: Daft improved Amazon's most critical data job efficiency by 24%, saving them 40,000 years of compute time annually. Together AI replaced their custom pipelines with simple Daft queries for 100TB+ datasets while achieving 10x performance gains. But this is just the beginning. AI applications are now generating massive amounts of multimodal data at machine speed, and we're building the engine to power it all. Thank you to our incredible community and supporters who made this possible - from our earliest believers at Y Combinator and Array Ventures to our lead investors CRV and Felicis, plus M12, Microsoft's Venture Fund, Citi, and everyone who's been part of this journey. What's next? We're launching early access to Eventual Cloud - the first production platform built from scratch for multimodal AI workloads. We're also hiring across Engineering, DevRel, Design and Product Marketing. - Check out our open roles here. Please help share the love on LinkedIn & Twitter! https://daft-amplify.lovable.app/ (PS, check out the video in the posts, it's pretty cool)

✅ 5

🔥 6

🙏 5

➕ 4

daft party 4

👏 1

❤️ 10

🙌 9

Artur Ciocanu

06/25/2025, 7:55 PM

Hello community, I saw https://daft.ai/ being shown on this repo: https://github.com/Eventual-Inc/Daft, but the address doesn’t open. Is this a known issue?

ChanChan Mao

06/26/2025, 4:41 PM

Another huge milestone achieved this week - Daft has surpassed 3000 stars! Thank you to our growing community for continuing to support us and sharing the love of Daft. And thank you to everyone who is using Daft and believing in the future that we're building. https://github.com/Eventual-Inc/Daft

⭐ 4

🙌 3

🤩 5

Garrett Weaver

06/27/2025, 7:23 PM

fyi, I am hitting this error https://github.com/apache/arrow/issues/21526 with

pyarrow

when using Ray data directly and also seeing errors trying to read the same data with

daft

pyspark

works fine. I assume daft might be impacted by the same issue (too large row groups)?

delasoul

07/14/2025, 8:47 AM

Hello, are you planning to support the Ducklake format?

Amir Shukayev

07/19/2025, 12:32 AM

Hey! it seems like a lot of the docs indexed on google are pointing to dead links 🤔

Coury Ditch

07/22/2025, 7:34 PM

Has anyone had the experience of the native runner being faster than the ray runner?

Garrett Weaver

07/24/2025, 3:45 AM

👋 I am seeing the following error with native parquet writer on

0.5.9

, goes away if I set

native_parquet_writer=False

, I am using

anonymous

(we use alternative way to authenticate in k8s)

Copy code

daft.exceptions.DaftCoreException: DaftError::External task 6617 panicked with message "Failed to create S3 multipart writer: Generic { store: S3, source: UploadsCannotBeAnonymous }"

👀 1

Garrett Weaver

07/25/2025, 4:44 AM

👋 added an issue around column order when writing out to parquet. In latest version on native runner, if I do a final

select

prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?

👀 1

Garrett Weaver

07/28/2025, 5:33 PM

Hi, do window functions and joins work in a single query with the new distributed engine on Ray, with the join pieces falling back to old Ray distributed engine? I know if I toggle the new distribution engine off, I get a not implemented error

daft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.

A small test with new engine on seems to work, but want to make sure there are not any caveats.

Everett Kleven

07/28/2025, 9:41 PM

Hey Daft Team, how expensive is the average concat operation? is it more recommended to append rows with pyarrow recordbatches?