https://www.getdaft.io logo
Join Slack
Powered by
# general
  • y

    Yufan

    05/21/2025, 11:29 PM
    hey folks, wanna have your inputs on whats the most efficient way of resizing partitions
    k
    c
    • 3
    • 8
  • y

    Yuri Gorokhov

    05/23/2025, 9:29 PM
    I am trying to explode on a column that is a list of structs (it's a fairly nested schema) and encountering this error:
    Copy code
    Attempting to downcast Map { key: Utf8, value: List(Utf8) } to \"daft_core::array::list_array::ListArray\"
    Wondering if someone has seen this before?
    c
    • 2
    • 5
  • g

    Giridhar Pathak

    05/25/2025, 1:37 AM
    hey folks im getting a weird Type error when reading from an iceberg table:
    Copy code
    TypeError: pyarrow.lib.large_list() takes no keyword arguments
    the code:
    Copy code
    table = catalog.load_table(table)
        return df.read_iceberg(table)
    has anyone experienced this before?
    • 1
    • 1
  • g

    Giridhar Pathak

    05/25/2025, 10:18 PM
    Im querying an iceberg table from a jyupter notebook (backed by 12Gb ram and 4 cpu)
    Copy code
    daft.read_table("platform.messages").filter("event_time > TIMESTAMP '2025-05-24T00:00:00Z'").limit(5).show()
    running this makes the process crash. looks like memory goes thru the roof. Not sure if its trying to read the whole table into memory. pre-materialization, i can get the schema just fine.
    c
    d
    • 3
    • 29
  • e

    Everett Kleven

    05/28/2025, 2:15 PM
    https://github.com/JanKaul/iceberg-rust ๐Ÿ‘€
    ๐Ÿ‘Œ 1
    k
    • 2
    • 6
  • y

    Yuri Gorokhov

    05/28/2025, 4:14 PM
    Is there an equivalent to pyspark's
    Copy code
    .dropDuplicates(subset: Optional[List[str]] = None)
    where you can specify which columns to consider?
    r
    k
    s
    • 4
    • 9
  • p

    Pat Patterson

    05/29/2025, 11:37 PM
    Hi there - Iโ€™m trying out Daft after meeting @ChanChan Mao and @Sammy Sidhu at Data Council a few weeks ago. I got all the queries from my recent Iceberg and Backblaze B2 blog post working - see https://gist.github.com/metadaddy/ec9e645fa0929321b626d8be6e11162e Performance in general is not great, but one query in particular is extremely slow:
    Copy code
    # How many records are in the current Drive Stats dataset?
        count, elapsed_time = time_collect(drivestats.count())
        print(f'Total record count: {count.to_pydict()['count'][0]} ({elapsed_time:.2f} seconds)')
    With the other systems I tested in my blog post, the equivalent query takes between a fraction of a second and 15 seconds. That Daft call to
    drivestats.count()
    takes 80 seconds. Iโ€™m guessing itโ€™s doing way more work than it needs to - reading the record counts from each of the 365 Parquet files rather than simply reading
    total-records
    from the most recent metadata file. Since
    SELECT COUNT(*)
    is such a common operation, I think itโ€™s worth short-circuiting the current behavior.
    c
    • 2
    • 8
  • g

    Giridhar Pathak

    06/03/2025, 2:17 PM
    Question on the Daft Native runtime ๐Ÿงต
    c
    • 2
    • 5
  • e

    Everett Kleven

    06/03/2025, 2:59 PM
    Hey daft squad, If I'm using a MemoryCatalog to track lancedb tables, am I restricted to only using dataframes at the moment?
    r
    • 2
    • 1
  • p

    Pat Patterson

    06/06/2025, 10:47 PM
    Where does the work take place when I use Daft with Ray? For example, consider the following minimal code:
    Copy code
    import daft
    import ray
    
    ray.init("<ray://head_node_host:10001>", runtime_env={"pip": ["daft"]})
    
    daft.context.set_runner_ray("<ray://head_node_host:10001>")
    
    catalog = load_catalog(
        'iceberg',
        **{
            'uri': 'sqlite:///:memory:',
            # configuration to access Backblaze B2's S3-compatible API such as
            # s3.endpoint, s3.region, etc
        }
    }
    
    catalog.create_namespace('default', { 'location': f'<s3://my-bucket/'}>)
    table = catalog.register_table('default.drivestats', metadata_location)
    
    drivestats = daft.read_iceberg(table)
    
    result = drivestats.count().collect()
    print(f'Total record count: {result.to_pydict()['count'][0]}')
    Presumably, the code to read Parquet files from Backblaze B2 via the AWS SDK executes on the Ray cluster, so I have to either install the necessary packages there ahead of time or specify them, and environment variables, in
    runtime_env
    ? For example:
    Copy code
    ray.init("ray://<head_node_host>:10001", runtime_env={
        "pip": ["daft==0.5.2", "boto3==1.34.162", "botocore==1.34.162", ...etc...],
        "env_vars": {
            "AWS_ENDPOINT_URL": os.environ["AWS_ENDPOINT_URL"],
            "AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
            "AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
            ...etc...
        }
    })
    j
    • 2
    • 3
  • d

    Dimitris

    06/10/2025, 8:41 PM
    Hi, how does one insert a row to a df with daft? All examples I see load an existing dataset and add columns. Thanks!
    r
    • 2
    • 1
  • t

    Tabrez Mohammed

    06/16/2025, 10:41 PM
    We use daft on Ray with Glue and Iceberg. We recently changed the table config from CoW to MoR to improve write perf on our Spark jobs. Unfortunately, Daft can't read the tables, let alone any operations like joins, without running out of memory. Tried up to 200GB worker and a 20GB table just loading in daft and converting to a ray dataset.
    ๐Ÿ‘€ 1
    j
    d
    • 3
    • 18
  • d

    Dimitris

    06/16/2025, 11:22 PM
    Hi, what are the recommended ways to debug UDFs? Is there a way to print or log in the console when using ray?
    d
    • 2
    • 1
  • s

    Sasha Shtern

    06/17/2025, 8:06 PM
    Hi Data Engineering Friends, I'm looking for someone to help us finalize our Daft / Ray setup. Is there anyone in the group who has solid experience with Daft and willing to do some consulting work? We're currently using Spark and I've ported our code over to Daft/Ray to give it a try. We're really love the tooling and API compared to Spark so far. I was hoping to see performance improvements, but we're hitting a few snags out of the gate. Iceberg tables writing much slower than Spark and some OOMs we weren't seeing before. I'm optimistic that these issues are solvable, but I could use the expertise of someone who's been around the block with Daft. Thank you!
    d
    c
    c
    • 4
    • 7
  • d

    Dimitris

    06/18/2025, 10:21 PM
    Hi, do you have recommendations for performing semantic search (with the use of embeddings) on daft? I havenโ€™t found much so far. Iโ€™m mainly interested in the distributed case.
    r
    • 2
    • 2
  • g

    Garrett Weaver

    06/19/2025, 4:25 PM
    ๐Ÿ‘‹ qq, I am seeing mixed support with respect to identifiers for Iceberg tables. specifically, for some methods I can provide
    catalog.namepace.table
    (e.g.
    write_table
    ), but others break and seem to expect
    namespace.table
    (e.g.
    create_table_if_not_exists
    which calls pyiceberg under the hood that had breaking change that no longer allows including catalog). Any advice on how I should be providing identifiers?
    r
    • 2
    • 16
  • g

    Garrett Weaver

    06/20/2025, 7:52 PM
    ๐Ÿ‘‹ I am running into weird issue where native runner is "hanging" when trying to read an Iceberg table, but switching to Ray runner (local cluster) works fine (plan in ๐Ÿงต). Maybe I am hitting an edge case as this is single row test table in staging environment for testing Nessie integration.
    c
    d
    • 3
    • 17
  • m

    Marco Gorelli

    06/24/2025, 8:58 AM
    If daft.pyspark were complete, then assuming it covers the operations one needs, would there still be an advantage to using Daft's own API instead?
    c
    • 2
    • 1
  • e

    Everett Kleven

    06/24/2025, 4:11 PM
    LFG DAFT SQUAD!
    ๐Ÿ™ 4
    ๐Ÿ™Œ 2
    ๐ŸŽ‰ 2
  • s

    Sammy Sidhu

    06/24/2025, 4:19 PM
    Today we're thrilled to announce that Eventual has raised $30M in funding to power the future of multimodal AI infrastructure! Jay Chia and I started this journey 3 years ago, frustrated by the same wall every AI team hits: processing images, video, and documents at scale with tools built for entirely different use cases. What began as pure frustration in my basement has become the data engine trusted by Amazon, CloudKitchens, Essential AI, Together AI and other Fortune 25 companies. The numbers speak for themselves: Daft improved Amazon's most critical data job efficiency by 24%, saving them 40,000 years of compute time annually. Together AI replaced their custom pipelines with simple Daft queries for 100TB+ datasets while achieving 10x performance gains. But this is just the beginning. AI applications are now generating massive amounts of multimodal data at machine speed, and we're building the engine to power it all. Thank you to our incredible community and supporters who made this possible - from our earliest believers at Y Combinator and Array Ventures to our lead investors CRV and Felicis, plus M12, Microsoft's Venture Fund, Citi, and everyone who's been part of this journey. What's next? We're launching early access to Eventual Cloud - the first production platform built from scratch for multimodal AI workloads. We're also hiring across Engineering, DevRel, Design and Product Marketing. - Check out our open roles here. Please help share the love on LinkedIn & Twitter! https://daft-amplify.lovable.app/ (PS, check out the video in the posts, it's pretty cool)
    โœ… 5
    ๐Ÿ”ฅ 6
    ๐Ÿ™ 5
    โž• 4
    daft party 4
    ๐Ÿ‘ 1
    โค๏ธ 10
    ๐Ÿ™Œ 9
    k
    k
    a
    • 4
    • 3
  • a

    Artur Ciocanu

    06/25/2025, 7:55 PM
    Hello community, I saw https://daft.ai/ being shown on this repo: https://github.com/Eventual-Inc/Daft, but the address doesnโ€™t open. Is this a known issue?
    j
    d
    • 3
    • 21
  • c

    ChanChan Mao

    06/26/2025, 4:41 PM
    Another huge milestone achieved this week - Daft has surpassed 3000 stars! Thank you to our growing community for continuing to support us and sharing the love of Daft. And thank you to everyone who is using Daft and believing in the future that we're building. https://github.com/Eventual-Inc/Daft
    โญ 4
    ๐Ÿ™Œ 3
    ๐Ÿคฉ 5
  • g

    Garrett Weaver

    06/27/2025, 7:23 PM
    fyi, I am hitting this error https://github.com/apache/arrow/issues/21526 with
    pyarrow
    when using Ray data directly and also seeing errors trying to read the same data with
    daft
    .
    pyspark
    works fine. I assume daft might be impacted by the same issue (too large row groups)?
    s
    • 2
    • 2
  • d

    delasoul

    07/14/2025, 8:47 AM
    Hello, are you planning to support the Ducklake format?
    j
    • 2
    • 1
  • a

    Amir Shukayev

    07/19/2025, 12:32 AM
    Hey! it seems like a lot of the docs indexed on google are pointing to dead links ๐Ÿค”
    r
    • 2
    • 1
  • c

    Coury Ditch

    07/22/2025, 7:34 PM
    Has anyone had the experience of the native runner being faster than the ray runner?
    j
    c
    • 3
    • 31
  • g

    Garrett Weaver

    07/24/2025, 3:45 AM
    ๐Ÿ‘‹ I am seeing the following error with native parquet writer on
    0.5.9
    , goes away if I set
    native_parquet_writer=False
    , I am using
    anonymous
    (we use alternative way to authenticate in k8s)
    Copy code
    daft.exceptions.DaftCoreException: DaftError::External task 6617 panicked with message "Failed to create S3 multipart writer: Generic { store: S3, source: UploadsCannotBeAnonymous }"
    ๐Ÿ‘€ 1
    s
    c
    +2
    • 5
    • 6
  • g

    Garrett Weaver

    07/25/2025, 4:44 AM
    ๐Ÿ‘‹ added an issue around column order when writing out to parquet. In latest version on native runner, if I do a final
    select
    prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?
    ๐Ÿ‘€ 1
    c
    • 2
    • 1
  • g

    Garrett Weaver

    07/28/2025, 5:33 PM
    Hi, do window functions and joins work in a single query with the new distributed engine on Ray, with the join pieces falling back to old Ray distributed engine? I know if I toggle the new distribution engine off, I get a not implemented error
    daft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.
    A small test with new engine on seems to work, but want to make sure there are not any caveats.
    c
    • 2
    • 6
  • e

    Everett Kleven

    07/28/2025, 9:41 PM
    Hey Daft Team, how expensive is the average concat operation? is it more recommended to append rows with pyarrow recordbatches?