https://www.getdaft.io logo
Join Slack
Powered by
# general
  • k

    Kesav Kolla

    08/27/2025, 11:26 AM
    Is there any benefit of writing rust functions instead of Python UDF? Wondering what's the performance penalty of Python UDFs? I have billions of rows in my dataframe and need to operate row wise transformations.
    c
    m
    +2
    • 5
    • 7
  • g

    Garrett Weaver

    08/27/2025, 6:18 PM
    Is there general guidance on using
    daft.func
    vs
    daft.udf
    ? I would guess that if the the underlying python code is not taking advantage of any vectorization but maybe just a list comprehension
    [my_func(x) for x in some_series],
    then just use
    daft.func
    ?
    j
    k
    s
    • 4
    • 38
  • g

    Garrett Weaver

    08/28/2025, 4:21 PM
    sqlmesh python models + daft would be 🔥 https://sqlmesh.readthedocs.io/en/latest/concepts/models/python_models/#pyspark
    m
    • 2
    • 2
  • v

    VOID 001

    08/29/2025, 3:55 AM
    Hi, does daft json unnset support in SQL queries? Is there any grammar like DuckDB struct explode? Something similar to the following SQL would be nice
    Copy code
    df = daft.from_pydict({
        "json": [
            '{"a": 1, "b": 2}',
            '{"a": 3, "b": 4}',
        ],
    })
    df = daft.sql("SELECT json.* FROM df")
    df.collect()
    r
    • 2
    • 4
  • a

    Amir Shukayev

    08/29/2025, 4:01 AM
    is
    concat
    lazy? Like
    Copy code
    df = reduce(
        lambda df1, df2: df1.concat(df2),
        [
            df_provider[i].get_daft_df()
            for i in range(num_dfs)
        ],
    )
    Is there any way to lazily combine a set of dfs? in any order
    j
    m
    • 3
    • 5
  • s

    Sky Yin

    08/29/2025, 10:31 PM
    When looking at the document, I don't see any data connector for GCP. How does Daft query data in Google cloud storage?
    c
    k
    +2
    • 5
    • 5
  • g

    Garrett Weaver

    09/04/2025, 8:41 PM
    Hi, with the new flotilla runner, should I expect OOM on the head node? I see
    get_next_partition
    is running there.
    k
    c
    • 3
    • 13
  • d

    Desmond Cheong

    09/04/2025, 11:58 PM
    Apparently we're trending on github (in rust) now! Thank you for all your support and love, and thank you to everyone who's been using Daft and building Daft alongside us :') https://www.reddit.com/r/rust/comments/1n8o8ud/daft_is_trending_on_github_in_rust/
    🔥 6
    ❤️ 8
  • v

    VOID 001

    09/05/2025, 5:56 AM
    Hi group, is there any benchmark comparing ray data & daft?
    💯 1
    n
    k
    +4
    • 7
    • 20
  • p

    Peer Schendel

    09/07/2025, 9:10 AM
    Hi guys, I am a data engineer in an ai team. I stumbled over daft and see alot of benefits using it 🙂 I saw the llm.generate() function. I was wondering is also working with llm-proxy providers like liteLLM? I also heard in the video about it, that it is running in batches like batch inference. But I was wondering, if there might be a nice implementation to run the real batch_api from AzureOpenai, Openai or other providers: https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/batch?tabs=global-bat[…]ndard-input%2Cpython-key&pivots=programming-language-python
    Copy code
    import os
    from openai import AzureOpenAI
        
    client = AzureOpenAI(
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
        api_version="2025-03-01-preview",
        azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
        )
    
    # Upload a file with a purpose of "batch"
    file = client.files.create(
      file=open("test.jsonl", "rb"), 
      purpose="batch",
      extra_body={"expires_after":{"seconds": 1209600, "anchor": "created_at"}} # Optional you can set to a number between 1209600-2592000. This is equivalent to 14-30 days
    )
    
    
    print(file.model_dump_json(indent=2))
    
    print(f"File expiration: {datetime.fromtimestamp(file.expires_at) if file.expires_at is not None else 'Not set'}")
    
    file_id = file.id
    j
    e
    • 3
    • 6
  • e

    Edmondo Porcu

    09/07/2025, 4:17 PM
    Hello world, minor member of the DataFusion community here 🙂 I actually think I met one of the founders of Daft at a startup event
    ❤️ 5
  • c

    ChanChan Mao

    09/08/2025, 5:29 PM
    The growth we saw last week was absolutely incredible 🔥 In the past 10 days, we've gained 700+ stars 🤯 Thank you to all for your support and for believing in Daft 🫶
    🎉 6
    🚀 8
    daft party 9
    c
    e
    y
    • 4
    • 4
  • c

    ChanChan Mao

    09/09/2025, 6:23 PM
    aaaaand we're live on Hugging Face documentation! Thank you to Quentin Lhoest, Daniel van Strien, and the Hugging Face team for all their help pushing this through, and excited for our continued collaboration! https://huggingface.co/docs/hub/datasets-daft
    🤗 7
    🙌 6
  • k

    Kyle

    09/11/2025, 5:04 AM
    For llm_generate is it possible to run a local huggingface model? Perhaps by directly putting the local model repo path in the params instead of an open huggingface repo name?
    k
    s
    +2
    • 5
    • 14
  • e

    Edmondo Porcu

    09/12/2025, 6:36 PM
    Quick question about Daft: how re-usable is its integration with Ray? The reason I asked is that datafusion-ray was an interesting project, tried to do some work, had no time, someone else picked it up, they had no time... Daft seems to be using Ray for distributing the workload
    j
    • 2
    • 2
  • r

    Rakesh Jain

    09/12/2025, 10:15 PM
    Hello Daft team, for the Lakevision project, which is for visualizing Iceberg based Lakehouses, we use daft for SQL and Sample Data, and we are very happy with it. Thanks for the great work!
    ❤️ 4
  • k

    Kyle

    09/15/2025, 6:22 AM
    Are there any plans to have a function to generate the perplexity of a model on a given text? E.g. perplexity of qwen1.5b on a particular string?
    s
    • 2
    • 2
  • u

    吕威

    09/16/2025, 7:22 AM
    Hi guys, i'm confused with DataType.image() I am try to use yolo model to detect image, and crop object to next embedding theme.
    Copy code
    @udf(
        return_dtype=DataType.list(
            DataType.struct(
                {
                    "class": DataType.string(),
                    "score": DataType.float64(),
                    "cropped_img": DataType.image(),
                    "bbox": DataType.list(DataType.int64()),
                }
            )
        ),
        num_gpus=1,
        batch_size=16,
    )
    class YOLOWorldOnnxObjDetect:
        def __init__(
            self,
            model_path: str,
            device: str = "cuda:0",
            confidence: float = 0.25,
        ):
             # int model
             pass
    
        def __call__(self, images_2d_col: Series) -> List[List[dict]]:
            images: List[np.ndarray] = images_2d_col.to_pylist()
            results = self.yolo.predict(source=images, conf=self.confidence)
            for r in results:
                img_result = []
                orig_img = r.orig_img
                for box in r.boxes:
                    x1, y1, x2, y2 = box.xyxy[0].cpu().numpy().astype(int)
                    x1, y1 = max(0, x1), max(0, y1)
                    x2, y2 = min(orig_img.shape[1], x2), min(orig_img.shape[0], y2)
                    x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2)
                    cls = int(box.cls[0])
                    img_result.append(
                        {
                            "class": self.yolo.names[cls],
                            "score": float(box.conf[0]),
                            "cropped_img": {
                                "cropimg": cv2.cvtColor(
                                    orig_img[y1:y2, x1:x2], cv2.COLOR_BGR2RGB
                                ),
                            },
                            "bbox": [x1, y1, x2, y2],
                        }
                    )
                objs.append(img_result)
            return objs
    the cropped_img must return with dict, if direct return np.ndarray, will raise Could not convert array(..., dtype=uint8) with type numpy.ndarray: was expecting tuple of (key, value) pair error why?
    e
    r
    • 3
    • 9
  • c

    ChanChan Mao

    09/22/2025, 4:00 PM
    hey all! i wanted to share this use case that we came across in our community 😊 authored by @YK --- When John Shelburne of CatFIX Technology needed to process 59 million bond market records (28M bonds + 31M prices + trade ideas) for his neural network model, he had a few options: • Continue using pandas (taking 3 days for processing) • Move to Vertex AI or another cloud service (more costs) Essentially, he was choosing between painfully slow local processing or expensive cloud compute. That's when he discovered Daft. By switching from pandas to Daft, John achieved: • 7.3 minutes total runtime (vs 3 days previously) • 600x faster processing • 50% lower memory usage • All running locally on his iMac As John put it: "HOLY SMOKES WAS IT FAST! My neural network model is now trainable in real time." See his original post: https://www.linkedin.com/posts/shelburne_daft-eventual-katana-activity-7371644995176460288-a02g
    daft party 6
    🙌 6
  • c

    ChanChan Mao

    09/23/2025, 12:33 AM
    Hey everyone! Just wanted to share in this channel that we're bringing back Daft Contributor Sync series where we'll highlight work in the open source, cover latest releases and features, and shout out our contributors! This month's contributor sync will be This Thursday September 25 at 4pm PT. We'll be talking about major improvements that we've shipped in the last few months, like Model APIs, UDF improvements, integrations with Turbopuffer, Clickhouse, and Lance, and our new
    daft.File
    datatype. Following that, @Colin Ho will dive into his work on Flotilla, our distributed engine, and showcase some exciting benchmark results 👀 We'll leave plenty of time at the end for questions and discussions. Add to your calendar and we'll see you then! 👋
    daft party 4
  • g

    Garrett Weaver

    09/23/2025, 10:28 PM
    any recommendations on dealing with protobuf? technically I can convert to json, but would like to convert to structured type automatically for easier access.
    s
    • 2
    • 2
  • a

    Amir Shukayev

    09/24/2025, 11:38 PM
    any recommendations on running daft in a managed instance group on GCP?
    c
    • 2
    • 1
  • n

    Nathan Cai

    09/24/2025, 11:59 PM
    Hey guys, quick question, but isn't this supposed to say
    Copy code
    # Supply actual values for the s3
    Not
    Copy code
    # Supply actual values for the se
    in the docs? https://docs.daft.ai/en/stable/connectors/aws/#rely-on-environment
    Copy code
    from <http://daft.io|daft.io> import IOConfig, S3Config
    
    # Supply actual values for the se
    io_config = IOConfig(s3=S3Config(key_id="key_id", session_token="session_token", secret_key="secret_key"))
    
    # Globally set the default IOConfig for any subsequent I/O calls
    daft.set_planning_config(default_io_config=io_config)
    
    # Perform some I/O operation
    df = daft.read_parquet("<s3://my_bucket/my_path/**/*>")
    d
    • 2
    • 2
  • c

    ChanChan Mao

    09/25/2025, 11:01 PM
    Hey everyone, we're live for the contributor sync! Join us 🙂 https://us06web.zoom.us/j/89647699067?pwd=b0jsNnL9yT1L2wTsDoG6kh9e83kcp7.1&amp;jst=2
    ❤️ 3
  • g

    Garrett Weaver

    09/26/2025, 4:58 AM
    when running a UDF with native runner and
    use_process=True
    everything works fine locally (mac), but seeing
    /usr/bin/bash: line 1:    58 Bus error               (core dumped)
    when run on k8s (argo workflows). any thoughts?
    c
    • 2
    • 3
  • g

    Garrett Weaver

    09/26/2025, 5:11 PM
    I am using
    explode
    with average result being 1 row --> 12 rows and max 1 row --> 366 rows (~5m rows --> ~66m rows). Seeing decently high memory usage during the explode even with a repartition prior to explode. Is the only remedy more partitions and/or reduced number cpus to reduce parallelism?
    c
    k
    • 3
    • 10
  • n

    Nathan Cai

    09/29/2025, 1:00 PM
    Hi there, just a quick question, what value does Daft provide compared to me just using multi-threaded programming? In which use cases does Daft go above and beyond just multi-threading?
    c
    k
    • 3
    • 3
  • r

    Robert Howell

    09/30/2025, 4:39 PM
    👋 Looking to get started with Daft? We’ve curated some approachable Good First Issues which the team is happy to help you with! New Expressions These are fun because you'll get to introduce entirely new capabilities. • Add `Expression.var` • Add `Expression.pow` • Add `Expression.product` • Support `ddof` argument for `stddev` • Support simple case and searched cased with list of branches • UUID function <-- very easy and very 🆒 Enhancements These are interesting because you'll get to learn from the existing functionality and extend current capabilities. • Support `Series[start:end]` slicing • Support image hash functions for deduplication • Equality on structured types <-- this would be a great addition • Hash rows of dataframes • Custom CSV delimiters <-- also great addition!! • Custom CSV quote characters • Custom CSV date/time formatting Documentation We are ALWAYS interested in documentation PRs and improve docs.daft.ai — here's a mega-list which you could work with coding agents to curate something nice! • https://github.com/Eventual-Inc/Daft/issues/4125 Full List • https://github.com/Eventual-Inc/Daft/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22&amp;page=1
    ❤️ 6
    🙌 1
  • d

    Dan Reverri

    09/30/2025, 9:50 PM
    I’m trying to read a large csv file and want to iterate over arrow record batches but daft is reading the whole file before starting to yield batches. Are there any options to process the csv file in chunks?
    d
    c
    • 3
    • 19
  • c

    ChanChan Mao

    10/03/2025, 9:35 PM
    If you're new to Daft and our community, I highly recommend reading this overview of Daft blogpost written by @Euan Lim! Such technical concepts were explained very well and were easily digestible. Thanks for this writeup!
    🙌 5
    e
    • 2
    • 1