Yufan
05/21/2025, 11:29 PMYuri Gorokhov
05/23/2025, 9:29 PMAttempting to downcast Map { key: Utf8, value: List(Utf8) } to \"daft_core::array::list_array::ListArray\"
Wondering if someone has seen this before?Giridhar Pathak
05/25/2025, 1:37 AMTypeError: pyarrow.lib.large_list() takes no keyword arguments
the code:
table = catalog.load_table(table)
return df.read_iceberg(table)
has anyone experienced this before?Giridhar Pathak
05/25/2025, 10:18 PMdaft.read_table("platform.messages").filter("event_time > TIMESTAMP '2025-05-24T00:00:00Z'").limit(5).show()
running this makes the process crash. looks like memory goes thru the roof. Not sure if its trying to read the whole table into memory.
pre-materialization, i can get the schema just fine.Everett Kleven
05/28/2025, 2:15 PMYuri Gorokhov
05/28/2025, 4:14 PM.dropDuplicates(subset: Optional[List[str]] = None)
where you can specify which columns to consider?Pat Patterson
05/29/2025, 11:37 PM# How many records are in the current Drive Stats dataset?
count, elapsed_time = time_collect(drivestats.count())
print(f'Total record count: {count.to_pydict()['count'][0]} ({elapsed_time:.2f} seconds)')
With the other systems I tested in my blog post, the equivalent query takes between a fraction of a second and 15 seconds. That Daft call to drivestats.count()
takes 80 seconds. Iโm guessing itโs doing way more work than it needs to - reading the record counts from each of the 365 Parquet files rather than simply reading total-records
from the most recent metadata file. Since SELECT COUNT(*)
is such a common operation, I think itโs worth short-circuiting the current behavior.Giridhar Pathak
06/03/2025, 2:17 PMEverett Kleven
06/03/2025, 2:59 PMPat Patterson
06/06/2025, 10:47 PMimport daft
import ray
ray.init("<ray://head_node_host:10001>", runtime_env={"pip": ["daft"]})
daft.context.set_runner_ray("<ray://head_node_host:10001>")
catalog = load_catalog(
'iceberg',
**{
'uri': 'sqlite:///:memory:',
# configuration to access Backblaze B2's S3-compatible API such as
# s3.endpoint, s3.region, etc
}
}
catalog.create_namespace('default', { 'location': f'<s3://my-bucket/'}>)
table = catalog.register_table('default.drivestats', metadata_location)
drivestats = daft.read_iceberg(table)
result = drivestats.count().collect()
print(f'Total record count: {result.to_pydict()['count'][0]}')
Presumably, the code to read Parquet files from Backblaze B2 via the AWS SDK executes on the Ray cluster, so I have to either install the necessary packages there ahead of time or specify them, and environment variables, in runtime_env
? For example:
ray.init("ray://<head_node_host>:10001", runtime_env={
"pip": ["daft==0.5.2", "boto3==1.34.162", "botocore==1.34.162", ...etc...],
"env_vars": {
"AWS_ENDPOINT_URL": os.environ["AWS_ENDPOINT_URL"],
"AWS_ACCESS_KEY_ID": os.environ["AWS_ACCESS_KEY_ID"],
"AWS_SECRET_ACCESS_KEY": os.environ["AWS_SECRET_ACCESS_KEY"],
...etc...
}
})
Dimitris
06/10/2025, 8:41 PMTabrez Mohammed
06/16/2025, 10:41 PMDimitris
06/16/2025, 11:22 PMSasha Shtern
06/17/2025, 8:06 PMDimitris
06/18/2025, 10:21 PMGarrett Weaver
06/19/2025, 4:25 PMcatalog.namepace.table
(e.g. write_table
), but others break and seem to expect namespace.table
(e.g. create_table_if_not_exists
which calls pyiceberg under the hood that had breaking change that no longer allows including catalog). Any advice on how I should be providing identifiers?Garrett Weaver
06/20/2025, 7:52 PMMarco Gorelli
06/24/2025, 8:58 AMEverett Kleven
06/24/2025, 4:11 PMSammy Sidhu
06/24/2025, 4:19 PMArtur Ciocanu
06/25/2025, 7:55 PMChanChan Mao
06/26/2025, 4:41 PMGarrett Weaver
06/27/2025, 7:23 PMpyarrow
when using Ray data directly and also seeing errors trying to read the same data with daft
. pyspark
works fine. I assume daft might be impacted by the same issue (too large row groups)?delasoul
07/14/2025, 8:47 AMAmir Shukayev
07/19/2025, 12:32 AMCoury Ditch
07/22/2025, 7:34 PMGarrett Weaver
07/24/2025, 3:45 AM0.5.9
, goes away if I set native_parquet_writer=False
, I am using anonymous
(we use alternative way to authenticate in k8s)
daft.exceptions.DaftCoreException: DaftError::External task 6617 panicked with message "Failed to create S3 multipart writer: Generic { store: S3, source: UploadsCannotBeAnonymous }"
Garrett Weaver
07/25/2025, 4:44 AMselect
prior to writing to parquet, it is not necessarily respected such that the order when reading back is different, is this expected?Garrett Weaver
07/28/2025, 5:33 PMdaft.exceptions.DaftCoreException: Not Yet Implemented: Window functions are currently only supported on the native runner.
A small test with new engine on seems to work, but want to make sure there are not any caveats.Everett Kleven
07/28/2025, 9:41 PM