feat(sedonadb/python): Add zero copy raster IO to/from numpy#999
Conversation
Add _get_binary_view_buffer() to extract zero-copy memoryviews from BinaryViewArray elements by decoding view descriptors and resolving variadic buffers. - Update Raster and Band __init__ to use slice() directly instead of pa.array() which copies - Add source_data_size property to Band - Update Band.source_data to use zero-copy extraction for out-of-line data, falling back to copy for inline data (≤12 bytes) - Update Band.to_numpy() to use np.frombuffer for zero-copy - Add tests verifying zero-copy behavior
There was a problem hiding this comment.
Pull request overview
This PR adds (and tests) zero-copy raster band data interchange between SedonaDB’s Python Raster objects and NumPy by avoiding PyArrow BinaryView materialization/copies and by improving Arrow→Pandas conversion behavior.
Changes:
- Updated
Raster.from_numpy,Raster/BandArrow slicing, and bandto_numpy()to enable zero-copy NumPy views when possible. - Added BinaryView decoding helpers (
_get_binary_view_buffer,_wrap_data_zero_copy) to extract out-of-linebinary_viewpayloads without forcing PyArrow copies. - Expanded Python tests to validate zero-copy behavior across Arrow tables, chunked arrays, DB roundtrips, and Pandas conversion.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| python/sedonadb/tests/test_raster.py | Adds coverage for chunked arrays, zero-copy to_numpy(), DB/Pandas roundtrip, and BinaryView buffer extraction edge cases. |
| python/sedonadb/python/sedonadb/raster.py | Implements zero-copy wrapping/extraction for binary_view, updates constructors to avoid copying, and adds Pandas conversion hooks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| return np.array(self.data) | ||
| # Use frombuffer for zero-copy when data is loaded | ||
| if self.outdb_uri is not None and len(self.source_data) == 0: |
There was a problem hiding this comment.
the visible shape is how we determine if data is loaded or not. len(self.source_data) == 0 can mean loaded if the product of the elements of the shape vector is 0.
There was a problem hiding this comment.
Ah sorry, you fixed that for me in a previous PR but it got lost in the shuffle. I'll fix.
There was a problem hiding this comment.
The reuse of data should solve this...we do have a test for an empty outdb view.
|
|
||
| # Build BinaryArray zero-copy using from_buffers, then cast to binary_view | ||
| # BinaryArray buffers: [validity, offsets, data] | ||
| offsets = pa.py_buffer(np.array([0, nbytes], dtype=np.int32)) |
There was a problem hiding this comment.
I think we need to check if nbytes is addressable by the int32 from the BinaryView.
if nbytes > 2**31 - 1:
raise ValueError(
f"Cannot store {nbytes} bytes in a single raster band: "
"the Arrow BinaryView storage type limits a band to 2 GiB. "
"Split the raster into multiple bands or tiles."
)| # Build band struct array with zero-copy data field | ||
| band_data_array = _wrap_data_zero_copy(data) | ||
|
|
||
| band_struct = pa.StructArray.from_arrays( |
There was a problem hiding this comment.
do we really need to redefine the schema here or can we use RASTER_STORAGE_TYPE?
There was a problem hiding this comment.
Unfortunately, we do. This is because we need to use StructArray.from_arrays() specifically to avoid the copy.
There was a problem hiding this comment.
claude seems to think you can put a dummy data field in to avoid redefining the schema, then update the data field in a zero copy manner. My arrow knowledge is pretty limited here.
Details
the claude code:def _replace_child(struct_arr, name, new_child):
"""Return a copy of a StructArray with one named child replaced, deriving
every field definition (type, order, nullability) from the array's own
type — nothing about the schema is restated here."""
t = struct_arr.type
children = [
new_child if t.field(i).name == name else struct_arr.field(i)
for i in range(t.num_fields)
]
fields = [t.field(i) for i in range(t.num_fields)]
return pa.StructArray.from_arrays(children, fields=fields)
def _with_band_data(raster_array, data_array):
"""Inject `data_array` as the single band's `data` field, zero-copy,
without touching any other buffer."""
bands = raster_array.field("bands")
new_band = _replace_child(bands.values, "data", data_array)
# pass type= to preserve the list's non-null `item` field
new_bands = pa.ListArray.from_arrays(bands.offsets, new_band, type=bands.type)
return _replace_child(raster_array, "bands", new_bands)
def _build_raster(dim_names, shape, data_type_id, *, data, crs=None,
nodata=None, transform=None, outdb_uri=None, outdb_format=None):
if crs is not None:
crs = gat.type_spec(crs=crs).crs.to_json()
nodata_bytes = None
if nodata is not None:
nodata_bytes = struct.pack("<" + BAND_DATA_TYPE_STRUCT_CHARS[data_type_id], nodata)
y_name, x_name = dim_names[-2], dim_names[-1]
height, width = shape[-2], shape[-1]
# Build the whole structure declaratively with an empty `data` placeholder,
# so types / field order / nullability all come from the canonical type.
raster = {
"crs": crs,
"transform": list(transform) if transform is not None
else [0.0, 1.0, 0.0, 0.0, 0.0, -1.0],
"spatial_dims": [x_name, y_name],
"spatial_shape": [width, height],
"bands": [{
"name": None,
"dim_names": list(dim_names),
"source_shape": list(shape),
"data_type": data_type_id,
"nodata": nodata_bytes,
"view": None,
"outdb_uri": outdb_uri,
"outdb_format": outdb_format,
"data": b"", # placeholder, replaced zero-copy below
}],
}
raster_array = pa.array([raster], type=RASTER_STORAGE_TYPE)
# Swap in the pixel data zero-copy (the one field pa.array() would copy).
return Raster(_with_band_data(raster_array, _wrap_data_zero_copy(data))) There was a problem hiding this comment.
Thanks for that...you/Claude is/are probably correct here. This errors loudly at the planning stage if incorrect and so I'm inclined to keep it as is because it reads better (but I'm happy to change if you feel strongly).
There was a problem hiding this comment.
Fine by me. I’m not planning on changing the schema any time soon.
Maybe a test that’ll fail if they drift is a good idea
There was a problem hiding this comment.
They all fail if this schema doesn't match 🙂 (used to be a crash before we fixed the panics!)
| if self.outdb_uri is not None and len(self.source_data) == 0: | ||
| raise ValueError("Can't extract buffer from a reference to external data.") | ||
|
|
||
| return np.frombuffer(self.source_data, dtype=self.data_type).reshape(self.shape) |
There was a problem hiding this comment.
is this cast duplicate to line 311?
There was a problem hiding this comment.
Good catch...I updated this to just reuse the casted memoryview from data.
This PR adds zero-coy support to/from numpy. I had thought my initial support was zero copy but there are a few operations that force a binaryview copy in pyarrow that make it tricky to make this work.