feat(sedonadb/python): Add zero copy raster IO to/from numpy by paleolimbot · Pull Request #999 · apache/sedona-db

paleolimbot · 2026-06-22T18:07:32Z

This PR adds zero-coy support to/from numpy. I had thought my initial support was zero copy but there are a few operations that force a binaryview copy in pyarrow that make it tricky to make this work.

import sedona.db
from sedonadb.raster import Raster
import numpy as np

arr = np.arange(100 * 100).reshape((100, 100))
r = Raster.from_numpy(arr) 
assert np.shares_memory(r.bands[0].to_numpy(), arr)

sd = sedona.db.connect()
df = sd.sql("SELECT $1 as raster", params=(r, )).to_pandas()

r_from_df = df.iloc[0, 0]
assert np.shares_memory(r_from_df.bands[0].to_numpy(), arr)

Add _get_binary_view_buffer() to extract zero-copy memoryviews from BinaryViewArray elements by decoding view descriptors and resolving variadic buffers. - Update Raster and Band __init__ to use slice() directly instead of pa.array() which copies - Add source_data_size property to Band - Update Band.source_data to use zero-copy extraction for out-of-line data, falling back to copy for inline data (≤12 bytes) - Update Band.to_numpy() to use np.frombuffer for zero-copy - Add tests verifying zero-copy behavior

…cess

Copilot

Pull request overview

This PR adds (and tests) zero-copy raster band data interchange between SedonaDB’s Python Raster objects and NumPy by avoiding PyArrow BinaryView materialization/copies and by improving Arrow→Pandas conversion behavior.

Changes:

Updated Raster.from_numpy, Raster/Band Arrow slicing, and band to_numpy() to enable zero-copy NumPy views when possible.
Added BinaryView decoding helpers (_get_binary_view_buffer, _wrap_data_zero_copy) to extract out-of-line binary_view payloads without forcing PyArrow copies.
Expanded Python tests to validate zero-copy behavior across Arrow tables, chunked arrays, DB roundtrips, and Pandas conversion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
python/sedonadb/tests/test_raster.py	Adds coverage for chunked arrays, zero-copy `to_numpy()`, DB/Pandas roundtrip, and BinaryView buffer extraction edge cases.
python/sedonadb/python/sedonadb/raster.py	Implements zero-copy wrapping/extraction for `binary_view`, updates constructors to avoid copying, and adds Pandas conversion hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

james-willis · 2026-06-23T20:06:12Z


-        return np.array(self.data)
+        # Use frombuffer for zero-copy when data is loaded
+        if self.outdb_uri is not None and len(self.source_data) == 0:


the visible shape is how we determine if data is loaded or not. len(self.source_data) == 0 can mean loaded if the product of the elements of the shape vector is 0.

Ah sorry, you fixed that for me in a previous PR but it got lost in the shuffle. I'll fix.

The reuse of data should solve this...we do have a test for an empty outdb view.

james-willis · 2026-06-23T20:47:35Z

+
+    # Build BinaryArray zero-copy using from_buffers, then cast to binary_view
+    # BinaryArray buffers: [validity, offsets, data]
+    offsets = pa.py_buffer(np.array([0, nbytes], dtype=np.int32))


I think we need to check if nbytes is addressable by the int32 from the BinaryView.

if nbytes > 2**31 - 1: raise ValueError( f"Cannot store {nbytes} bytes in a single raster band: " "the Arrow BinaryView storage type limits a band to 2 GiB. " "Split the raster into multiple bands or tiles." )

james-willis · 2026-06-23T21:36:08Z

+    # Build band struct array with zero-copy data field
+    band_data_array = _wrap_data_zero_copy(data)
+
+    band_struct = pa.StructArray.from_arrays(


do we really need to redefine the schema here or can we use RASTER_STORAGE_TYPE?

Unfortunately, we do. This is because we need to use StructArray.from_arrays() specifically to avoid the copy.

claude seems to think you can put a dummy data field in to avoid redefining the schema, then update the data field in a zero copy manner. My arrow knowledge is pretty limited here.

Details
the claude code:
def _replace_child(struct_arr, name, new_child): """Return a copy of a StructArray with one named child replaced, deriving every field definition (type, order, nullability) from the array's own type — nothing about the schema is restated here.""" t = struct_arr.type children = [ new_child if t.field(i).name == name else struct_arr.field(i) for i in range(t.num_fields) ] fields = [t.field(i) for i in range(t.num_fields)] return pa.StructArray.from_arrays(children, fields=fields) def _with_band_data(raster_array, data_array): """Inject `data_array` as the single band's `data` field, zero-copy, without touching any other buffer.""" bands = raster_array.field("bands") new_band = _replace_child(bands.values, "data", data_array) # pass type= to preserve the list's non-null `item` field new_bands = pa.ListArray.from_arrays(bands.offsets, new_band, type=bands.type) return _replace_child(raster_array, "bands", new_bands) def _build_raster(dim_names, shape, data_type_id, *, data, crs=None, nodata=None, transform=None, outdb_uri=None, outdb_format=None): if crs is not None: crs = gat.type_spec(crs=crs).crs.to_json() nodata_bytes = None if nodata is not None: nodata_bytes = struct.pack("<" + BAND_DATA_TYPE_STRUCT_CHARS[data_type_id], nodata) y_name, x_name = dim_names[-2], dim_names[-1] height, width = shape[-2], shape[-1] # Build the whole structure declaratively with an empty `data` placeholder, # so types / field order / nullability all come from the canonical type. raster = { "crs": crs, "transform": list(transform) if transform is not None else [0.0, 1.0, 0.0, 0.0, 0.0, -1.0], "spatial_dims": [x_name, y_name], "spatial_shape": [width, height], "bands": [{ "name": None, "dim_names": list(dim_names), "source_shape": list(shape), "data_type": data_type_id, "nodata": nodata_bytes, "view": None, "outdb_uri": outdb_uri, "outdb_format": outdb_format, "data": b"", # placeholder, replaced zero-copy below }], } raster_array = pa.array([raster], type=RASTER_STORAGE_TYPE) # Swap in the pixel data zero-copy (the one field pa.array() would copy). return Raster(_with_band_data(raster_array, _wrap_data_zero_copy(data)))

Thanks for that...you/Claude is/are probably correct here. This errors loudly at the planning stage if incorrect and so I'm inclined to keep it as is because it reads better (but I'm happy to change if you feel strongly).

Fine by me. I’m not planning on changing the schema any time soon.

Maybe a test that’ll fail if they drift is a good idea

They all fail if this schema doesn't match 🙂 (used to be a crash before we fixed the panics!)

james-willis · 2026-06-23T22:34:25Z

+        if self.outdb_uri is not None and len(self.source_data) == 0:
+            raise ValueError("Can't extract buffer from a reference to external data.")
+
+        return np.frombuffer(self.source_data, dtype=self.data_type).reshape(self.shape)


is this cast duplicate to line 311?

Good catch...I updated this to just reuse the casted memoryview from data.

paleolimbot added 9 commits June 22, 2026 13:03

Merge remote-tracking branch 'upstream/main' into zero-copy-raster-ac…

715b017

…cess

fix test

9d90f6d

add pandas conversion

140a6e1

add table accessor and tests

f9e3de8

check the roundtrip

77c22e9

format

fe6570a

add note

9a2a4ca

zero copy other direction

65ccf84

paleolimbot requested a review from Copilot June 22, 2026 21:15

Copilot started reviewing on behalf of paleolimbot June 22, 2026 21:15 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

paleolimbot added 3 commits June 23, 2026 08:51

comments

aaf47a9

remove doc warnings

743f598

fix

8284f25

paleolimbot marked this pull request as ready for review June 23, 2026 13:57

paleolimbot mentioned this pull request Jun 23, 2026

feat(rust/sedona-raster-functions): add RS_Value for spatial 2d rasters #974

Merged

james-willis reviewed Jun 23, 2026

View reviewed changes

paleolimbot added 2 commits June 23, 2026 23:06

fix empty with numpy, dedup handling with .data

0e4c8ac

check

7c98c35

paleolimbot mentioned this pull request Jun 24, 2026

[Python] PyArrow copies large BinaryView buffers during construction and scalar extraction apache/arrow#50246

Open

paleolimbot merged commit e46ff75 into apache:main Jun 24, 2026
5 checks passed

paleolimbot mentioned this pull request Jun 24, 2026

chore(python/sedonadb): Error on attempt to convert a Raster array to Pandas for Pandas < 3.0 #1002

Open

Uh oh!

Conversation

paleolimbot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

james-willis Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paleolimbot commented Jun 22, 2026 •

edited

Loading

james-willis Jun 23, 2026 •

edited

Loading