Skip to content

feat(sedonadb/python): Add zero copy raster IO to/from numpy#999

Merged
paleolimbot merged 14 commits into
apache:mainfrom
paleolimbot:zero-copy-raster-access
Jun 24, 2026
Merged

feat(sedonadb/python): Add zero copy raster IO to/from numpy#999
paleolimbot merged 14 commits into
apache:mainfrom
paleolimbot:zero-copy-raster-access

Conversation

@paleolimbot

@paleolimbot paleolimbot commented Jun 22, 2026

Copy link
Copy Markdown
Member

This PR adds zero-coy support to/from numpy. I had thought my initial support was zero copy but there are a few operations that force a binaryview copy in pyarrow that make it tricky to make this work.

import sedona.db
from sedonadb.raster import Raster
import numpy as np

arr = np.arange(100 * 100).reshape((100, 100))
r = Raster.from_numpy(arr) 
assert np.shares_memory(r.bands[0].to_numpy(), arr)

sd = sedona.db.connect()
df = sd.sql("SELECT $1 as raster", params=(r, )).to_pandas()

r_from_df = df.iloc[0, 0]
assert np.shares_memory(r_from_df.bands[0].to_numpy(), arr)

Add _get_binary_view_buffer() to extract zero-copy memoryviews from
BinaryViewArray elements by decoding view descriptors and resolving
variadic buffers.

- Update Raster and Band __init__ to use slice() directly instead of
  pa.array() which copies
- Add source_data_size property to Band
- Update Band.source_data to use zero-copy extraction for out-of-line
  data, falling back to copy for inline data (≤12 bytes)
- Update Band.to_numpy() to use np.frombuffer for zero-copy
- Add tests verifying zero-copy behavior

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds (and tests) zero-copy raster band data interchange between SedonaDB’s Python Raster objects and NumPy by avoiding PyArrow BinaryView materialization/copies and by improving Arrow→Pandas conversion behavior.

Changes:

  • Updated Raster.from_numpy, Raster/Band Arrow slicing, and band to_numpy() to enable zero-copy NumPy views when possible.
  • Added BinaryView decoding helpers (_get_binary_view_buffer, _wrap_data_zero_copy) to extract out-of-line binary_view payloads without forcing PyArrow copies.
  • Expanded Python tests to validate zero-copy behavior across Arrow tables, chunked arrays, DB roundtrips, and Pandas conversion.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
python/sedonadb/tests/test_raster.py Adds coverage for chunked arrays, zero-copy to_numpy(), DB/Pandas roundtrip, and BinaryView buffer extraction edge cases.
python/sedonadb/python/sedonadb/raster.py Implements zero-copy wrapping/extraction for binary_view, updates constructors to avoid copying, and adds Pandas conversion hooks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sedonadb/python/sedonadb/raster.py Outdated
Comment thread python/sedonadb/python/sedonadb/raster.py Outdated
Comment thread python/sedonadb/python/sedonadb/raster.py Outdated
Comment thread python/sedonadb/python/sedonadb/raster.py Outdated
Comment thread python/sedonadb/python/sedonadb/raster.py

return np.array(self.data)
# Use frombuffer for zero-copy when data is loaded
if self.outdb_uri is not None and len(self.source_data) == 0:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the visible shape is how we determine if data is loaded or not. len(self.source_data) == 0 can mean loaded if the product of the elements of the shape vector is 0.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, you fixed that for me in a previous PR but it got lost in the shuffle. I'll fix.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reuse of data should solve this...we do have a test for an empty outdb view.


# Build BinaryArray zero-copy using from_buffers, then cast to binary_view
# BinaryArray buffers: [validity, offsets, data]
offsets = pa.py_buffer(np.array([0, nbytes], dtype=np.int32))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check if nbytes is addressable by the int32 from the BinaryView.

if nbytes > 2**31 - 1:
    raise ValueError(
        f"Cannot store {nbytes} bytes in a single raster band: "
        "the Arrow BinaryView storage type limits a band to 2 GiB. "
        "Split the raster into multiple bands or tiles."
    )

# Build band struct array with zero-copy data field
band_data_array = _wrap_data_zero_copy(data)

band_struct = pa.StructArray.from_arrays(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to redefine the schema here or can we use RASTER_STORAGE_TYPE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we do. This is because we need to use StructArray.from_arrays() specifically to avoid the copy.

@james-willis james-willis Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude seems to think you can put a dummy data field in to avoid redefining the schema, then update the data field in a zero copy manner. My arrow knowledge is pretty limited here.

Details the claude code:
def _replace_child(struct_arr, name, new_child):
    """Return a copy of a StructArray with one named child replaced, deriving
    every field definition (type, order, nullability) from the array's own
    type — nothing about the schema is restated here."""
    t = struct_arr.type
    children = [
        new_child if t.field(i).name == name else struct_arr.field(i)
        for i in range(t.num_fields)
    ]
    fields = [t.field(i) for i in range(t.num_fields)]
    return pa.StructArray.from_arrays(children, fields=fields)


def _with_band_data(raster_array, data_array):
    """Inject `data_array` as the single band's `data` field, zero-copy,
    without touching any other buffer."""
    bands = raster_array.field("bands")
    new_band = _replace_child(bands.values, "data", data_array)
    # pass type= to preserve the list's non-null `item` field
    new_bands = pa.ListArray.from_arrays(bands.offsets, new_band, type=bands.type)
    return _replace_child(raster_array, "bands", new_bands)


def _build_raster(dim_names, shape, data_type_id, *, data, crs=None,
                  nodata=None, transform=None, outdb_uri=None, outdb_format=None):
    if crs is not None:
        crs = gat.type_spec(crs=crs).crs.to_json()
    nodata_bytes = None
    if nodata is not None:
        nodata_bytes = struct.pack("<" + BAND_DATA_TYPE_STRUCT_CHARS[data_type_id], nodata)

    y_name, x_name = dim_names[-2], dim_names[-1]
    height, width = shape[-2], shape[-1]

    # Build the whole structure declaratively with an empty `data` placeholder,
    # so types / field order / nullability all come from the canonical type.
    raster = {
        "crs": crs,
        "transform": list(transform) if transform is not None
                     else [0.0, 1.0, 0.0, 0.0, 0.0, -1.0],
        "spatial_dims": [x_name, y_name],
        "spatial_shape": [width, height],
        "bands": [{
            "name": None,
            "dim_names": list(dim_names),
            "source_shape": list(shape),
            "data_type": data_type_id,
            "nodata": nodata_bytes,
            "view": None,
            "outdb_uri": outdb_uri,
            "outdb_format": outdb_format,
            "data": b"",  # placeholder, replaced zero-copy below
        }],
    }
    raster_array = pa.array([raster], type=RASTER_STORAGE_TYPE)

    # Swap in the pixel data zero-copy (the one field pa.array() would copy).
    return Raster(_with_band_data(raster_array, _wrap_data_zero_copy(data))) 

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that...you/Claude is/are probably correct here. This errors loudly at the planning stage if incorrect and so I'm inclined to keep it as is because it reads better (but I'm happy to change if you feel strongly).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. I’m not planning on changing the schema any time soon.

Maybe a test that’ll fail if they drift is a good idea

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all fail if this schema doesn't match 🙂 (used to be a crash before we fixed the panics!)

if self.outdb_uri is not None and len(self.source_data) == 0:
raise ValueError("Can't extract buffer from a reference to external data.")

return np.frombuffer(self.source_data, dtype=self.data_type).reshape(self.shape)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cast duplicate to line 311?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch...I updated this to just reuse the casted memoryview from data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants