Refactor GeoTIFF Phase 5b: extract _decode.py from _reader.py#2254
Conversation
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Refactor GeoTIFF Phase 5b, extract _decode.py from _reader.py
Blockers
None. The extraction is mechanical and behavior-preserving. All 5033 geotiff tests pass.
Suggestions
-
xrspatial/geotiff/_decode.pylines 343-348 and 543-548: the lazyfrom ._reader import ...inside_read_strips/_read_tilesis a hidden runtime dependency on_reader.py. The module docstring already calls this a leaf module for transport-independent decode, but it doesn't say the module deliberately avoids a top-level_readerimport to break a cycle. The inline comments are good; lifting that note into the docstring is a one-liner. -
xrspatial/geotiff/_decode.pylines 364 and 521:_read_strips/_read_tileslost theirmax_pixels: intannotation when the signature becamemax_pixels=_MAX_PIXELS_UNSET._resolve_max_pixelskeeps the runtime behavior, but type checkers and IDEs no longer see the int. The sentinel itself is the right shape givenMAX_PIXELS_DEFAULTstays in_reader.py. Annotatingmax_pixels: int = ...(callers should still pass an int) restores the hint at zero cost. -
xrspatial/geotiff/_decode.pylines 478 and 678: the function-localfrom concurrent.futures import ThreadPoolExecutoris the seam tests now patch. The comment says so on the tile path. The strip path got the same comment but lost the#1551issue cross-reference the tile path has. One-line fix.
Nits
-
xrspatial/geotiff/_decode.pyline 22:import os as _os_modulemirrors the alias in_reader.py. There's noosshadow risk inside_decode.py, so plainimport oswould read fine here. Keeping the alias for grep continuity is also defensible. -
xrspatial/geotiff/tests/test_parallel_strip_decode_2100.py: a few patches still target_reader_mod._PARALLEL_DECODE_PIXEL_THRESHOLD. They work because the back-import binds at module load, but the canonical home is now_decode. Switching the patch target (or leaving a comment about why both work) avoids a future puzzle. -
xrspatial/geotiff/_decode.pylines 50-60:_resolve_max_pixelsdoesfrom ._reader import MAX_PIXELS_DEFAULTon every default-value call. It's asys.modulesdict lookup once_readeris loaded, so the cost is microseconds. Mentioning it for completeness, not worth changing.
What looks good
- Function bodies are byte-identical to the originals, diffed against
origin/main. - The back-import block in
_reader.pykeeps every downstream consumer working:_backends/dask.py,_backends/gpu.py,_backends/vrt.py,_vrt.py,_sidecar.py,_gpu_decode.py,_writer.py, the test suite. _decode.pyhas zero top-level dependency on_reader.py, so the import cycle is broken at module load.- Tests that monkey-patched
_reader_mod.ThreadPoolExecutorwere updated to patchconcurrent.futures.ThreadPoolExecutor, which is the more durable seam now that the executor is function-local in both readers. - The redundant
import os as _osshadow inside the original_read_tileswas cleaned up to use the module-level_os_modulebinding. Behavior-neutral side benefit. - Unused imports dropped from
_reader.py:decompress,predictor_decode,fp_predictor_decode,unpack_bits,lerc_decompress_with_mask,COMPRESSION_NONE,extract_geo_info,RASTER_PIXEL_IS_POINT,GeoTransform. All were only used inside the moved functions. _reader.pydrops from 2201 to 1407 lines (-794), in the "another ~600-800 lines move" range the issue targets.
Checklist
- Algorithm matches reference / paper, n/a, mechanical extraction
- All implemented backends produce consistent results
- NaN handling is correct (unchanged)
- Edge cases are covered by tests (5033 passed)
- Dask chunk boundaries handled correctly (unchanged)
- No premature materialization or unnecessary copies
- Benchmark exists or is not needed, n/a
- README feature matrix updated (if applicable), n/a
- Docstrings present and accurate (preserved verbatim)
brendancol
left a comment
There was a problem hiding this comment.
PR Review (round 2): follow-up after review fixes
Blockers
None.
Suggestions
None. The three round-1 suggestions are addressed:
- The
_decode.pymodule docstring now calls out that the module deliberately has no top-level_readerimport, names the lazily-imported symbols, and points at PR-H (#2247) as the point where the lazy imports collapse back. _read_stripsand_read_tilesgot theirmax_pixels: intannotation back (with atype: ignore[assignment]on the sentinel default).- The strip path's
ThreadPoolExecutorcomment now cross-references issue #1551.
Nits
The audit on nit #5 from round 1 turned up a real correctness issue worth recording. Several local-path patches of _PARALLEL_DECODE_PIXEL_THRESHOLD were targeting _reader_mod, but after PR-G the live binding _read_strips / _read_tiles reads is in _decode. Patching the _reader back-import alone leaves the live binding in _decode unchanged, so the "force serial" half of the parallel-vs-serial parity tests was silently a no-op (both sides ran parallel and trivially compared equal). Local-path patches in test_parallel_strip_decode_2100.py and test_parallel_strip_decode_sparse_2100.py now target _decode_mod; HTTP-path patches stay on _reader_mod because _fetch_decode_cog_http_* still lives there and uses the back-imported binding.
Remaining round-1 nits intentionally not addressed:
- Nit #4 (
import os as _os_module): kept for grep continuity with_reader.py. - Nit #6 (
_resolve_max_pixelslazy lookup): microsecond optimization in a non-hot path. PR-H will collapse the lazy imports anyway.
What looks good
- 5033 geotiff tests still pass on the latest commit (pre-existing
lz4writer test deselected, unrelated). - The patch-target audit confirms the only
_reader_mod._decode_strip_or_tilepatches remaining are in HTTP-path tests (test_cog_http_parallel_decode_2026_05_15.py,test_parallel_strip_decode_2100.pyline 229/236) where_fetch_decode_cog_http_*calls the back-imported binding. - The doc note about PR-H means a future maintainer reading
_decode.pycold will not be surprised by the lazyfrom ._reader import ...inside the read functions.
Checklist
- Algorithm matches reference / paper, n/a, mechanical extraction
- All implemented backends produce consistent results
- NaN handling is correct (unchanged)
- Edge cases are covered by tests (5033 passed)
- Dask chunk boundaries handled correctly (unchanged)
- No premature materialization or unnecessary copies
- Benchmark exists or is not needed, n/a
- README feature matrix updated (if applicable), n/a
- Docstrings present and accurate (preserved verbatim, plus expanded module docstring)
Clean. Nits-only remaining, all by design.
Mechanical extraction of strip/tile decode orchestration out of _reader.py into a new _decode.py module. Behavior-neutral. Moves the transport-independent decode helpers: - _apply_predictor, _packed_byte_count, _int_nodata_in_range, _resolve_masked_fill, _decode_strip_or_tile - _read_strips, _read_tiles (the local strip/tile readers) - _apply_orientation, _apply_orientation_with_geo, _apply_photometric_miniswhite, _miniswhite_inverted_nodata - _NATIVE_ORDER, _PARALLEL_DECODE_PIXEL_THRESHOLD Left in _reader.py: top-level entry points (_read_to_array, read_to_array), the COG-HTTP fetch+decode paths, the pixel-safety guards (MAX_PIXELS_DEFAULT, _check_dimensions, _check_source_dimensions, PixelSafetyLimitError), and the sparse- layout helpers (_sparse_fill_value, _has_sparse, _compute_full_image_byte_budget, _ifd_required_extent) which move with _layout.py in PR-H (#2247). _reader.py drops from 2201 to 1407 lines; the moved code lands in _decode.py at 897 lines. The full public/internal import surface from xrspatial.geotiff._reader is preserved via a back-import block, matching the pattern PR-E used for _sources. Two strip-decode tests that monkey-patched ``_reader_mod.ThreadPoolExecutor`` now patch ``concurrent.futures.ThreadPoolExecutor`` instead, tracking the decode functions to their new home. The dispatch contract under test (parallel branch engages on multi-strip, sparse-only short- circuits) is unchanged. Part of #2211.
- Docstring: spell out that _decode.py deliberately has no top-level _reader import so the two modules can sit on either side of the circular relationship. - Restore the ``max_pixels: int`` annotation on _read_strips and _read_tiles. _MAX_PIXELS_UNSET is still the runtime default so the lookup of MAX_PIXELS_DEFAULT can stay lazy. - Cross-reference issue #1551 in the strip path's ThreadPoolExecutor comment so it matches the tile path note. - Update local-path _PARALLEL_DECODE_PIXEL_THRESHOLD patches in the parallel-strip tests to target ``_decode`` instead of ``_reader``. After PR-G the live binding _read_strips reads is in _decode; the back-imported name in _reader is a separate reference, so patching _reader alone silently no-ops the "force serial" half of the parallel-vs-serial parity tests. HTTP-path patches stay on _reader because _fetch_decode_cog_http_strips still lives there and uses the back-imported binding.
The previous assertion required coalesced wall time to be less than half the baseline, which couples the test to per-tile decode cost on the runner. On macOS arm64 in CI the constant decode overhead is large enough that the ratio fails even when coalescing correctly saves the expected ~7 RTTs of network time. Assert on absolute RTTs saved instead, which is what coalescing actually controls.
bd13f0e to
20bdb28
Compare
Closes #2246
Part of #2211
Summary
Mechanical extraction of strip/tile decode orchestration out of
_reader.pyinto a new_decode.py. Behavior-neutral.Moves to
_decode.py:_apply_predictor,_packed_byte_count,_int_nodata_in_range,_resolve_masked_fill,_decode_strip_or_tile_read_strips,_read_tiles(the local strip/tile readers)_apply_orientation,_apply_orientation_with_geo,_apply_photometric_miniswhite,_miniswhite_inverted_nodata_NATIVE_ORDER,_PARALLEL_DECODE_PIXEL_THRESHOLDLeft in
_reader.py:_read_to_array,read_to_array)MAX_PIXELS_DEFAULT,_check_dimensions,_check_source_dimensions,PixelSafetyLimitError)_sparse_fill_value,_has_sparse,_compute_full_image_byte_budget,_ifd_required_extent) — these move with_layout.pyin PR-H (Refactor GeoTIFF Phase 5c-read: extract _layout.py from _reader.py (PR-H of #2211) #2247)_reader.pydrops from 2201 to 1407 lines; the moved code lands in_decode.pyat 897 lines. The public and internal import surface fromxrspatial.geotiff._readeris preserved via a back-import block, matching how PR-E handled_sources.Test plan
lz4failure onmaindeselected and unrelated to this PR)from xrspatial.geotiff._reader import _read_strips, _read_tiles, _apply_predictor, _decode_strip_or_tile, ...from xrspatial.geotiff._decode import ...ThreadPoolExecutorconstruction (patched atconcurrent.futures.ThreadPoolExecutorrather than_reader.ThreadPoolExecutorto track the move)Notes
Two strip-decode tests that monkey-patched
_reader_mod.ThreadPoolExecutornow patchconcurrent.futures.ThreadPoolExecutorinstead. The dispatch contract under test (parallel branch engages on multi-strip, sparse-only short-circuits) is unchanged; the patch location is an implementation detail that tracks the move.No public API change.