Skip to content

Add support for PyMongo Async and N+1 Dereference/select_related using Pipeline#2904

Draft
arunsureshkumar wants to merge 66 commits into
MongoEngine:masterfrom
strollby:feat/async
Draft

Add support for PyMongo Async and N+1 Dereference/select_related using Pipeline#2904
arunsureshkumar wants to merge 66 commits into
MongoEngine:masterfrom
strollby:feat/async

Conversation

@arunsureshkumar
Copy link
Copy Markdown

This PR contributes to #2902 , which tracks the ongoing effort to add native async support to MongoEngine using PyMongo’s official async API (>= 4.14).

It includes foundational changes toward async-first workflows, improvements to select_related via aggregation pipelines, and related updates across core internals, tests, CI, and documentation.

Notes

  • Async support is native, not layered on top of sync behavior
  • Focus on unified sync/async code paths and improved performance
  • Version and compatibility changes apply (see issue for details)

📌 Details, motivation, and full scope are documented in the issue: #2902
🚧 Work is still in progress — feedback is welcome.

arunsureshkumar and others added 30 commits December 27, 2025 22:43
- Refactored the core ORM to support PyMongo’s native async API
- Unified sync and async code paths across documents, querysets, and transactions
- Replaced legacy async implementations
- Removed deprecated and compatibility code

BREAKING CHANGE:
- Removed legacy async behavior
- Removed LazyReferenceField
- Removed GenericLazyReferenceField
- GenericReferenceField now requires `choices`
- Dropped support for PyMongo < 4.14
- Dropped support for MongoDB < 4.2
BaseQuerySet is now defined only in the synchronous queryset implementation.
add: TestQuerysetLookupMatch
…uilder stages

- Extract query normalization, match planning, lookup planning, stage building, and tail stages
- Introduce clear aggregation pipeline architecture aligned with MongoDB stages
- Reduce PipelineBuilder to a small orchestration layer
- Improve readability, isolation, and long-term maintainability
…ilter-only lookups

Body:
	•	refactor StageBuilder traversal/handlers for readability
	•	keep $lookup unfiltered for correct hydration; apply foreign predicates via local $filter
	•	emit explicit _missing_reference markers so deref raises DoesNotExist
	•	reduce duplication with shared helpers and structured dispatch
- Normalize refIds generation across scalar and container fields
- Use reduce-based flattening for ListField reference lookups
- Ensure missing references emit `{_missing_reference, _ref}` only
- Fix select_related pipelines to match MongoDB 4.2+ semantics
- Expand pipeline builder tests for nested and container references
…ort, updated installation steps, supported MongoDB versions, and improved examples.
…project

- Use uv for dependency management and builds
- Simplify GitHub Actions matrix and MongoDB setup
- Replace custom CI scripts with uv-based dependency management, actions
- Match tox environment with GitHub Action matrix
…registry cleanup

- Update assertions to use `await` where needed for async compatibility.
- Introduce `_DocumentRegistry.clear()` and `_CollectionRegistry.clear()` calls in test setups and teardowns.
- Normalize test workflows to ensure proper registry state management across synchronous and asynchronous environments.
- Simplify pipeline builder tests by removing unnecessary async and ensuring compatibility with recent updates.
…ts, and async example usage in `query_counter` and `async_query_counter`
@abhinand-c
Copy link
Copy Markdown

@bagerard @rozza @hmarr
Can you help us with review and feedbacks?

abhinand-c and others added 23 commits January 6, 2026 16:09
Reducing the flakiness of transaction
…ression

When a filter condition targets a reference/list-of-reference field (e.g.
articles__headline='Hello') and select_related is used, the $addFields
hydration stage now applies a $filter on the docs_alias array instead of
using the full unfiltered result. This ensures the hydrated field contains
only the matching documents rather than all fetched documents.

For ListField references the filtered array is assigned directly; for scalar
ReferenceField the first matching element is extracted via $arrayElemAt (or
null if none match).

Also remove unused `Union` import from synchronous queryset.
When _walk_lookups recursed into a ListField(ReferenceField) subtree
(e.g. before_child -> parent), it passed embedded_list_path=None,
causing the nested ref to use _add_structured_ref_lookup with a dotted
path over an array. This generated $indexOfArray(ids, array_of_ids)
which always returned -1, writing {_missing_reference: True} to every
element even when the referenced document existed.

Fix: pass embedded_list_path=full_path so the recursive call uses
_add_embedded_list_structured_ref_lookup, which correctly uses $map
to update each array element's field individually.
…= 5.0

- StageBuilder now builds the joined-docs hash {id_str: doc} once in an outer
  $let and uses $getField for O(1) lookup per ref leaf. Cuts hydration cost
  for List/Map/Dict of ReferenceField from O(N*M) to O(N+M) against large
  joined collections. Falls back to legacy $indexOfArray when MongoDB < 5.0.
- PipelineBuilder accepts mongo_version; all 4 caller sites resolve the
  queryset's effective alias (matching _get_collection's logic, with
  using(None,None) guard) so multi-cluster setups probe the correct cluster.
- get_mongodb_version/async_get_mongodb_version now accept an alias and
  cache per-alias to avoid a server_info() roundtrip on every aggregation.
  Disconnect clears only the disconnected alias's entry.

Cleanup bundled in:
- Consolidate is_list_of_embedded / embedded_doc_type into Schema; drop the
  duplicated copies in StageBuilder and utils.py.
- Delegate LookupPlanner._get_field_by_db_part and MatchPlanner's local
  field-lookup closure to Schema.resolve_field_name.
- Remove the verbatim duplicate of needs_aggregation in pipeline_builder.py
  (utils.py is the canonical home, exported via __init__).
- Fix __init__.py NameError from referencing modules after star-import.
- Drop dead `ids` $let variable in _build_value_expr that computed an unused
  $map on every hydrated document.
- Merge the two sequential `if isinstance(field, DictField)` blocks in
  _walk_lookups into one explicit dispatch.
- Expand pipeline_builder/README.md with the missing schema.py / utils.py,
  data flow, both build paths, and design invariants.
- Enable ruff-format hook in .pre-commit-config.yaml (ruff-check stays
  disabled until the ~6,700-error backlog — mostly F403/F405 from `import *`
  in tests and E501 — is triaged separately).
- Run `ruff format` once across the codebase to establish a clean baseline.
- Run `ruff check --select F401 --fix` to drop 32 unused imports.
- Trailing-newline fix in pipeline_builder/README.md from end-of-file-fixer.

No functional changes — purely whitespace, formatting, and dead-import cleanup.
Split the 2,559-line fields.py monolith into a logical folder hierarchy:

- string/ - StringField, URLField, EmailField (individual files)
- numeric/ - IntField, FloatField, DecimalField, Decimal128Field
- datetime/ - DateTimeField, DateField, ComplexDateTimeField
- complex/ - ListField, DictField, MapField (renamed from container/)
- document/ - EmbeddedDocumentField, GenericEmbeddedDocumentField, DynamicField
- reference/ - ReferenceField, GenericReferenceField
- file/ - BinaryField, FileField, ImageField, GridFSProxy
- geo/ - GeoPointField + 6 GeoJSON types (individual files)
- boolean.py, enum.py, uuid.py, sequence.py (single-file modules)
- exceptions.py - GridFSError, ImproperlyConfigured

All imports remain backward compatible via fields/__init__.py re-exports.
Tests pass: 597 field tests (299 sync + 298 async).
Split the 689-line base/fields.py into individual class files:

- base_field.py - BaseField (260 lines, core field descriptor)
- complex_base_field.py - ComplexBaseField (217 lines, for lists/dicts)
- object_id_field.py - ObjectIdField (31 lines, ObjectId wrapper)
- geo_json_base_field.py - GeoJsonBaseField (159 lines, GeoJSON validation)

All imports remain backward compatible via base/fields/__init__.py.
Tests pass: 198 base field tests (99 sync + 99 async).
Split the 516-line base/datastructures.py into individual class files:

- helpers.py - mark_as_changed_wrapper decorators (26 lines)
- base_dict.py - BaseDict (74 lines, change-tracking dict)
- base_list.py - BaseList (106 lines, change-tracking list)
- embedded_document_list.py - EmbeddedDocumentList (173 lines, queryable embedded doc list)
- strict_dict.py - StrictDict (85 lines, slot-based efficient dict)
- lazy_reference.py - LazyReference (70 lines, deferred document loading)

All imports remain backward compatible via base/datastructures/__init__.py.
Tests pass: 60 dereference tests + 31 embedded document list tests.
Add ZonedDateTimeField that stores both UTC time and timezone name,
enabling accurate time comparisons while preserving the original timezone
for frontend display.

Storage format:
- MongoDB: {"utc": ISODate(...), "tz": "Asia/Kolkata"}
- Python: timezone-aware datetime in original timezone

Key features:
- DST-safe: stores timezone name (e.g., "America/New_York"), not offset
- Query support: start_time__utc__gte for time queries, start_time__tz for timezone queries
- Automatic index expansion: 'start_time' → 'start_time.utc' in meta.indexes
- Works with both sync and async APIs

Tests cover storage, retrieval, DST handling, querying, ordering, and indexing.
- Add ZonedDateTimeField to API reference
- Add ZonedDateTimeField to field list in defining-documents guide
- Fix Sphinx 9.x incompatibility with readthedocs_ext (only load on ReadTheDocs)
- Update dependencies to latest versions with environment markers for Sphinx
  - Sphinx 8.1.3 for Python 3.10-3.11, 9.1.0 for Python 3.12+
  - ruff 0.15, pre-commit 4.6, pytest 9.0.3, coverage 7.14, pillow 12.2
  - tox 4.54, tox-uv 1.35.2, uv_build 0.11.16
MongoDB 5.0-7.0 support $getField but require the 'field' parameter to be
a constant, not a variable expression. The pipeline builder was using
{"$getField": {"field": {"$toString": "$$rid"}, ...}} which works on
MongoDB 4.4 (lenient) and 8.0+ (relaxed), but fails on 5.0-7.0 with
error 5654601: "$getField requires 'field' to evaluate to a constant".

Changed the version check to only enable $getField optimization on
MongoDB >= 8.0, falling back to the legacy $indexOfArray approach on
earlier versions.

This trades O(1) lookup performance for compatibility on 5.0-7.0, while
MongoDB 8.0+ still gets the optimized path.

Tested on MongoDB 5.0.31, 6.0.28, 7.0.34 - all tests pass.
The tox -a command outputs environment names separated by newlines, but
tox -e expects comma-separated values. This was causing CI failures with:
'provided environments not found in configuration file'.

Added 'tr "\n" "," | sed "s/,$//"' to convert the newline-separated
list to comma-separated format that tox expects.
The 'readthedocs' builder doesn't exist in Sphinx. Changed the
html-readthedocs target to use the standard 'html' builder with
the -T -E flags for strict checking and fresh build.

This fixes the CI build_doc_dryrun job that was failing with:
'Builder name readthedocs not registered or available through entry point'
Split the build-n-publish job into two separate jobs:

1. build-release: Builds wheel and sdist, uploads as artifacts
2. publish-to-pypi: Downloads artifacts and publishes to PyPI

Benefits:
- Better separation of concerns
- Build artifacts can be verified before publishing
- Failed publish doesn't require rebuilding
- Follows GitHub Actions best practices for release workflows

The publish job depends on build-release and both only run on tag
creation (refs/tags/v*).
- Add pymongo 4.16 and 4.17 to tox environments
- Add MongoDB 8.3 to CI workflow matrix
- Keep all MongoDB versions (4.4-8.3) for backward compatibility
- Add pymongo-version to test matrix to isolate test runs
- Run one Python × MongoDB × PyMongo combination per job
- Change from tox run-parallel to single tox run per job
- Update cache key to include PyMongo version

This prevents transaction/lock conflicts that occurred when
multiple PyMongo versions ran concurrently against the same
MongoDB instance.

Matrix: 5 Python × 6 MongoDB × 4 PyMongo = 120 jobs
- Add job name template with proper capitalization
- Change pymongo-version format from "414" to "4.14" for readability
- Update tox env construction to strip dots from pymongo version

Job names now show as:
"test (Python 3.10, MongoDB 4.4, PyMongo 4.14)"
instead of:
"test (3.10, 4.4, 414)"
Adopt Python's standard terminology for timezone-aware datetimes
("aware" vs "naive") instead of "zoned" vs "unzoned".

Changes:
- Rename ZonedDateTimeField class to AwareDateTimeField
- Update all error messages and docstrings
- Rename file from zoned_datetime_field.py to aware_datetime_field.py
- Update imports in mongoengine/fields/__init__.py
- Update imports in mongoengine/fields/datetime/__init__.py
- Rename test files and update all test references

This is a breaking change for code using ZonedDateTimeField.
Users should update their imports to use AwareDateTimeField.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants