Skip to content

Test: Assert specific evaluation scores for sample data#1249

Draft
musaqlain wants to merge 1 commit into
weecology:mainfrom
musaqlain:test-model-accuracy
Draft

Test: Assert specific evaluation scores for sample data#1249
musaqlain wants to merge 1 commit into
weecology:mainfrom
musaqlain:test-model-accuracy

Conversation

@musaqlain
Copy link
Copy Markdown
Contributor

@musaqlain musaqlain commented Dec 25, 2025

Description

Previously, test_evaluate only checked if the execution completed without errors, but did not assert the quality of the results. This PR adds assertions to ensure the model produces consistent precision and recall scores on the sample data.

Changes Made

  • Relaxed Unit Test: I modified test_evaluate in tests/test_main.py to use looser bounds (> 0.7) so it acts as a sanity check for future models.
  • New Benchmark Test: I created tests/test_benchmark.py which loads the specific model revision (SHA) using m.load_model(..., revision=...). This test asserts the exact precision/recall values (approx 0.80 / 0.72) to ensure reproducibility for this specific release.

Testing

  • Ran pytest tests/test_main.py::test_evaluate locally.
  • Result: Passed.

Linked Issue

Closes #1233

AI-Assisted Development

I utilized AI for sanity checking code logic and conducting an assisted review to identify potential missing resets and schema constraints.

  • I used AI tools (e.g., GitHub Copilot, ChatGPT, etc.) in developing this PR
  • I understand all the code I'm submitting

@jveitchmichaelis
Copy link
Copy Markdown
Collaborator

jveitchmichaelis commented Jan 1, 2026

Thanks for the contribution. I would prefer a "benchmark" unit test that specifies the model via config as explicitly as possible (for example using a commit hash from hugging face, not just revision=main as that could also change), and then makes this check. Then we could add other tests for newer models etc.

We already have this assertion in the unit test in the PR:

assert np.round(results["box_precision"], 2) > 0.5
assert np.round(results["box_recall"], 2) > 0.5

which we could bump to 0.7 if we want tighter checks. But I don't think this test is the right place for a "close" assertion as it will break if we ever release an improved model.

Please could you also include the AI assistance declaration from the PR template? (you can see an example here)

@musaqlain
Copy link
Copy Markdown
Contributor Author

Thanks for the guidance! I completely understand your point that separating the sanity checks from strict benchmarking is a better approach for long-term maintenance of this package. Now I have updated it accordingly also updated this PR's description. 👍

Copy link
Copy Markdown
Member

@ethanwhite ethanwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really close to me. Just one recommendation for clarifying a code comment. @jveitchmichaelis - do you see anything else?

Comment thread tests/test_main.py Outdated
Comment thread tests/test_benchmark.py Outdated
@jveitchmichaelis
Copy link
Copy Markdown
Collaborator

jveitchmichaelis commented May 8, 2026

Looks like we also need to rebase, and there is another comment I would suggest removing as it doesn't provide much. I think that in the long run, we probably want to use tags or branches rather than opaque commit hashes but that's not an issue right now.

@musaqlain musaqlain force-pushed the test-model-accuracy branch from a9b3b5e to 509015b Compare May 20, 2026 17:47
@musaqlain
Copy link
Copy Markdown
Contributor Author

it is ready for review, I have resolved the conflict, squashed the commit messages into 1, and also remove those comments.
PTAL now.

@musaqlain musaqlain marked this pull request as draft May 20, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assert the outputs of sample data

3 participants