Update Python version to 3.12 and refresh PR template by kayametehan · Pull Request #1648 · openai/evals

kayametehan · 2026-04-23T17:45:21Z

Changes

Python version bump (fixes #1606)

run_tests.yaml: python-version: 3.9 → "3.12"
test_eval.yaml: python-version: 3.9 → "3.12"

Python 3.9 reached end-of-life in October 2025. 3.12 is the current stable release.

PR template refresh (fixes #1608)

Removed references to GPT-4 as an internal/private model
Removed the outdated "must fail on GPT-4" merge requirement
Removed the GPT-3.5-Turbo workaround instructions
Added link to the OpenAI Dashboard evals experience
Kept all contribution requirements and checklist intact

- Bump python-version from 3.9 to 3.12 in run_tests.yaml and test_eval.yaml workflows (closes openai#1606) - Remove stale GPT-4 private-access language from PR template; replace with current contribution guidelines (closes openai#1608)

Copilot

Pull request overview

Updates CI to run on a supported Python version and refreshes contributor guidance in the PR template to remove obsolete model-specific requirements.

Changes:

Bump GitHub Actions workflows from Python 3.9 to Python 3.12.
Refresh .github/PULL_REQUEST_TEMPLATE.md by removing outdated GPT-4-specific merge requirements and adding updated guidance/links.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
.github/workflows/test_eval.yaml	Updates workflow Python runtime to 3.12 for eval validation.
.github/workflows/run_tests.yaml	Updates unit test workflow Python runtime to 3.12.
.github/PULL_REQUEST_TEMPLATE.md	Updates PR template instructions to reflect current eval submission expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-23T17:48:52Z

 **PLEASE READ THIS**:

-In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task.
+We are currently only accepting evals that use **model-graded eval classes** (no custom code). Your eval should include a minimum of **15 high-quality samples**.


The PR template now says we are only accepting evals that use model-graded eval classes. This conflicts with the repo docs, which explicitly say contributors can follow an existing eval template to build a basic or model-graded eval (and the only stated restriction is “no custom code”). Please align the template wording with README guidance so contributors aren’t incorrectly discouraged from submitting basic-template evals.

Suggested change

We are currently only accepting evals that use **model-graded eval classes** (no custom code). Your eval should include a minimum of **15 high-quality samples**.

We are currently only accepting evals that use **existing eval classes** such as **Basic** or **model-graded eval classes** (no custom code). Your eval should include a minimum of **15 high-quality samples**.

Copilot AI review requested due to automatic review settings April 23, 2026 17:45

kayametehan requested review from andrew-openai, etr2460 and katyhshi as code owners April 23, 2026 17:45

Copilot started reviewing on behalf of kayametehan April 23, 2026 17:45 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Python version to 3.12 and refresh PR template#1648

Update Python version to 3.12 and refresh PR template#1648
kayametehan wants to merge 1 commit intoopenai:mainfrom
kayametehan:fix/python-version-and-pr-template

kayametehan commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	We are currently only accepting evals that use model-graded eval classes (no custom code). Your eval should include a minimum of 15 high-quality samples.
	We are currently only accepting evals that use existing eval classes such as Basic or model-graded eval classes (no custom code). Your eval should include a minimum of 15 high-quality samples.

Conversation

kayametehan commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants