Check GitHub API proactively for renamed or deleted users by carols10cents · Pull Request #13143 · rust-lang/crates.io

carols10cents · 2026-03-10T15:26:29Z

Also to hopefully get the database to a state where users.gh_login is unique.

Rates

The part that I haven't decided/completed yet is how often to run the UpdateFromGithub job and at what pace. Currently, the job must be enqueued completely manually and it does a batch of 100 crates.io users then stops at a pace under the GitHub API rate limits.

We definitely need to run the job against all users once before thinking about implementing rust-lang/rfcs#3946, so that 1. we start all accounts off with their current GitHub username as their crates.io username and 2. the crates.io username based on the GitHub username can be unique.

What I'm imagining is that once we decide to implement the part of rust-lang/rfcs#3946 that stops updating your crates.io username in sync with your GitHub username, then the UpdateFromGithub job will only update oauth_github.login and not users.gh_login (or whatever that column ends up getting renamed to when it becomes the crates.io username).

Reasons to be cautious about how often and how fast this job runs:

I don't know how the queries and updates to our database will affect database load in production. I hope I've written this in such a way as to not be noticeable, but I could be wrong! If i'm wrong, are there better ways to write the queries? Or will we need to schedule this task to run during times that our traffic is lower?
We haven't been doing these proactive updates, so there isn't a huge reason to rush into doing them as fast as possible.

Reasons to turn this on to run all the time as fast as possible:

To get the best mitigation of such long-standing issues as:
- We should have a way to mark that a user has been deleted #1585
- It is possible for more than one user record to have the same gh_login #1584
We're confident it doesn't affect performance or cause issues (once we've established that confidence however we need to)

And of course there are places between "as slow as possible" and "as fast as possible" that we could land; I'm interested to get peoples' thoughts on the tradeoffs.

User IDs

@LawnGnome You mentioned the other day something about the crates.io user account ID being PII that we should be cautious about using, mostly because archives exist. I'm interested in your thoughts about using ghost_{user id} as the crates.io username for deleted GitHub accounts. I think this is fine, and here's why:

Our API responses to https://crates.io/api/v1/users/{username} include the user's crates.io ID; it's not secret information
I think the information that could be gleaned from using the user ID in conjunction with an archive is that crates.io says there's an account ghost_123 that owns crate foo; the archive can tell us through user ID association that this user previously had the username LawnGnome, and perhaps this person deleted their GitHub account because they didn't want people to know they published crate foo. I don't think this is actually a reason not to use user IDs in this way because:
- The archive may go back to a point in time where you could see the owner of foo was LawnGnome anyway
- Users can request that crates.io delete all data we have associated with them, including their user record and sometimes including their crate, which we're likely to grant if the reason is to remove PII the user doesn't want to be exposed

Am I missing a way the user ID could be used that would make ghost_{user id} a bad idea?

Other

Any other concerns I haven't addressed?

eth3lbert

I'll try to review this over the weekend!

View changes since this review

carols10cents · 2026-05-08T21:19:02Z

Ok new approach here, leaving the failing migration check for now so that this doesn't get merged (i think there are still a few details to work out) and because I have to go 😅

But I'd love thoughts!

Turbo87 · 2026-05-11T15:42:48Z

+            error!(
+                "Could not update user ID {} from username {} to username {}: {e}",
+                self.user_id, self.old_username, github_user.login,
+            );


any reason for logging the error instead of failing the background job so that it stays in the queue and will be retried?

View changes since the review

I was trying to be conservative and only retrying potentially the next time we enqueue a batch with the admin job, rather than retrying a bunch of times via the background job retry, so that if something is completely wrong when we try this for real, we aren't doing a whole bunch of wrong stuff a bunch of times (and not needing to scramble to get the jobs out of the queue).

What do you think?

Steps broken down roughly into functions that the background job calls that have `todo!()` implementations for now

Set all existing rows to 1970 so that we'll refresh them soon. Set new rows to now, because creating a new user/oauth_github record means we just got the user's github information (which counts as a sync).

rustbot · 2026-05-13T18:14:02Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

The diesel-guard message says: > Note: For Postgres 11+, this is safe if the default is a constant value. which it is. We don't actually need the part where I changed the default after adding the column because we set the default explicitly for new rows in the code.

Anonymously via the regular GitHub API. Except for enterprise managed users that have an underscore in their username; anonymous API requests will always return 404 for them (so don't try, just keep their username). If the fallback request to get public user info by the user ID fails, probably because we've hit the rate limit, fail the job and try again later.

And oauth_github is the only source used to enqueue jobs. I'll update the `user.gh_login` records for users with gh_id < 1 manually once this whole strategy is approved, to get us to a state where `user.gh_login` is unique-- but that isn't what this job needs to do in perpetuity, so don't.

Thanks DataDog 😞

carols10cents · 2026-05-13T21:50:27Z

@Turbo87 @eth3lbert this is ready for rereview!

LawnGnome · 2026-05-17T11:44:33Z

I'll try to do a full review on this next week, but don't block on me if @Turbo87 and/or @eth3lbert approve it; I'm happy to review this ex post facto in my copious amounts of spare time.

Just on your question for me in your post, though:

Am I missing a way the user ID could be used that would make ghost_{user id} a bad idea?

I think your analysis is basically correct, and you've captured my concern: that having the ID might make it easier to work backwards to identify a user who no longer wants to be identified.¹

But, as you also said, the API is returning the ID anyway right now, unless we decide to take extra steps to merge deleted users into a singular ghost user (which I am explicitly not suggesting here, and think we'd want to RFC if we ever did go down that road), so having it also be in the user name seems reasonable enough for now.

So no, I don't think you're missing a way, and I don't think it's a blocker here.

You didn't ask, but to get it off my chest: having just gone through the GDPR erasure process, I think it's a slightly unfortunate combination of well-intentioned, rooted in a real concern, and fundamentally useless on an Internet where public things are archived and mirrored willy-nilly. I think it's reasonable for us to enable steps that will make it harder for a bad person to figure out who a deleted user is, but it's basically always a balancing act, since I can pretty much guarantee there are people archiving things like database dumps and mirroring metadata whether we — or the deleted users — like it or not. ↩

rustbot added the A-backend ⚙️ label Mar 10, 2026

carols10cents force-pushed the refresh-github branch 3 times, most recently from 11dbdda to 4514edb Compare March 20, 2026 19:41

carols10cents force-pushed the refresh-github branch from 36268b8 to f166d51 Compare March 20, 2026 21:40

carols10cents mentioned this pull request Apr 17, 2026

Propose the concept of a crates.io username for identity rust-lang/rfcs#3946

Open

carols10cents force-pushed the refresh-github branch 2 times, most recently from f9c1c72 to 27d3a01 Compare April 22, 2026 15:07

carols10cents force-pushed the refresh-github branch 2 times, most recently from b872321 to a4b72c9 Compare May 3, 2026 01:51

carols10cents marked this pull request as ready for review May 4, 2026 14:10

carols10cents changed the title ~~WIP: Check GitHub API proactively for renamed or deleted users~~ Check GitHub API proactively for renamed or deleted users May 4, 2026

carols10cents requested review from LawnGnome, Turbo87 and eth3lbert May 4, 2026 14:34

carols10cents mentioned this pull request May 4, 2026

Refactoring to disentangle account creation from only one identity provider #10611

Open

6 tasks

Turbo87 reviewed May 5, 2026

View reviewed changes

carols10cents force-pushed the refresh-github branch from a925015 to 82d36bb Compare May 5, 2026 17:15

eth3lbert reviewed May 7, 2026

View reviewed changes

Comment thread crates/crates_io_database/src/schema.rs Outdated

carols10cents force-pushed the refresh-github branch from 82d36bb to f7cfc39 Compare May 8, 2026 21:12

This comment has been minimized.

Sign in to view

eth3lbert reviewed May 10, 2026

View reviewed changes

Comment thread migrations/2026-05-05-180007-0000_add_last_sync_to_oauth_github/up.sql Outdated

Comment thread src/worker/jobs/update_user_from_github.rs Outdated

Comment thread src/worker/jobs/update_user_from_github.rs Outdated

Comment thread src/bin/crates-admin/enqueue_job.rs

Turbo87 reviewed May 11, 2026

View reviewed changes

carols10cents added 5 commits May 13, 2026 14:06

Skeleton of what a job to update a user from GitHub will do

47d178e

Steps broken down roughly into functions that the background job calls that have `todo!()` implementations for now

Add a last_sync column to the oauth_github table

2486b4a

Set all existing rows to 1970 so that we'll refresh them soon. Set new rows to now, because creating a new user/oauth_github record means we just got the user's github information (which counts as a sync).

Implement refreshing user from GitHub

9dff332

Add a dry run mode to only log updates from GitHub

a7bec20

Add an admin command to enqueue a batch of user update jobs

773b789

carols10cents force-pushed the refresh-github branch from f7cfc39 to 2234817 Compare May 13, 2026 18:13

carols10cents force-pushed the refresh-github branch from 2234817 to 748df9c Compare May 13, 2026 18:29

carols10cents added 5 commits May 13, 2026 15:59

Don't update the users table if the username hasn't changed

8440bc6

Only put dry_run and account_id in job params

7240582

Put info in text of log messages, not structured fields

dc44cff

Thanks DataDog 😞

carols10cents force-pushed the refresh-github branch from 21f5842 to dc44cff Compare May 13, 2026 20:49

Make priority of username updating bg jobs negative

7efc1be

Conversation

carols10cents commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rates

User IDs

Other

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eth3lbert left a comment • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

carols10cents commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Turbo87 May 11, 2026 • edited by rustbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carols10cents May 13, 2026

Choose a reason for hiding this comment

Uh oh!

rustbot commented May 13, 2026

Uh oh!

carols10cents commented May 13, 2026

Uh oh!

LawnGnome commented May 17, 2026

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

carols10cents commented Mar 10, 2026 •

edited

Loading

eth3lbert left a comment •

edited by rustbot

Loading

Turbo87 May 11, 2026 •

edited by rustbot

Loading