Skip to content

Check GitHub API proactively for renamed or deleted users#13143

Open
carols10cents wants to merge 12 commits into
rust-lang:mainfrom
carols10cents:refresh-github
Open

Check GitHub API proactively for renamed or deleted users#13143
carols10cents wants to merge 12 commits into
rust-lang:mainfrom
carols10cents:refresh-github

Conversation

@carols10cents
Copy link
Copy Markdown
Member

@carols10cents carols10cents commented Mar 10, 2026

Also to hopefully get the database to a state where users.gh_login is unique.

Rates

The part that I haven't decided/completed yet is how often to run the UpdateFromGithub job and at what pace. Currently, the job must be enqueued completely manually and it does a batch of 100 crates.io users then stops at a pace under the GitHub API rate limits.

We definitely need to run the job against all users once before thinking about implementing rust-lang/rfcs#3946, so that 1. we start all accounts off with their current GitHub username as their crates.io username and 2. the crates.io username based on the GitHub username can be unique.

What I'm imagining is that once we decide to implement the part of rust-lang/rfcs#3946 that stops updating your crates.io username in sync with your GitHub username, then the UpdateFromGithub job will only update oauth_github.login and not users.gh_login (or whatever that column ends up getting renamed to when it becomes the crates.io username).

Reasons to be cautious about how often and how fast this job runs:

  • I don't know how the queries and updates to our database will affect database load in production. I hope I've written this in such a way as to not be noticeable, but I could be wrong! If i'm wrong, are there better ways to write the queries? Or will we need to schedule this task to run during times that our traffic is lower?
  • We haven't been doing these proactive updates, so there isn't a huge reason to rush into doing them as fast as possible.

Reasons to turn this on to run all the time as fast as possible:

And of course there are places between "as slow as possible" and "as fast as possible" that we could land; I'm interested to get peoples' thoughts on the tradeoffs.

User IDs

@LawnGnome You mentioned the other day something about the crates.io user account ID being PII that we should be cautious about using, mostly because archives exist. I'm interested in your thoughts about using ghost_{user id} as the crates.io username for deleted GitHub accounts. I think this is fine, and here's why:

  • Our API responses to https://crates.io/api/v1/users/{username} include the user's crates.io ID; it's not secret information
  • I think the information that could be gleaned from using the user ID in conjunction with an archive is that crates.io says there's an account ghost_123 that owns crate foo; the archive can tell us through user ID association that this user previously had the username LawnGnome, and perhaps this person deleted their GitHub account because they didn't want people to know they published crate foo. I don't think this is actually a reason not to use user IDs in this way because:
    • The archive may go back to a point in time where you could see the owner of foo was LawnGnome anyway
    • Users can request that crates.io delete all data we have associated with them, including their user record and sometimes including their crate, which we're likely to grant if the reason is to remove PII the user doesn't want to be exposed

Am I missing a way the user ID could be used that would make ghost_{user id} a bad idea?

Other

Any other concerns I haven't addressed?

@carols10cents carols10cents force-pushed the refresh-github branch 3 times, most recently from 11dbdda to 4514edb Compare March 20, 2026 19:41
@carols10cents carols10cents force-pushed the refresh-github branch 2 times, most recently from f9c1c72 to 27d3a01 Compare April 22, 2026 15:07
@carols10cents carols10cents force-pushed the refresh-github branch 2 times, most recently from b872321 to a4b72c9 Compare May 3, 2026 01:51
@carols10cents carols10cents marked this pull request as ready for review May 4, 2026 14:10
@carols10cents carols10cents changed the title WIP: Check GitHub API proactively for renamed or deleted users Check GitHub API proactively for renamed or deleted users May 4, 2026
Comment thread src/worker/jobs/update_from_github.rs Outdated
Comment thread crates/crates_io_github/src/lib.rs Outdated
Comment thread src/worker/jobs/update_from_github.rs Outdated
Comment thread crates/crates_io_database/src/schema.rs Outdated
Comment thread src/worker/jobs/update_from_github.rs Outdated
Copy link
Copy Markdown
Contributor

@eth3lbert eth3lbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to review this over the weekend!

View changes since this review

Comment thread crates/crates_io_database/src/schema.rs Outdated
@rustbot

This comment has been minimized.

@carols10cents
Copy link
Copy Markdown
Member Author

Ok new approach here, leaving the failing migration check for now so that this doesn't get merged (i think there are still a few details to work out) and because I have to go 😅

But I'd love thoughts!

Comment thread migrations/2026-05-05-180007-0000_add_last_sync_to_oauth_github/up.sql Outdated
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment thread src/bin/crates-admin/enqueue_job.rs
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment thread src/worker/jobs/update_user_from_github.rs
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment thread src/worker/jobs/update_user_from_github.rs
Comment thread src/worker/jobs/update_user_from_github.rs Outdated
Comment on lines +149 to +152
error!(
"Could not update user ID {} from username {} to username {}: {e}",
self.user_id, self.old_username, github_user.login,
);
Copy link
Copy Markdown
Member

@Turbo87 Turbo87 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for logging the error instead of failing the background job so that it stays in the queue and will be retried?

View changes since the review

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be conservative and only retrying potentially the next time we enqueue a batch with the admin job, rather than retrying a bunch of times via the background job retry, so that if something is completely wrong when we try this for real, we aren't doing a whole bunch of wrong stuff a bunch of times (and not needing to scramble to get the jobs out of the queue).

What do you think?

Steps broken down roughly into functions that the background job
calls that have `todo!()` implementations for now
Set all existing rows to 1970 so that we'll refresh them soon. Set new
rows to now, because creating a new user/oauth_github record means we
just got the user's github information (which counts as a sync).
@rustbot
Copy link
Copy Markdown
Collaborator

rustbot commented May 13, 2026

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

The diesel-guard message says:

>  Note: For Postgres 11+, this is safe if the default is a constant value.

which it is.

We don't actually need the part where I changed the default after
adding the column because we set the default explicitly for new rows in
the code.
Anonymously via the regular GitHub API. Except for enterprise managed
users that have an underscore in their username; anonymous API requests
will always return 404 for them (so don't try, just keep their
username).

If the fallback request to get public user info by the user ID fails,
probably because we've hit the rate limit, fail the job and try again
later.
And oauth_github is the only source used to enqueue jobs.

I'll update the `user.gh_login` records for users with gh_id < 1
manually once this whole strategy is approved, to get us to a state
where `user.gh_login` is unique-- but that isn't what this job needs to
do in perpetuity, so don't.
@carols10cents
Copy link
Copy Markdown
Member Author

@Turbo87 @eth3lbert this is ready for rereview!

@LawnGnome
Copy link
Copy Markdown
Contributor

I'll try to do a full review on this next week, but don't block on me if @Turbo87 and/or @eth3lbert approve it; I'm happy to review this ex post facto in my copious amounts of spare time.

Just on your question for me in your post, though:

Am I missing a way the user ID could be used that would make ghost_{user id} a bad idea?

I think your analysis is basically correct, and you've captured my concern: that having the ID might make it easier to work backwards to identify a user who no longer wants to be identified.1

But, as you also said, the API is returning the ID anyway right now, unless we decide to take extra steps to merge deleted users into a singular ghost user (which I am explicitly not suggesting here, and think we'd want to RFC if we ever did go down that road), so having it also be in the user name seems reasonable enough for now.

So no, I don't think you're missing a way, and I don't think it's a blocker here.

Footnotes

  1. You didn't ask, but to get it off my chest: having just gone through the GDPR erasure process, I think it's a slightly unfortunate combination of well-intentioned, rooted in a real concern, and fundamentally useless on an Internet where public things are archived and mirrored willy-nilly. I think it's reasonable for us to enable steps that will make it harder for a bad person to figure out who a deleted user is, but it's basically always a balancing act, since I can pretty much guarantee there are people archiving things like database dumps and mirroring metadata whether we — or the deleted users — like it or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants