Check GitHub API proactively for renamed or deleted users#13143
Check GitHub API proactively for renamed or deleted users#13143carols10cents wants to merge 12 commits into
Conversation
11dbdda to
4514edb
Compare
36268b8 to
f166d51
Compare
f9c1c72 to
27d3a01
Compare
b872321 to
a4b72c9
Compare
a925015 to
82d36bb
Compare
82d36bb to
f7cfc39
Compare
This comment has been minimized.
This comment has been minimized.
|
Ok new approach here, leaving the failing migration check for now so that this doesn't get merged (i think there are still a few details to work out) and because I have to go 😅 But I'd love thoughts! |
| error!( | ||
| "Could not update user ID {} from username {} to username {}: {e}", | ||
| self.user_id, self.old_username, github_user.login, | ||
| ); |
There was a problem hiding this comment.
any reason for logging the error instead of failing the background job so that it stays in the queue and will be retried?
There was a problem hiding this comment.
I was trying to be conservative and only retrying potentially the next time we enqueue a batch with the admin job, rather than retrying a bunch of times via the background job retry, so that if something is completely wrong when we try this for real, we aren't doing a whole bunch of wrong stuff a bunch of times (and not needing to scramble to get the jobs out of the queue).
What do you think?
Steps broken down roughly into functions that the background job calls that have `todo!()` implementations for now
Set all existing rows to 1970 so that we'll refresh them soon. Set new rows to now, because creating a new user/oauth_github record means we just got the user's github information (which counts as a sync).
f7cfc39 to
2234817
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
The diesel-guard message says: > Note: For Postgres 11+, this is safe if the default is a constant value. which it is. We don't actually need the part where I changed the default after adding the column because we set the default explicitly for new rows in the code.
2234817 to
748df9c
Compare
Anonymously via the regular GitHub API. Except for enterprise managed users that have an underscore in their username; anonymous API requests will always return 404 for them (so don't try, just keep their username). If the fallback request to get public user info by the user ID fails, probably because we've hit the rate limit, fail the job and try again later.
And oauth_github is the only source used to enqueue jobs. I'll update the `user.gh_login` records for users with gh_id < 1 manually once this whole strategy is approved, to get us to a state where `user.gh_login` is unique-- but that isn't what this job needs to do in perpetuity, so don't.
Thanks DataDog 😞
21f5842 to
dc44cff
Compare
|
@Turbo87 @eth3lbert this is ready for rereview! |
|
I'll try to do a full review on this next week, but don't block on me if @Turbo87 and/or @eth3lbert approve it; I'm happy to review this ex post facto in my copious amounts of spare time. Just on your question for me in your post, though:
I think your analysis is basically correct, and you've captured my concern: that having the ID might make it easier to work backwards to identify a user who no longer wants to be identified.1 But, as you also said, the API is returning the ID anyway right now, unless we decide to take extra steps to merge deleted users into a singular So no, I don't think you're missing a way, and I don't think it's a blocker here. Footnotes
|
Also to hopefully get the database to a state where
users.gh_loginis unique.Rates
The part that I haven't decided/completed yet is how often to run the
UpdateFromGithubjob and at what pace. Currently, the job must be enqueued completely manually and it does a batch of 100 crates.io users then stops at a pace under the GitHub API rate limits.We definitely need to run the job against all users once before thinking about implementing rust-lang/rfcs#3946, so that 1. we start all accounts off with their current GitHub username as their crates.io username and 2. the crates.io username based on the GitHub username can be unique.
What I'm imagining is that once we decide to implement the part of rust-lang/rfcs#3946 that stops updating your crates.io username in sync with your GitHub username, then the
UpdateFromGithubjob will only updateoauth_github.loginand notusers.gh_login(or whatever that column ends up getting renamed to when it becomes the crates.io username).Reasons to be cautious about how often and how fast this job runs:
Reasons to turn this on to run all the time as fast as possible:
And of course there are places between "as slow as possible" and "as fast as possible" that we could land; I'm interested to get peoples' thoughts on the tradeoffs.
User IDs
@LawnGnome You mentioned the other day something about the crates.io user account ID being PII that we should be cautious about using, mostly because archives exist. I'm interested in your thoughts about using
ghost_{user id}as the crates.io username for deleted GitHub accounts. I think this is fine, and here's why:https://crates.io/api/v1/users/{username}include the user's crates.io ID; it's not secret informationghost_123that owns cratefoo; the archive can tell us through user ID association that this user previously had the usernameLawnGnome, and perhaps this person deleted their GitHub account because they didn't want people to know they published cratefoo. I don't think this is actually a reason not to use user IDs in this way because:foowasLawnGnomeanywayAm I missing a way the user ID could be used that would make
ghost_{user id}a bad idea?Other
Any other concerns I haven't addressed?