Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
aa1e18b
Initial stub of python files
timj May 28, 2026
d016dba
Very basic CLAUDE.md
timj May 28, 2026
4f7b076
Initial summary of what the repo is trying to achieve
timj May 28, 2026
cf3e82a
Add design spec for github-summarizer
timj May 28, 2026
fc52c39
fixup! Initial stub of python files
timj May 28, 2026
2cfc464
Revise spec: split derived model, exclude archived/disabled by default
timj May 28, 2026
8cd1631
Add implementation plan for github-summarizer
timj May 28, 2026
6d0b108
Add to CLAUDE instructions
timj May 28, 2026
f58dbde
Add dependencies, entry point, and test scaffolding
timj May 28, 2026
7bbe956
Add Repository and RepositorySummary models
timj May 28, 2026
ba5e096
Add precommit
timj May 28, 2026
f057df9
Add configuration models and YAML loader
timj May 28, 2026
2b78db5
Add activity classification
timj May 28, 2026
240e4b4
Add semantic grouping with auto-topic clustering
timj May 28, 2026
4de9573
Add GitHub GraphQL source and token resolution
timj May 28, 2026
34ee509
Add orchestrator with archived/disabled filtering
timj May 28, 2026
3ccdfa4
Add markdown, CSV, and JSON report writers
timj May 28, 2026
9508d40
Add Click CLI with report subcommand
timj May 28, 2026
a45f50f
Add CI workflow and example configuration
timj May 28, 2026
93931d9
Remove unused fsspec mypy override
timj May 28, 2026
a2d43ce
Add request diagnostics and configurable HTTP timeout
timj May 29, 2026
5413804
Add design spec for raw data cache
timj May 29, 2026
8056a71
Add implementation plan for raw data cache
timj May 29, 2026
16b436e
Add raw repository cache module
timj May 29, 2026
6832c99
Refactor CLI error handling into shared context manager
timj May 29, 2026
3f5e3fd
Add fetch subcommand to save raw repository data
timj May 29, 2026
c3bf3dd
Add report --from-raw to render from a cached file
timj May 29, 2026
3a2508b
Show data-fetched timestamp in markdown reports from a cache
timj May 29, 2026
822fde1
Sort report tables case-insensitively
timj May 29, 2026
aa6a02f
Add -h as an alias for --help
timj May 29, 2026
e394945
Add lsst-dm config
timj May 29, 2026
1334e1d
Omit archived/disabled summary counts unless included
timj May 29, 2026
3e37ec6
Move example config to configs dir
timj May 29, 2026
44bed5d
Add lsst org config
timj May 29, 2026
e817882
Appease Codex by linking AGENTS to CLAUDE
timj May 29, 2026
dfa7371
Add CSV metadata update workflow
timj May 29, 2026
cee4233
Annotate topic metadata diffs
timj May 29, 2026
881186f
Update the lsst org config
timj May 29, 2026
69a17fc
Prefer name rules over topic grouping
timj May 29, 2026
d315660
Combine all statements of work into one group
timj May 29, 2026
f758fd1
Add ignored topics for grouping
timj May 29, 2026
570003e
Allow groups of size 2
timj May 29, 2026
ab89901
Use commit history for activity classification
timj May 30, 2026
45b33a2
Add an "abandoned" state for older than 5 years
timj May 30, 2026
ffdc992
Make REST JSON payloads deterministic
timj Jun 1, 2026
6dd210c
Add support for typst output for better PDF creation
timj Jun 1, 2026
4cb03ee
Add basic lsst-sqre config
timj Jun 1, 2026
48b08f1
Add stubs for lsst-ts and lsst-sims
timj Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: CI

on:
push:
branches: [main]
pull_request:

jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.13", "3.14"]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install
run: pip install -e '.[test]' ruff mypy
- name: Ruff check
run: ruff check .
- name: Ruff format
run: ruff format --check .
- name: Mypy
run: mypy python
- name: Pytest
run: pytest -v
17 changes: 17 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v5.0.0
hooks:
- id: check-yaml
args:
- "--unsafe"
- id: end-of-file-fixer
- id: trailing-whitespace
- id: check-toml
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.14.5
hooks:
- id: ruff-check
args: [--fix]
- id: ruff-format
1 change: 1 addition & 0 deletions AGENTS.md
19 changes: 19 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# CLAUDE Guidelines

## What is this?

This is a tool for scanning a GitHub org and generating reports on the state of the repositories in that org.

## Build Guidelines

Uses standard python tooling.

* The code must always pass `ruff check`, `ruff format` and `mypy`.
* There should be unit tests for all public APIs. Pytest can be used for testing infrastructure.
* Use Click for command line tooling with subcommands.
* Use GitHub workflows for CI.

## Coding Policies

* Place imports at the top of the file unless there is a circular import problem.
We do not want imports in each method/function call.
234 changes: 234 additions & 0 deletions GOALS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# GitHub Organization Repository Report Generator

## Goal

Create a Python command-line tool that generates a report for all repositories in a GitHub organization.

The report must summarize basic repository metadata, activity status, and semantic grouping. The tool should be flexible enough to add more fields later without large architectural changes.

## Required repository fields

For every repository, collect:

- Repository name
- Repository URL
- Archived status
- Disabled status
- Description
- Primary language
- GitHub topic labels
- Last push timestamp
- Activity status
- Semantic group
- Grouping reason

## Data source

Use the GitHub GraphQL API as the primary data source.

The query should paginate over all repositories in the configured organization and retrieve at least:

- name
- url
- description
- isArchived
- isDisabled
- primaryLanguage { name }
- pushedAt
- repositoryTopics
- defaultBranchRef { name }

The tool should authenticate using a GitHub token from either:

- GITHUB_TOKEN
- an explicit CLI option

## Activity status

Compute activity status from the last push date.

Default thresholds:

- Active: pushed within 90 days
- Warm: pushed within 180 days
- Quiet: pushed within 365 days
- Dormant: no push in more than 365 days
- Archived: repository is archived
- Disabled: repository is disabled

Archived and disabled statuses take precedence over date-based activity status.

Thresholds must be configurable.

## Semantic grouping

Repositories should be assigned to semantic groups using configurable rules.

Rule precedence:

1. Explicit repository-name override
2. Topic-based grouping
3. Regex/name-based grouping
4. Fallback group

Each classified repository should include a grouping_reason, such as:

- override:special-repo-name
- topic:pipelines
- regex:^DMTN-[0-9]+$
- fallback

Example use cases:

- Repositories with the pipelines topic should be grouped together as a virtual monorepo.
- Repositories matching ^DMTN-[0-9]+$ should be grouped as Data Management Tech Notes.

## Configuration

Use a YAML configuration file.

Example:
```yaml
org: lsst

activity:
active_days: 90
warm_days: 180
quiet_days: 365

groups:
- name: Pipelines
topics:
- pipelines
description: Virtual monorepo / coordinated pipelines repositories

- name: Data Management Tech Notes
regex:
- "^DMTN-[0-9]+$"
- "^dmtn-[0-9]+$"

- name: Documentation
topics:
- documentation
- docs

overrides:
special-repo-name:
group: Pipelines
notes: Manual override
```

## Output formats

The tool should support at least:

- Markdown report
- CSV inventory
- JSON inventory

Markdown report structure:

1. Title and generation timestamp
2. Summary counts:
- total repositories
- active repositories
- warm repositories
- quiet repositories
- dormant repositories
- archived repositories
- disabled repositories
3. Grouped repository sections
4. Uncategorized repositories section
5. Optional appendix containing the raw inventory table

Each group section should include:

- Group name
- Optional group description
- Number of repositories
- Table of repositories

Repository table columns:

- Repo
- Description
- Primary language
- Topics
- Last push
- Activity status
- Archived
- Disabled
- Grouping reason

## CLI interface

Suggested command:

```bash
github-org-report \
--config report-config.yaml \
--format markdown \
--output repo-report.md
```

Consider using Click with subcommands to simplify and expand the user interface.
For example `github-org report` would potentially give us more flexibility later if we wanted to generate different output.

Optional flags:

```bash
--org ORG
--token TOKEN
--format markdown|csv|json
--output PATH
--include-archived
--include-disabled
--verbose
```
## Architecture

Suggested modules:

- github_client.py
- GraphQL client
- pagination
- rate-limit handling
- models.py
- repository data model
- group rule model
- activity.py
- activity classification logic
- grouping.py
- semantic grouping logic
- report.py
- markdown, CSV, and JSON output writers
- cli.py
- command-line interface

## Testing

Add tests for:

- activity status classification
- topic-based grouping
- regex-based grouping
- override precedence
- fallback grouping
- markdown report rendering

Use mocked GitHub API responses for tests.

## Future extensions

The implementation should make it easy to add:

- commit counts over the last 30/90/365 days
- open issue and pull request counts
- default branch
- license
- branch protection status
- CODEOWNERS presence
- CI workflow presence
- release recency
- ownership/team metadata
- repo health score
28 changes: 28 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,30 @@
# github-summarizer
Summarize the repos in a GitHub org and generate a report

## Grouping topic exemptions

Use top-level `ignored_topics` for broadly applied GitHub topics that should
not influence grouping. Ignored topics still appear in reports and metadata,
but they do not match configured topic rules or create automatic topic groups.

```yaml
ignored_topics: [hacktoberfest]
```

`auto_group_by_topic.ignore_topics` remains available for topics that should
only be skipped by automatic topic clusters.

## CSV metadata updates

Export a CSV report, edit only the `description` and `topics` columns in a
spreadsheet, save it as CSV, then apply the intentional edits:

```sh
github-summarizer report --config configs/lsst-config.yaml --format csv --output repos-baseline.csv
github-summarizer apply-metadata --config configs/lsst-config.yaml --baseline repos-baseline.csv --input repos-edited.csv
```

By default `apply-metadata` prompts before each GitHub update. Use `--dry-run`
to preview updates or `--yes` to apply all safe updates without prompting. A
safe update means the live GitHub value still matches the baseline CSV value;
stale edits are skipped unless `--allow-stale` is supplied.
28 changes: 28 additions & 0 deletions configs/example-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
org: lsst

activity:
active_days: 90
warm_days: 180
quiet_days: 365
abandoned_days: 1825

ignored_topics: [hacktoberfest]

groups:
- name: Pipelines
topics: [pipelines]
description: Virtual monorepo / coordinated pipelines repositories
- name: Data Management Tech Notes
glob: ["DMTN-*"]
- name: Documentation
topics: [documentation, docs]

auto_group_by_topic:
enabled: true
min_repos: 3
ignore_topics: []

overrides:
special-repo-name:
group: Pipelines
notes: Manual override
Loading
Loading