Anki Miner is a PyQt6 desktop application. It processes anime video/subtitle files through a 5-stage pipeline to create Japanese vocabulary flashcards in Anki.
The core data flow is a linear 5-stage pipeline orchestrated by EpisodeProcessor. YouTube mining prepends a fetch pre-stage that produces the same (video, subtitle) pair the file-based flow starts from; everything downstream is unchanged.
YouTube URL (optional entry point)
│
▼
┌─────────────────────────────────────────────────────┐
│ 0. Fetch (YouTube only) │
│ YouTubeFetcherService (yt-dlp subprocess) │
│ probe_metadata(url) → VideoInfo │
│ fetch_video(url, workspace, sub_mode) │
│ → FetchedMedia(video_file, subtitle_file, ...) │
└─────────────────────────────────────────────────────┘
│
▼
Subtitle file (ASS/SRT/SSA)
│
▼
┌─────────────────────────────────────────────────────┐
│ 1. Parse Subtitles │
│ SubtitleParserService (pysubs2 + fugashi/MeCab) │
│ → list[TokenizedWord] │
├─────────────────────────────────────────────────────┤
│ 2. Filter Unknown Words │
│ WordFilterService + AnkiService │
│ + optional: FrequencyService, WordListService, │
│ KnownWordDB, cross-episode counts │
│ → list[TokenizedWord] (unknown only) │
├─────────────────────────────────────────────────────┤
│ 3. Extract Media │
│ MediaExtractorService (ffmpeg, parallel) │
│ → list[(TokenizedWord, MediaData)] │
├─────────────────────────────────────────────────────┤
│ 4. Fetch Definitions │
│ DefinitionService → DictionaryProviders │
│ (JMdictProvider offline → JishoProvider fallback) │
│ → list[str | None] │
├─────────────────────────────────────────────────────┤
│ 5. Create Anki Cards │
│ AnkiService (AnkiConnect HTTP API) │
│ → cards_created count │
└─────────────────────────────────────────────────────┘
│
▼
ProcessingResult
Cancellation is checked between each phase. Preview mode exits after stage 2 (shows words, creates no cards). An optional curation callback lets the GUI present a word selection dialog between stages 2 and 3.
gui/
│
▼
orchestration/
│
▼
services/
services/providers/
│
┌───────┼───────┐
▼ ▼ ▼
interfaces/ models/ utils/
│
▼
models/
config/ ← used by all packages
exceptions/ ← used by all packages
Leaf packages (config, models, exceptions, utils) have no internal dependencies. interfaces depends only on models for type signatures. services depends on interfaces, models, config, exceptions, and utils. orchestration composes services. gui is the sole top-level entry point.
Three protocols in interfaces/ define the system's extension points:
PresenterProtocol (interfaces/presenter.py): output abstraction with 7 methods.
show_info,show_success,show_warning,show_error: message display.show_validation_result(ValidationResult): system check results.show_processing_result(ProcessingResult): episode processing summary.show_word_preview(list[TokenizedWord]): discovered word listing.
Implementations: GUIPresenter (Qt signals) and NullPresenter (tests). The protocol is preserved even without a CLI so that workers, orchestration, and services stay UI-agnostic and fully testable.
ProgressCallback (interfaces/progress.py): progress reporting with 4 methods.
on_start(total, description),on_progress(current, item_description)on_complete(),on_error(item_description, error_message)
DictionaryProvider (interfaces/dictionary_provider.py): pluggable dictionary backend.
nameproperty,is_available(),load(),lookup(word) -> str | None
All use typing.Protocol for structural subtyping. Implementations satisfy the protocol via duck typing, without explicit inheritance.
Data classes in models/:
| Model | File | Purpose |
|---|---|---|
TokenizedWord |
word.py |
Parsed word with surface, lemma, reading, sentence, timing, furigana, frequency_rank |
WordData |
word.py |
TokenizedWord + definition + media paths + pitch accent |
MediaData |
media.py |
Screenshot/audio file paths and filenames |
ProcessingResult |
processing.py |
Pipeline output: word counts, card count, errors, elapsed time, comprehension %, card IDs |
ValidationResult |
processing.py |
System check results: connectivity, tool availability, issues list |
ValidationIssue |
processing.py |
Component + severity (ERROR/WARNING) + message |
BatchQueue / QueueItem |
batch_queue.py |
Batch processing queue with PENDING/PROCESSING/COMPLETED/ERROR states |
MiningSession |
stats.py |
Analytics: series, episode, word counts, timing |
SeriesStats / OverallStats |
stats.py |
Aggregated analytics |
DifficultyEntry |
stats.py |
Per-episode difficulty tracking |
HistoryEntry |
history.py |
Mining history record with undo support (stores card IDs) |
VideoInfo |
youtube.py |
YouTube probe result: id, title, duration, sub availability, is_live, is_age_restricted |
FetchedMedia |
youtube.py |
yt-dlp fetch result: video path, subtitle path, sub_source ("manual" or "auto") |
SubMode |
youtube.py |
Literal["manual_only", "auto_only"] — resolved in the GUI from the probe + user acceptance |
Stateless business logic classes in services/. Each receives the frozen AnkiMinerConfig in its constructor.
Core services (always created):
- SubtitleParserService: parses ASS/SRT/SSA files via
pysubs2, tokenizes Japanese text withfugashi(MeCab wrapper), generates furigana annotations, deduplicates by lemma+surface. - WordFilterService: multi-layer filtering via
filter_unknown(against known vocabulary),filter_by_length,filter_by_frequency,filter_by_word_lists,deduplicate_by_sentence, andfilter_by_episode_count. - MediaExtractorService: extracts screenshots (
ffmpeg -frames:v 1) and audio clips (ffmpeg libmp3lame) at subtitle timestamps. Runs in parallel viaThreadPoolExecutorwithmax_parallel_workersthreads. Auto-detects the Japanese audio stream viaffprobewith thread-safe caching. - DefinitionService: orchestrates the provider chain. Default mode: JMdict first, Jisho only if JMdict is unavailable. Returns HTML-formatted definition strings.
- AnkiService: AnkiConnect HTTP API wrapper (localhost:8765). Key operations:
get_existing_vocabulary,store_media_file,create_cards_batch(batch size 50),delete_notes. Storeslast_created_note_idsfor undo support. - ValidationService: checks AnkiConnect connectivity, ffmpeg presence, deck existence, and note type existence. Returns
ValidationResult(never raises). - YouTubeFetcherService (
services/youtube_fetcher.py): wraps theyt-dlpsubprocess. Two entry points:probe_metadata(url) → VideoInfo(fast,--skip-download --dump-single-json) andfetch_video(url, workspace, sub_mode, progress_cb, cancel_event) → FetchedMedia. Detects native vs translated auto-captions via_has_native_auto_ja(). Tracks thePopenhandle so cancellation can kill the full process tree (yt-dlp → ffmpeg child) viapsutil. Writes the (video, subtitle) pair into a caller-owned workspace directory.
Optional services (created based on config flags):
- FrequencyService: loads word frequency CSV, exposes
lookup(word) -> rank. - PitchAccentService: loads pitch accent CSV, exposes
lookup_batch. - KnownWordDB: SQLite-backed persistent known word cache. Supports differential sync with Anki vocabulary.
- WordListService: loads blacklist/whitelist text files for word filtering.
- HistoryService: SQLite-backed mining history (
mining_historytable). Records what was mined, supports undo via stored card IDs. - StatsService: SQLite-backed analytics (
mining_sessions,difficulty_entriestables). Provides aggregated stats and milestones. - UpdateChecker: queries the GitHub Releases API for newer versions.
- ExportService: exports results to CSV, TSV, or vocabulary list formats.
Dictionary providers (services/providers/):
- JMdictProvider: parses JMdict XML (~60MB) into an in-memory dict on
load(). Lookup returns HTML-formatted numbered definitions (max 5 per word). - JishoProvider: REST client for the jisho.org API. Always available. Rate-limited with a configurable delay (
jisho_delay).
EpisodeProcessor (orchestration/episode_processor.py):
- Receives all services via constructor injection
process_episode()runs the 5-stage pipelineprocess_youtube_url()callsYouTubeFetcherService.fetch_video, then delegates to the unchangedprocess_episodewithepisode_name_override=f"YT:{video_id}"andseries_name_override="YT:<channel_id>"(or"YouTube"fallback). The workspace is allocated and cleaned by the worker, not the orchestrator.- Cancellation checkpoints between each phase (
self._cancelledflag); the YouTube flow additionally threads athreading.Eventinto the fetcher so an in-flight yt-dlp subprocess can be killed. - Supports
preview_mode(exits after filtering, no cards created) - Supports
curation_callback(GUI presents word selection dialog) - Supports
cross_episode_countsfor batch frequency filtering - Supports
episode_name_override/series_name_overrideso YouTube-sourced sessions have stable, file-name-independent identity. - Records to StatsService and KnownWordDB after successful processing
- Cleans up temp media files in
finallyblock
FolderProcessor (orchestration/folder_processor.py):
- Wraps
EpisodeProcessorfor batch processing find_video_subtitle_pairs()matches files by stem namecollect_cross_episode_frequencies()does a two-pass approach: first parses all subtitles, then counts word appearances across episodesprocess_folder()iterates matched pairs sequentially
AnkiMinerConfig (config/config.py) is a frozen (immutable) dataclass with ~30 settings:
- Anki: deck name, note type, field mappings, AnkiConnect URL
- Media: audio padding, screenshot offset, temp folder, subtitle offset
- Filtering: min word length, allowed POS tags, excluded subtypes, deduplication
- Dictionary: JMdict path, offline toggle, Jisho URL/delay
- Optional data: pitch accent, frequency, known words DB, blacklist/whitelist paths and toggles
- History/analytics: DB paths, enable flags
- Performance: max parallel workers (default 6)
The __post_init__ method uses object.__setattr__ to convert string paths to Path objects (required because the dataclass is frozen). New config instances are created with dataclasses.replace().
Config source:
- GUI:
GUIConfigManager(gui/utils/config_manager.py) persists to~/.anki_miner/gui_config.json. Defaults come from theAnkiMinerConfigdataclass field defaults.
MainWindow contains a QTabWidget with five tabs:
- SingleEpisodeTab: file selectors (drag-and-drop), subtitle offset control, process/preview buttons, log widget, progress widget.
- BatchProcessingTab: folder selection,
BatchQueuemanagement via queue panel, dual progress bars. - YouTubeTab (
gui/widgets/youtube_tab.py): URL input, Fetch Info button, metadata preview, auto-caption warning + explicit "Accept auto-captions" button, Mine button, progress bar. Deck/note-type/tags widgets live in the tab (not pulled from global settings) so YouTube and file-based mining can target different decks. - AnalyticsTab: mining statistics dashboard (queries
StatsService). - SettingsTab: config editing with sub-panels (Anki, media, dictionary, filtering, YouTube). Emits
config_changedsignal.
CancellableWorker base class (QThread + threading.Event) provides:
- Thread-safe cancellation via
cancel()/is_cancelled() - Qt signals for results, errors, and progress
Worker implementations:
EpisodeWorkerThread: runsEpisodeProcessor.process_episode()in the background.BatchQueueWorkerThread: processes batch queue items sequentially.ManualPairWorkerThread: processes manually paired files.ValidationWorkerThread: runs system validation checks.UpdateWorkerThread: checks for updates.YouTubeProbeWorker(gui/workers/youtube_probe_worker.py): short-lived QThread that callsYouTubeFetcherService.probe_metadataso the GUI stays responsive during network I/O.YouTubeWorkerThread(gui/workers/youtube_worker.py):CancellableWorkersubclass. Allocates a fresh workspace undermedia_temp_folder/youtube/run-<uuid>/, callsEpisodeProcessor.process_youtube_url(...), and deletes the workspace on every exit path (success, cancel, error) viashutil.rmtreeinfinally. Threads itsthreading.Eventthrough to the fetcher so cancel can kill the yt-dlp process tree.
GUIPresenter emits Qt signals from worker threads. Main window slots receive them on the GUI thread. Per-tab presenters avoid cross-tab signal pollution. GUIProgressCallback bridges the ProgressCallback protocol to Qt signals.
GUIPresenter does not explicitly inherit from PresenterProtocol. It satisfies the protocol via structural subtyping, which avoids a metaclass conflict between QObject and Protocol.
Theme singleton backed by JSON theme files in gui/resources/styles/themes/. Four built-in themes: Light, Dark, Sakura, and Tokyo Night. The discover_themes() function scans the themes directory at startup, validates each JSON file against a required color key schema (REQUIRED_COLOR_KEYS), and registers valid themes. A single common.qss stylesheet uses ${color-*} variable substitution. The Theme._substitute_variables() method merges layout variables from _variables.py with color variables extracted from the active theme JSON. Custom themes can be added by dropping a valid JSON file into the themes directory. Theme preference is saved via QSettings.
WordCurationDialog: user selects which discovered words to mine (cross-thread via athreading.Eventbridge).WordPreviewDialog: preview discovered words.PairPreviewDialog: preview video/subtitle file pairing.ResultsDialog: summary of a mining session with undo option.ExportDialog: export results to file.QueueManagerDialog: manage the batch processing queue.
HTTP POST to localhost:8765 (configurable). Protocol version 6. Key actions:
version,deckNames,modelNames,modelFieldNames: validation.findNotes,notesInfo: vocabulary lookup.storeMediaFile: upload screenshots/audio.addNote,addNotes: card creation (batch size 50).deleteNotes: undo support.
GET https://jisho.org/api/v1/search/words?keyword=<word>. Rate-limited with configurable delay (default 0.5s). Used as fallback when JMdict is unavailable.
- ffmpeg:
-ssseek +-iinput +-frames:v 1for screenshots,libmp3lamefor audio extraction - ffprobe:
-show_streams -select_streams afor Japanese audio track detection - Parallel execution via
ThreadPoolExecutor(default 6 workers)
Subprocess invoked by YouTubeFetcherService. Probe uses --skip-download --dump-single-json --no-playlist; fetch uses --write-sub (or --write-auto-sub for auto-caption mode) + --sub-lang ja --sub-format vtt/best --convert-subs srt + a height-capped format string. Progress parsed from a custom --progress-template; post-download merge phases detected by scanning for [Merger]/[SubtitleConvertor] line signatures. Process tree killed via psutil on cancel (yt-dlp spawns ffmpeg as a child for merging; Popen.terminate() alone leaks it on Windows). Optional --cookies-from-browser flag enables bypass of bot-detection prompts and age-restricted content.
yt-dlp lazy-loads ~1600 extractor modules plus optional deps (websockets, mutagen, brotli) that PyInstaller's static analysis misses. PyInstaller-Hooks/hook-yt_dlp.py calls collect_all("yt_dlp") and the release workflow passes --additional-hooks-dir=PyInstaller-Hooks. The release workflow's bundled smoke step (ANKI_MINER_SMOKE=youtube env var in anki_miner/gui/app.py) walks yt_dlp.extractor.gen_extractors() offline to verify the registry survived collect_all.
AnkiMinerException (base)
├── ValidationError
├── SetupError
├── AnkiConnectionError
├── DeckNotFoundError
├── NoteTypeNotFoundError
├── CardCreationError
├── MediaExtractionError
├── SubtitleParseError
└── FFmpegError
All persistent user data under ~/.anki_miner/:
| File | Format | Purpose |
|---|---|---|
gui_config.json |
JSON | GUI configuration persistence |
JMdict_e |
XML | Offline Japanese-English dictionary (~60MB) |
known_words.db |
SQLite | Known word cache with Anki sync |
history.db |
SQLite | Mining history with undo support |
stats.db |
SQLite | Analytics (sessions, difficulty, milestones) |
pitch_accent.csv |
CSV | Pitch accent lookup data |
frequency.csv |
CSV | Word frequency rankings |
Temporary media files are stored in the system temp directory under anki_miner_temp/ and cleaned up after each processing run. YouTube downloads go one level deeper — anki_miner_temp/youtube/run-<uuid>/ — owned by YouTubeWorkerThread and rmtree'd on every exit path (success, cancel, exception).