auto-doc Pipeline

--- description: Automated documentation pipeline for llm-code-docs. Discovers, fetches, cleans, reviews, and commits upstream docs for 11K+ libraries. category: personal --- flowchart TD _HEADER_["

auto-doc Pipeline

Automated documentation pipeline for llm-code-docs. Discovers, fetches, cleans, reviews, and commits upstream docs for 11K+ libraries.

"]:::headerStyle classDef headerStyle fill:none,stroke:none subgraph _MAIN_[" "] subgraph Plow["Batch Mode (plow)"] QUEUE[Pull Next Ticket] --> EXTRACT[Extract Library Name] EXTRACT --> SET_IP[Set Ticket In-Progress] end subgraph Discover["Stage 0: Parallel Discovery"] SET_IP --> DOC_IDX_PRE[doc-index Lookup
138K libraries] DOC_IDX_PRE --> PARALLEL{8 Parallel Workers} PARALLEL --> PYPI[PyPI Registry] PARALLEL --> NPM[npm Registry] PARALLEL --> CRATES[crates.io Registry] PARALLEL --> GH_SEARCH[GitHub Search
Top 25 by Stars] PARALLEL --> TAVILY[Tavily Web Search
Grounding] PARALLEL --> DOMAIN_GUESS[Domain Guessing
HEAD-probe .dev/.io/.org] PARALLEL --> UNI_IDX[unified-index
5.14M packages] PARALLEL --> ENUM_DOM[Enumerated Domains
6.8K sites] PYPI --> MERGE_DISC[Merge + Score
Best Guess] NPM --> MERGE_DISC CRATES --> MERGE_DISC GH_SEARCH --> MERGE_DISC TAVILY --> MERGE_DISC DOMAIN_GUESS --> MERGE_DISC UNI_IDX --> MERGE_DISC ENUM_DOM --> MERGE_DISC end subgraph LlmsProbe["llms.txt + GitHub Assessment"] MERGE_DISC --> LLMS_TXT[llms.txt Probing
All Discovered Domains] MERGE_DISC --> GH_ASSESS[GitHub Assessment
Doc File Count + Fork Check] MERGE_DISC --> STACK_V2[stack-v2 Enrichment
859K repos] end subgraph Decide["Stage 1: LLM Source Selection"] LLMS_TXT --> GATHER[Gather All
Candidate Sources] GH_ASSESS --> GATHER STACK_V2 --> GATHER GATHER --> CAND_IDX[doc-index hits] GATHER --> CAND_LLMS[Valid llms.txt
>5KB, score>=75] GATHER --> CAND_GH[GitHub repos
with docs] GATHER --> CAND_LANG[Language hosts
docs.rs / gomarkdoc /
hexdocs / metacpan] GATHER --> CAND_HP[GitHub repo
homepage] CAND_IDX --> LLM_DECIDE[MiniMax-M2.5
Arbiter] CAND_LLMS --> LLM_DECIDE CAND_GH --> LLM_DECIDE CAND_LANG --> LLM_DECIDE CAND_HP --> LLM_DECIDE LLM_DECIDE -->|Picks best| LLM_RESULT[Selected Source] LLM_DECIDE -->|No viable source| SKIP[Skip] end subgraph Fetch["Stage 2: Fetch"] LLM_RESULT --> FETCH_ROUTE{Route by
Source Type} FETCH_ROUTE -->|llms-txt| FETCH_LLMS[llms-txt-scraper.py
Bulk Download] FETCH_ROUTE -->|github| FETCH_GH[Shallow Clone
Extract docs/] FETCH_ROUTE -->|go-native| FETCH_GO[gomarkdoc
Generate from Source] FETCH_ROUTE -->|web| FETCH_WEB[Jina Reader
Crawl + Convert to MD] end subgraph Clean["Stage 3: Clean"] FETCH_LLMS --> MDLINT[markdownlint --fix
Batch of 50 Files] FETCH_GH --> MDLINT FETCH_GO --> MDLINT FETCH_WEB --> MDLINT end subgraph Review["Stage 4: Review"] MDLINT --> REVIEW_LLM[MiniMax-M2.5 Review
Sample 10 Files, 3 Full] REVIEW_LLM --> REVIEW_CHECK{Pass?} REVIEW_CHECK -->|Identity OK
Substance OK
Encoding OK
No Duplicates
Good Coverage
Complete| COMMIT_STAGE[Proceed to Commit] REVIEW_CHECK -->|Fail| REVERT[Delete Docs + Revert Config] end subgraph Commit["Stage 5: Commit"] COMMIT_STAGE --> GIT_ADD[git add docs/ + configs] GIT_ADD --> GIT_COMMIT[git commit + push master] end subgraph Outcome["Ticket Resolution"] GIT_COMMIT --> DONE[Ticket Done
SHA in Comment] REVERT --> IN_REVIEW[Ticket In-Review
Error in Comment] SKIP --> IN_REVIEW DONE --> NEXT{More Tickets?} IN_REVIEW --> NEXT NEXT -->|Yes| QUEUE NEXT -->|No| FINISHED[Batch Complete] end click QUEUE "#" "**Pull Next Ticket**\nQuery trckr for next 'todo' ticket in the DOCS project.\n`trckr ticket list --project DOCS --status todo --limit 1`" click EXTRACT "#" "**Extract Library Name**\nParse library name from ticket title.\nTitle format: 'Add docs for '" click SET_IP "#" "**Set Ticket In-Progress**\n`trckr ticket update DOCS-NNN --status in-progress`\nPrevents other workers from picking up the same ticket." click DOC_IDX_PRE "#" "**doc-index Lookup**\nSQLite DB with 138K+ libraries from 11 platform adapters:\nDevDocs, Dash, devhints, APIs.guru, HexDocs, pub.dev,\nHackage, SwiftPkg, Go modules, ReadTheDocs, cljdoc.\nRuns first (sequential) before parallel workers launch.\n`python3 scripts/doc_index.py lookup-cmd `" click PARALLEL "#" "**8 Parallel Workers**\nAll 8 discovery streams run simultaneously via ThreadPoolExecutor.\nTavily always runs (even with index hits) for grounding." click PYPI "#" "**PyPI Registry**\nQuery PyPI JSON API for package metadata.\nExtracts homepage, project URLs, repo link.\nFree, unauthenticated." click NPM "#" "**npm Registry**\nQuery npm registry for package metadata.\nExtracts homepage, repository, docs URL.\nFree, unauthenticated." click CRATES "#" "**crates.io Registry**\nQuery crates.io API for crate metadata.\nExtracts homepage, repository, documentation URL.\nFree, unauthenticated." click GH_SEARCH "#" "**GitHub Search**\nSearch GitHub for top 25 repos matching the library name.\nUses `gh api search/repositories` with sort by stars.\nReturns: name, stars, description, homepage, language." click TAVILY "#" "**Tavily Web Search**\nAlways runs for grounding, even with index hits.\nQuery: library name + 'documentation site official'\nProvides real-time verification that URLs are alive.\nResults cached in `~/gitea/auto-doc/cache/tavily/`." click DOMAIN_GUESS "#" "**Domain Guessing**\nHEAD-probe common TLD patterns:\n`.dev`, `.io`, `.org`, `.com`\nFilters: library name must appear as dot-separated domain segment." click UNI_IDX "#" "**unified-index Lookup**\n5.14M packages across 53 registries.\nReturns `docs_url`, `homepage`, `repository_url` per package.\nBuilt from reverse-dns-indexer project.\nConfidence score: 65-75 depending on field." click ENUM_DOM "#" "**Enumerated Domains**\n6,808 known doc-hosting domains across 7 platforms:\n- GitBook, ReadMe, Mintlify, Fern\n- Writerside, Archbee, Zudoku\n- Plus custom validated domains\nChecks if library name is a substring of any domain." click MERGE_DISC "#" "**Merge + Score Best Guess**\nConfidence-weighted scoring selects `best_guess`:\n- GitHub stars: 50 + stars/1000 (max 100)\n- Registry github link: 70\n- Registry domain: 40\n- Tavily: 60\n- Unified index: 65-75\n- Enumerated domains: 55\n- Domain guess: 20\nCross-validates GitHub homepage against domain candidates." click LLMS_TXT "#" "**llms.txt Probing**\nCheck all discovered domains for `/llms.txt` endpoint.\nReturns LLM-optimized markdown when available.\nScored by size and content quality (threshold: >5KB, score>=75).\nOnly relevant domains checked (registry-backed or name-matched)." click GH_ASSESS "#" "**GitHub Assessment**\nFor best-guess GitHub repo, checks:\n- Doc file count (threshold: >5)\n- Has markdown content\n- Is not a fork\n- Docs folder location\nResults used by Rule 3." click STACK_V2 "#" "**stack-v2 Enrichment**\nDuckDB scan of Stack V2 parquet data (859K GitHub repos).\nReturns which doc platform a repo uses\n(MkDocs, Sphinx, Docusaurus, etc.).\nInforms fetch strategy, not decision." click GATHER "#" "**Gather All Candidate Sources**\nCollects every viable source into a structured list.\nNo filtering or ranking yet -- just enumeration.\nEach candidate has: type, source, url, detail.\nReplaces the old Rule 0-3 heuristic cascade." click CAND_IDX "#" "**doc-index Candidates**\nPre-computed URLs from 138K+ library index.\nMaps doc_type to source type:\n- llms-txt type -> llms-txt source\n- html/api-spec -> web source" click CAND_LLMS "#" "**llms.txt Candidates**\nValid llms.txt files from probed domains:\n- HTTP 200, not a stub\n- Size >5KB\n- Validation score >= 75\nIncludes size, score, heading count as detail." click CAND_GH "#" "**GitHub Repo Candidates**\nRepos with substantial documentation:\n- >5 doc files, has markdown, not a fork\nIncludes doc file count and folder location." click CAND_LANG "#" "**Language Host Candidates**\nDeterministic URLs by ecosystem:\n- Rust: docs.rs\n- Go: gomarkdoc (or homepage if real doc site)\n- Elixir: hexdocs.pm\n- Perl: metacpan.org\n- And more (pub.dev, Hackage, RubyGems, etc.)" click CAND_HP "#" "**GitHub Homepage Candidate**\nThe best-guess repo's homepage field.\nOften points to a dedicated docs site.\nFiltered against NOISE_DOMAINS list." click LLM_DECIDE "#" "**MiniMax-M2.5 Arbiter**\nThe LLM makes the final judgement call.\nReceives ALL candidates + full discovery context:\n- Top 5 GitHub repos with stars\n- Tavily search results + AI answer\n- Unified index matches (5.14M packages)\n- Enumerated doc domains\n- Stack V2 doc platforms\n- Registry data\n\nHandles fuzzy cases heuristics miss\n(e.g. SvelteKit -> svelte.dev).\nPicks single best source + one-sentence reason.\nFree (MiniMax via `clauded-mm -p`)." click LLM_RESULT "#" "**Selected Source**\nLLM picked the best candidate.\nReturns JSON: source type, url, reason.\nSource type must match a candidate exactly:\nllms-txt, github, go-native, web, or skip." click SKIP "#" "**Skip**\nLLM determined no viable documentation source.\nTicket goes to in-review for human triage." click FETCH_ROUTE "#" "**Route by Source Type**\nFour fetch strategies, each optimized for its content format:\n- llms-txt: bulk scraper\n- github: shallow clone\n- go-native: gomarkdoc generation\n- web: Jina Reader crawl" click FETCH_LLMS "#" "**llms-txt Fetch**\nBulk download via `llms-txt-scraper.py`.\nHandles pagination, nested sections, and\nmarkdown formatting from llms.txt endpoints." click FETCH_GH "#" "**GitHub Fetch**\nShallow clone (`git clone --depth 1`).\nExtract docs/ folder contents.\nRespects .gitignore, skips binary files." click FETCH_GO "#" "**Go Module Fetch**\n`gomarkdoc` generates documentation from Go source.\nProduces well-structured markdown with\npackage docs, function signatures, examples.\nModule path derived from library name or GitHub repo." click FETCH_WEB "#" "**Jina Reader Fetch**\nCrawls docs via `https://r.jina.ai/` API:\n1. Fetch index page, extract links matching docs prefix\n2. Crawl up to 100 pages (20MB total cap)\n3. Convert each page to markdown\n4. Rate limited: 3.5s between requests (Jina free tier)\n5. Min content: 500 bytes per page\nReplaced MiniMax scraper generation (~8% success)\nwith deterministic Jina Reader approach." click MDLINT "#" "**markdownlint --fix**\nBatch of 50 files processed via:\n`npx markdownlint-cli --fix`\nProject-standard disabled rules applied.\nFixes heading levels, trailing whitespace, etc." click REVIEW_LLM "#" "**MiniMax-M2.5 Content Review**\nSamples up to 10 files (prioritizing largest by size).\nReads full content of 3 largest (first 2000 chars each).\nChecks 6 criteria via `clauded-mm -p`:\n- Identity: correct library?\n- Substance: real docs, not nav cruft?\n- Encoding: no mojibake?\n- Duplication: near-identical files?\n- Coverage: sufficient completeness?\n- Completeness: obvious gaps or broken refs?\nCan flag specific files for deletion." click REVIEW_CHECK "#" "**Review Gate**\nMandatory -- no bypass. All 6 criteria must pass.\nReturns JSON with pass/fail, issues list,\nfiles_to_delete, and summary.\nZero false positives across 125+ processed libraries." click COMMIT_STAGE "#" "**Proceed to Commit**\nReview passed all 6 criteria.\nContent is verified upstream documentation." click REVERT "#" "**Revert on Failure**\n1. Delete flagged files (from review response)\n2. `shutil.rmtree()` the entire docs directory\n3. `git checkout -- scripts/llms-sites.yaml scripts/repo_config.yaml`\nTicket goes to in-review with error details." click GIT_ADD "#" "**git add**\nStage docs directory and config files:\n- `docs///`\n- `scripts/llms-sites.yaml` or `scripts/repo_config.yaml`" click GIT_COMMIT "#" "**git commit + push**\nCommit with message: 'Add docs ()'\nPush directly to master.\n`_git_commit_and_push()` handles both operations." click DONE "#" "**Ticket Done**\nUpdate trckr ticket to 'done' status.\nAdd comment with commit SHA, source type, file count.\n`trckr ticket update DOCS-NNN --status done`" click IN_REVIEW "#" "**Ticket In-Review**\nUpdate trckr ticket to 'in-review' status.\nAdd comment with error details for human triage.\nPipeline moves on to next ticket." click NEXT "#" "**More Tickets?**\nCheck if batch limit reached or queue empty.\nSerial processing -- one library at a time.\nIndividual failures don't stop the batch." click FINISHED "#" "**Batch Complete**\nAll tickets in batch processed.\nBatch performance: ~82% success rate across 125+ libraries.\nNew-library rate ~75-80% (rest are duplicates)." classDef batch fill:#e8daef,stroke:#b07cc6 classDef discover fill:#d1ecf1,stroke:#7ec8d8 classDef assess fill:#d1ecf1,stroke:#7ec8d8 classDef decision fill:#fff3cd,stroke:#f0c040 classDef fetch fill:#ffeaa7,stroke:#e0c040 classDef clean fill:#ffeaa7,stroke:#e0c040 classDef review fill:#fff3cd,stroke:#f0c040 classDef success fill:#d4edda,stroke:#5cb85c classDef failure fill:#f8d7da,stroke:#e06070 classDef merge fill:#d1ecf1,stroke:#7ec8d8 class QUEUE,EXTRACT,SET_IP batch class DOC_IDX_PRE,PYPI,NPM,CRATES,GH_SEARCH,TAVILY,DOMAIN_GUESS,UNI_IDX,ENUM_DOM,MERGE_DISC discover class LLMS_TXT,GH_ASSESS,STACK_V2 assess class PARALLEL,GATHER,LLM_DECIDE,REVIEW_CHECK,NEXT decision class CAND_IDX,CAND_LLMS,CAND_GH,CAND_LANG,CAND_HP,LLM_RESULT merge class FETCH_ROUTE,FETCH_LLMS,FETCH_GH,FETCH_GO,FETCH_WEB fetch class MDLINT clean class REVIEW_LLM review class COMMIT_STAGE,GIT_ADD,GIT_COMMIT success class DONE,FINISHED success class SKIP,REVERT,IN_REVIEW failure end style _MAIN_ fill:none,stroke:none,padding:0 _HEADER_ ~~~ _MAIN_