---
description: Automated documentation pipeline for llm-code-docs. Discovers, fetches, cleans, reviews, and commits upstream docs for 11K+ libraries.
category: personal
---
flowchart TD
_HEADER_["
auto-doc Pipeline
Automated documentation pipeline for llm-code-docs. Discovers, fetches, cleans, reviews, and commits upstream docs for 11K+ libraries.
"]:::headerStyle
classDef headerStyle fill:none,stroke:none
subgraph _MAIN_[" "]
subgraph Plow["Batch Mode (plow)"]
QUEUE[Pull Next Ticket] --> EXTRACT[Extract Library Name]
EXTRACT --> SET_IP[Set Ticket In-Progress]
end
subgraph Discover["Stage 0: Parallel Discovery"]
SET_IP --> DOC_IDX_PRE[doc-index Lookup
138K libraries]
DOC_IDX_PRE --> PARALLEL{8 Parallel Workers}
PARALLEL --> PYPI[PyPI Registry]
PARALLEL --> NPM[npm Registry]
PARALLEL --> CRATES[crates.io Registry]
PARALLEL --> GH_SEARCH[GitHub Search
Top 25 by Stars]
PARALLEL --> TAVILY[Tavily Web Search
Grounding]
PARALLEL --> DOMAIN_GUESS[Domain Guessing
HEAD-probe .dev/.io/.org]
PARALLEL --> UNI_IDX[unified-index
5.14M packages]
PARALLEL --> ENUM_DOM[Enumerated Domains
6.8K sites]
PYPI --> MERGE_DISC[Merge + Score
Best Guess]
NPM --> MERGE_DISC
CRATES --> MERGE_DISC
GH_SEARCH --> MERGE_DISC
TAVILY --> MERGE_DISC
DOMAIN_GUESS --> MERGE_DISC
UNI_IDX --> MERGE_DISC
ENUM_DOM --> MERGE_DISC
end
subgraph LlmsProbe["llms.txt + GitHub Assessment"]
MERGE_DISC --> LLMS_TXT[llms.txt Probing
All Discovered Domains]
MERGE_DISC --> GH_ASSESS[GitHub Assessment
Doc File Count + Fork Check]
MERGE_DISC --> STACK_V2[stack-v2 Enrichment
859K repos]
end
subgraph Decide["Stage 1: LLM Source Selection"]
LLMS_TXT --> GATHER[Gather All
Candidate Sources]
GH_ASSESS --> GATHER
STACK_V2 --> GATHER
GATHER --> CAND_IDX[doc-index hits]
GATHER --> CAND_LLMS[Valid llms.txt
>5KB, score>=75]
GATHER --> CAND_GH[GitHub repos
with docs]
GATHER --> CAND_LANG[Language hosts
docs.rs / gomarkdoc /
hexdocs / metacpan]
GATHER --> CAND_HP[GitHub repo
homepage]
CAND_IDX --> LLM_DECIDE[MiniMax-M2.5
Arbiter]
CAND_LLMS --> LLM_DECIDE
CAND_GH --> LLM_DECIDE
CAND_LANG --> LLM_DECIDE
CAND_HP --> LLM_DECIDE
LLM_DECIDE -->|Picks best| LLM_RESULT[Selected Source]
LLM_DECIDE -->|No viable source| SKIP[Skip]
end
subgraph Fetch["Stage 2: Fetch"]
LLM_RESULT --> FETCH_ROUTE{Route by
Source Type}
FETCH_ROUTE -->|llms-txt| FETCH_LLMS[llms-txt-scraper.py
Bulk Download]
FETCH_ROUTE -->|github| FETCH_GH[Shallow Clone
Extract docs/]
FETCH_ROUTE -->|go-native| FETCH_GO[gomarkdoc
Generate from Source]
FETCH_ROUTE -->|web| FETCH_WEB[Jina Reader
Crawl + Convert to MD]
end
subgraph Clean["Stage 3: Clean"]
FETCH_LLMS --> MDLINT[markdownlint --fix
Batch of 50 Files]
FETCH_GH --> MDLINT
FETCH_GO --> MDLINT
FETCH_WEB --> MDLINT
end
subgraph Review["Stage 4: Review"]
MDLINT --> REVIEW_LLM[MiniMax-M2.5 Review
Sample 10 Files, 3 Full]
REVIEW_LLM --> REVIEW_CHECK{Pass?}
REVIEW_CHECK -->|Identity OK
Substance OK
Encoding OK
No Duplicates
Good Coverage
Complete| COMMIT_STAGE[Proceed to Commit]
REVIEW_CHECK -->|Fail| REVERT[Delete Docs + Revert Config]
end
subgraph Commit["Stage 5: Commit"]
COMMIT_STAGE --> GIT_ADD[git add docs/ + configs]
GIT_ADD --> GIT_COMMIT[git commit + push master]
end
subgraph Outcome["Ticket Resolution"]
GIT_COMMIT --> DONE[Ticket Done
SHA in Comment]
REVERT --> IN_REVIEW[Ticket In-Review
Error in Comment]
SKIP --> IN_REVIEW
DONE --> NEXT{More Tickets?}
IN_REVIEW --> NEXT
NEXT -->|Yes| QUEUE
NEXT -->|No| FINISHED[Batch Complete]
end
click QUEUE "#" "**Pull Next Ticket**\nQuery trckr for next 'todo' ticket in the DOCS project.\n`trckr ticket list --project DOCS --status todo --limit 1`"
click EXTRACT "#" "**Extract Library Name**\nParse library name from ticket title.\nTitle format: 'Add docs for
'"
click SET_IP "#" "**Set Ticket In-Progress**\n`trckr ticket update DOCS-NNN --status in-progress`\nPrevents other workers from picking up the same ticket."
click DOC_IDX_PRE "#" "**doc-index Lookup**\nSQLite DB with 138K+ libraries from 11 platform adapters:\nDevDocs, Dash, devhints, APIs.guru, HexDocs, pub.dev,\nHackage, SwiftPkg, Go modules, ReadTheDocs, cljdoc.\nRuns first (sequential) before parallel workers launch.\n`python3 scripts/doc_index.py lookup-cmd `"
click PARALLEL "#" "**8 Parallel Workers**\nAll 8 discovery streams run simultaneously via ThreadPoolExecutor.\nTavily always runs (even with index hits) for grounding."
click PYPI "#" "**PyPI Registry**\nQuery PyPI JSON API for package metadata.\nExtracts homepage, project URLs, repo link.\nFree, unauthenticated."
click NPM "#" "**npm Registry**\nQuery npm registry for package metadata.\nExtracts homepage, repository, docs URL.\nFree, unauthenticated."
click CRATES "#" "**crates.io Registry**\nQuery crates.io API for crate metadata.\nExtracts homepage, repository, documentation URL.\nFree, unauthenticated."
click GH_SEARCH "#" "**GitHub Search**\nSearch GitHub for top 25 repos matching the library name.\nUses `gh api search/repositories` with sort by stars.\nReturns: name, stars, description, homepage, language."
click TAVILY "#" "**Tavily Web Search**\nAlways runs for grounding, even with index hits.\nQuery: library name + 'documentation site official'\nProvides real-time verification that URLs are alive.\nResults cached in `~/gitea/auto-doc/cache/tavily/`."
click DOMAIN_GUESS "#" "**Domain Guessing**\nHEAD-probe common TLD patterns:\n`.dev`, `.io`, `.org`, `.com`\nFilters: library name must appear as dot-separated domain segment."
click UNI_IDX "#" "**unified-index Lookup**\n5.14M packages across 53 registries.\nReturns `docs_url`, `homepage`, `repository_url` per package.\nBuilt from reverse-dns-indexer project.\nConfidence score: 65-75 depending on field."
click ENUM_DOM "#" "**Enumerated Domains**\n6,808 known doc-hosting domains across 7 platforms:\n- GitBook, ReadMe, Mintlify, Fern\n- Writerside, Archbee, Zudoku\n- Plus custom validated domains\nChecks if library name is a substring of any domain."
click MERGE_DISC "#" "**Merge + Score Best Guess**\nConfidence-weighted scoring selects `best_guess`:\n- GitHub stars: 50 + stars/1000 (max 100)\n- Registry github link: 70\n- Registry domain: 40\n- Tavily: 60\n- Unified index: 65-75\n- Enumerated domains: 55\n- Domain guess: 20\nCross-validates GitHub homepage against domain candidates."
click LLMS_TXT "#" "**llms.txt Probing**\nCheck all discovered domains for `/llms.txt` endpoint.\nReturns LLM-optimized markdown when available.\nScored by size and content quality (threshold: >5KB, score>=75).\nOnly relevant domains checked (registry-backed or name-matched)."
click GH_ASSESS "#" "**GitHub Assessment**\nFor best-guess GitHub repo, checks:\n- Doc file count (threshold: >5)\n- Has markdown content\n- Is not a fork\n- Docs folder location\nResults used by Rule 3."
click STACK_V2 "#" "**stack-v2 Enrichment**\nDuckDB scan of Stack V2 parquet data (859K GitHub repos).\nReturns which doc platform a repo uses\n(MkDocs, Sphinx, Docusaurus, etc.).\nInforms fetch strategy, not decision."
click GATHER "#" "**Gather All Candidate Sources**\nCollects every viable source into a structured list.\nNo filtering or ranking yet -- just enumeration.\nEach candidate has: type, source, url, detail.\nReplaces the old Rule 0-3 heuristic cascade."
click CAND_IDX "#" "**doc-index Candidates**\nPre-computed URLs from 138K+ library index.\nMaps doc_type to source type:\n- llms-txt type -> llms-txt source\n- html/api-spec -> web source"
click CAND_LLMS "#" "**llms.txt Candidates**\nValid llms.txt files from probed domains:\n- HTTP 200, not a stub\n- Size >5KB\n- Validation score >= 75\nIncludes size, score, heading count as detail."
click CAND_GH "#" "**GitHub Repo Candidates**\nRepos with substantial documentation:\n- >5 doc files, has markdown, not a fork\nIncludes doc file count and folder location."
click CAND_LANG "#" "**Language Host Candidates**\nDeterministic URLs by ecosystem:\n- Rust: docs.rs\n- Go: gomarkdoc (or homepage if real doc site)\n- Elixir: hexdocs.pm\n- Perl: metacpan.org\n- And more (pub.dev, Hackage, RubyGems, etc.)"
click CAND_HP "#" "**GitHub Homepage Candidate**\nThe best-guess repo's homepage field.\nOften points to a dedicated docs site.\nFiltered against NOISE_DOMAINS list."
click LLM_DECIDE "#" "**MiniMax-M2.5 Arbiter**\nThe LLM makes the final judgement call.\nReceives ALL candidates + full discovery context:\n- Top 5 GitHub repos with stars\n- Tavily search results + AI answer\n- Unified index matches (5.14M packages)\n- Enumerated doc domains\n- Stack V2 doc platforms\n- Registry data\n\nHandles fuzzy cases heuristics miss\n(e.g. SvelteKit -> svelte.dev).\nPicks single best source + one-sentence reason.\nFree (MiniMax via `clauded-mm -p`)."
click LLM_RESULT "#" "**Selected Source**\nLLM picked the best candidate.\nReturns JSON: source type, url, reason.\nSource type must match a candidate exactly:\nllms-txt, github, go-native, web, or skip."
click SKIP "#" "**Skip**\nLLM determined no viable documentation source.\nTicket goes to in-review for human triage."
click FETCH_ROUTE "#" "**Route by Source Type**\nFour fetch strategies, each optimized for its content format:\n- llms-txt: bulk scraper\n- github: shallow clone\n- go-native: gomarkdoc generation\n- web: Jina Reader crawl"
click FETCH_LLMS "#" "**llms-txt Fetch**\nBulk download via `llms-txt-scraper.py`.\nHandles pagination, nested sections, and\nmarkdown formatting from llms.txt endpoints."
click FETCH_GH "#" "**GitHub Fetch**\nShallow clone (`git clone --depth 1`).\nExtract docs/ folder contents.\nRespects .gitignore, skips binary files."
click FETCH_GO "#" "**Go Module Fetch**\n`gomarkdoc` generates documentation from Go source.\nProduces well-structured markdown with\npackage docs, function signatures, examples.\nModule path derived from library name or GitHub repo."
click FETCH_WEB "#" "**Jina Reader Fetch**\nCrawls docs via `https://r.jina.ai/` API:\n1. Fetch index page, extract links matching docs prefix\n2. Crawl up to 100 pages (20MB total cap)\n3. Convert each page to markdown\n4. Rate limited: 3.5s between requests (Jina free tier)\n5. Min content: 500 bytes per page\nReplaced MiniMax scraper generation (~8% success)\nwith deterministic Jina Reader approach."
click MDLINT "#" "**markdownlint --fix**\nBatch of 50 files processed via:\n`npx markdownlint-cli --fix`\nProject-standard disabled rules applied.\nFixes heading levels, trailing whitespace, etc."
click REVIEW_LLM "#" "**MiniMax-M2.5 Content Review**\nSamples up to 10 files (prioritizing largest by size).\nReads full content of 3 largest (first 2000 chars each).\nChecks 6 criteria via `clauded-mm -p`:\n- Identity: correct library?\n- Substance: real docs, not nav cruft?\n- Encoding: no mojibake?\n- Duplication: near-identical files?\n- Coverage: sufficient completeness?\n- Completeness: obvious gaps or broken refs?\nCan flag specific files for deletion."
click REVIEW_CHECK "#" "**Review Gate**\nMandatory -- no bypass. All 6 criteria must pass.\nReturns JSON with pass/fail, issues list,\nfiles_to_delete, and summary.\nZero false positives across 125+ processed libraries."
click COMMIT_STAGE "#" "**Proceed to Commit**\nReview passed all 6 criteria.\nContent is verified upstream documentation."
click REVERT "#" "**Revert on Failure**\n1. Delete flagged files (from review response)\n2. `shutil.rmtree()` the entire docs directory\n3. `git checkout -- scripts/llms-sites.yaml scripts/repo_config.yaml`\nTicket goes to in-review with error details."
click GIT_ADD "#" "**git add**\nStage docs directory and config files:\n- `docs///`\n- `scripts/llms-sites.yaml` or `scripts/repo_config.yaml`"
click GIT_COMMIT "#" "**git commit + push**\nCommit with message: 'Add docs ()'\nPush directly to master.\n`_git_commit_and_push()` handles both operations."
click DONE "#" "**Ticket Done**\nUpdate trckr ticket to 'done' status.\nAdd comment with commit SHA, source type, file count.\n`trckr ticket update DOCS-NNN --status done`"
click IN_REVIEW "#" "**Ticket In-Review**\nUpdate trckr ticket to 'in-review' status.\nAdd comment with error details for human triage.\nPipeline moves on to next ticket."
click NEXT "#" "**More Tickets?**\nCheck if batch limit reached or queue empty.\nSerial processing -- one library at a time.\nIndividual failures don't stop the batch."
click FINISHED "#" "**Batch Complete**\nAll tickets in batch processed.\nBatch performance: ~82% success rate across 125+ libraries.\nNew-library rate ~75-80% (rest are duplicates)."
classDef batch fill:#e8daef,stroke:#b07cc6
classDef discover fill:#d1ecf1,stroke:#7ec8d8
classDef assess fill:#d1ecf1,stroke:#7ec8d8
classDef decision fill:#fff3cd,stroke:#f0c040
classDef fetch fill:#ffeaa7,stroke:#e0c040
classDef clean fill:#ffeaa7,stroke:#e0c040
classDef review fill:#fff3cd,stroke:#f0c040
classDef success fill:#d4edda,stroke:#5cb85c
classDef failure fill:#f8d7da,stroke:#e06070
classDef merge fill:#d1ecf1,stroke:#7ec8d8
class QUEUE,EXTRACT,SET_IP batch
class DOC_IDX_PRE,PYPI,NPM,CRATES,GH_SEARCH,TAVILY,DOMAIN_GUESS,UNI_IDX,ENUM_DOM,MERGE_DISC discover
class LLMS_TXT,GH_ASSESS,STACK_V2 assess
class PARALLEL,GATHER,LLM_DECIDE,REVIEW_CHECK,NEXT decision
class CAND_IDX,CAND_LLMS,CAND_GH,CAND_LANG,CAND_HP,LLM_RESULT merge
class FETCH_ROUTE,FETCH_LLMS,FETCH_GH,FETCH_GO,FETCH_WEB fetch
class MDLINT clean
class REVIEW_LLM review
class COMMIT_STAGE,GIT_ADD,GIT_COMMIT success
class DONE,FINISHED success
class SKIP,REVERT,IN_REVIEW failure
end
style _MAIN_ fill:none,stroke:none,padding:0
_HEADER_ ~~~ _MAIN_