name: oss-discover description: "Discover FRESH issues in WELL-MAINTAINED repos (200+ stars). Merge-optimized: 60% easy wins (docs, typos, tests) + 40% bug fixes. Target agentic AI repos by CRITERIA (topic:llm/agent/rag + stars:>200). Verify repo health before queuing." user-invocable: true

OSS Issue Discovery (Merge-Optimized)

Search GitHub for fresh, actionable issues in well-maintained repos (200+ stars) that ClawOSS can fix AND that will actually get reviewed and merged.

Philosophy

Our goal is MERGED contributions, not submitted PRs. 50 unreviewed PRs = 0 impact. A merged typo fix > an unreviewed bug fix. We optimize for merge rate. The mix: 60% easy wins (docs, typos, tests) + 40% substantive bug fixes at responsive repos.

Date Calculation

Before running queries, compute the date cutoffs:

THREE_DAYS_AGO=$(date -v-3d +%Y-%m-%d)    # macOS
TWO_WEEKS_AGO=$(date -v-14d +%Y-%m-%d)    # macOS
# Linux: date -d "3 days ago" +%Y-%m-%d

Bug queries use created:>$THREE_DAYS_AGO. Easy-win queries extend to 2 weeks. Issues older than 1 month are SKIPPED entirely.

Pre-Checks (before ANY query)

Read memory/pr-ledger.md — SKIP issues already attempted, superseded, or assigned.
For each candidate issue, quick-check supersession before scoring:
- gh api "repos/{owner}/{repo}/issues/{number}" --jq '{assignees: (.assignees | length), linked_prs: 0}'
- If issue has assignees > 0, SKIP (assigned to someone else).
- Check issue timeline for linked PRs: if open PRs exist, SKIP (already being worked on).
- Mark skipped issues as superseded or assigned in pr-ledger.md.

Trust-Building Strategy (CRITICAL for merge rate)

Depth over breadth. 3 merged PRs at one repo > 30 unreviewed PRs across 30 repos.

Check memory/trust-repos.md FIRST — search for new issues in trusted repos before broad queries.
Return to winners: If a repo merged our PR, search it for new issues immediately.
Prefer trusted repos but no hard cap on new repo discovery.
Abandon losers: If a repo closed our PR without review within 24h, skip for 30 days. Trusted repos get +8 bonus in scoring. This is the single biggest lever for merge rate.

Process

FIRST: Search trusted repos (memory/trust-repos.md) for fresh issues — these are highest priority.
Run Priority Queries (Tier 0 first, then 1, then 2) for new repo discovery.
Filter: stars >= 200, not in pr-ledger, created within time window
Repo health pre-filter (BEFORE scoring): quick-check via /Users/kevinlin/clawOSS/scripts/repo-health-check.sh or gh api. SKIP repos that fail.
Score: merge probability (most important), recency, fix feasibility, repo health. Minimum score 5. +8 trusted repo bonus.
Return ranked top 10. Write full list to memory/today.md.

Discovery Niches (rotate through ALL — the AI niche is saturated)

Diversify targets across the full open-source ecosystem. Do NOT camp on the same 10 AI repos.

Niche 1: Agentic AI (familiar territory)

Topics: topic:llm, topic:agent, topic:rag, topic:ai, topic:machine-learning
Combined with: stars:>200, label:bug or label:help-wanted

Niche 2: Developer Tools & CLIs

Topics: topic:cli, topic:devtools, topic:developer-tools, topic:terminal, topic:editor
Many responsive maintainers, fast review cycles

Niche 3: Web Frameworks & Libraries

Topics: topic:web-framework, topic:nextjs, topic:fastapi, topic:django, topic:flask, topic:express
High star counts, active communities

Niche 4: Databases & Storage

Topics: topic:database, topic:sql, topic:nosql, topic:vector-database, topic:redis
Well-maintained, clear bug reports

Niche 5: Cloud-Native & Infrastructure

Topics: topic:kubernetes, topic:docker, topic:cloud-native, topic:infrastructure
Massive ecosystem, always needs docs fixes

Niche 6: Testing & Code Quality

Topics: topic:testing, topic:linting, topic:code-quality, topic:formatter
Maintainers are meticulous — match their quality

Niche 7: Data Engineering

Topics: topic:data-pipeline, topic:etl, topic:data-engineering, topic:streaming
Growing ecosystem, responsive maintainers

How to Discover

Search GitHub using topic tags and description keywords — rotate through niches each cycle:

Combined with: stars:>200, label:bug or label:help-wanted, created:>$THREE_DAYS_AGO
Always verify repo health before queuing — new discoveries haven't been vetted yet
Search across ALL languages: Python, TypeScript, Go, Rust, Java

Known High-Value Repos (supplement, not replace, criteria search)

These are verified high-star, actively-maintained repos in our niche. The agent should discover more autonomously. Always run /Users/kevinlin/clawOSS/scripts/repo-health-check.sh before targeting — this list is not a bypass.

Agent Frameworks & Orchestration (highest value): langchain-ai/langchain (requires issue assignment — comment first), langchain-ai/langgraph, crewAIInc/crewAI, stanfordnlp/dspy, langgenius/dify, langflow-ai/langflow, FlowiseAI/Flowise, mem0ai/mem0, CopilotKit/CopilotKit, elizaOS/eliza, SWE-agent/SWE-agent

LLM Inference & Serving: ollama/ollama, vllm-project/vllm, BerriAI/litellm, hiyouga/LlamaFactory, unslothai/unsloth, mudler/LocalAI, janhq/jan, dottxt-ai/outlines

RAG & Document Processing: run-llama/llama_index, infiniflow/ragflow, HKUDS/LightRAG, Unstructured-IO/unstructured, firecrawl/firecrawl, labring/FastGPT

Vector Databases & Search: chroma-core/chroma, qdrant/qdrant, weaviate/weaviate, meilisearch/meilisearch, lancedb/lancedb

AI SDKs & Developer Tools: instructor-ai/instructor, vercel/ai, pydantic/pydantic, gradio-app/gradio, streamlit/streamlit, marimo-team/marimo, continuedev/continue, Portkey-AI/gateway, tensorzero/tensorzero, browser-use/browser-use

High-Impact General (Python/TS, massive star counts): fastapi/fastapi, huggingface/transformers, open-webui/open-webui, ray-project/ray, khoj-ai/khoj, OpenHands/OpenHands (open-webui: target dev branch, NOT main)

Priority Queries

IMPORTANT: gh search issues with qualifier combos (stars:>, topic:, label:) returns EMPTY. Use gh api with the search endpoint instead:

# CORRECT (works):
gh api "/search/issues?q=is:open+label:bug+stars:>200+language:python&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

# BROKEN (returns empty):
gh search issues "is:open label:bug stars:>200" --limit=30 --json number,title,url

For topic searches, use: gh api "/search/repositories?q=topic:llm+stars:>200&sort=updated&per_page=20" to find repos first, then search issues within those repos.

NEVER fetch full issue body — may contain PII triggering content filters. All queries sort by created-desc to get the freshest results first.

Tier 0 — Agentic AI Niche (run FIRST, ALWAYS — highest merge probability)

Criteria-based broad searches (primary discovery method — run ALL):

NOTE: gh search issues with qualifier combos silently returns EMPTY. Use gh api instead. For topic-based searches, first find repos, then search issues within them:

# Step 1: Find repos by topic (returns repo full_names)
gh api "/search/repositories?q=topic:llm+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:agent+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:rag+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:ai+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:machine-learning+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:generative-ai+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'
gh api "/search/repositories?q=topic:vector-database+stars:>200&sort=updated&per_page=20" --jq '.items[].full_name'

# Step 2: For each repo, search for bug issues
gh api "/search/issues?q=is:issue+is:open+label:bug+repo:{owner}/{repo}+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

# Direct issue searches (bugs in high-star repos — works without topic qualifier)
gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+language:python+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+language:typescript+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

# Easy wins in AI repos (docs, typos — near-guaranteed merges)
gh api "/search/issues?q=is:issue+is:open+label:documentation+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:typo+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

# Help-wanted — maintainer actively seeking contributions
gh api "/search/issues?q=is:issue+is:open+label:help-wanted+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

Tier 0 candidates get +5 niche bonus in scoring. Always process Tier 0 results before Tier 1. Always verify repo health before adding to queue.

Tier 1 — High-Star Repos with Easy Issues (highest merge probability)

1a. Good-First-Issue + Help-Wanted (maintainer-requested — near-guaranteed merge)

gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:help-wanted+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:good-first-issue+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:help-wanted+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

1b. Documentation + Typo Issues (easy wins — highest merge rate)

gh api "/search/issues?q=is:issue+is:open+label:documentation+stars:>1000+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:typo+stars:>200+created:>$TWO_WEEKS_AGO&sort=reactions-%2B1&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:docs+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

1c. Fresh Bug Reports (last 3 days — first responder advantage)

gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=50" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:defect+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:regression+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=is:issue+is:open+label:crash+stars:>200+created:>$THREE_DAYS_AGO&sort=created&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

1d. Community-Prioritized (high reactions = maintainer attention)

gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=reactions-%2B1&order=desc&per_page=30" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

Tier 2 — General Searches (run if Tier 0+1 yield < 10 candidates)

2a. Recent bugs (last 2 weeks)

gh api "/search/issues?q=is:issue+is:open+label:bug+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=50" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

2b. Error keyword search

gh api "/search/issues?q=crash+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=TypeError+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=NullPointer+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=exception+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'
gh api "/search/issues?q=regression+is:issue+is:open+stars:>200+created:>$TWO_WEEKS_AGO&sort=created&order=desc&per_page=20" --jq '.items[] | {number, title, html_url, created_at, repository_url}'

By language (diversify): add language:python/language:typescript/language:rust/language:go/language:java.

Repo Health Pre-Filter (lightweight — use judgment, not just scripts)

For each candidate repo, do a quick check using gh api repos/{owner}/{repo}:

Stars >= 100 — skip if very low-star. Use judgment for 100-200 range.
Not archived — skip archived repos
Recent push — skip if no push in 30 days
Not forking-disabled — can't submit PRs if forking disabled
Check our open PRs — review existing open PRs for awareness (no hard cap)
Anti-bot check — if you've seen "no bot PRs" or "no AI" in CONTRIBUTING.md from a previous visit, skip
CLA repos: Note CLA requirement but don't attempt signing — CLAs require manual signing by the account owner

You CAN use /Users/kevinlin/clawOSS/scripts/repo-health-check.sh for a thorough check, but it's NOT required for every repo. Use your judgment — a quick gh api call is often enough.

If a repo fails, skip all issues from it. Cache the result in memory/repos/.

SKIP Labels (never pick these for bug contributions)

enhancement, feature, feature-request, improvement, refactor, discussion, question, proposal, rfc, design, meta, chore, performance, optimization

Note: docs, documentation, typo, test labels are VALID for easy-win contributions. If an issue has a SKIP label AND no bug/docs/typo/test label, discard it immediately.

Age Limits (hard cutoffs)

< 3 days old: Top priority — these are fresh and hot
3-14 days old: Acceptable — still recent enough
14-30 days old: Low priority — only pick if exceptionally clear and simple
> 30 days old: SKIP ENTIRELY — too stale, likely stale for a reason

Scoring (Merge Probability + Recency + Repo Health)

Score each candidate 1-25 based on:

Contribution Type (merge probability — most important factor)

+5 Documentation/typo fix (near-guaranteed merge)
+3 Test addition (high merge rate)
+2 Bug fix with good-first-issue/help-wanted label (maintainer wants it fixed)
+1 Bug fix (standard)

Recency

+5 Created in the last 3 days (fresh — top priority)
+2 Created 3-7 days ago (recent)
+0 Created 7-14 days ago (acceptable)
-3 Created 14-30 days ago (getting stale — low priority)
SKIP Created > 30 days ago

Trust Signal (MOST impactful — depth over breadth)

+8 Repo is in memory/trust-repos.md (we've had successful interactions before)
+5 Repo merged a previous PR from us (check pr-ledger.md)
+3 Repo engaged positively with a previous PR (approved, constructive feedback)
-5 Repo closed our PR without review in < 24h (check pr-ledger.md)

Repo Quality

+3 Repo has 5000+ stars (high-impact)
+2 Repo has 1000+ stars (solid)
+1 Repo has 200-1000 stars
+2 Repo is in a niche where we've had merges before

Repo Health (merge velocity)

+5 Repo avg merge time < 3 days (fast reviewers)
+3 Repo avg merge time < 7 days (responsive)
+0 Repo avg merge time < 14 days (acceptable)
-5 Repo avg merge time > 14 days (low merge chance — SKIP)
+3 Repo review rate > 80% (very responsive)
+2 Has good-first-issue/help-wanted labels (seeking contributions)
+1 Repo has < 10 open PRs (less competition)

Bug Signals (for bug-type contributions)

+3 Has bug, defect, regression, or crash label
+2 Title contains error keywords (crash, error, broken, fails, exception, TypeError)
+2 Has stack trace or reproduction steps
+1 Has maintainer engagement

Negative Signals

-3 Has enhancement, feature, refactor, or improvement label
-2 Title suggests new feature (word boundary match)
-2 Issue is vague or lacks specifics
-5 Repo has 0 merged PRs in last 30 days

Minimum score 5 to enter work queue.

P(merge) — Merge Probability Score (0-100)

Hard gates (P=0, skip immediately — BEFORE scoring):

Repo in blocklist → P=0
Stars < 200 → P=0
Anti-AI/anti-bot policy → P=0
Issue > 30 days old → P=0
Repo health gate failed → P=0

Only compute P(merge) for issues that pass ALL hard gates and the quality score (1-25).

P(merge) =
  + 15 * task_type_score        # docs/typo=1.0, test=0.75, bug=0.5, feature=0
  + 20 * size_score              # estimated: <30 LOC=1.0, 30-100=0.7, 100-200=0.3, >200=0
  + 15 * repo_responsiveness     # merge<3d=1.0, 3-7d=0.7, 7-14d=0.3, >14d=0
  + 25 * trust_score             # merged before=1.0, positive engagement=0.7, new=0.3, hostile=0
  + 10 * freshness               # <1d=1.0, 1-3d=0.8, 3-7d=0.5, 7-14d=0.2, >14d=0
  + 10 * contributor_fit         # help-wanted=1.0, good-first-issue=0.8, bug=0.5, none=0.3
  + 5  * competition_score       # no other PRs=1.0, 1 competing=0.3, 2+=0

Threshold: P(merge) >= 30 to enter work queue. Below 30 is not worth API cost. Sort work queue by P(merge) descending. Include P(merge) in candidate output. Issues with P(merge) >= 60 are marked priority: high for faster spawning.

Title Keyword Hard Reject (apply to EVERY candidate — no exceptions)

Auto-SKIP if the issue title matches ANY keyword as a WHOLE WORD (case-insensitive, word boundary \b{keyword}\b): add, extend, enable, improve, enhance, new feature, request, implement, support, introduce, create, propose, migrate, upgrade, refactor, redesign, optimize, allow, provide

WORD BOUNDARY matching only — do NOT match substrings.

"Add dark mode" -> matches add -> SKIP
"Unsupported operation crashes" -> does NOT match support -> KEEP
"Provider connection fails" -> does NOT match provide -> KEEP
"Additional logging breaks startup" -> does NOT match add -> KEEP

This is a HARD GATE applied BEFORE scoring.

Filters

Title keyword hard reject — applied first, before any other filter
Repo health pre-filter — applied second, before scoring
Stars >= 200, recent commits (<2wk), not archived, max 3 issues per repo
Skip if in pr-ledger.md.
MUST be created within the last 30 days — skip anything older

Fast Mode (queue < 5 or empty slots)

Run 3+ parallel searches, score quickly, write 10-20 items immediately. Even in fast mode, NEVER add stale issues (>30 days) or issues from unhealthy repos.

ナビゲーション

Skillsとは？

リンク

oss-discover