name: scientific-literature-search description: "Systematic strategies for searching scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers PICO framework for clinical questions, three-tiered search (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH, boolean query construction, and full-text extraction. Use when planning a literature search or choosing a search tier." license: CC-BY-4.0

Scientific Literature Search

Overview

Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.

Key Concepts

The PICO Framework

For clinical and biomedical questions, structure queries using the PICO framework:

P (Population): Who are you studying? (e.g., "Diabetes Mellitus"[MeSH])
I (Intervention): What treatment or exposure? (e.g., "Metformin"[MeSH])
C (Comparison): What is the alternative? (e.g., placebo, standard care)
O (Outcome): What result are you measuring? (e.g., "Cardiovascular Diseases"[MeSH])

PICO queries can be combined with publication type filters to target specific evidence levels:

"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])

Three-Tiered Search Strategy

Literature search is most effective when approached in tiers of increasing breadth:

Tier 1 -- Database-Specific Searches (Most Reliable)

Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.

PubMed (via Biopython Bio.Entrez): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags.
arXiv (via the arxiv package): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals.
Google Scholar (via the scholarly package): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.

Best for: finding specific papers, systematic reviews, clinical evidence, preprints.

Tier 2 -- AI-Assisted Web Search (Comprehensive)

Use the Claude API with the web_search_20250305 server-side tool to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (e.g. via the duckduckgo-search package) for protocols, tutorials, and software documentation.

Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.

Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.

Tier 3 -- Direct Content Extraction (Deep Dive)

Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using trafilatura (HTML article extraction), pypdf (PDF text), and the Crossref API (DOI → supplementary file URLs).

Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.

PubMed Field Tags

PubMed supports field-specific searching to improve precision:

Tag	Description	Example
`[MeSH]`	Medical Subject Heading (controlled vocabulary)	`"Neoplasms"[MeSH]`
`[Title]`	Title field only	`"CRISPR"[Title]`
`[Title/Abstract]`	Title or abstract	`"gene therapy"[Title/Abstract]`
`[Author]`	Author name	`"Zhang F"[Author]`
`[Journal]`	Journal name	`"Nature"[Journal]`
`[Publication Type]`	Article type filter	`"Review"[Publication Type]`
`[Date - Publication]`	Publication date range	`"2020/01/01"[Date - Publication]:"2024/12/31"[Date - Publication]`
`[MeSH Major Topic]`	MeSH term as major focus of the article	`"CRISPR-Cas Systems"[MeSH Major Topic]`

Boolean Operators

Boolean operators control how search terms combine:

# AND: All terms must be present -- narrows results
results = query_pubmed("CRISPR AND cancer AND therapy")

# OR: Any term can be present -- broadens results (use for synonyms)
results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy")

# NOT: Exclude terms -- use sparingly to avoid losing relevant papers
results = query_pubmed("cancer immunotherapy NOT review")

Use parentheses to group OR terms together before combining with AND.

arXiv Subject Categories

arXiv organizes preprints by subject category. Biology-related categories include:

Category	Description
`q-bio.BM`	Biomolecules
`q-bio.CB`	Cell Behavior
`q-bio.GN`	Genomics
`q-bio.MN`	Molecular Networks
`q-bio.NC`	Neurons and Cognition
`q-bio.QM`	Quantitative Methods
`cs.AI`	Artificial Intelligence
`cs.LG`	Machine Learning

Decision Framework

Use this tree to determine which search tier and database to start with:

What type of question are you answering?
├── Clinical / biomedical question
│   ├── Specific drug or treatment → Tier 1: PubMed with PICO query
│   ├── Disease mechanism → Tier 1: PubMed with MeSH terms
│   └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type
├── Computational / quantitative methods
│   ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI)
│   ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed
│   └── Software tool or pipeline → Tier 2: AI-assisted web search
├── Broad research landscape
│   ├── Current state of a field → Tier 2: AI-assisted web search
│   ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search
│   └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2
├── Specific paper or data
│   ├── Known paper details → Tier 1: any database by title/author/DOI
│   ├── Methodology or protocol → Tier 3: full-text extraction
│   └── Supplementary data → Tier 3: DOI-based supplementary fetch
└── Protocols / reagents
    ├── Lab protocol → Tier 2: web search for protocols.io, etc.
    └── Validated reagents → Tier 2: AI-assisted web search

Scenario	Recommended Tier and Database	Rationale
Systematic review of clinical evidence	Tier 1: PubMed with MeSH + publication type filters	Reproducible, documented search strategy required
Finding a preprint on a new ML method	Tier 1: arXiv with category and keyword search	Preprints appear on arXiv before journals
Understanding the research landscape	Tier 2: AI-assisted web search	Requires synthesis across many sources
Extracting a specific protocol from a paper	Tier 3: PDF content extraction	Need full-text access to methods section
Finding papers across disciplines	Tier 1: Google Scholar	Broadest coverage across fields
Identifying key researchers in a niche area	Tier 2: AI-assisted web search	Requires contextual synthesis
Downloading supplementary data tables	Tier 3: DOI-based supplementary fetch	Direct access to supplementary files

Best Practices

Use controlled vocabulary (MeSH) for PubMed searches: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.
```
# Free text misses synonyms
query_pubmed("heart attack treatment")
# MeSH captures all synonyms
query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]')
```
Include synonyms and alternative terms with OR: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.
```
query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)")
```
Use phrase searching for multi-word concepts: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.
```
query_pubmed('"single cell RNA sequencing" AND methods')
```
Filter by publication type when seeking specific evidence: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use [Publication Type] to target the evidence level you need.
```
query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]")
```

Start broad, then narrow iteratively: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.

# Step 1: Broad
results = query_pubmed("CRISPR base editing iPSC", max_papers=20)
# Step 2: Add MeSH and specificity
results = query_pubmed(
    '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency',
    max_papers=20
)
# Step 3: Filter by date
results = query_pubmed(
    '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])',
    max_papers=20
)

Cross-reference multiple databases: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.
Assess result quality systematically: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.

Common Pitfalls

Overly long and specific queries: Packing too many terms into a single query causes missed results because all terms must match simultaneously.
- How to avoid: Limit queries to core concepts (3-5 terms). Run separate searches for sub-topics and combine results manually.
```
# Too specific -- misses relevant papers
query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery")
# Better -- core concepts only
query_pubmed("CRISPR Cas9 gene editing optimization efficiency")
```
Relying on a single database: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.
- How to avoid: Always search at least two databases. For computational biology, combine PubMed and arXiv. For cross-disciplinary topics, include Google Scholar.
Ignoring publication dates: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.
- How to avoid: Check publication dates in all results. For methods papers, prefer the last 3-5 years. For foundational concepts, older papers are acceptable but verify with recent reviews.
Skipping title and abstract review before deep-diving: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.
- How to avoid: Always screen titles and abstracts first. Only extract full text (Tier 3) for papers that pass screening.
Using NOT operators too aggressively: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.
- How to avoid: Use NOT sparingly. Prefer adding positive terms to narrow results rather than excluding terms. When you must use NOT, verify that excluded results are genuinely irrelevant.
Ignoring Google Scholar rate limits: Google Scholar aggressively rate-limits automated queries, which can block further searches.
- How to avoid: Use Google Scholar sparingly. Add delays between requests. Prefer PubMed or arXiv for bulk searching and reserve Google Scholar for cross-disciplinary checks.
Not documenting the search strategy: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.
- How to avoid: Record your search terms, databases queried, date ranges, and number of results at each stage. This is essential for systematic reviews and good practice for all searches.

Workflow

Step 1: Define the research question
- Identify the main concept, population/model, intervention/method, desired outcome, and time frame
- For clinical questions, map to the PICO framework
- Example: "Find recent papers on CRISPR base editing efficiency in human iPSCs" decomposes to: main concept = CRISPR base editing, model = human iPSCs, outcome = efficiency, time frame = last 3 years

Step 2: Construct and execute database queries (Tier 1)

Start with PubMed for biomedical topics, arXiv for computational topics
Begin with a broad query using 2-3 core terms
Refine with MeSH terms, field tags, date filters, and publication type filters

from Bio import Entrez
import arxiv
from scholarly import scholarly

Entrez.email = "your.email@example.com"  # NCBI requires a contact email

# PubMed: biomedical literature
handle = Entrez.esearch(
    db="pubmed",
    term='"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]',
    retmax=20,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()

# arXiv: computational biology preprints
arxiv_results = list(
    arxiv.Search(query="protein structure prediction", max_results=10).results()
)

# Google Scholar: broad cross-disciplinary coverage
scholar_results = scholarly.search_pubs("single cell RNA sequencing analysis methods")

Step 3: Supplement with AI-assisted search (Tier 2)

Use AI-assisted web search for landscape overviews and recent developments
Use general web search for protocols, tutorials, and documentation

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}],
    messages=[{
        "role": "user",
        "content": "What are the latest developments in CAR-T cell therapy for solid tumors in 2024?",
    }],
)
print(response.content)

Step 4: Evaluate and filter results
- Screen titles and abstracts for relevance
- Prioritize by recency, journal quality, citation count, and study design
- For clinical evidence, prioritize RCTs, systematic reviews, and meta-analyses
- For methods, prioritize protocol papers and method comparisons
- Decision point: If too many results, add more specific terms or filters. If too few, broaden terms and add synonyms.

Step 5: Deep dive into key papers (Tier 3)

Extract full text from high-priority papers
Download supplementary materials for data and protocols
Check reference lists for additional relevant papers

import io
import os
from pathlib import Path
from urllib.parse import urlparse

import requests
import trafilatura
from pypdf import PdfReader

# Extract article content from URL (clean main text, drops nav/ads)
downloaded = trafilatura.fetch_url("https://www.nature.com/articles/nature12373")
article_text = trafilatura.extract(downloaded)

# Extract text from a PDF
pdf_bytes = requests.get("https://arxiv.org/pdf/1706.03762.pdf", timeout=30).content
reader = PdfReader(io.BytesIO(pdf_bytes))
pdf_text = "\n".join(page.extract_text() or "" for page in reader.pages)

# Download supplementary files via Crossref DOI metadata
doi = "10.1038/nature12373"
meta = requests.get(f"https://api.crossref.org/works/{doi}", timeout=30).json()
out_dir = Path("./supplementary_materials")
out_dir.mkdir(exist_ok=True)
for link in meta.get("message", {}).get("link", []):
    url = link.get("URL")
    if not url:
        continue
    fname = os.path.basename(urlparse(url).path) or "supplement.bin"
    (out_dir / fname).write_bytes(requests.get(url, timeout=60).content)

Step 6: Document and iterate
- Record all search terms, databases, filters, and result counts
- If gaps remain, revisit Steps 2-3 with refined queries
- For systematic reviews, follow PRISMA guidelines for reporting

Common Search Scenarios

The following scenarios illustrate how to combine the three tiers for typical research questions.

Finding Methods and Protocols

Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.

from Bio import Entrez
from duckduckgo_search import DDGS

Entrez.email = "your.email@example.com"

# Search for methodology papers in PubMed
handle = Entrez.esearch(
    db="pubmed",
    term='"Western Blotting"[MeSH] AND (protocol OR method OR technique)',
    retmax=10,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()

# Check web for step-by-step protocols
web_hits = DDGS().text("Western blot protocol for membrane proteins", max_results=5)

Understanding Disease Mechanisms

Begin with review articles for a broad overview, then drill into specific mechanistic studies.

# Find review articles first for an overview
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]',
    max_papers=10
)

# Then find specific mechanistic studies
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism',
    max_papers=20
)

Finding Drug and Treatment Information

Use publication type filters to separate clinical trial evidence from systematic reviews.

# Clinical trials for a specific drug-condition pair
results = query_pubmed(
    '"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]',
    max_papers=20
)

# Systematic reviews and meta-analyses
results = query_pubmed(
    '"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])',
    max_papers=10
)

Tracking Latest Developments

Combine AI-assisted search for synthesis with database searches for recent indexed publications.

from anthropic import Anthropic
from Bio import Entrez

client = Anthropic()
Entrez.email = "your.email@example.com"

# AI-assisted synthesis of recent advances (Claude API web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}],
    messages=[{
        "role": "user",
        "content": "What are the most significant advances in CAR-T cell therapy in 2024?",
    }],
)

# Supplement with recent PubMed results
handle = Entrez.esearch(
    db="pubmed",
    term='"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]',
    retmax=20,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()

Finding Specific Reagents and Materials

Use AI-assisted search for validated reagent recommendations, supplemented by general web search.

from anthropic import Anthropic
from duckduckgo_search import DDGS

client = Anthropic()

# Search for validated reagents (Claude API + web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 2}],
    messages=[{
        "role": "user",
        "content": "validated antibodies for Western blot detection of p53 protein",
    }],
)

# Search supplier databases
supplier_hits = DDGS().text("p53 antibody Western blot validated", max_results=5)

Comparative Analysis Across Methods

Use AI-assisted search for synthesized comparisons of techniques or tools.

from anthropic import Anthropic

client = Anthropic()

# Compare approaches with AI synthesis (Claude API web search tool)
response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 5}],
    messages=[{
        "role": "user",
        "content": "Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles",
    }],
)
print(response.content)

Quality Assessment Checklist

When evaluating search results, apply these criteria:

Source reliability: Is the paper from a peer-reviewed journal?
Author credentials: Are the authors established experts in the field?
Recency: Is the information current enough for your purpose?
Study design: Is the design appropriate for the question (e.g., RCT for efficacy, cohort for risk)?
Sample size: Is it adequate for the conclusions drawn?
Reproducibility: Are methods described clearly enough to replicate?
Conflicts of interest: Are any conflicts declared?
Citation count: Has the paper been well-cited by subsequent work?

Related Skills

pubmed-database -- Direct PubMed API access for programmatic literature retrieval
scientific-manuscript-writing -- Structuring literature review sections within manuscripts
research-question-formulation -- Frameworks for defining answerable research questions

ナビゲーション

Skillsとは？

リンク

scientific-literature-search