name: scientific-literature-search description: "Systematic strategies for searching scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers PICO framework for clinical questions, three-tiered search (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH, boolean query construction, and full-text extraction. Use when planning a literature search or choosing a search tier." license: CC-BY-4.0
Scientific Literature Search
Overview
Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.
Key Concepts
The PICO Framework
For clinical and biomedical questions, structure queries using the PICO framework:
- P (Population): Who are you studying? (e.g., "Diabetes Mellitus"[MeSH])
- I (Intervention): What treatment or exposure? (e.g., "Metformin"[MeSH])
- C (Comparison): What is the alternative? (e.g., placebo, standard care)
- O (Outcome): What result are you measuring? (e.g., "Cardiovascular Diseases"[MeSH])
PICO queries can be combined with publication type filters to target specific evidence levels:
"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])
Three-Tiered Search Strategy
Literature search is most effective when approached in tiers of increasing breadth:
Tier 1 -- Database-Specific Searches (Most Reliable)
Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.
- PubMed (via Biopython
Bio.Entrez): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags. - arXiv (via the
arxivpackage): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals. - Google Scholar (via the
scholarlypackage): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.
Best for: finding specific papers, systematic reviews, clinical evidence, preprints.
Tier 2 -- AI-Assisted Web Search (Comprehensive)
Use the Claude API with the web_search_20250305 server-side tool to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (e.g. via the duckduckgo-search package) for protocols, tutorials, and software documentation.
Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.
Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.
Tier 3 -- Direct Content Extraction (Deep Dive)
Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using trafilatura (HTML article extraction), pypdf (PDF text), and the Crossref API (DOI → supplementary file URLs).
Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.
PubMed Field Tags
PubMed supports field-specific searching to improve precision:
| Tag | Description | Example |
|---|---|---|
[MeSH] | Medical Subject Heading (controlled vocabulary) | "Neoplasms"[MeSH] |
[Title] | Title field only | "CRISPR"[Title] |
[Title/Abstract] | Title or abstract | "gene therapy"[Title/Abstract] |
[Author] | Author name | "Zhang F"[Author] |
[Journal] | Journal name | "Nature"[Journal] |
[Publication Type] | Article type filter | "Review"[Publication Type] |
[Date - Publication] | Publication date range | "2020/01/01"[Date - Publication]:"2024/12/31"[Date - Publication] |
[MeSH Major Topic] | MeSH term as major focus of the article | "CRISPR-Cas Systems"[MeSH Major Topic] |
Boolean Operators
Boolean operators control how search terms combine:
# AND: All terms must be present -- narrows results
results = query_pubmed("CRISPR AND cancer AND therapy")
# OR: Any term can be present -- broadens results (use for synonyms)
results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy")
# NOT: Exclude terms -- use sparingly to avoid losing relevant papers
results = query_pubmed("cancer immunotherapy NOT review")
Use parentheses to group OR terms together before combining with AND.
arXiv Subject Categories
arXiv organizes preprints by subject category. Biology-related categories include:
| Category | Description |
|---|---|
q-bio.BM | Biomolecules |
q-bio.CB | Cell Behavior |
q-bio.GN | Genomics |
q-bio.MN | Molecular Networks |
q-bio.NC | Neurons and Cognition |
q-bio.QM | Quantitative Methods |
cs.AI | Artificial Intelligence |
cs.LG | Machine Learning |
Decision Framework
Use this tree to determine which search tier and database to start with:
What type of question are you answering?
├── Clinical / biomedical question
│ ├── Specific drug or treatment → Tier 1: PubMed with PICO query
│ ├── Disease mechanism → Tier 1: PubMed with MeSH terms
│ └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type
├── Computational / quantitative methods
│ ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI)
│ ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed
│ └── Software tool or pipeline → Tier 2: AI-assisted web search
├── Broad research landscape
│ ├── Current state of a field → Tier 2: AI-assisted web search
│ ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search
│ └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2
├── Specific paper or data
│ ├── Known paper details → Tier 1: any database by title/author/DOI
│ ├── Methodology or protocol → Tier 3: full-text extraction
│ └── Supplementary data → Tier 3: DOI-based supplementary fetch
└── Protocols / reagents
├── Lab protocol → Tier 2: web search for protocols.io, etc.
└── Validated reagents → Tier 2: AI-assisted web search
| Scenario | Recommended Tier and Database | Rationale |
|---|---|---|
| Systematic review of clinical evidence | Tier 1: PubMed with MeSH + publication type filters | Reproducible, documented search strategy required |
| Finding a preprint on a new ML method | Tier 1: arXiv with category and keyword search | Preprints appear on arXiv before journals |
| Understanding the research landscape | Tier 2: AI-assisted web search | Requires synthesis across many sources |
| Extracting a specific protocol from a paper | Tier 3: PDF content extraction | Need full-text access to methods section |
| Finding papers across disciplines | Tier 1: Google Scholar | Broadest coverage across fields |
| Identifying key researchers in a niche area | Tier 2: AI-assisted web search | Requires contextual synthesis |
| Downloading supplementary data tables | Tier 3: DOI-based supplementary fetch | Direct access to supplementary files |
Best Practices
-
Use controlled vocabulary (MeSH) for PubMed searches: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.
# Free text misses synonyms query_pubmed("heart attack treatment") # MeSH captures all synonyms query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]') -
Include synonyms and alternative terms with OR: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.
query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)") -
Use phrase searching for multi-word concepts: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.
query_pubmed('"single cell RNA sequencing" AND methods') -
Filter by publication type when seeking specific evidence: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use
[Publication Type]to target the evidence level you need.query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]") -
Start broad, then narrow iteratively: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.
# Step 1: Broad results = query_pubmed("CRISPR base editing iPSC", max_papers=20) # Step 2: Add MeSH and specificity results = query_pubmed( '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency', max_papers=20 ) # Step 3: Filter by date results = query_pubmed( '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])', max_papers=20 ) -
Cross-reference multiple databases: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.
-
Assess result quality systematically: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.
Common Pitfalls
-
Overly long and specific queries: Packing too many terms into a single query causes missed results because all terms must match simultaneously.
- How to avoid: Limit queries to core concepts (3-5 terms). Run separate searches for sub-topics and combine results manually.
# Too specific -- misses relevant papers query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery") # Better -- core concepts only query_pubmed("CRISPR Cas9 gene editing optimization efficiency") -
Relying on a single database: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.
- How to avoid: Always search at least two databases. For computational biology, combine PubMed and arXiv. For cross-disciplinary topics, include Google Scholar.
-
Ignoring publication dates: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.
- How to avoid: Check publication dates in all results. For methods papers, prefer the last 3-5 years. For foundational concepts, older papers are acceptable but verify with recent reviews.
-
Skipping title and abstract review before deep-diving: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.
- How to avoid: Always screen titles and abstracts first. Only extract full text (Tier 3) for papers that pass screening.
-
Using NOT operators too aggressively: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.
- How to avoid: Use NOT sparingly. Prefer adding positive terms to narrow results rather than excluding terms. When you must use NOT, verify that excluded results are genuinely irrelevant.
-
Ignoring Google Scholar rate limits: Google Scholar aggressively rate-limits automated queries, which can block further searches.
- How to avoid: Use Google Scholar sparingly. Add delays between requests. Prefer PubMed or arXiv for bulk searching and reserve Google Scholar for cross-disciplinary checks.
-
Not documenting the search strategy: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.
- How to avoid: Record your search terms, databases queried, date ranges, and number of results at each stage. This is essential for systematic reviews and good practice for all searches.
Workflow
-
Step 1: Define the research question
- Identify the main concept, population/model, intervention/method, desired outcome, and time frame
- For clinical questions, map to the PICO framework
- Example: "Find recent papers on CRISPR base editing efficiency in human iPSCs" decomposes to: main concept = CRISPR base editing, model = human iPSCs, outcome = efficiency, time frame = last 3 years
-
Step 2: Construct and execute database queries (Tier 1)
- Start with PubMed for biomedical topics, arXiv for computational topics
- Begin with a broad query using 2-3 core terms
- Refine with MeSH terms, field tags, date filters, and publication type filters
from Bio import Entrez import arxiv from scholarly import scholarly Entrez.email = "your.email@example.com" # NCBI requires a contact email # PubMed: biomedical literature handle = Entrez.esearch( db="pubmed", term='"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]', retmax=20, ) pubmed_ids = Entrez.read(handle)["IdList"] handle.close() # arXiv: computational biology preprints arxiv_results = list( arxiv.Search(query="protein structure prediction", max_results=10).results() ) # Google Scholar: broad cross-disciplinary coverage scholar_results = scholarly.search_pubs("single cell RNA sequencing analysis methods") -
Step 3: Supplement with AI-assisted search (Tier 2)
- Use AI-assisted web search for landscape overviews and recent developments
- Use general web search for protocols, tutorials, and documentation
from anthropic import Anthropic client = Anthropic() response = client.messages.create( model="claude-opus-4-7", max_tokens=4096, tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}], messages=[{ "role": "user", "content": "What are the latest developments in CAR-T cell therapy for solid tumors in 2024?", }], ) print(response.content) -
Step 4: Evaluate and filter results
- Screen titles and abstracts for relevance
- Prioritize by recency, journal quality, citation count, and study design
- For clinical evidence, prioritize RCTs, systematic reviews, and meta-analyses
- For methods, prioritize protocol papers and method comparisons
- Decision point: If too many results, add more specific terms or filters. If too few, broaden terms and add synonyms.
-
Step 5: Deep dive into key papers (Tier 3)
- Extract full text from high-priority papers
- Download supplementary materials for data and protocols
- Check reference lists for additional relevant papers
import io import os from pathlib import Path from urllib.parse import urlparse import requests import trafilatura from pypdf import PdfReader # Extract article content from URL (clean main text, drops nav/ads) downloaded = trafilatura.fetch_url("https://www.nature.com/articles/nature12373") article_text = trafilatura.extract(downloaded) # Extract text from a PDF pdf_bytes = requests.get("https://arxiv.org/pdf/1706.03762.pdf", timeout=30).content reader = PdfReader(io.BytesIO(pdf_bytes)) pdf_text = "\n".join(page.extract_text() or "" for page in reader.pages) # Download supplementary files via Crossref DOI metadata doi = "10.1038/nature12373" meta = requests.get(f"https://api.crossref.org/works/{doi}", timeout=30).json() out_dir = Path("./supplementary_materials") out_dir.mkdir(exist_ok=True) for link in meta.get("message", {}).get("link", []): url = link.get("URL") if not url: continue fname = os.path.basename(urlparse(url).path) or "supplement.bin" (out_dir / fname).write_bytes(requests.get(url, timeout=60).content) -
Step 6: Document and iterate
- Record all search terms, databases, filters, and result counts
- If gaps remain, revisit Steps 2-3 with refined queries
- For systematic reviews, follow PRISMA guidelines for reporting
Common Search Scenarios
The following scenarios illustrate how to combine the three tiers for typical research questions.
Finding Methods and Protocols
Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.
from Bio import Entrez
from duckduckgo_search import DDGS
Entrez.email = "your.email@example.com"
# Search for methodology papers in PubMed
handle = Entrez.esearch(
db="pubmed",
term='"Western Blotting"[MeSH] AND (protocol OR method OR technique)',
retmax=10,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()
# Check web for step-by-step protocols
web_hits = DDGS().text("Western blot protocol for membrane proteins", max_results=5)
Understanding Disease Mechanisms
Begin with review articles for a broad overview, then drill into specific mechanistic studies.
# Find review articles first for an overview
results = query_pubmed(
'"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]',
max_papers=10
)
# Then find specific mechanistic studies
results = query_pubmed(
'"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism',
max_papers=20
)
Finding Drug and Treatment Information
Use publication type filters to separate clinical trial evidence from systematic reviews.
# Clinical trials for a specific drug-condition pair
results = query_pubmed(
'"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]',
max_papers=20
)
# Systematic reviews and meta-analyses
results = query_pubmed(
'"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])',
max_papers=10
)
Tracking Latest Developments
Combine AI-assisted search for synthesis with database searches for recent indexed publications.
from anthropic import Anthropic
from Bio import Entrez
client = Anthropic()
Entrez.email = "your.email@example.com"
# AI-assisted synthesis of recent advances (Claude API web search tool)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 3}],
messages=[{
"role": "user",
"content": "What are the most significant advances in CAR-T cell therapy in 2024?",
}],
)
# Supplement with recent PubMed results
handle = Entrez.esearch(
db="pubmed",
term='"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]',
retmax=20,
)
pubmed_ids = Entrez.read(handle)["IdList"]
handle.close()
Finding Specific Reagents and Materials
Use AI-assisted search for validated reagent recommendations, supplemented by general web search.
from anthropic import Anthropic
from duckduckgo_search import DDGS
client = Anthropic()
# Search for validated reagents (Claude API + web search tool)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 2}],
messages=[{
"role": "user",
"content": "validated antibodies for Western blot detection of p53 protein",
}],
)
# Search supplier databases
supplier_hits = DDGS().text("p53 antibody Western blot validated", max_results=5)
Comparative Analysis Across Methods
Use AI-assisted search for synthesized comparisons of techniques or tools.
from anthropic import Anthropic
client = Anthropic()
# Compare approaches with AI synthesis (Claude API web search tool)
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=[{"type": "web_search_20250305", "name": "web_search", "max_uses": 5}],
messages=[{
"role": "user",
"content": "Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles",
}],
)
print(response.content)
Quality Assessment Checklist
When evaluating search results, apply these criteria:
- Source reliability: Is the paper from a peer-reviewed journal?
- Author credentials: Are the authors established experts in the field?
- Recency: Is the information current enough for your purpose?
- Study design: Is the design appropriate for the question (e.g., RCT for efficacy, cohort for risk)?
- Sample size: Is it adequate for the conclusions drawn?
- Reproducibility: Are methods described clearly enough to replicate?
- Conflicts of interest: Are any conflicts declared?
- Citation count: Has the paper been well-cited by subsequent work?
Further Reading
- PubMed Help -- Official guide to PubMed search syntax, field tags, filters, and advanced features
- arXiv Help Pages -- Documentation on arXiv search, subject categories, and submission process
- MeSH Browser -- NLM tool for browsing and searching the Medical Subject Headings controlled vocabulary
- PRISMA Statement -- Guidelines for transparent reporting of systematic reviews and meta-analyses
- Cochrane Handbook for Systematic Reviews -- Gold-standard methodology for systematic literature reviews
Related Skills
pubmed-database-- Direct PubMed API access for programmatic literature retrievalscientific-manuscript-writing-- Structuring literature review sections within manuscriptsresearch-question-formulation-- Frameworks for defining answerable research questions