name: metagenomics-amplicon description: 16S/amplicon metagenomics diversity and community analysis. tool_type: mixed primary_tool: Python

16S Amplicon Metagenomics

When to Use

16S rRNA amplicon microbiome studies (OTU/ASV, alpha/beta diversity)
Promoter scanning: TATA box detection, CpG island calling, PWM/TFBS scanning

Quick Reference

16S Alpha Diversity

Metric	Formula	Notes
Observed species	`count(counts > 0)`	Richness only
Shannon H'	`-Σ pᵢ ln(pᵢ)`	Richness + evenness
Simpson 1-D	`1 - Σ pᵢ²`	Robust to rare taxa
Faith's PD	`Σ branch lengths`	Phylogenetic richness

16S Beta Diversity

Metric	Phylogenetic	Quantitative	Formula
Jaccard	No	No	`1 - \|A∩B\| / \|A∪B\|`
Bray-Curtis	No	Yes	`Σ\|Xᵢⱼ - Xᵢₖ\| / Σ(Xᵢⱼ + Xᵢₖ)`
Unweighted UniFrac	Yes	No	unique / total observed branch length
Weighted UniFrac	Yes	Yes	abundance-weighted branch fractions

Promoter Key Elements

Element	Position	Consensus	Notes
TATA box	−25 to −30 bp	TATAAA (TATAWAW)	~10–20% human genes
CpG island	Near TSS	GC ≥ 50%, CpG O/E ≥ 0.6, ≥ 200 bp	~70% human promoters
Inr	+1 (TSS)	YYANWYY	TATA-less promoters
DPE	+28 to +32	RGWYV	Pairs with Inr

CpG O/E: (N_CpG × L) / (N_C × N_G)

Key Patterns

Shannon / Simpson

def shannon_diversity(counts):
    p = counts[counts > 0] / counts[counts > 0].sum()
    return -np.sum(p * np.log(p))

def simpson_diversity(counts):
    p = counts[counts > 0] / counts[counts > 0].sum()
    return 1 - np.sum(p ** 2)

Bray-Curtis Distance

def bray_curtis(s1, s2):
    s1, s2 = np.array(s1, float), np.array(s2, float)
    return np.sum(np.abs(s1 - s2)) / np.sum(s1 + s2)

PCoA from Distance Matrix

def pcoa(dm_df):
    dm = dm_df.values; n = len(dm)
    H = np.eye(n) - np.ones((n, n)) / n
    B = -0.5 * H @ (dm ** 2) @ H
    eigvals, eigvecs = np.linalg.eigh(B)
    idx = np.argsort(eigvals)[::-1]
    eigvals, eigvecs = eigvals[idx], eigvecs[:, idx]
    pos = eigvals > 0
    coords = eigvecs[:, pos] * np.sqrt(eigvals[pos])
    prop = eigvals[pos] / eigvals[pos].sum()
    return pd.DataFrame(coords[:, :2], index=dm_df.index, columns=['PC1', 'PC2']), prop

CpG Island Scanner

def cpg_island_scanner(seq, window=200, step=1, gc_thresh=0.5, oe_thresh=0.6):
    seq = seq.upper(); islands = []
    for i in range(0, len(seq) - window + 1, step):
        w = seq[i:i+window]
        nc, ng, ncpg = w.count('C'), w.count('G'), w.count('CG')
        gc = (nc + ng) / window
        oe = (ncpg * window) / (nc * ng) if nc > 0 and ng > 0 else 0
        if gc >= gc_thresh and oe >= oe_thresh:
            islands.append((i, i+window, gc, oe))
    return islands

PWM Construction and Scanning

def build_pfm(sites):
    pfm = {b: [0]*len(sites[0]) for b in 'ACGT'}
    for site in sites:
        for i, b in enumerate(site.upper()): pfm[b][i] += 1
    return pfm

def pfm_to_pwm(pfm, pseudocount=0.5, bg=None):
    if bg is None: bg = {b: 0.25 for b in 'ACGT'}
    n = sum(pfm[b][0] for b in 'ACGT')
    return {b: [np.log2((pfm[b][i]+pseudocount)/(n+4*pseudocount)/bg[b])
                for i in range(len(pfm['A']))] for b in 'ACGT'}

def scan_pwm(seq, pwm, threshold=0.0):
    L = len(pwm['A']); hits = []
    for i in range(len(seq) - L + 1):
        kmer = seq[i:i+L].upper()
        score = sum(pwm[b][j] for j, b in enumerate(kmer) if b in pwm)
        if score >= threshold: hits.append((i, kmer, score))
    return hits

Code Templates

PERMANOVA

def permanova(dm, grouping, n_perm=999):
    dm_v = dm.values; groups = grouping.loc[dm.index].values
    unique = np.unique(groups); n = len(groups)
    def f_stat(g):
        ss_tot = np.sum(dm_v**2) / n
        ss_w = sum(np.sum(dm_v[np.ix_(g==grp,g==grp)]**2)/(2*(g==grp).sum())
                   for grp in unique if (g==grp).sum() > 1)
        df_b, df_w = len(unique)-1, n-len(unique)
        return ((ss_tot-ss_w)/df_b) / (ss_w/df_w) if df_w and ss_w else 0
    obs = f_stat(groups)
    p = (sum(f_stat(np.random.permutation(groups)) >= obs for _ in range(n_perm)) + 1) / (n_perm + 1)
    return obs, p

TATA Box Detection

import re
def find_tata_boxes(seq, strict=True):
    pat = 'TATAAA' if strict else r'TATA[AT]A[AT]'
    return [(m.start(), m.group()) for m in re.finditer(pat, seq.upper())]

Pitfalls

Rarefying without curves — rarefy only when rarefaction curves plateau; raw alpha diversity requires rarefaction (deeper samples appear more diverse).
PCoA axes without variance-explained — axis 1 may explain only 20%; check percentages.
PERMANOVA + dispersion — sensitive to within-group variance, not just centroids; pair with betadisper (R).
CpG step=1 on large sequences — use step=10–50 for >10 kb, then merge overlapping windows.
PWM threshold — use 80–90th percentile of random-sequence scores as baseline.
Multiple testing in ORA — BH-correct across all tested pathways.

Related Skills

biostatistics-r — hypothesis testing, BH correction, power analysis
numpy-pandas-wrangling — count matrix manipulation
python-advanced-sql — annotation lookups

ナビゲーション

Skillsとは？

リンク

metagenomics-amplicon

name: metagenomics-amplicon description: 16S/amplicon metagenomics diversity and community analysis. tool_type: mixed primary_tool: Python

16S Amplicon Metagenomics

When to Use

Quick Reference

16S Alpha Diversity

16S Beta Diversity

Promoter Key Elements

Key Patterns

Shannon / Simpson

Bray-Curtis Distance

PCoA from Distance Matrix

CpG Island Scanner

PWM Construction and Scanning

Code Templates

PERMANOVA

TATA Box Detection

Pitfalls

Related Skills

関連スキル(🔧 開発ツール)