name: r-python-translation description: >- R-to-Python translation for data analysis. Maps R packages (tidyverse, ggplot2, fixest, survey, sf, plm) to Python equivalents (polars, plotnine, pyfixest, svy, geopandas). Use when user has R background or requests R-equivalent code comments. metadata: audience: research-coders domain: research-methodology skill-last-updated: "2026-03-28"
R-to-Python Translation Skill
R-to-Python translation reference for quantitative social science data analysis. Maps R ecosystem packages (tidyverse/dplyr, ggplot2, fixest, survey, sf, plm, lme4, marginaleffects, rdrobust) to DAAF Python equivalents (polars, plotnine, pyfixest, statsmodels, linearmodels, svy, geopandas). Use when user mentions R/RStudio background, requests R-equivalent code comments, needs to understand Python analysis code from an R perspective, or wants to translate R data analysis concepts to Python. Covers paradigm differences, verb-by-verb operation translations, regression modeling, causal inference, visualization, and workflow adaptation.
Cross-language translation reference for researchers moving between the R and Python data analysis ecosystems. This skill maps R packages, idioms, and workflows to their DAAF Python equivalents so that R-background users can audit, understand, and learn from DAAF-produced code, and so that code-producing agents can annotate their output with R equivalents when directed.
This skill is a routing hub — it provides overview tables, decision trees, and directs readers to the detailed reference files listed below. The reference files contain the exhaustive verb-by-verb mappings, code examples, and edge-case documentation.
What This Skill Does
- Maps the R data analysis ecosystem to DAAF's Python stack across data wrangling, modeling, visualization, causal inference, surveys, spatial analysis, and workflow tooling
- Provides a structured annotation protocol for agents to add inline R-equivalent comments to Python code
- Identifies paradigm gaps where R and Python diverge fundamentally, so users know where to expect friction
Use cases:
- R user auditing DAAF Python code and needing to understand what operations are being performed
- Agent annotating code with R-equivalent comments for an R-background researcher
- R user learning Python for data analysis and needing a conceptual bridge
- Translating a specific R operation or idiom to its Python equivalent
- Understanding where R tools have no direct Python equivalent (and what the workaround is)
How to Use This Skill
Reference File Structure
Each topic in ./references/ contains focused documentation:
| File | Purpose | When to Read |
|---|---|---|
paradigm-differences.md | Core language and paradigm differences | Encountering fundamental R-vs-Python confusion |
polars-dplyr.md | Core dplyr/tidyr to polars verb mapping (select, filter, mutate, joins, reshaping, window functions, lazy eval) | Reading or writing data manipulation code |
polars-strings-dates-factors.md | String, date/time, and factor operations (stringr, lubridate, forcats to polars) | Working with string/date/categorical columns |
regression-modeling.md | fixest/stats/plm to pyfixest/statsmodels/linearmodels | Reading or writing regression code |
visualization.md | ggplot2/plotly R to plotnine/plotly Python | Reading or writing visualization code |
causal-inference.md | R causal inference ecosystem to Python equivalents | Working with DiD, RDD, IV, event studies |
survey-spatial-ml.md | survey/sf/tidymodels to svy/geopandas/scikit-learn | Working with surveys, spatial data, or ML |
workflow-environment.md | RStudio/Quarto workflow to DAAF/marimo workflow | Adapting to DAAF's execution model |
external-resources.md | Curated guides and tutorials with provenance | Seeking additional learning materials |
gotchas.md | Common R-user mistakes in Python | Debugging or reviewing code from R perspective |
Reading Order
- R user auditing DAAF code:
paradigm-differences.mdthen the relevant domain file (e.g.,polars-dplyr.mdfor data wrangling,regression-modeling.mdfor models) thengotchas.md - Agent annotating code with R equivalents: Agent Code Annotation Protocol section below, then the relevant domain file for the code being annotated
- Learning Python from R background:
paradigm-differences.mdthenpolars-dplyr.mdthenworkflow-environment.mdthenexternal-resources.md - Looking up a specific translation: Quick Decision Trees below, then the relevant reference file
Quick Decision Trees
"How do I do X from R in Python?"
What kind of R operation?
├─ Data wrangling (filter, mutate, join, pivot, summarise)
│ └─ ./references/polars-dplyr.md
├─ Regression / statistical modeling
│ └─ ./references/regression-modeling.md
├─ Plotting / visualization
│ └─ ./references/visualization.md
├─ Causal inference (DiD, RDD, IV, event studies)
│ └─ ./references/causal-inference.md
├─ Surveys / spatial / machine learning
│ └─ ./references/survey-spatial-ml.md
└─ Fundamental language differences (types, syntax, environment)
└─ ./references/paradigm-differences.md
"Why does this Python code look different from R?"
What looks unfamiliar?
├─ Expression syntax (pl.col().method().alias())
│ └─ ./references/paradigm-differences.md
├─ Missing values (None vs NaN vs null vs NA)
│ └─ ./references/paradigm-differences.md
├─ Formula interface (~) behaves differently
│ └─ ./references/regression-modeling.md
├─ Import patterns and namespacing
│ └─ ./references/gotchas.md
└─ No interactive REPL / console workflow
└─ ./references/workflow-environment.md
"I want to translate an R script to Python"
What does the R script do?
├─ Loads and wrangles data (read_csv, dplyr verbs)
│ └─ ./references/polars-dplyr.md
├─ Runs regressions (lm, feols, plm)
│ └─ ./references/regression-modeling.md
├─ Creates plots (ggplot, plotly)
│ └─ ./references/visualization.md
├─ Uses survey weights (svydesign, svymean)
│ └─ ./references/survey-spatial-ml.md
├─ Spatial operations (sf, st_join)
│ └─ ./references/survey-spatial-ml.md
├─ Multiple of the above
│ └─ Start with ./references/paradigm-differences.md, then each relevant file
└─ Uses a package not listed above
└─ ./references/external-resources.md for broader ecosystem guidance
"Something isn't working and I think it's an R habit"
What went wrong?
├─ 1-indexed access gave wrong element
│ └─ ./references/gotchas.md
├─ Factor/categorical behaves differently
│ └─ ./references/gotchas.md
├─ NA handling surprised me
│ └─ ./references/paradigm-differences.md
├─ Pipe operator (|> or %>%) not available
│ └─ ./references/paradigm-differences.md
├─ library() vs import confusion
│ └─ ./references/gotchas.md
└─ Model output structure is different
└─ ./references/regression-modeling.md
"Which Python package replaces my R package?"
Which R package?
├─ dplyr / tidyr / readr / tibble → polars
│ └─ ./references/polars-dplyr.md
├─ ggplot2 → plotnine
│ └─ ./references/visualization.md
├─ plotly (R) → plotly (Python)
│ └─ ./references/visualization.md
├─ fixest → pyfixest
│ └─ ./references/regression-modeling.md
├─ stats (lm, glm) → statsmodels
│ └─ ./references/regression-modeling.md
├─ plm / lme4 / estimatr → linearmodels
│ └─ ./references/regression-modeling.md
├─ survey → svy
│ └─ ./references/survey-spatial-ml.md
├─ sf / terra → geopandas
│ └─ ./references/survey-spatial-ml.md
├─ tidymodels / caret → scikit-learn
│ └─ ./references/survey-spatial-ml.md
├─ marginaleffects → marginaleffects (Python)
│ └─ ./references/regression-modeling.md
├─ rdrobust / did / synthdid → rdrobust / pyfixest DiD
│ └─ ./references/causal-inference.md
└─ Quarto / RMarkdown → marimo
└─ ./references/workflow-environment.md
Package Mapping Overview
| Python Package | R Equivalent | Fidelity | Key Difference |
|---|---|---|---|
| polars | dplyr + tidyr + data.table | Low | Expression system vs verb grammar; method chaining vs pipe |
| pyfixest | fixest | High | Near-identical formula syntax; minor SE default differences |
| plotnine | ggplot2 | High | Same grammar of graphics; Python string quoting for aes |
| plotly | plotly (R) | High | px.scatter() vs plot_ly(); similar output |
| statsmodels | base R stats + lmtest + sandwich | Medium | Three formula dialects; manual vcov specification |
| linearmodels | plm + lme4 + estimatr | Medium | Requires pandas MultiIndex for panel structure |
| scikit-learn | tidymodels / caret | Medium | Imperative fit/predict vs declarative recipe pipeline |
| geopandas | sf + terra | Medium | shapely geometries vs sfc; different CRS handling |
| svy | survey (Lumley) | Medium | Limited GLM family coverage (gaussian/binomial/Poisson only) |
| marimo | Quarto / RMarkdown | Medium | Reactive cells vs knit-based linear execution |
Fidelity key: High = near-direct translation, same mental model. Medium = same capability, different API patterns. Low = fundamentally different paradigm requiring conceptual remapping.
Library Versions
Translations in this skill reference specific library versions. Python versions are pinned in DAAF's Docker environment (Python 3.12). R versions reference CRAN releases as of March 2026. When syntax or behavior has changed between versions, the reference files note the change.
| Python Package | DAAF Version | R Equivalent | R Version (CRAN) |
|---|---|---|---|
| polars | 1.38.1 | dplyr + tidyr + data.table | dplyr 1.2.0, tidyr 1.3.2, data.table 1.18.2 |
| pyfixest | 0.40.0 | fixest | 0.14.0 |
| plotnine | 0.15.3 | ggplot2 | 4.0.2 |
| plotly | 6.5.2 | plotly (R) | 4.12.0 |
| statsmodels | 0.14.6 | base R stats + lmtest + sandwich | lmtest 0.9-40, sandwich 3.1-1 |
| linearmodels | unpinned | plm + lme4 + estimatr | plm 2.6-7, lme4 2.0-1 |
| scikit-learn | 1.8.0 | tidymodels / caret | tidymodels 1.4.1, caret 7.0-1 |
| geopandas | 1.1.3 | sf + terra | sf 1.1-0, terra 1.9-11 |
| svy | 0.13.0 | survey | survey 4.5 |
| marginaleffects | unpinned | marginaleffects (R) | 0.32.0 |
| rdrobust | unpinned | rdrobust (R) | 3.0.0 |
| marimo | 0.19.11 | Quarto / RMarkdown | Quarto 1.6.x |
Unpinned packages: linearmodels, marginaleffects, and rdrobust install the latest version at Docker build time. Translations for these packages reference their documented API as of March 2026.
R version note: R package versions are from CRAN as of March 2026 (R 4.5.3). Check
packageVersion("pkg") in your R installation to verify your local version matches.
Top 10 Paradigm Differences
These are the friction points R users encounter most frequently when reading or writing DAAF Python code. Each is covered in depth in the referenced file.
| # | Friction Point | R Way | Python Way | Reference |
|---|---|---|---|---|
| 1 | Expression system | df %>% mutate(x = a + b) | df.with_columns((pl.col("a") + pl.col("b")).alias("x")) | paradigm-differences.md |
| 2 | Formula fragmentation | One universal ~ syntax | Three dialects (pyfixest, statsmodels, linearmodels) | regression-modeling.md |
| 3 | Missing values | Single NA type | None, NaN, and null (context-dependent) | paradigm-differences.md |
| 4 | mutate equivalent | mutate(new = expr) | with_columns(expr.alias("new")) | polars-dplyr.md |
| 5 | No row index | Tibbles have row numbers | Polars has no row index; use with_row_index() | paradigm-differences.md |
| 6 | Polars-to-pandas bridge | Data frames go directly into models | Must call .to_pandas() before statsmodels/pyfixest | paradigm-differences.md |
| 7 | Factor vs Categorical | factor() with ordered levels | pl.Categorical / pd.Categorical (different semantics) | gotchas.md |
| 8 | Package fragmentation | One package per domain (fixest does it all) | Multiple packages per domain (statsmodels + linearmodels + pyfixest) | paradigm-differences.md |
| 9 | 1-indexed vs 0-indexed | x[1] is first element | x[0] is first element | gotchas.md |
| 10 | Namespace model | library() exports all names | import requires explicit namespacing | gotchas.md |
Agent Code Annotation Protocol
This section defines when and how code-producing agents add inline R-equivalent comments to DAAF Python scripts.
When to Annotate
Annotations are added only when the orchestrator explicitly passes an R-background directive to the agent. This is not a default behavior.
Trigger conditions (orchestrator activates this when any apply):
- User states they have an R / RStudio background
- User requests R-equivalent comments in code
- User asks to understand Python code from an R perspective
How the orchestrator passes the directive: The orchestrator adds the following to the agent prompt:
"User has R background. Load r-python-translation skill. Add inline R-equivalent comments for non-trivial data operations."
Comment Format
# R: df %>% filter(year == 2020)
filtered = df.filter(pl.col("year") == 2020)
# R: df %>% mutate(pct = count / sum(count))
result = df.with_columns(
(pl.col("count") / pl.col("count").sum()).alias("pct")
)
# R: feols(y ~ x1 + x2 | state + year, data = df, cluster = ~state)
fit = pf.feols("y ~ x1 + x2 | state + year", data=pdf, vcov={"CRV1": "state"})
What to Annotate
- Annotate: Data wrangling (polars operations), modeling calls (pyfixest, statsmodels, linearmodels), visualization layer construction (plotnine, plotly), causal inference method calls
- Do NOT annotate: Import statements,
print()/assertvalidation lines, file I/O boilerplate (pl.read_parquet,df.write_parquet), config sections, section separator comments
Rules
- One
# R:comment per logical operation, placed on the line immediately above the Python code - Keep annotations to a single line; abbreviate complex R pipelines if needed
- R annotations are in addition to standard IAT comments (
# INTENT:,# REASONING:,# ASSUMES:), not a replacement - Consumer agents: research-executor, code-reviewer, debugger, data-ingest
Related Skills
| Skill | Relationship |
|---|---|
polars | Python-side data wrangling — detailed API reference for the dplyr/tidyr equivalent |
pyfixest | Python-side fixed effects regression — detailed API for the fixest equivalent |
plotnine | Python-side static visualization — detailed API for the ggplot2 equivalent |
plotly | Python-side interactive visualization — detailed API for plotly R equivalent |
statsmodels | Python-side general modeling — covers base R stats, lmtest, sandwich equivalents |
linearmodels | Python-side panel/IV models — covers plm, lme4, estimatr equivalents |
scikit-learn | Python-side ML — covers tidymodels/caret equivalents |
geopandas | Python-side spatial data — covers sf/terra equivalents |
svy | Python-side survey analysis — covers survey (Lumley) equivalents |
marimo | Python-side notebooks — covers Quarto/RMarkdown workflow equivalents |
stata-python-translation | Parallel skill for Stata-background users — shares the same Python target stack |
Note: Individual tool skills contain library-specific usage guidance (syntax, gotchas, performance). This skill provides the R-to-Python conceptual bridge — use both together when an R-background user is working with a specific library.
Topic Index
| Topic | Reference File |
|---|---|
Pipe operator (%>% / ` | >`) equivalents |
| Expression system (pl.col, .alias) | ./references/paradigm-differences.md |
| Missing value semantics (NA vs None/NaN/null) | ./references/paradigm-differences.md |
| Type system differences | ./references/paradigm-differences.md |
| Package/namespace model | ./references/paradigm-differences.md |
| 0-indexing vs 1-indexing | ./references/paradigm-differences.md |
| Polars-to-pandas conversion for modeling | ./references/paradigm-differences.md |
| Row index differences | ./references/paradigm-differences.md |
| dplyr verb mapping (filter, select, mutate, arrange) | ./references/polars-dplyr.md |
| summarise / group_by equivalents | ./references/polars-dplyr.md |
| tidyr verbs (pivot_longer, pivot_wider, separate, unite) | ./references/polars-dplyr.md |
| Join operations (left_join, inner_join, anti_join) | ./references/polars-dplyr.md |
| String operations (stringr vs polars .str) | ./references/polars-strings-dates-factors.md |
| Date operations (lubridate vs polars .dt) | ./references/polars-strings-dates-factors.md |
| across() / where() equivalents | ./references/polars-dplyr.md |
| case_when equivalent | ./references/polars-dplyr.md |
| readr I/O equivalents | ./references/polars-dplyr.md |
| fixest formula syntax in pyfixest | ./references/regression-modeling.md |
| lm() / glm() in statsmodels | ./references/regression-modeling.md |
| Formula interface comparison (three Python dialects) | ./references/regression-modeling.md |
| Standard error specification differences | ./references/regression-modeling.md |
| plm panel models in linearmodels | ./references/regression-modeling.md |
| lme4 mixed effects equivalents | ./references/regression-modeling.md |
| marginaleffects (R to Python) | ./references/regression-modeling.md |
| Model summary / tidy output | ./references/regression-modeling.md |
| Sandwich / robust SE equivalents | ./references/regression-modeling.md |
| ggplot2 layer mapping to plotnine | ./references/visualization.md |
| aes() string quoting in plotnine | ./references/visualization.md |
| Theme customization | ./references/visualization.md |
| Scale functions | ./references/visualization.md |
| Faceting (facet_wrap, facet_grid) | ./references/visualization.md |
| plotly R vs plotly Python | ./references/visualization.md |
| ggsave equivalent | ./references/visualization.md |
| Difference-in-differences (did, did2s) | ./references/causal-inference.md |
| Regression discontinuity (rdrobust) | ./references/causal-inference.md |
| Instrumental variables (ivreg vs pyfixest IV) | ./references/causal-inference.md |
| Event study designs | ./references/causal-inference.md |
| Synthetic control | ./references/causal-inference.md |
| Matching / propensity scores | ./references/causal-inference.md |
| survey package to svy | ./references/survey-spatial-ml.md |
| svydesign / svymean / svyglm equivalents | ./references/survey-spatial-ml.md |
| sf spatial operations to geopandas | ./references/survey-spatial-ml.md |
| CRS / projection handling | ./references/survey-spatial-ml.md |
| Spatial joins (st_join vs sjoin) | ./references/survey-spatial-ml.md |
| tidymodels pipeline to scikit-learn | ./references/survey-spatial-ml.md |
| RStudio vs DAAF workflow | ./references/workflow-environment.md |
| Quarto / RMarkdown vs marimo | ./references/workflow-environment.md |
| Interactive console vs file-first execution | ./references/workflow-environment.md |
| Package management (renv vs pip/uv) | ./references/workflow-environment.md |
| Project structure conventions | ./references/workflow-environment.md |
| Curated R-to-Python migration guides | ./references/external-resources.md |
| Package documentation links | ./references/external-resources.md |
| Tutorial recommendations with provenance | ./references/external-resources.md |
| 1-indexed list/vector access | ./references/gotchas.md |
| Factor vs Categorical pitfalls | ./references/gotchas.md |
| library() vs import habits | ./references/gotchas.md |
| T/F vs True/False | ./references/gotchas.md |
| Assignment operator (<- vs =) | ./references/gotchas.md |
| Vectorized operations expectations | ./references/gotchas.md |
| NULL vs None differences | ./references/gotchas.md |
| apply family vs map/list comprehension | ./references/gotchas.md |
| Copying semantics (R copy-on-modify vs Python references) | ./references/gotchas.md |
| Logical operators (& / | vs and / or) |
| String interpolation (glue vs f-strings) | ./references/gotchas.md |
| data.table vs polars | ./references/polars-strings-dates-factors.md |
| Lazy evaluation (polars LazyFrame vs R lazy tibble) | ./references/polars-dplyr.md |
| nest/unnest equivalents | ./references/polars-dplyr.md |
| Window functions (over vs mutate + group_by) | ./references/polars-dplyr.md |
| Coordinate systems (coord_flip, coord_polar) | ./references/visualization.md |
| Stat layers (stat_smooth, stat_summary) | ./references/visualization.md |
| Color palette mapping (viridis, brewer) | ./references/visualization.md |
| Multi-panel layouts (patchwork vs subplot) | ./references/visualization.md |
| Staggered DiD estimators | ./references/causal-inference.md |
| Parallel trends testing | ./references/causal-inference.md |
| BRR / jackknife replication weights | ./references/survey-spatial-ml.md |
| Raster data handling (terra vs rasterio) | ./references/survey-spatial-ml.md |
| Feature engineering (recipes vs sklearn Pipeline) | ./references/survey-spatial-ml.md |
| Cross-validation (rsample vs sklearn) | ./references/survey-spatial-ml.md |
| Environment/workspace differences (.RData vs nothing) | ./references/workflow-environment.md |
| Debugging workflow (browser() vs breakpoint()) | ./references/workflow-environment.md |
| R help system (?func) vs Python help(func) | ./references/workflow-environment.md |
| Cheat sheet and quick-reference links | ./references/external-resources.md |
| Community resources (Stack Overflow tags, forums) | ./references/external-resources.md |