Tiangong LCA Spec Coding Development Guidelines
This guide focuses on general conventions for engineering collaboration, helping the team understand repository organization, development environment, credential management, and quality self-check procedures. Detailed responsibilities for process extraction and the staged scripts are described in .github/prompts/extract-process-workflow.prompt.md.
1. Project Overview
- Objective: Deliver end-to-end automation of the Tiangong LCA Spec Coding workflow, covering paper cleaning, process extraction, exchange alignment, data merging, TIDAS validation, and final delivery.
- Core directories:
src/tiangong_lca_spec/: Workflow services, MCP clients, data models, and logging utilities.scripts/md/: Staged CLIs fromstage1_preprocess.pythroughstage4_publish.py, plus the regression entry pointrun_test_workflow.py.scripts/jsonld/: JSON-LD extraction, validation, and publishing helpers..github/prompts/: Prompt specifications for Codex, withextract-process-workflow.prompt.mddedicated to the process extraction task.scripts/kb/: Knowledge base tooling (e.g.,import_ris.pyfor ingestion,minio_fetch.pyfor downloading parsed bundles, andretrieve.pyfor issuing test retrieval queries) for pushing bibliographic PDFs into Tiangong datasets.test/: Unit tests and regression fixtures.
- Collaboration interfaces: The standard workflow depends on
.secrets/secrets.tomlwhere OpenAI and Tiangong LCA Remote services are configured. Validate credentials before running Stage 3 or later in batch during your first integration. - Further references: Requirements, alignment strategies, and exception handling for each stage are documented in
.github/prompts/extract-process-workflow.prompt.md. For supplemental classification or geographic information, use the helper CLIs provided byscripts/md/list_*_children.py. - Stage 4 flow publishing: When filling in missing flow definitions, the publisher now leans on the configured LLM to infer both the flow type and the most specific product classification. Follow the credential setup above so the scripts can call
scripts/md/list_product_flow_category_children.pyvia the LLM-assisted selector. - Run ID management: The markdown pipeline continues to use the default
artifacts/.latest_run_id, which downstream stages reuse whenever--run-idis omitted. JSON-LD stages keep their ownartifacts/.latest_jsonld_run_id; runningscripts/jsonld/run_pipeline.pywithout--run-idgenerates a fresh identifier and records it there, and Stage 2/Stage 3 fall back to that file when--run-idis omitted. Pass--run-idexplicitly to rerun an older pipeline output.
2. Development Environment and Dependencies
- Python version: ≥ 3.12. Manage it with
uv toolchain; the default virtual environment lives in.venv/. - Command conventions: Workstations do not expose a system-level
python. Useuv run python …oruv run -- python script.py. For one-liners, useuv run python - <<'PY'. - Dependency installation:
Setuv sync # Install runtime dependencies uv sync --upgrade # Upgrade all dependencies to the latest allowed versions uv sync --group dev # Install development dependencies including black/ruffUV_PYPI_URL=https://pypi.tuna.tsinghua.edu.cn/simpletemporarily if you need a mirror. - Key runtime libraries:
anyio,httpx,jsonschema,langgraph,mcp,openai,pydantic,pydantic-settings,python-dotenv,structlog,tenacity. (langgraphis currently available for future graph-style workflows; the shipped CLIs/orchestrator do not depend on it yet.) - Build system: The project uses
hatchling. Inpyproject.toml,[tool.hatch.build.targets.wheel]declaressrc/tiangong_lca_specas the build target.
3. Credentials and Remote Services
- Copy the template to create a local configuration:
cp .secrets/secrets.example.toml .secrets/secrets.toml. - Edit
.secrets/secrets.toml:[openai]:api_key,model(defaultgpt-5, override as needed).[tiangong_lca_remote]:url,service_name,tool_name,api_key.- Additional MCP services: add extra tables (e.g.,
[tiangong_kb_remote],[tavily_web_mcp]) withtransport,service_name,url,api_key, optionaltimeout, and optionalapi_key_header/api_key_prefixto override the defaultAuthorization: Bearerheader; all such blocks are loaded intomcp_connectionssoMCPToolClientcan talk to multiple servers concurrently. [kb]:base_url,dataset_id,api_key, optionaltimeout, andmetadata_fields(defaults already set to themetaandcategoryfields).[kb.pipeline]:datasource_type,start_node_id,is_published,response_mode, and optionalinputsfor the RAG pipeline runner. The pipeline node ID is available from the dataset’s pipeline designer.[minio]:endpoint,access_key,secret_key,bucket_name, andprefixfor the KB bundle bucket; optionalsecure(defaults to the endpoint scheme when present, otherwisehttps) andsession_tokenare supported for custom deployments.
- Write plaintext tokens directly into
api_key; the framework defaults toAuthorization: Bearer <token>. Override withapi_key_header/api_key_prefixwhen required by the service. - Before running Stage 3, call
FlowSearchServicewith one or two sample exchanges to perform a connectivity self-test (see the workflow prompt document for Python snippets).
- If operations has already provisioned
.secrets/secrets.toml, Codex uses it as-is. Only revisit the local configuration when scripts raise missing-credential errors or connection failures.
Local TIDAS validation relies on the CLI command uv run tidas-validate -i artifacts, which scripts/md/stage3_align_flows.py runs by default (disable with --skip-artifact-validation). No additional MCP credentials are required for this step.
Knowledge base ingestion
- Populate the
[kb]section in.secrets/secrets.tomlwith the real host (e.g.,https://<kb-host>/v1), dataset ID, and API key. - Configure
[kb.pipeline]so the importer can trigger the published RAG pipeline:datasource_typeshould match your FILE block (typicallylocal_file),start_node_idmust be copied from the pipeline designer, andinputscarries any required input-field values. If omitted, pipeline defaults apply. - Default metadata includes
meta(auto-generated citation text) andcategory(taken from the first subdirectory underinput_data/, e.g.,battery). Override via--categoryif needed. - Use
uv run python scripts/kb/import_ris.py --ris-dir input_data/<dir>(or--ris-path ...) to ingest RIS files; add--dry-runfor previews. Attachments must live under the sameinput_data/<dir>root. - Use
uv run python scripts/kb/retrieve.py --query "<text>" --top-k 10as the default sanity-check command when retrieving KB chunks; agents should prefer this invocation unless a task explicitly requests different parameters. - The importer now uploads files through the dataset pipeline (
/pipeline/file-upload+/pipeline/run) so the configured pipeline stages run exactly as the UI workflow. Metadata is attached after the pipeline reports the generated document IDs. - When you need to pull parsed artifacts from MinIO, populate
[minio]as described above and runuv run python scripts/kb/minio_fetch.py list --path <remote_subdir>to inspect available bundles oruv run python scripts/kb/minio_fetch.py download --path <remote_subdir> --output input_data/<dir> --include-sourceto materialize themeta.txt,parsed.json,pages/, and optionalsource.pdffiles locally. Omit--include-sourceto skip the PDF.
4. Quality Assurance and Self-Checks
- After modifying Python source code, run:
uv run black . uv run ruff check uv run python -m compileall src scripts uv run pytest - When changes involve process extraction or alignment logic, prioritize a minimal end-to-end run (literature pipeline: Stage 1→Stage 4; JSON-LD pipeline: Stage 1→Stage 3). Command examples and stage requirements are in
.github/prompts/extract-process-workflow.prompt.md. - Structured logging defaults to
structlog. While running CLIs, monitorflow_alignment.*,process_extraction.*, and similar events to quickly localize issues. - Before committing, ensure intermediate files under
artifacts/are not accidentally staged. Clean them locally or add them to.gitignoreif needed.
5. Support and Communication
- Consult the docstrings and type definitions in the relevant modules under
src/tiangong_lca_spec/to keep terminology consistent when questions arise. - If external services are unavailable or time out for extended periods, capture reproduction steps and log excerpts, then escalate to operations or the workflow owner promptly.
- When expanding prompts or reorganizing workflow responsibilities, update the corresponding documents in
.github/prompts/first to keep roles aligned with this guide.