Tiangong LCA Spec Coding Development Guidelines

This guide focuses on general conventions for engineering collaboration, helping the team understand repository organization, development environment, credential management, and quality self-check procedures. Detailed responsibilities for process extraction and the staged scripts are described in .github/prompts/extract-process-workflow.prompt.md.

1. Project Overview

Objective: Deliver end-to-end automation of the Tiangong LCA Spec Coding workflow, covering paper cleaning, process extraction, exchange alignment, data merging, TIDAS validation, and final delivery.
Core directories:
- src/tiangong_lca_spec/: Workflow services, MCP clients, data models, and logging utilities.
- scripts/md/: Staged CLIs from stage1_preprocess.py through stage4_publish.py, plus the regression entry point run_test_workflow.py.
- scripts/jsonld/: JSON-LD extraction, validation, and publishing helpers.
- .github/prompts/: Prompt specifications for Codex, with extract-process-workflow.prompt.md dedicated to the process extraction task.
- scripts/kb/: Knowledge base tooling (e.g., import_ris.py for ingestion, minio_fetch.py for downloading parsed bundles, and retrieve.py for issuing test retrieval queries) for pushing bibliographic PDFs into Tiangong datasets.
- test/: Unit tests and regression fixtures.
Collaboration interfaces: The standard workflow depends on .secrets/secrets.toml where OpenAI and Tiangong LCA Remote services are configured. Validate credentials before running Stage 3 or later in batch during your first integration.
Further references: Requirements, alignment strategies, and exception handling for each stage are documented in .github/prompts/extract-process-workflow.prompt.md. For supplemental classification or geographic information, use the helper CLIs provided by scripts/md/list_*_children.py.
Stage 4 flow publishing: When filling in missing flow definitions, the publisher now leans on the configured LLM to infer both the flow type and the most specific product classification. Follow the credential setup above so the scripts can call scripts/md/list_product_flow_category_children.py via the LLM-assisted selector.
Run ID management: The markdown pipeline continues to use the default artifacts/.latest_run_id, which downstream stages reuse whenever --run-id is omitted. JSON-LD stages keep their own artifacts/.latest_jsonld_run_id; running scripts/jsonld/run_pipeline.py without --run-id generates a fresh identifier and records it there, and Stage 2/Stage 3 fall back to that file when --run-id is omitted. Pass --run-id explicitly to rerun an older pipeline output.

2. Development Environment and Dependencies

Python version: ≥ 3.12. Manage it with uv toolchain; the default virtual environment lives in .venv/.
Command conventions: Workstations do not expose a system-level python. Use uv run python … or uv run -- python script.py. For one-liners, use uv run python - <<'PY'.

Dependency installation:

uv sync                # Install runtime dependencies
uv sync --upgrade  # Upgrade all dependencies to the latest allowed versions
uv sync --group dev    # Install development dependencies including black/ruff

Set UV_PYPI_URL=https://pypi.tuna.tsinghua.edu.cn/simple temporarily if you need a mirror.

Key runtime libraries: anyio, httpx, jsonschema, langgraph, mcp, openai, pydantic, pydantic-settings, python-dotenv, structlog, tenacity. (langgraph is currently available for future graph-style workflows; the shipped CLIs/orchestrator do not depend on it yet.)
Build system: The project uses hatchling. In pyproject.toml, [tool.hatch.build.targets.wheel] declares src/tiangong_lca_spec as the build target.

3. Credentials and Remote Services

Copy the template to create a local configuration: cp .secrets/secrets.example.toml .secrets/secrets.toml.
Edit .secrets/secrets.toml:
- [openai]: api_key, model (default gpt-5, override as needed).
- [tiangong_lca_remote]: url, service_name, tool_name, api_key.
- Additional MCP services: add extra tables (e.g., [tiangong_kb_remote], [tavily_web_mcp]) with transport, service_name, url, api_key, optional timeout, and optional api_key_header/api_key_prefix to override the default Authorization: Bearer header; all such blocks are loaded into mcp_connections so MCPToolClient can talk to multiple servers concurrently.
- [kb]: base_url, dataset_id, api_key, optional timeout, and metadata_fields (defaults already set to the meta and category fields).
- [kb.pipeline]: datasource_type, start_node_id, is_published, response_mode, and optional inputs for the RAG pipeline runner. The pipeline node ID is available from the dataset’s pipeline designer.
- [minio]: endpoint, access_key, secret_key, bucket_name, and prefix for the KB bundle bucket; optional secure (defaults to the endpoint scheme when present, otherwise https) and session_token are supported for custom deployments.
Write plaintext tokens directly into api_key; the framework defaults to Authorization: Bearer <token>. Override with api_key_header/api_key_prefix when required by the service.
Before running Stage 3, call FlowSearchService with one or two sample exchanges to perform a connectivity self-test (see the workflow prompt document for Python snippets).

If operations has already provisioned .secrets/secrets.toml, Codex uses it as-is. Only revisit the local configuration when scripts raise missing-credential errors or connection failures.

Local TIDAS validation relies on the CLI command uv run tidas-validate -i artifacts, which scripts/md/stage3_align_flows.py runs by default (disable with --skip-artifact-validation). No additional MCP credentials are required for this step.

Knowledge base ingestion

Populate the [kb] section in .secrets/secrets.toml with the real host (e.g., https://<kb-host>/v1), dataset ID, and API key.
Configure [kb.pipeline] so the importer can trigger the published RAG pipeline: datasource_type should match your FILE block (typically local_file), start_node_id must be copied from the pipeline designer, and inputs carries any required input-field values. If omitted, pipeline defaults apply.
Default metadata includes meta (auto-generated citation text) and category (taken from the first subdirectory under input_data/, e.g., battery). Override via --category if needed.
Use uv run python scripts/kb/import_ris.py --ris-dir input_data/<dir> (or --ris-path ...) to ingest RIS files; add --dry-run for previews. Attachments must live under the same input_data/<dir> root.
Use uv run python scripts/kb/retrieve.py --query "<text>" --top-k 10 as the default sanity-check command when retrieving KB chunks; agents should prefer this invocation unless a task explicitly requests different parameters.
The importer now uploads files through the dataset pipeline (/pipeline/file-upload + /pipeline/run) so the configured pipeline stages run exactly as the UI workflow. Metadata is attached after the pipeline reports the generated document IDs.
When you need to pull parsed artifacts from MinIO, populate [minio] as described above and run uv run python scripts/kb/minio_fetch.py list --path <remote_subdir> to inspect available bundles or uv run python scripts/kb/minio_fetch.py download --path <remote_subdir> --output input_data/<dir> --include-source to materialize the meta.txt, parsed.json, pages/, and optional source.pdf files locally. Omit --include-source to skip the PDF.

4. Quality Assurance and Self-Checks

After modifying Python source code, run:

uv run black .
uv run ruff check
uv run python -m compileall src scripts
uv run pytest

When changes involve process extraction or alignment logic, prioritize a minimal end-to-end run (literature pipeline: Stage 1→Stage 4; JSON-LD pipeline: Stage 1→Stage 3). Command examples and stage requirements are in .github/prompts/extract-process-workflow.prompt.md.
Structured logging defaults to structlog. While running CLIs, monitor flow_alignment.*, process_extraction.*, and similar events to quickly localize issues.
Before committing, ensure intermediate files under artifacts/ are not accidentally staged. Clean them locally or add them to .gitignore if needed.

5. Support and Communication

Consult the docstrings and type definitions in the relevant modules under src/tiangong_lca_spec/ to keep terminology consistent when questions arise.
If external services are unavailable or time out for extended periods, capture reproduction steps and log excerpts, then escalate to operations or the workflow owner promptly.
When expanding prompts or reorganizing workflow responsibilities, update the corresponding documents in .github/prompts/ first to keep roles aligned with this guide.

ナビゲーション

Skillsとは？

リンク

Tiangong LCA Spec Coding Development Guidelines

Tiangong LCA Spec Coding Development Guidelines

1. Project Overview

2. Development Environment and Dependencies

3. Credentials and Remote Services

4. Quality Assurance and Self-Checks

5. Support and Communication

関連スキル(🔧 開発ツール)