skill_id: ARCH-FAULT-TOL version: 1.0.0 last_updated: 2026-01-04 applies_to: [Class A, Class B, Class C] jurisdiction: [Global] prerequisites: [ARCH-SAFETY-CLASS]

Fault Tolerance Design

Purpose

Provide patterns for detecting, containing, and recovering from faults in medical device software, scaled to safety class.

When to Apply

Safety-critical control loops, sensing/actuation, communication paths.
Watchdogs, redundancy, health monitoring, self-test.
Power, memory, and comms error handling.

Requirements (testable)

Fault Detection: Implement monitoring for critical resources (tasks, sensors, comms) with thresholds and alarms. Rationale: early detection.
Graceful Degradation: Define degraded modes or safe state when partial functionality fails. Rationale: bounded failure.
Redundancy Strategy: For Class C functions, consider redundancy (sensing, computation, or communication) with voter/consistency checks. Rationale: resilience.
Watchdog Use: Configure hardware/software watchdogs with bounded servicing windows; service only after critical checks pass. Rationale: recover from hangs.
Self-Test/BIST: Run self-tests at startup and periodically for critical components; handle failures deterministically. Rationale: latent fault detection.
Error Propagation Control: Sanitize/contain errors at boundaries; avoid cascading faults. Rationale: containment.
Logging & Alarms: Log and, where required, annunciate safety-relevant faults; ensure tamper-evident logs for post-incident analysis. Rationale: traceability.

Recommended Practices

Use majority voting or reasonableness checks instead of blind trust in single sensors.
Employ brownout/power-fail detection to enter safe state gracefully.
For RTOS, assign dedicated safety monitor task with higher priority than non-critical tasks.
Debounce fault signals to reduce false positives but cap with timeouts.

Patterns

Watchdog servicing with checks:

// REQ-FT-WD-01; TEST-FT-03
void service_watchdog(void) {
    if (critical_tasks_healthy() && comms_alive()) {
        wdt_kick();
    } else {
        // Do not kick; let watchdog reset into safe boot
    }
}

Sensor plausibility check:

// REQ-FT-SNS-02; TEST-FT-07
bool validate_pressure(float p_kpa) {
    return (p_kpa >= 0.0f && p_kpa <= 300.0f);
}

Redundant reading vote:

// REQ-FT-RED-01; TEST-FT-10
float fused_temp(float a, float b) {
    if (fabsf(a - b) > 2.0f) {
        alarm_sensor_disagree();
        enter_safe_state();
    }
    return (a + b) * 0.5f;
}

Anti-Patterns (risks)

Servicing watchdog unconditionally in main loop -> risk: hides deadlocks.
Single-point sensors without plausibility checks -> risk: unsafe outputs.
Logging faults without annunciation where required -> risk: latent hazards.
No degraded mode or safe fallback -> risk: uncontrolled failure behavior.

Verification Checklist

Fault monitors implemented for critical resources with thresholds/timeouts.
Watchdog configuration reviewed; serviced only after health checks.
Degraded modes or safe state defined and reachable on fault.
Redundancy/plausibility checks implemented for critical sensors/paths.
Self-tests executed at startup/periodically; failures handled deterministically.
Errors contained at boundaries; no unchecked propagation.
Faults logged and annunciated as applicable; integrity of logs maintained.

Traceability

Link REQ-FT-### to hazards and controls; map to tests (TEST-FT-###).
Store watchdog and fault monitor configuration with release artifacts.

References

IEC 62304 design/implementation expectations (fault control).
ISO 14971 for risk-driven fault handling.
IEC 60601-1 (power/brownout considerations; informative).

Changelog

1.0.0 (2026-01-04): Initial fault tolerance patterns with watchdog, redundancy, and safe fallback guidance.

Audit History

2026-01-04: Audit performed. Verified:
- Fault tolerance patterns technically accurate
- IEC 60601-1 reference appropriate as informative for power/brownout considerations
- Watchdog and redundancy patterns follow industry best practices

ナビゲーション

Skillsとは？

リンク

Fault Tolerance Design

skill_id: ARCH-FAULT-TOL version: 1.0.0 last_updated: 2026-01-04 applies_to: [Class A, Class B, Class C] jurisdiction: [Global] prerequisites: [ARCH-SAFETY-CLASS]

Fault Tolerance Design

Purpose

When to Apply

Requirements (testable)

Recommended Practices

Patterns

Anti-Patterns (risks)

Verification Checklist

Traceability

References

Changelog

Audit History

関連スキル(⚙️ DevOps)