

# Dependable and Safety-Critical Systems

I



### References

- Practical Design of Safety-Critical Computer Systems, William R. Dunn, Reliability Press,
   2002
- MIL-STD-882E: Standard Practice for System Safety

2

# **BASIC CONCEPTS**

3

Reduce Handle Handle at Run Time Categorite Prevent Blame Yourself Runs normally Fghore First

### Fault Model: Which Faults Are We Considering?

- Hardware examples
  - 3.3 V regulator fails.
    - Fail short (5 V), fail open (0 V), fail at some other voltage?
  - Inductor not soldered fully (open circuit)
  - I<sup>2</sup>C pullup resistor open
  - LCD module connection has some open circuits
  - RAM, Flash ROM bit flips
  - CPU multiply instruction fails?
- Software examples
  - Bugs in code: Generic, could do almost anything!
  - Need to be more specific (Project)
    - Corrupt important shared data
    - Reconfigure or disable the ADC or TPM
    - Overwrite TPM PWM duty cycle

- Reconfigure the port pins.
- Disable CPU clock, drop the CPU clock speed so current rises.
- Disable all interrupts
- Keep ADC from generating interrupt
- Put an infinite loop in a high-priority ISR
- Keep key threads from running (lower their priority, hog processor, take but don't release resources they need)
- Misuse queues (make full or empty)
- Use up RTOS object memory preventing queues from working
- Overflow the stack
- Crash the RTOS

5



## Software doesn't wear out. It's already broken.

- Embedded systems should always work, but it's not easy to get the software perfect
  - Must handle many different scenarios and error conditions
  - Hard to anticipate all of them when creating requirements and code
  - Hard, slow and expensive to test all of them (if even possible)
- So what should we do?

6

How Could The System Fail?



What could possibly go wrong?

- Software isn't perfect, so it has latent defects lurking, waiting to be triggered by a fault (input and/or condition)
  - Bad inputs or bad system state → defect activated → device fails to perform properly
  - Good inputs and system state, but not considered in design → defect activated → device fails to perform properly
- Hardware failure (transient or permanent) → device fails to perform properly

### Responses

- How should system respond to each fault?
  - Ignore the problem?
    - "Problem? What problem?"
  - Declare out of scope?
    - "Not my problem!"
  - Stop working?
    - Shut down.
    - Fail Stop.
  - Stop working in a way that minimizes further damage?
    - Move to safe mode, then stop.
    - Fail Safe (subset of Fail Stop)

- Keep working (as able)?
  - Handle failure and continue (possibly in degraded mode)
  - Fail Operate. Much harder to provide this, but very useful.
- Restart and hope it doesn't happen again?
  - May want to save a checkpoint.
- How should system respond to multiple concurrent faults?
  - Non-trivial (hard)

Ö

### **Common Types of Solutions**

- Design the fault out of the system
  - May be feasible to modify design so that fault is impossible
- Design-in or code-in the fault, but find it as early as possible during development
  - Testing, peer review

- Don't design-in or code-in the fault
  - Be more careful!

- Accept the fault may happen, so design system to tolerate it during operation
  - System masks the fault. E.g. RAM protected by hardware ECC (error-correcting code).
  - System detect the faults and responds (manages it)

9

### **Basic Fault Management Concepts**

- Start with
  - System diagram
  - Fault model
- Identify where system is vulnerable to faults
- Prioritize faults
- For each fault
  - Define a fault-management strategy (more later in process discussion)
    - Prevent during development?
    - Handle during operation?
- Fault handling during operation
  - Develop detection method
  - Develop response (continue operating? restart?)
  - Discuss, implement, test





# **FAULT DETECTION AND RESPONSE**

### Range of Fault Detection Methods

 As you know more about the application, more precise fault detection becomes possible

Coarse-Grain Example: Apply redundancy at very high level

- Duplicate the controller and compare the outputs
- Mismatched outputs indicate fault has occurred
  - Misses common-mode faults
- Apply generic Built-In Test (BIT)
  - Memory tests, CPU tests, etc.
- Safety Monitors
  - Watchdog timer (WDT)
  - Memory protection unit (MPU) or memory management unit (MMU),
  - Low voltage detector
  - Others possible...
- More detailed application knowledge → more accurate fault detection
  - Input range checking, variable protection with CRC, etc.



## **Detecting Data Faults**



- Store extra information to confirm valid updates
- Use extra memory (fine-grain to coarse-grain)

5: 000101

58: 111010

- Parity bit, EDC/ECC
- Data copy/copies
- Data complement
- CRC/hash of data
- Read: Confirm validity before use
- Write: Update redundancy with data
- Is data reasonable, within expected range?
  - Simple bounds
  - Near previous value (or trend)



Other ways?



Use: Check all all 3coptes match?
Else, if 2 match, use that value
and upSate the other

### Responding to Data Faults

- Logging
  - Log a fault code?
- Processing flow
  - Restart system?
    - Save checkpoint first?
  - Switch to degraded mode?
    - Which data value to use? Corrected, old, estimated...
  - Continue normally
    - Which data value to use? Corrected, old, estimated...





# Protecting the HBLED Buck Driver (High and Low)

| Fault                                                                        | Detection Methods                                                                                      | Possible Responses                       |
|------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|------------------------------------------|
| Corrupt important shared data.                                               | Hide data (make it file local), add access functions, integrity check (CRC/complement), range checking | Use previous value? Clip to valid range? |
| Reconfigure or disable the ADC or TPM.                                       | Check configuration periodically?                                                                      | Scrub peripheral configuration           |
| Overwrite TPM PWM duty cycle                                                 | ?                                                                                                      | ?                                        |
| Reconfigure the port pins (if not lockable)                                  | Check configuration periodically?                                                                      | Scrub peripheral configuration           |
| Disable CPU clock or drop the CPU clock speed so peak current rises.         | Check configuration periodically?                                                                      | Scrub clock configuration                |
| Disable ADC interrupt (or all), put an infinite loop in a high-priority ISR. | WDT                                                                                                    | ?                                        |
| Keep key threads from running (via priority or shared resources).            | WDT                                                                                                    | ?                                        |
| Misuse queues (make full or empty),                                          | Test status returned by OS calls                                                                       | ?                                        |
| Use up RTOS object memory preventing queues from working.                    | Test status returned by OS calls                                                                       | ?                                        |
| Overflow the stack.                                                          | Perform stack overflow detection                                                                       | ?                                        |
| Crash the RTOS                                                               | WDT                                                                                                    | Reset                                    |

## **SYSTEM ARCHITECTURE OPTIONS**

## System Architecture Options – Independent of Application



Fig. 4. Simplex, no fault tolerance



Fig. 5. Simplex, disengagement features



Fig. 6. Dual standby FT architecture

- Built-in Test (BIT) used to detect faults
- Functional Safety requirements (e.g. IEC 60730) specify running BIT
  - At start-up
  - Periodically

### More System Architecture Options



Fig. 7. Self-checking pair



Fig. 8. Self-checking pair with simplex fault down

- Note that inputs are still shared between computers (channels)
  - Single point of failure. If the oven's thermal sensor fails, then both channels will be wrong (but agree)

### More System Architecture Options



Fig. 9. Dual self-checking pair



Fig. 10. Triple modular redundancy

- Duplicate sensors to eliminate single point of failure
- Reduces some common mode failures, but some may remain

## **Safety Invariant Conditions**

- Can we define simple tests which tell us system may be operating unsafely?
- Then we can add monitor to disable system when we enter the maybe unsafe zone
- Where do we draw the line?
  - False positive: System is safe, but test says maybe unsafe
  - False negative: System is unsafe, but test says not unsafe (i.e. safe)



### **External Safety Monitor**

- Use separate simple device to monitor system, disable it if safety invariant conditions are violated
- Over-temperature protection for oven
  - Temperature must never rise above 500º F
  - Thermal switch (bimetallic) opens circuit above max. allowed temperature, removing power to heating element (or controller?)
- Over-current protection for switch-mode power conversion and motor control
  - Motor current never above 5 A
  - Control system will eventually detect and respond to over-current, but may be too late to prevent damage
  - Use dedicated hardware: analog comparator triggers when overcurrent, turns off PWM output via fault input



# **HBLED DRIVER ANALYSIS**

**Current and Temperature Limits** 

- Transistor Q3: BSS215P.
  - Max I<sub>D</sub> @ 25C = -1.5 A.
  - $R_{DS(on)} = 105-280 \text{ m}\Omega$
- Diode D1: DB2J20900LSMD
  - $I_{FAve} = 0.5 A$ ,  $I_{FSurge} = 3 A$
- Inductor L1: SDR0604-680KLCT-ND.
  - $I_{RMS Max} = 0.62 A$
  - $I_{sat} = 0.84 A$
  - $R_{DCMax} = 0.52\Omega$
- LED1: MLEAWT-A1-0000-0003F7
  - Nominal  $I_{max} = 500 \text{ mA}$
  - T<sub>max</sub> for LED Junction = 150º C



**NC STATE UNIVERSITY** 

## Fault Tree Analysis for HBLED Driver

• How can the hardware be damaged?

Too much current for LED at given Temperature Q3-fail closed ctt. Metal shorting Q3 Inductor fail short or Saturate Bad Soldering Bad Temp. reading

Crashing the HBLED Buck Driver with Code

Reconfigure or disable the ADC or TPM

- Overwrite TPM PWM duty cycle
- Reconfigure the port pins
- Disable CPU clock, drop the CPU clock speed so current rises
- Disable all interrupts, keep ADC from generating interrupt, put an infinite loop in a high-priority ISR
- Keep key threads from running (lower their priority, hog processor, take but don't release resources they need)
- Misuse queues (make full or empty)
- Use up RTOS object memory preventing queues from working
- Overflow the stack
- Corrupt important shared data
- Crash the RTOS



g\_enable\_flash g\_peak\_ set\_current

g\_flash\_period

**NC STATE UNIVERSITY** 

# Generic Methods to Tolerate Imperfect Software

- Software methods
  - Architecture and Implementation:
    - Structure software, hardware to isolate critical functions
    - Provide redundancy in time or space (multicore) for fault detection (and masking)
  - Development Process:
    - Design, implement and test that critical software much more carefully than the rest. Refer to Safety-Critical System processes

- Hardware support
  - Use memory protection unit (MPU) or MMU to prevent tasks from corrupting each other's memory or control registers
  - Detect when program is out of control, and reset it
    - Watchdog timer covered next
    - Safety monitor
  - Use MCU peripheral hardware limitations (e.g. only one write to register allowed after reset)
  - Apply thread privilege levels to prevent execution of critical instructions, access to key control registers
  - Etc.

## Examples of Limited Privileges for Cortex-M0+

- From ARM CM0+ Device Generic User Guide
- Types
  - All handler code is privileged
  - Thread code may or may not be privileged
- Unprivileged code
  - Limited use of MSR, MRS
  - Cannot use CPS to mask interrupts
  - Cannot access system timer, NVIC, system control block
  - Possible restricted access to memory or peripherals
  - Cannot write to CONTROL register
- Unprivileged thread uses SVC (service call, SW interrupt) to request execution of privileged software
- Is option for CM0+, but not provided in our KL25Z MCU

### Protecting the HBLED Buck Driver

- Reconfigure or disable the ADC or TPM scrub system to reconfigure peripherals correctly
- Overwrite TPM PWM duty cycle
- Reconfigure the port pins. Scrub system.
- Disable CPU clock, drop the CPU clock speed so current rises. Check CPU clock speed periodically
- Disable all interrupts, keep ADC from generating interrupt, put an infinite loop in a high-priority ISR. WDT
- Keep key threads from running (lower their priority, hog processor, take but don't release resources they need). WDT.
- Misuse queues (make full or empty)
- Use up RTOS object memory preventing queues from working. Check return codes
- Overflow the stack. Perform stack overflow detection and protection.
- Corrupt important shared data. Hide data (make it file local), add access functions, integrity check (CRC/complement), range checking, duty cycle < period?.</li>
- Crash the RTOS
- (Temperature sensor)

## **DEVELOPMENT PROCESSES**

Software doesn't wear out. It's already broken.



- Systems have latent defects lurking, waiting to be triggered
  - defect activated -> device fails to perform
  - Bad inputs -> defect activated -> device fails to perform
  - Hardware failures -> device fails to perform

- Can eliminate some through better development process
- Good but unexpected inputs and system state -> 
   Address others by designing system to detect and handle faults

## Overview of System Development Process So Far



## **DESIGN IN FEWER DEFECTS**

## Perform Peer Reviews Upstream of Testing

- Peer reviews of code highly effective
  - Good at finding different kinds of bugs than testing is
  - Use both as complements
- Peer reviews of Design **Documents** 
  - Test the ideas the system will be built upon
  - Express ideas more concisely tha with code

Denotits

2. Makes list of issues - bugs, risks, questions

2. Makes list of issues - bugs, risks, questions

2. Institutional wisdom

4. Offline Cada and in 4. Offline Code revision.

Peer Review of Design Documuss - vs Code. - Details hurt efficiency. - Pseudo code - Flow charts - State d'agrams - Tasks+Communications, 15Rs, handlers Task

# **Software Development Models**





# **TEST TO FIND IMPORTANT DEFECTS EARLIER**

# Improve Testing to Find Relevant Defects Earlier



- Improve testing to identify defects earlier, reducing cost of rework
- Impossible to test all possible inputs and states
- Range of testing methods available. Apply appropriately

# DEVELOPMENT PROCESSES FOR SAFETY-CRITICAL SYSTEMS

# Safety-Critical System Concepts

- Safety-Critical System
  - Failure can lead to injury, death, damage to property or environment (a mishap)
- Example Applications
  - Motor control, aircraft, chemical process, medical electronics, automobile anti-lock braking system
- Terminology
  - Mishap: Event (or chain of events) resulting in unintentional death, injury, damage, etc.
  - Hazard: Condition that could lead to a mishap
  - Failure: Failing to perform an expected action
  - Fault: A defect

- Types of Computer Use
  - Safety System: passively monitor system, take control when application enters dangerous state
  - Control System: actively control system based on inputs (operator, sensors) and control logic
- Types of Requirements
  - Functional and Operational
    - Maintain oven temperature within 1%
  - Safety-Related
    - Even with failures, don't let temperature go over a specified limit
    - Don't electrocute the user

# **Activities Required for Achieving System Safety**

- Must address system's life cycle
  - Design of system, development, testing, production, installation, operation (control, monitoring, diagnostics, repair, upgrade), disposal
- Requires a Systems Engineering management approach (process) specific to safety
  - Processes: documentation standards and practices, issue tracking system, system configuration management, etc.
  - Many Other Risks: Changes to design may require additional training for users, employee turn-over, component obsolescence, etc.

- Requires multidisciplinary effort
  - Span design from hardware through software through application through ...
  - Span entire life cycle
- Apply system safety standards
  - MIL-STD-882E: Standard Practice for System Safety
  - IEC 61508: Functional Safety of electrical/electronic/programmable electronic safety-related systems

## MIL-STD-882E System Safety Process Elements



- Just a design (hardware and code) is not nearly enough!
- Systems must be developed according to a process
  - Must explain context, the approach used, considerations, etc.
- Development activities are typically audited: "Show me the documentation for all of the steps"

## Risk Assessment Matrix

Depends on severity and probability

| RISK ASSESSMENT MATRIX  |                  |                 |                 |                   |  |
|-------------------------|------------------|-----------------|-----------------|-------------------|--|
| SEVERITY<br>PROBABILITY | Catastrophic (1) | Critical<br>(2) | Marginal<br>(3) | Negligible<br>(4) |  |
| Frequent<br>(A)         | High             | High            | Serious         | Medium            |  |
| Probable<br>(B)         | High             | High            | Serious         | Medium            |  |
| Occasional<br>(C)       | High             | Serious         | Medium          | Low               |  |
| Remote<br>(D)           | Serious          | Medium          | Medium          | Low               |  |
| Improbable<br>(E)       | Medium           | Medium Medium   |                 | Low               |  |
| Eliminated<br>(F)       | Eliminated       |                 |                 |                   |  |

## Mishap Analysis for Basic System



Assessing Risk



- Design evaluation and analysis
  - Failure Modes and Effects Analysis: In what ways can each component fail? What is the impact on the system of that component failing that way?
  - Failure Modes and Effects Testing: Inject component failures and evaluate system responses, confirming design mitigates failures.
  - Fault Tree Analysis: What could cause each mishap?

- Risk Analysis
  - Perform Fault Tree Analysis to identify basic component failure events
  - Calculate probability of all these failure events
  - Combine probabilities to determine mishap probability



- Design evaluation and analysis
  - Failure Modes and Effects Analysis: In what ways can each component fail? What is the impact on the system of that component failing that way?
  - Failure Modes and Effects Testing: Inject component failures and evaluate system responses, confirming design mitigates failures.
  - Fault Tree Analysis: What could cause each mishap?

Risk Analysis

- Perform Fault Tree Analysis to identify basic component failure events
- Calculate probability of all these failure events
- Combine probabilities to determine mishap probability

# Methods to Mitigate Mishap Risk (In Preferred Order)

Hardware

**Faults** 

- Eliminate hazards through design selection
- Reduce risk through design alteration
- Improve component reliability and quality
- 4. Incorporate safety devices
  - External
  - Internal
- Provide warning devices.
- Develop procedures, training and personal protective equipment



#### MIL-STD-882E on Software

#### 4.4 <u>Software contribution to system risk</u>.

The assessment of risk for software, and consequently software-controlled or software-intensive systems, cannot rely solely on the risk severity and probability.

Determining the probability of failure of a single software function is difficult at best and cannot be based on historical data.

Software is generally application-specific and reliability parameters associated with it cannot be estimated in the same manner as hardware.

Therefore, another approach shall be used for the assessment of software's contributions to system risk that considers the potential risk severity and the degree of control that software exercises over the hardware.

# Process Requires Safety Decisions to be Documented and Justified

| RELATIONSHIP BETWEEN SwCI, RISK LEVEL, LOR Tasks, AND RISK |            |                                                                                                                                                                                                                                                                                                                                                             |  |  |
|------------------------------------------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Software<br>Criticality<br>Index (SwCI)                    | Risk Level | Software LOR Tasks and Risk Assessment/Acceptance                                                                                                                                                                                                                                                                                                           |  |  |
| SwCl 1                                                     | High       | If SwCl 1 LOR tasks are unspecified or incomplete, the contributions to system risk will be documented as HIGH and provided to the PM for decision. The PM shall document the decision of whether to expend the resources required to implement SwCl 1 LOR tasks or prepare a formal risk assessment for acceptance of a HIGH risk.                         |  |  |
| SwCl 2                                                     | Serious    | If SwCl 2 LOR tasks are unspecified or incomplete, the contributions to system risk will be documented as SERIOUS and provided to the PM for decision. The PM shall document the decision of whether to expend the resources required to implement SwCl 2 LOR tasks or prepare a formal risk assessment for acceptance of a SERIOUS risk.                   |  |  |
| SwCl 3                                                     | Medium     | <ul> <li>If SwCI 3 LOR tasks are unspecified or incomplete, the contributions to system risk will be documented as MEDIUM and provided to the PM for decision. The PM shall document the decision of whether to expend the resources required to implement SwCI 3 LOR tasks or prepare a formal risk assessment for acceptance of a MEDIUM risk.</li> </ul> |  |  |
| SwCl 4                                                     | Low        | If SwCl 4 LOR tasks are unspecified or incomplete, the contributions to system risk will be documented as LOW and provided to the PM for decision. The PM shall document the decision of whether to expend the resources required to implement SwCl 4 LOR tasks or prepare a formal risk assessment for acceptance of a LOW risk.                           |  |  |
| SwCl 5                                                     | Not Safety | No safety-specific analyses or testing is required.                                                                                                                                                                                                                                                                                                         |  |  |

## 4.4.1 Software Assessments

- Software Control Categories (SCC)
  - Describes how much of the system the software controls
- Software Safety Criticality Matrix (SSCM)
  - Describes how critical the software is based on the severity of a mishap and the SCC
  - Uses a number: Software Criticality Index (SwCI)
- Level of Rigor Tasks (LoR)
  - Indicates which tasks must be performed for this SwCI



# **Software Control Categories**

| Level  | Name                                 | Description (Paraphrased)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |  |  |
|--------|--------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| l<br>T | Autonomous<br>(AT)                   | Autonomous control authority over potentially safety-significant components without the possibility of predetermined safe detection and intervention by a control entity to prevent mishap or hazard.                                                                                                                                                                                                                                                                                                                                 |  |  |
| 2      | Semi-<br>Autonomous<br>(SAT)         | Control authority over potentially safety-significant components, allowing time for predetermined safe detection and intervention by independent safety mechanisms to mitigate or control the mishap or hazard.  Software item that displays safety-significant information requiring immediate operator entity to execute a predetermined action for mitigation or control over a mishap or hazard. Software exception, failure, fault, or delay will allow mishap occurrence.                                                       |  |  |
| 3      | Redundant<br>Fault Tolerant<br>(RFT) | Software functionality that issues commands over safety-significant hardware components requiring a control entity to complete the command function. The system detection and functional reaction includes redundant, independent fault tolerant mechanisms for each defined hazardous condition.  Software that generates information of a safety-critical nature used to make critical decisions. The system includes several redundant, independent fault tolerant mechanisms for each hazardous condition, detection and display. |  |  |
| 4      | Influential                          | Software <b>generates information of a safety-related nature</b> used to make decisions by the operator, but <b>does not require operator action to avoid a mishap</b> .                                                                                                                                                                                                                                                                                                                                                              |  |  |
| 5      | No Safety<br>Impact (NSI)            | No command or control authority over safety-significant hardware components, does not provide safety-significant information. Software does not provide safety-significant or time sensitive data or information that requires control entity interaction. Software does not transport or resolve communication of safety-significant or time sensitive data.                                                                                                                                                                         |  |  |

| CCCNA                         | l (D:                                              | + 1          |                                    |                     |                 | INC             | JIA              | IE UNIVENS        |
|-------------------------------|----------------------------------------------------|--------------|------------------------------------|---------------------|-----------------|-----------------|------------------|-------------------|
| SSCM and Level of Rigor Tasks |                                                    |              | SOFTWARE SAFETY CRITICALITY MATRIX |                     |                 |                 |                  |                   |
|                               |                                                    |              |                                    | SEVERITY CATEGORY   |                 |                 |                  |                   |
|                               |                                                    |              | SOFTWARE<br>CONTROL<br>CATEGORY    | Catastrophic<br>(1) | Critical<br>(2) | Marg<br>(3)     |                  | Negligible<br>(4) |
|                               |                                                    |              | 1                                  | SwCl 1              | SwCl 1          | SwC             | 13               | SwCl 4            |
|                               |                                                    |              | 2                                  | SwCl 1              | SwCl 2          | SwC             | 1 3              | SwCl 4            |
|                               |                                                    |              | 3                                  | SwCl 2              | SwCl 3          | SwC             | 1 4              | SwCl 4            |
|                               |                                                    |              | 4                                  | SwCI 3              | SwCl 4          | SwC             | 1 4              | SwCl 4            |
|                               |                                                    |              | 5                                  | SwCI 5              | SwCI 5          | SwC             | 1 5              | SwCl 5            |
|                               | Software Criticality Perform analysis of Conduct 3 |              |                                    |                     |                 |                 |                  |                   |
| •                             |                                                    |              |                                    |                     |                 | Conduct Safety- |                  |                   |
| Index                         | Requirements                                       | Architecture | Desig                              | gn                  | Code            |                 | Specific Testing |                   |
|                               | •                                                  | •            |                                    | •                   | •               |                 |                  | In-depth          |
| 2                             | •                                                  | •            |                                    | •                   |                 |                 |                  | In-depth          |
| 3                             | •                                                  | •            |                                    |                     |                 |                 |                  | In-depth          |
| 4                             |                                                    |              |                                    |                     |                 |                 |                  | •                 |

# **APPENDIX**

# Mapping Solutions onto the System Development Process

| Requirements    | Peer Review |  |  |
|-----------------|-------------|--|--|
|                 |             |  |  |
| Architecture    | Peer Review |  |  |
|                 |             |  |  |
| Detailed Design | Peer Review |  |  |
| Coding          | Tooting     |  |  |
| Coding          | Testing     |  |  |
| Testing         |             |  |  |
|                 |             |  |  |
| Operation       |             |  |  |
|                 |             |  |  |

- Design the fault out of the system
- Don't design-in or code-in the fault
  - Use a better development process
  - Prototype or model high-risk areas
  - During development, find the fault as early as possible
    - Test early and often
    - Use peer reviews for non-functional items (architectures, etc.)
- Accept the fault may happen, so design system to tolerate it during operation

# Causes of Embedded System Failures

- Latent faults
  - The system was designed and built that way
- Environmental factors
  - Device failures
  - Noise, single event upsets
  - Hacker attacks
  - New unexpected inputs/conditions

System

# **Definitions**

- Terms
  - Safety
    - System does not cause any harm
  - Reliability
    - Probability that item will operate reliably for a specified amount of time
  - Availability
    - Probability that item will operate correctly at a given time

# **Fault Detection Methods**

