

Chinese Society of Aeronautics and Astronautics & Beihang University

**Chinese Journal of Aeronautics** 

cja@buaa.edu.cn www.sciencedirect.com



# Multi-objective evolutionary design of selective triple modular redundancy systems against SEUs



Yao Rui \*, Chen Qinqin, Li Zengwu, Sun Yanmei

College of Automation and Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

Received 17 June 2014; revised 5 September 2014; accepted 21 February 2015 Available online 8 April 2015

### **KEYWORDS**

Evolvable hardware; Field programmable gate array; Multi-objective approach; Selective triple modular redundancy; Single event upset

Abstract To improve the reliability of spaceborne electronic systems, a fault-tolerant strategy of selective triple modular redundancy (STMR) based on multi-objective optimization and evolvable hardware (EHW) against single-event upsets (SEUs) for circuits implemented on field programmable gate arrays (FPGAs) based on static random access memory (SRAM) is presented in this paper. Various topologies of circuit with the same functionality are evolved using EHW firstly. Then the SEU-sensitive gates of each circuit are identified using signal probabilities of all the lines in it, and each circuit is hardened against SEUs by selectively applying triple modular redundancy (TMR) to these SEU-sensitive gates. Afterward, each circuit hardened has been evaluated by SEU Simulation, and the multi-objective optimization technology is introduced to optimize the area overhead and the number of functional errors of all the circuits. The proposed fault-tolerant strategy is tested on four circuits from microelectronics center of North Carolina (MCNC) benchmark suite. The experimental results show that it can generate innovative trade-off solutions to compromise between hardware resource consumption and system reliability. The maximum savings in the area overhead of the STMR circuit over the full TMR design is 58% with the same SEU immunity. © 2015 The Authors. Production and hosting by Elsevier Ltd. on behalf of CSAA & BUAA. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

#### 1. Introduction

With the fast development of science and technology, the reliability of electronics in space and avionics has become crucial due to the increased complexity of the architecture and function. Field programmable gate arrays (FPGAs) based on static

\* Corresponding author. Tel.: +86 25 84892352.

E-mail address: yaorui@nuaa.edu.cn (R. Yao).

Peer review under responsibility of Editorial Committee of CJA.



random access memory (SRAM) have gained a steadily increasing interest for such applications because of their short time to market, good reconfigurability and low cost. Unfortunately, along with these advantages this technology has a high susceptibility to the so-called single event upsets (SEUs).<sup>1</sup> An SEU stands for the inversion of a memory bit caused by heavy ions, protons and/or ground level radiation. SEU is the largest contributor to device soft failure,<sup>2</sup> which may even lead to failure in the mission. Hence, aerospace industry will benefit significantly from SEU mitigation technologies for SRAM-based FPGAs.

Triple modular redundancy (TMR)<sup>3,4</sup> is the most widely adopted one for hardening circuits implemented on SRAM-based FPGAs. For digital circuits mapped on

http://dx.doi.org/10.1016/j.cja.2015.03.005

1000-9361 © 2015 The Authors. Production and hosting by Elsevier Ltd. on behalf of CSAA & BUAA. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

FPGAs, not only the flip-flops (FFs) to form the feedback path of sequential circuits, but also the logic gates in combinational and sequential circuits, need to be hardened. The reason for this is that the logic gates are mapped on the FPGA using look up tables (LUTs), which consist of SRAM cells. Even the interconnection is also controlled using the data stored in SRAM cells.TMR can be applied based on different granularities, such as device redundancy, system redundancy, module redundancy or logic element redundancy. The finer the granularity is, the higher the probability is. For example, in the system level TMR system, the original system is replicated three times and the output extracted from a majority voter. Each replica of the system works independently and is named domain. If an SEU occurs in one domain, TMR masks the fault by majority voting and thus propagates the correct output. This method provides the TMR system with resistance against SEUs, and can harden the system without affecting its normal operation. However, TMR system can withstand only single upset at any instant of time. If two out of three domains give faulty results, the system will produce wrong answers. To enable the system to mask multiple faults, logic element level TMR can be applied. In the logic element level TMR system, each logic element, including the logic gate and the FF, is hardened by TMR, so it allows every logic element to tolerate one failure. Obviously, the anti-SEU ability of logic element level TMR is better than system level TMR, but the area overhead of voter insertion is also significant.

To reduce the area overhead of TMR system, a special kind of TMR named reduced triple modular redundancy (RTMR) for specific very long instruction word VLIW processors is proposed in Ref. 5. The key idea is to employ the redundancy of operators in the data path of a VLIW processor. I.e., every operation is executed twice by two different operators during normal program execution. Only in case a mismatch between both computed results occurs, the operation is executed by a third operator and its result is used for voting. Therefore, during most of the execution time, the area overhead of (RTMR) is only 100%. However, RTMR is somewhat a system level TMR, and it is only suitable for application to specific VLIW processors. Moreover, the VLIW architecture must be modified in order to detect a mismatch in computed results, and necessary program transformations must be introduced to obtain an internal representation for fault tolerant programs that can be scheduled to the proposed VLIW architecture. So if it is used for logic element level TMR, many additional hardware logic and complex scheduling mechanism should be added.

To reduce the hardware resource consumption of logic element level full TMR system, a fault-tolerant method of selective TMR (STMR) for circuit mapped on FPGAs is proposed in Ref. 6. In this method, only the SEU sensitive gates, i.e., gates that are prone to upset in case of SEU, in the circuits are detected using the signal probabilities of the line and are further hardened with TMR; while those non-sensitive to SEU are not hardened. Because only part of the gates are selectively hardened by TMR, so the STMR method can significantly reduce the area overhead of the hardened circuit compared to full TMR; moreover, since the gates not hardened are not prone to upset by SEU, the loss of SEU immunity is small. However, the area overhead and SEU immunity (namely reliability, which is inversely proportional to functional errors in case of SEU) of a circuit conflict with each other, i.e., if the former increases then the latter increases too; vice versa. If we want to increase the reliability of a circuit, more gates need to be hardened by STMR, so the area overheard increases too; otherwise, if we want to decrease the area overhead, the number of gates to be hardened by TMR must decrease, so the reliability decreases too in the case of SEU. Therefore, a compromise between the area overhead and the number of functional errors is required. Moreover, faulty domains in the STMR system cannot be repaired.

In this paper, the multi-objective evolutionary design of STMR system against SEUs is presented, which combines the novel design and self-repairing capabilities of evolvable hardware (EHW) with the less area overhead of the STMR technique. Moreover, this strategy can result in a tradeoff between reliability and resource consumption by using the multi-objective optimization technology. In general, the procedure of this strategy can be divided into three steps. Firstly, various topologies of circuit with the same functionality are evolved using EHW. Then these circuits are hardened against SEUs by introducing the STMR technique to greatly reduce the area overhead with a slight loss of reliability. Lastly, through introducing multi-objective optimization algorithm based on weighted summation, the number of functional errors and the area overhead of the circuits with different topologies are optimized simultaneously. The proposed fault-tolerant strategy is tested on four circuits taken from microelectronics center of North Carolina (MCNC) benchmark library. Not only is the area overhead of the STMR circuit decreased significantly over that required for the full TMR design of the same circuit with a small loss of reliability, but also a tradeoff between the number of functional errors and the area overhead is achieved.

## 2. STMR method based on multi-objective optimization and EHW

## 2.1. Evolvable hardware

EHW is a novel kind of bio-inspired smart hardware, which is capable of self-assembly, self-repairing and self-adaptation. It is an integration of evolutionary computation and reconfigurable hardware devices.<sup>8,9</sup> It applies evolutionary algorithms, particularly genetic algorithm (GA), as the global search engine, and in-situ configurable devices as the physical medium. The goal of EHW is to obtain expected circuits and topologies through evolution without human intervention or designers' knowledge,<sup>10</sup> and then adapt to the new environment by reconfiguring its own internal structure dynamically and autonomously according to changing environment.

Extrinsic or intrinsic evolution<sup>11</sup> is applied to EHW in which circuit architecture as well as property parameters are encoded into chromosomes, then each candidate circuit can either be simulated or implemented physically on reconfigurable devices to evaluate using evaluation function. An evaluation function, known as the fitness function, is used to evaluate each chromosome in terms of being a good solution to the problem and is the optimization objective of GA. Offspring individuals are thus derived from operators like selection, crossover, and mutation according to their fitness. The evolution cycle is then repeated until a satisfying solution (a circuit providing the desired behavior) is found. Download English Version:

https://daneshyari.com/en/article/757621

Download Persian Version:

https://daneshyari.com/article/757621

Daneshyari.com