Programa de Pós-Graduação em Computação Instituto de Informática Universidade Federal do Rio...
-
Upload
jose-hutchison -
Category
Documents
-
view
215 -
download
2
Transcript of Programa de Pós-Graduação em Computação Instituto de Informática Universidade Federal do Rio...
Programa de Pós-Graduação em ComputaçãoInstituto de Informática
Universidade Federal do Rio Grande do SulPorto Alegre – RS – Brazil
Semana Acadêmica PPGC/UFRGS17/10/2006
PPGCPrograma de
Pós-Graduação em Computação
Dealing withMultiple Simultaneous Faults
in Future Technologies
Doutorando: Carlos Arthur Lang Lisbôa
Orientador: Luigi Carro
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 2
Why Multiple Simultaneous Faults ?
• Future technologies (2010 and beyond)
• very small transistors and fewer electrons to form the
channel ( SETs)
• transient pulses due to radiation attack will last longer
than the propagation delays of gates and cycle times
• devices will be more sensitive to the effects of
electromagnetic noise, neutrons and alpha particles
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 3
Single Event Upset Origin
1 0 1 0 0 0 0 1
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 4
Why Should One Study Multiple Faults ?
Changes in paradigm:
• Gates will behave statistically,
producing correct outputs only a
fraction of the time
• Faster devices cycle times shorter
than duration of transient pulses
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 5
• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required (TMR
will no longer provide enough protection)
How to Deal with Multiple Faults ?
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 6
• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required (TMR
will no longer provide enough protection)
• How to deal with this problem ?
• new materials and manufacturing technologies must
be developed
OR• new design approaches must be taken
How to Deal with Multiple Faults ?
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 7
• New paradigm: multiple simultaneous faults• new fault tolerance techniques will be required (TMR
will no longer provide enough protection)
• How to deal with this problem ?
How to Deal with Multiple Faults ?
•new design approaches must be taken (our bet !)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 8
OnlineHardening
MajorityLogic
Low cost
redundancy
Research Evolution - Overview
StochasticOperators
TMR andAnalogVoter
Bit StreamOperators
MemProc
StatisticalComputation
2004 2005 2006 2007
IOLTS 04
DFT 04WDES 04
LATW 06ETS 06
DFT 06
VTS 07(submitted)
ETS 05SBCCI 05
ResearchReport
SRC 2005TechCon
ResearchReport
DATE 06PhD Forum
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 9
Published Papers
• Lisbôa, C. and Carro, L., “Arithmetic Operators Robust to Multiple Simultaneous Upsets”, 10th IEEE International Online Test Symposium - IOLTS 2004, IEEE Computer Society, Funchal, Madeira Island, Portugal, July 2004.
• Lisbôa, C. and Carro, L., “Highly Reliable Arithmetic Multipliers for Future Technologies”, in Proceedings of the International Workshop on Dependable Embedded Systems - WDES 2004 - in conjunction with the 23rd International Symposium on Reliable Distributed Systems - SRDS 2004, pp. 13-18. Edited by Becker, L. B. and Kaiser, J., Florianópolis, October 17, 2004.
• Lisbôa, C. and Carro, L., “Arithmetic Operators Robust to Multiple Simultaneous Upsets”, in Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2004, pp. 289-297, ISBN0-7695-2241-6. IEEE Computer Society, New York, October 2004.
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 10
Published Papers
• Lisbôa, C. A. L., Carro, L. and Cota, E., “RobOps - Arithmetic
Operators for Future Technologies”, 10th European Test
Symposium - ETS 2005, Tallin, Estonia, May 2005.
• Lisbôa, C. A. L., Schüler, E. and Carro, L., “Going Beyond TMR for
Protection Against Multiple Faults”, in Proceedings of the 18th
Symposium on Integrated Circuits and Systems Design - SBCCI
2005, September 2005.
• Rhod, E.; Lisbôa, C. A. L. and Carro, L., “Using Memory to Cope
with Simultaneous Transient Faults”, in Proceedings of the 7th
Latin-American Test Workshop - LATW 2006, pp. 151-156, IEEE
Computer Society, New York, March 2006.
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 11
Published Papers
• Rhod, E.; Lisbôa, C. A. L.; Michels, Á. and Carro, L., “Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve the Architectural Vulnerability Factor”, in Informal Digest of Papers of the 11th IEEE European Test Symposium - ETS 2006, pp. 229-234, IEEE Computer Society, New York, May 2006.
• Michels, Á., Petroli, L., Lisbôa, C. A. L., Kastensmidt, F. and Carro, L. “SET Fault Tolerant Combinational Circuits Based on Majority Logic”, in Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2006, pp. 345-352, IEEE Computer Society, Los Alamitos, CA, October 2006.
• Lisbôa, C. A. L., Carro, L., Sonza Reorda, M., and Violante, M. “Online Hardening of Programs against SEUs and SETs”, in Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems - DFT 2006, pp. 280-288, IEEE Computer Society, Los Alamitos, CA, October 2006.
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 12
Research Approaches - 2004 / 2005
• Use of stochastic operators
• Use of bit stream operators
• Ensuring voter reliability to use n-MR
while dealing with multiple
simultaneous faults
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 13
Research Evolution - 2004 / 2005
StochasticOperators
IOLTS2004
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 14
Research Evolution - 2004 / 2005
IOLTS2004
OK for someDSP
Applications
StochasticOperators
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 15
Research Evolution - 2004 / 2005
Look
ing fo
r
mor
e sp
eed
StochasticOperators
Bit StreamOperators
DFT 2004WDES 2004
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 16
Research Evolution - 2004 / 2005
Look
ing fo
r
mor
e sp
eed
StochasticOperators
Small footprintand fast
Bit StreamOperators
DFT 2004WDES 2004
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 17
Research Evolution - 2004 / 2005
Look
ing fo
r
mor
e sp
eed
StochasticOperators
AnalogVoter
Bit StreamOperators
Looking for
tolerant converter
ETS 2005SBCCI 2005
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 18
Research Evolution - 2004 / 2005
Look
ing fo
r
mor
e sp
eed
StochasticOperators
Tolerant to multiple faults in n-MR solutions
Bit StreamOperators
Looking for
tolerant converter
TMR andAnalogVoter
ETS 2005SBCCI 2005
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 19
Research Evolution - 2004 / 2005
Look
ing fo
r
mor
e sp
eed
StochasticOperators
Bit StreamOperators
Looking for
tolerant converter
TMR andAnalogVoter
ResearchReport
SRC 2005TechCon
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 20
Research approach - 2006 / 2007
• cooperation with peers
• use of memory for computation
• analog voter + majority logic
• use of an I-IP to harden instructions
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 21
Research approach - 2006 / 2007
• cooperation with peers
• use of memory for computation
• analog voter + majority logic
• use of an I-IP to harden instructions
• low cost redundancy using statistical parallel computation
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 22
Research Evolution - 2006 / 2007
ResearchReport
DATE 06PhD Forum
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 23
Research Evolution - 2006 / 2007
MemProc
LATW 06ETS 06
ResearchReport
DATE 06PhD Forum
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 24
MajorityLogic
Research Evolution - 2006 / 2007
MemProc
LATW 06ETS 06
ResearchReport
DATE 06PhD Forum
DFT 06
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 25
Low cost
redundancy
MajorityLogic
Research Evolution - 2006 / 2007
MemProc
LATW 06ETS 06
ResearchReport
DATE 06PhD Forum
DFT 06
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 26
Low cost
redundancy
OnlineHardening
MajorityLogic
Research Evolution - 2006 / 2007
MemProc
LATW 06ETS 06
DFT 06
ResearchReport
DATE 06PhD Forum
DFT 06
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 27
OnlineHardening
MajorityLogic
Low cost
redundancy
Research Evolution - 2006 / 2007
MemProc
StatisticalComputation
LATW 06ETS 06
DFT 06
VTS 07(submitted)
ResearchReport
DATE 06PhD Forum
DFT 06
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 28
Current research - motivation
faster devices
transient pulse duration scaling not proportional to speed scaling
transient pulses will last longer than one cycle
• future technologies
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 29
Current research - motivation
• future technologies faster devices
transient pulse duration scaling not proportional to speed scaling
transient pulses will last longer than one cycle
• techniques relying on time redundancy will fail
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 30
Current research - motivation
• alternative approach: space redundancy
current solutions: area overhead 100%
small granularity does not provide low overhead
(what can one do with 50% of a MOSFET ?)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 31
•
• proposed solution: fingerprinting
parallel processing on subset of possible inputs
small transient fault probability (desired: 0%)
Current research - motivation
• alternative approach: space redundancy
current solutions: area overhead 100%
small granularity does not provide low overhead
(what can one do with 50% of a MOSFET ?)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 32
Current research - focus
• use of low cost redundancy and statistical computation to cope with transient faults
main circuit
random checker
inputsoutput
error
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 33
Sample application
• Freivalds: matrix multiplication correctness
• given matrices A and B, n x n
• given one algorithm that calculates C = A x B
• goal: check if the algorithm performs correctly by executing thousands of multiplications and comparing the results
• naive solution: calculate again and compare O(n3)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 34
Sample application
• Freivalds technique
1. generate a random vector r, with values from {0,1}
2. compute vector Cr = C r O(n2)
3. compute vector ABr = A (B x r) O(n2)
4. if C A B, then Pr[Abr = Cr] 1/2
After k independent repetitions of steps 1, 2 and 3:
Pr[Abr = Cr] 1/2k
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 35
Sample application
• Our extension of Freivalds technique
1. generate a random vector r, with values from {0,1}
2. generate a vector rc with rci = not(ri) for i = 1:n
3. compute Cr = C r and Crc = C rc
4. compute ABr = A (B x r) and ABrc = A (B x rc)
5. if ABr Cr OR ABrc Crc, then
Pr[Abr Cr] = 1
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 36
Sample Implementation
C A * B
Cr C * rABr A*(B*r)
inputs (A, B)
output (C)
error
• matrix multiplier with checker
• application of Freivalds technique
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 37
Sample Implementation
Matrix width (n) 3 16 32Size of the multiplier 10.773 218.688 1.830.144Size of the checker 11.181 125.203 765.523Checker areaoverhead (%) 104% 57% 42%
Area overhead (# of gates)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 38
Sample implementation
Time overhead (# of instructions)
Matrix width (n) 3 8 16 32Multiplication 45 960 7,936 64,512Verification 90 720 2,976 12,096Overhead (%) 200% 75% 37.5% 18.75%
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 39
Sample implementation
Fault injection results
# of repetitions 1,000,000faults propagated to C 332,142A*(B*r) = C*r 166,285A*(B*r) C*r 165,857A*(B*rc) = C*rc 165,857A*(B*rc) C*rc 166,285
A*(B*r) C*r OR A*(B*rc) C*rc 332,142
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 40
PhD program requiremnets
• 36 credits • qualifying examination • 2 foreign languages proficiency exam • academic week seminar • Thesis proposal February 2007
• Thesis presentation December 2007
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 41
Questions ?
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 42
Using Stochastic Operators
• SEU induced transient errors are of random nature
• Stochastic operators rely on randomness to produce approximate results
• The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results
0 faults 2 faults 4 faults 8 faults0.1412 0.2580 0.1768 0.2196
Stochastic AdderConventional
0.0000
% Errors in 1,000 additions
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 43
Using Stochastic Operators
• SEU induced transient errors are of random nature
• Stochastic operators rely on randomness to produce approximate results
• The injection of random faults in the input signals processed by stochastic operators did not impact the precision of the results
• Several application areas (DSP) can deal with approximate values and still produce acceptable results (outputs)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 44
Using Stochastic Operators
• Benefit: reduced area of the operators
Stochastic multiplier circuit
1000100110011010
10010001000010111000000100001010
Stochastic Adder Circuit
01100010101
010111011001S1
S3
Sum
01010101101
0010100110101
S2
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 45
Using Bit Stream Operators
• Computation principles similar to those of the stochastic adder and multiplier
• Operators can produce bit streams which represent the exact results of the operation
Proposed Multiplication Algorithm - bit stream product(the count of 1’s in the stream is equal to the product value)
F12 F11 F10
x F22 F21 F20
F20.F12 F20.F11 F20.F10
F21.F12 F21.F11 F21.F10
F22.F12 F22.F11 F22.F10
b48 .. b33 b32 .. b17 b16 .. b5 b4 .. b1 b0
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 46
b48 .. b48 b47 .. b47 ... b0 .. b0 1 1 1 1 0 0 0
8 times 8 times 8 times +4total count of 1’s = 8 * product + 4
Using Bit Stream Operators
• Computation principles similar to those of the stochastic adder and multiplier
• Operators can produce bit streams which represent the exact results of the operation
• Redundancy is added to the bit streams in order to stand to multiple bit flips
Adding robustness to the bit stream through redundancy
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 47
Using Bit Stream Operators
• Computation principles similar to those of the stochastic adder and multiplier
• Operators can produce bit streams which represent the exact results of the operation
• Redundancy is added to the bit streams in order to stand to multiple bit flips
• Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n-MR for protection against faults
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 48
Using Bit Stream Operators
• Computation principles similar to those of the stochastic adder and multiplier
• Operators can produce bit streams which represent the exact results of the operation
• Redundancy is added to the bit streams in order to stand to multiple bit flips
• Conversion of bit streams to binary coded values is delayed as much as possible, and conversion circuits must use TMR or n-MR for protection against faults
• Issues to be further investigated: size of bit streams and area of the conversion circuits
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 49
VOTER
correct output
What is Wrong with TMR ?
• TMR protects only against single faults in one of the modules
Module 1
Module 2
Module 3
correct output
correct output
correct output
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 50
Module 2 wrong output
What is Wrong with TMR ?
Module 1
Module 3
correct output
correct output
VOTER
correct output
• TMR protects only against single faults in one of the modules
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 51
Module 2 correct output
What is Wrong with TMR ?
• TMR does not protect against double faults in different modules
Module 1
Module 3
wrong output
wrong output
VOTER
wrong output
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 52
VOTER
correct output
What is Wrong with TMR ?
• When a single fault occurs in the voter circuit, the voter output may be wrong
Module 1
Module 2
Module 3
correct output
correct output
correct output
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 53
VOTER
correct output ?
What is Wrong with TMR ?
Module 1
Module 2
Module 3
correct output
correct output
correct output
• When a single fault occurs in the voter circuit, the voter output may be wrong
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 54
Making TMR (n-MR) more reliable
• Known solutions imply in• area, performance and / or power penalties
• deadlock: how to protect the output generator ?
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 55
Making TMR (n-MR) more reliable
• Known solutions imply in• area, performance and / or power penalties
• deadlock: how to protect the output generator ?
• Proposed solution:• use TMR to cope with single faults in the modules
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 56
Making TMR (n-MR) more reliable
• Known solutions imply in• area, performance and / or power penalties
• deadlock: how to protect the output generator ?
• Proposed solution:• use TMR to cope with single faults in the modules
• replace the digital voter by an analog voter that• uses a comparator to generate the output
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 57
• Known solutions imply in• area, performance and / or power penalties
• deadlock: how to protect the output generator ?
• Proposed solution:• use TMR to cope with single faults in the modules
• replace the digital voter by an analog voter that• uses a comparator to generate the output
• can support some noise, nevertheless producing the correct result
Making TMR (n-MR) more reliable
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 58
The Analog Voter
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 59
Injection of faultsin the comparator (*)
Minimum Area Comparator
(*) using CMOS 0.35µm
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 60
Electrical Simulation: Multiple Faults(SPICE and CMOS 0.35 m)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 61
Dealing with Multiple Simultaneous Faults: n-MR
The Analog Voter with 5 Inputs (for 5-MR)
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 62
Dealing with Multiple Simultaneous Faults: n-MR
The Analog Voter with 5 Inputs (for 5-MR)
Simulations with injection of2 simultaneous faults also succeeded
Carlos A. L. Lisbôa Semana Acadêmica PPGC/UFRGS 17/10/2006 63
The Analog Voter ... Oops !
Does t
his
work ??
?