Fault Tolerant Computing Homework 2

###### r12921a10 謝宗翰 **Fault Tolerant Computing Homework 2** === **Question 1:** --- *Some systems are designed for reliability whereas others are designed for availability.* 1. Explain the difference between reliability and availability - Reliability 取決於 interval of time，系統在時間內滿足標準並產生正確輸出的機率。 - Availability 取決於 instant of time，系統保持運作以服務其預期目的的時間百分比。如果一個系統會很頻繁出現non-operational的問題，但是出現問題的時間極短，則此系統仍可稱作是 highly available 的系統。 2. Give an example of an application requiring high availability and one requiring high reliability.* - High Reliability: 心臟起搏器和呼吸機等設備必須具有高可靠性，以確保患者的生命安全。這些設備通常具有多層次的故障檢測，以確保它們在關鍵時刻正常運行。 - High Availability: AWS或GCP等雲端計算平台，致力于提供High reliability的服務。他們非常需要確保即使在伺服器或數據中心故障時，服務仍然可用。 **Question 2:** --- *For the following systems (A and B), identify which attribute (reliability, availability etc.) is considered least important. Justify your answers.* > *A. An aircraft system has three computers voting on the results of every operation performed by the auto-pilot. If the auto-pilot fails, a warning alarm goes off in the cockpit to alert the pilot, who can then take over the manual controls of the aircraft and guide it to safety. However, the pilot does not interfere as long as the autopilot does not raise the alarm.* 1. Reliablility: Reliablility在飛機的自動駕駛中較**重要**，如果自動駕駛系統發生故障或錯誤，可能導致安全問題 2. Availability: Availability在飛機的自動駕駛中較**不重要**，一旦自動駕駛能正常運作，飛行員就不用操作，只有發出警報才要。**因此，即使在大部分時間飛機系統是可用狀態，但並不會對整體操作產生很大的影響。** > *B. An online trading website allows its customers to place bids on various items, and to track their bidding online. While it is acceptable for a user to not be able to place bids if the traffic is too high, it is not acceptable for a user who has placed bids to not track their bid’s status and modify the bid. Also, as far as possible, the website should not display an incorrect value of the item’s current bid, as this can cause users to over/under-bid for it.* 1. Reliablility: 以「**交易**」的角度來看，Reliablility非常重要，因為網站需要正確的處理使用者的資訊，如果交易上出現錯誤或商品誤植，可能會導致使用者或平台嚴重的財產損失。 2. Availability: 以「**網頁**」的角度來看，Availability也是關鍵之一，因为使用者希望能夠隨時使用網站。如果網站不可用，使用者就無法進行交易，可能會錯失一些機會。**因此，線上交易網站的Reliablility較為重要。** 綜合以上兩觀點，作為一個線上交易網站來說，出現交易錯誤的嚴重性較大，因此Reliablility較Availability重要。 *In each of the following descriptions (C and D), identify the fault, error and failure.* > *C. A program contains a rare race condition that is only triggered when the OS schedules threads in a certain order. Once triggered however, the race condition corrupts a value in the program, which in turn is used to make a branching decision. If the branching decision is incorrect, the program will go into an infinite loop and hang, thus failing to produce any output.* 1. Fault：指程式中存在的潛在問題，即Race Condition。即在特定條件，會導致程序出現Error。 → *A program contains a rare race condition that is only triggered when the OS schedules threads in a certain order.* 2. Error：指在程式執行時發生異常情況，即Race Condition被觸發並出現錯誤。 → *Once triggered however, the race condition corrupts a value in the program, which in turn is used to make a branching decision.* 3. Failure：指程式完全掛掉。 → *If the branching decision is incorrect, the program will go into an infinite loop and hang, thus failing to produce any output.* > *D. A radar system uses an array of processors to track its target in real-time. A soft error in a processor can lead to the processor computing an incorrect value for the target’s location. However, the system can compensate for this effect by redundantly allocating the tasks to processors and comparing the results. But this compensation entails a performance overhead, which in some cases, can cause the system to miss the tasks’ deadlines and lose the target.* 1. Fault：指程式中存在的潛在問題。即在特定條件，會導致程序出現Error。 → *A soft error in a processor* 2. Error：指在程式執行時發生異常情況，即Race Condition被觸發並出現錯誤。 → *A soft error in a processor can lead to the processor computing an incorrect value for the target’s location.* 3. Failure：指程式完全掛掉。 → *the system misses the tasks’ deadlines and lose the targe.* **Question 3:** --- *A telephone system has less than 3 min per year downtime. What is its steady-state availability?* 1. ${Steady-State \ availability = \frac{MTBF}{MTBF+MTTR}}$ 2. ${1-\frac{3}{365\times24\times60 } = 0.999994}$ 因此電話系統的steady-state availability = 99.9994%。(幾乎沒停機) **Question 4:** --- *A copy machines manufacturer estimates that the reliability of the machines he produces is 73% during the first 3 years of operation.* 1. *How many copy machines will need a repair during the first year of operation?* - 假設 exp failure law - ${R(t) = e^{-\lambda t}}$ 帶入 ${R(t) = 0.73 , \lambda = 3}$ - ${0.73 = e^{-3\lambda}}$ - ${e^{-\lambda} = 0.9 ⇒ \lambda = -\frac{1}{3}ln0.73}$ 2. *What is the MTTF of the copy machines?* - ${MTTF = \frac{1}{\lambda} = \frac{1}{-\frac{1}{3}ln0.73} = 9.5 (years)}$ 3. *The manufactures guarantees MTTR = 2 days. What is the MTBF of the copy machines?* - ${MTBF = MTTF + MTTR}$ - ${MTTF = 9.5 }$ - ${MTTR = 2 = \frac{2 day}{365 day} = 0.055 year}$ - ${MTBF = 9.5 + 0.055 = 9.555 years}$ 4. *Suppose that two copy machines work in parallel and the failures are independent. What is the probability of failure during the first year of operation?* - ${(1-0.9)^{2} = 1\% }$ **Question 5:** --- *Devise an original example (different from the lecture examples) to illustrate the difference between faults, errors, and failures. As you illustrate these concepts, relate them to the three-universe model.* 三者關係如下圖： ![](https://hackmd.io/_uploads/SJb_u8Yg6.png) 假設一個智能家居安全系统，有攝影機、運動感測器和中央控制中心。 1. Fault為系統的潛在問題或異常，可能會導致Error或Failure。在這個例子中，運動感測器有Fault，因為硬體缺陷，偶爾會發送不正確的信號。感測器有可能觸發假警報。 - Fault：感應器硬體 2. Error是指系統因Fault而遇到意外或不預期行為的狀態。在這個例子中，由於感測器的Fault，發送不正確的信號，導致假警報，而其中，這個假警報就是個Error。 - Error 1：中央控制中心生成的假警報 - Error 2：智能家居安全系統的響應，ex 激活警鈴或報警 3. Failure是指整個系統無法執行其該有的功能。在這個例子中，當假警報導致了不必要的影響，例如觸發安全警報並報警，導致屋主的困擾，這個就是智能家居系統未能提供可靠和準確的安全服務的Failure。 - Failure 1：造成屋主的困擾 - Failure 2：系統無法區分真正威脅和假警報 **Question 6** --- *Why redundancy techniques used in hardware system cannot be used for software fault tolerance. If you are employed as a software quality engineer, what techniques you will prefer?* 1. 軟體的容錯比硬體還複雜，因為各Module往往具有高度相關的Failures，而硬體通常可簡化成小的component來看。 2. 軟體的Reliablility會取決於環境隨時間的輸入。而硬體的容錯技術開發主要是為了緩解永久性的component faults，如果module給的spec夠詳細，詳細到可以寫出很多等效的backup，我應該比較prefer recovery block technique。採用RB的方法，遇到問題再寫就好，如果採用NVP，在還沒遇到問題前，就會花比較多時間在找不存在的bug。 **Question 7** --- *The company that you work for is designing an industrial controller that maintains the temperature of a fluid during a chemical reaction. The non-redundant controller (figure below) contains:* 1. *temperature sensor* 2. *analog circuitry to process the temperature sensor’s output signal* 3. *analog-to-digital converter (ADC)* 4. *microprocessor (including hardware and software)* 5. *digital-to-analog converter (DAC)* 6. *analog circuitry to process the output of the DAC* 7. *heating coil to control the temperature.* *You have been asked to develop at least two approaches for making the controller tolerant of **any two faulty components**. The term “component” means one of the blocks of functionality listed above, **excluding the heating coil.*** ![](https://hackmd.io/_uploads/rJEVNMue6.png) 1. *Show block diagrams of your two approaches and compare them qualitatively. Note that your designs should be able to handle faults of any two components, including any two same components (e.g., 2 ADCs) and any two different components (e.g., 1 ADC and 1 temperature sensor).* ![](https://hackmd.io/_uploads/SyMZuacbT.png) (TMR/Simplex) ![](https://hackmd.io/_uploads/rkl6J9pq-T.png) 2. *Which approach would you recommend for implementation and why?* 雖然TMR/Simplex用Voter的方式會多一個Module，但這同時可以減少Error detector。而且 Voter比3to1 Switch簡單實現。因此會推薦用下圖TMR/Simplex的方法。 **Question 8** --- *Moon Systems, a manufacturer of scientific workstations, produces its Model 13 System at sites S1, S2, S3; 20% at S1, 35% at S2, and the remaining 45% at S3. The probability that a Model 13 System will be found defective upon receipt by a customer is 0.01if it is shipped from site S1, 0.06 if from S2, and 0.03 if from S3.* 1. *What is the probability that a Model 13 System selected at random at a customer location will be found defective?* ${S_1: 0.2\times 0.01 = 0.002}$ ${S_2: 0.35\times 0.06 = 0.021}$ ${S_3: 0.45\times 0.03 = 0.0135}$ ${S_1 + S_2 + S_3 = 365}$ 2. *Suppose a Model 13 System selected at random is found to be defective at a customer location. What is the probability that it was manufactured at site S3?* ${\frac{S_3}{S_1 + S_2 + S_3} = 0.37}$ **Question 9** --- *Show that the reliability of TMR/Simplex is always better than either TMR or Simplex alone.* 以Reliability來說，TMR/Simplex出錯一個仍可運作，而TMR有一個模組出錯後需要去看剩下的兩個有無相同。若不同，整個就會failure，所以TMR/Simplex有著較小的Reliability，因此TMR/Simplex比TMR或Simplix佳。 **Question 10** --- *Find all tests for the stuck-at-0 fault on the marked line* ![](https://hackmd.io/_uploads/HyU24f_xT.png =70%x) * 為了能讓bc檢測出s-a-0，bc不得為 (1,1) * A所連接的NAND gate，當A = 1時，就會出現錯誤 * output為f的NAND，上面的port因為s-a-0，輸入都為1，所以下面的port為1時即可檢查出錯誤，因此可以回推e = 0時可檢查出錯誤。所以 (A,B,C,D) = (1,0,0,0) or (1,0,1,0) or (1,1,0,0)