Q1) Exercise 4.13 Instruction sequence a lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.1. Indicate dependences and their type. From Ia to Ib on $x : Ia -> Ib ($x) a : RAW : I1 -> I3 ($1), I2 -> I3 ($6) WAR : I1 -> I2 ($6) b : RAW : I1 -> I2 ($5), I1 -> I3 ($5) WAR : I2 -> I3 ($5), I1 -> I3 ($5) WAW : I1 -> I3 ($5) 4.13.2 Assume there is no forwarding in this pipelined processor. Indicate hazards and add instructions to eliminate them. WAW and WAR does not affect the single five stage pipeline. a : I1 -> I3 ($1) -> Hazard because dist(i1, I3) = 2 <= dist(id, WB) = 3 I2 -> I3 ($6) -> Hazard because dist(i2, I3) = 1 <= 3 lw $1,40($6) I1 add $6,$2,$2 I2
sw $6,50($1) I3 If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> Hazard because dist(i1, I2) = 1 < 3 I1 -> I3 ($5) -> Hazard because dist(i1, I3) = 2 < 3. lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 4.13.3 Assume there is full forwarding. Indicate hazards and add instructions to eliminate them. The remaining problems in this exercise assume the following clock cycle
times: without fowarding with full fowarding with ALU-ALU forwarding only a 300ps 400ps 360ps b 200ps 250ps 220ps a : I1 ->I3 ($1) -> The result of MEM(I1) can be forwarded to EX(I3) stage. I2 -> I3 ($6) -> The result of EX(I2) can be forwarded to EX(I3) stage. lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> The result of MEM(I1) cannot be foworaded to EX(I2) stage : 1. I1 -> I3 ($5) -> The result of MEM(I1) can be foworaded to EX(I3) stage. lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.4 What is the total execution time of this instruction sequence without forwarding and with full forwarding? What is the speed-up achieved by adding full forwarding? a : Without forwarding t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 ID I1 I2 I3
EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 10 cycles for 3 instructions: 10 * 300ps = 3000ps If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 9 cycles for 3 instructions: 9 * 300 = 2700ps With forwarding t0 t1 t2 t3 t4 t5 t6 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 7 cycles for 3 instructions: 7 * 400ps = 2800ps Speedup: 3000/2800 = 1.07 ( 2700/2800 = 0.96)
b : Without forwarding t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 10 cycles for 3 instructions: 10 * 200ps = 2000ps If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 9 cycles for 3 instructions: 9 * 200ps = 1800ps With forwarding t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3
MEM I1 I2 I3 WB I1 I2 I3 8 cycles for 3 instructions: 8 * 250ps = 2000ps Speedup: 2000/2000 = 1. (1800/2000 = 0.9) 4.13.5 Add instructions to this code to eliminate hazards if there is ALU-ALU forwarding only (no forwarding from the MEM to EX stage)? a : I1 ->I3 ($1) -> The result of MEM(I1) cannot be forwarded to EX(I3) stage. I2 -> I3 ($6) -> The result of EX(I2) can be forwarded to EX(I3) stage. t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 We need 2 instructions between I2 and I3. lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 If register read/write can be done in one cycle, only 1 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I3
ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> The result of MEM(I1) cannot be forwarded to EX(I2) stage. I1 -> I3 ($5) -> The result of MEM(I1) cannot be forwarded to EX(I3) stage. No dependencies can be forwarded: Same as without forwarding lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 If register read/write can be done in one cycle, only 1 s are needed. (Both are correct) lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.6. What is the total execution time of this instruction sequence with only ALU-ALU
forwarding? What is the speed-up over a no-forwarding pipeline? a : Without forwarding : 3000ps(2700ps) With ALU-ALU forwarding : 9 cycles for 3 instructions, 9 * 360 = 3240ps (8 * 360 = 2880ps) Speed up : 0.93(0.94) b : Without forwarding = 2000ps ( 1800ps) With ALU-ALU forwarding : 10 * 220ps = 2200ps (9 * 220 = 1980ps) Speed up : 0.91(0.91) Q2) Exercise 4.14
Instruction sequence a lw $1,40($6) I1 beq $2,$0,Label ; Assume $2 == $0 I2 sw $6,50($2) I3 Label: add $2,$3,$4 I4 sw $3,50($4) I5 b lw $5,-16($5) I1 sw $4,-16($4) I2 lw $3,-20($4) I3 beq $2,$0,Label ; Assume $2!= $0 I4 add $5,$1,$4 I5 4.14.1 For this problem, assume that all branches are perfectly predicted and that no delay slots are used. If we have only one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the five-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by adding s to the code. Can you do the same with this structural hazard? Why? a : 9 cycles for 4 instructions t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I4 stall I5 ID I1 I2 I4 stall I5 EX I1 I2 I4 stall I5 MEM I1 I2 I4 stall I5 WB I1 I2 I4 stall I5
b : 12 cycles for 5 instructions t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 IF I1 I2 I3 stall stall stall I4 I5 ID I1 I2 I3 stall stall stall I4 I5 EX I1 I2 I3 stall stall stall I4 I5 MEM I1 I2 I3 stall stall stall I4 I5 WB I1 I2 I3 stall stall stall I4 I5 Because also needs to access memory, we cannot resolve structural hazard through inserting s. 4.14.2. For this problem, assume that all branches are perfectly predicted and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only four stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speed-up is achieved this instruction sequence? Assume that $6 + 40 = $a, $2 + 50 = $b, $4 + 50 = $c, Instruction sequence a lw $1,$a I1 beq $2,$0,Label ; Assume $2 == $0 I2 sw $6,$b I3 Label: add $2,$3,$4 I4 sw $3,$c I5 b lw $5,$d I1 sw $4,$e I2 lw $3,$f I3
beq $2,$0,Label ; Assume $2!= $0 I4 add $5,$1,$4 I5 a : 5 stages -> 8 cycles t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I4 I5 ID I1 I2 I4 I5 EX I1 I2 I4 I5 MEM I1 I2 I4 I5 WB I1 I2 I4 I5 4 stages -> 7 cycles t0 t1 t2 t3 t4 t5 t6 IF I1 I2 I3 I5 ID I1 I2 I3 I5 E/M I1 I2 I3 I5 WB I1 I2 I3 I5 Speedup = 8/7 = 1.14 b : 5 stages -> 9 cycles 4 stages -> 8 cycles Speedup = 9/8 = 1.13 4.13.3 Assuming stall-on-branch and no delay slots, what speed-up is achieved on this code if branch outcomes are determined in the ID stage, relative to the executions where branch outcomes are determined in the EX stage?
a : ID stage -> 9 cycles t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 I4 I5 ID I1 I2 stall I4 I5 EX I1 I2 stall I4 I5 MEM I1 I2 stall I4 I5 WB I1 I2 stall I4 I5 Ex stage -> 10 cycles t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 I3 I4 I5 ID I1 I2 stall stall I4 I5 EX I1 I2 stall stall I4 I5 MEM I1 I2 stall stall I4 I5 WB I1 I2 stall stall I4 I5 Speedup = 10/9 = 1.11 b : ID stage -> 10 cycles EX stage -> 11 cycles Speedup = 11/10 = 1.10 The remaining problems in this exercise assume that individual pipeline stages have the following latencies: IF ID EX MEM WB a 100ps 120ps 90ps 130ps 60ps
b 180ps 100ps 170ps 220ps 60ps 4.14.4 Given these pipeline stage latencies, repeat the speed-up calculation from 4.14.2, but take into account the change in clock cycle time. When EX and MEM are don in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20ps needed for the work that could not be done in parallel. The latency of the pipeline stage should be greater than the slowest stage latency. a : 5 stage -> 130ps(MEM) * 8 cycles = 1040ps 4 stage -> (130 + 20)ps(EX/MEM) * 7 cycles = 1050ps Speedup = 0.99 b : 5 stage -> 220ps(MEM) * 9 cycles = 1980ps 4 stage -> (220 + 20)(EX/MEM) * 8 cycles = 1920ps Speedup = 1.03 4.14.5 Given these pipeline stage latencies, repeat the speed-up calculation from Exercise 4.14.3, taking into account the change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decreases by 10ps when branch outcome resolution is moved from EX to ID. a : Branch resolution at EX stage -> 130ps(MEM) * 10cycles = 1300ps Branch resolution at ID stage -> 180ps(ID) * 9 cycles = 1620ps Speedup = 0.80 b : Branch resolution at EX stage -> 220ps(MEM) * 11cycles = 2420ps Branch resolution at ID stage -> 220ps(MEM) * 10 cycles = 2200ps Speedup = 1.1 4.14.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and
execution time of this instruction sequence if beq address computation is moved to the MEM stage? What is the speed-up from this change? Assume that the latency of the EX stage is reduced by 20ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM. a : Branch resolution at EX stage -> 130ps(MEM) * 10cycles = 1300ps Branch resolution at MEM stage -> 130ps(MEM) * 11 cycles = 1430ps Speedup = 0.91 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 IF I1 I2 I3 I3 I3 I4 I5 ID I1 I2 stall stall stall I4 I5 EX I1 I2 stall stall stall I4 I5 MEM I1 I2 stall stall stall I4 I5 WB I1 I2 stall stall stall I4 I5 b : Branch resolution at EX stage -> 220ps(MEM) * 11cycles = 1300ps Branch resolution at MEM stage -> 220ps(MEM) * 12 cycles = 2640ps Speedup = 0.92 Q3) 4.19.1. If we use no forwarding, what fraction of cycles are we stalling due to data hazards? (3) Dependences to the 1 st next instruction: 2 stall cycles Dependences to both 1 st and 2 nd next instruction: 2 stall cycles Dependences to only the 2 nd next instruction: 1 stall cycle
CPI Fraction a. 1 + 0.45x2 + 0.05x1 = 1.95 49% (0.95/1.95) b. 1 + 0.4x2 + 0.1x1 = 1.9 47% (0.9/1.9) 4.19.2. If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we stalling due to data hazards? (3) The only RAW data dependences that cause stalls are those from the MEM state of one instruction to the 1 st next instruction, and the stall is only 1 cycle. CPI Fraction a. 1 + 0.25 = 1.25 20% (0.25/1.25) b. 1 + 0.2 = 1.2 17% (0.2/1.2) 4.19.3. Which of the two options results in fewer data stall cycles? (3) With forwarding only from the EX/MEM register: EX to 1st dependences: no stalls. EX to 2nd dependences: 1 stall cycles MEM to 1st dependences: 2 stall cycles With forwarding only from the MEM/WB register: EX to 2nd dependences: no stalls. MEM to 1st dependences: 1 stall cycle (no time travel). EX to 1st dependences: 1 stall cycle because we must wait for the instruction to complete the MEM stage to be able to forward to the next instruction. EX/MEM MEM/WB Fewer stall cycles with a. 0.1 + 0.05 + 0.25x2 = 0.65 0.1 + 0.1 + 0.25 = 0.45 MEM/WB b. 0.05 + 0.1 + 0.2x2 = 0.55 0.15 + 0.05 + 0.2 = 0.4 MEM/WB
4.19.4. What is the speed-up achieved by adding full forwarding to a pipeline that had no forwarding? (2) We have already computed the CPI without forwarding and with full forwarding in 4.19.1 and 4.19.2. We can compute time per instruction by taking into account the clock cycle time: Without forwarding With forwarding Speed-up a. 1.95 x 100ps = 195ps 1.25 x 110ps = 137.5ps 1.42 b. 1.9 x 300ps = 570ps 1.2 x 350ps = 420ps 1.36 4.19.5. What would be the additional speed-up if we added time-travel forwarding that eliminates all data hazards? (2) We already computed the time per instruction for full forwarding in 4.19.4. We can compute timeper instruction with time-travel forwarding and the speed-up over full forwarding. Without forwarding Time-travel forwarding Speed-up a. 1.25 x 110ps = 137.5ps 1 x 210ps = 210ps 0.65 b. 1.2 x 350ps = 420ps 1 x 450ps = 450ps 0.93 4.19.6. Repeat Exercise 4.19.3 but this time determine which of the two options results in shorter time per instruction. (1) EX/MEM MEM/WB Fewer stall cycles with a. 1.4 x 100ps = 140ps 1.45 x 100ps = 145ps EX/MEM b. 1.35 x 320ps = 432ps 1.4 x 310ps = 434ps EX/MEM Q4) 4.23.1. Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are
determined in the EX stage, that there are no data hazards, and that no delay slots are used. (2) - cycle stall 을 2cycle, 3cycle 모두정답으로처리. a. Misprediction 에의해생기는 cycle stall 은 3 cycle 이다. Branch instruction 의전체양은 15% 이고, always-taken predictor 의 miss rate 은 60% 이므로상승하는 CPI 을계산하면, 3 * 0.6 * 0.15 = 0.27 만큼 CPI 상승이발생한다. b. predictor 의 miss rate 은 40% 이고, branch instruction 의양은 10% 이므로, 3 * 0.4 * 0.10 =0.12 4.23.2. Repeat Exercise 4.21.1 for the "always not-taken" predictor. (2) a. always not-taken 의 miss rate 은 40% 이고, branch instruction 의양은 15% 이므로, 3 * 0.4 * 0.15 = 0.18 b. miss-rate 은 60% 이고, branch instruction 의양은 10% 이므로, 3 * 0.6 * 0.10 = 0.18 4.23.3. Repeat Exercise 4.23.1 for the 2-bit predictor. (2) a. 2-bit predictor 의 miss rate 은 20% 이고, branch instruction 의양은 15% 이므로, 3 * 0.2 * 0.15 = 0.09 b. miss-rate 은 5% 이고, branch instruction 의양은 10% 이므로, 3 * 0.05 * 0.10 = 0.015 3 * 0.2 4.23.4. With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced. (3) 2-bit predictor의 CPI는 ALU instruction으로의 conversion이없는경우, 4.23.3의 CPI만큼의증가가일어난 a = 1.09, b = 1.015이다. 여기서 branch instruction의절반을 (branch prediction이 correct 하던, incorrect 하던지상관없이 ) ALU instruction으로교체가가능하다면 CPI는 a = 1 + 3 * 0.2 * 0.15 * 0.5 = 1.045, b = 1+ 3 * 0.05 * 0.10 * 0.5 = 1.0075이된다. 따라서 conversion으로생기는 speed-up은 a = 1.09 / 1.045 = 1.043, b = 1.015 / 1.0075 = 1.007가된다. 4.23.5. With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same
chance of being replaced. (3) Branch instruction의 50% 를 2개의 ALU instruction으로 conversion하게되면총 program의 instruction의개수는늘어나게된다. 따라서 CPI를계산하면, a = 1 + (1 + 3 * 0.2) * 0.15 * 0.5 = 1.12, b = 1 + (1 + 3 * 0.05) * 0.1 * 0.5 = 1.058로나오게되므로, speed-up을계산하면, a = 1.09 / 1.12 = 0.97, b = 1.015 / 1.058 = 0.96이나오게된다. 4.23.6. Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict look-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions? (2) a. 2-bit predictor 의 accuracy 는 80% 인데, 위문제에서 80% 의 branch instruction 은항상 correctly predicted 하므로나머지 20% 의 branch instruction 에대해서는 100% miss 가난다고볼 수있다. b. 2-bit predictor의 accuracy는 95% 인데, 위문제에서 80% 의 branch instruction은항상 correctly predicted 하므로나머지 20% 의 branch instruction 중 15% 에대해서정확히맞춘다고볼수있다. 따라서이경우나머지 20% 의 branch instruction에대한 accuracy는 15/20 = 75% 가된다. Q5) 4.24.1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes? (2) a. always-taken = 75%, always not-taken = 25% b. always-taken = 60%, always not-taken = 40% 4.24.2. What is the accuracy of the 2-bit predictor for the first four branches in this pattern, assuming that the predictor starts off in the bottom left state from Figure 4.63 (predict not taken) (2) 처음 2-bit predictor 의 prediction 은 predict not taken 에서시작하므로, a. 0%, b. 25% 4.24.3. What is the accuracy of the 2-bit predictor if this pattern is repeated forever? (2) a. 75%, b. 40%
4.24.4. Design a predictor that would achieve a perfect accuracy if this pattern is repeated forever. Your predictor should be a sequential circuit with one output that provides a prediction (1 for taken, 0 for not taken) and no inputs other than the clock and the control signal that indicates that the instruction is a conditional branch. (3) Shift register 를통해구현이가능하다. 1 for taken, 0 for not taken 으로 shift 하는 register 를만들어 a 의경우 100% prediction 을위한초기값은 1101 로준다. ( 즉 4-bit shift register 를 left shift 방식 으로구현 ) b 의경우 5-bit shift register 를두어서초기값을 11100 으로두면된다. 4.24.5. what is the accuracy of your predictor from Exercise 4.24.4 if it is given a repeating pattern that is the exact opposite of this one? (2) Predictor 의 output 이항상 opposite 하게반복되는 pattern 이라면 accuracy 는 0% 가된다. 4.24.6. Repeat the Exercise 4.24.4, but now your predictor should be able to eventually (after a warm-up period during which it can make wrong predictions) start perfectly predicting both this pattern and its opposite. Your predictor should have an input that tells it what the real outcome was. Hint: this input lets your predictor determine which of the two repeating patterns it is given. (3) 4.24.5에서구현된 shift register에서처음 predictor의결과가 incorrect로나올경우, shift register 의 pattern을 invert하여인식할수있게만들면된다. 이렇게하면처음 prediction에서만 incorrect가발생하고, 그후부터 opposite pattern에대해서도 100% prediction이가능하게된다. Opposite pattern으로시작하지않는다면처음부터 shift register를통해 100% prediction이가능하게될것이다. Q6) 4.25.1. Which exceptions can each of these instructions trigger? For each of these exceptions, specify the pipeline state in which it is detected. (2) Instruction 1 Instruction 2 a. Overflow (EX) Invalid target address (EX) b. Invalid data address (MEM) No exceptions
4.25.2. Show how the pipeline organization must be changed to be able to handle this exception. (2) The Mux that selects the next PC must have inputs added to it. Each input is a constant address of an exception handler. The exception detectors must be added to the appropriate pipeline stage and the outputs of these detectors must be used to control the pre-pc Mux, and also to convert to s instructions that are already in the pipeline behind the exception-triggering instruction. 4.25.3. What happens in the pipeline when the first instruction causes the first exception? (3) Instructions are fetched normally until the exception is detected. When the exception is detected, all instructions that are in the pipeline after the first instruction must be converted to s. As a result, the second instruction never completes and does not affect pipeline state. In the cycle that immediately follows the cycle in which the exception is detected, the processor will fetch the first instruction of the exception handler. 4.25.4. What is the address of the exception handler? What happens if there is an invalid instruction at that address in instruction memory? (3) Handler address: a. 0xFFFFF000 b. 0x00000010 The first instruction word from the handler address is fetched in the cycle after the one in which the original exception is detected. When this instruction is decoded in the next cycle, the processor detects that the instruction is invalid. This exception is treated just like a normal exception it converts the instruction being fetched in that cycle into a and puts the address of the Invalid Instruction handler into the PC at the end of the cycle in which the Invalid Instruction exception is detected. 4.25.5. Repeat Exercise 4.25.3 using this modified pipeline and vectored exception handling. (3) This approach requires us to fetch the address of the handler from memory. We must add the
code of the exception to the address of the exception vector table, read the handler s address from memory, and jump to that address. One way of doing this is to handle it like a special instruction that computer the address in EX, loads the handler s address in MEM, and sets the PC in WB. 4.25.6. We want to emulate vectored exception handling on a machine that has only one fixed handler address. Write the code that should be at that fixed address. (2) We need a special instruction that allows us to move a value from the (exception) Cause register to a general-purpose register. We must first save the general-purpose register (so we can restore it later), load the Cause register into it, add the address of the vector table to it, use the result as an address for a load that gets the address of the right exception handler from memory, and finally jump to that handler. Q7) 4.28.1. a. b. add $1, $0, $0 # i = 0 add $4, $0, $0 loop: loop: beq $1, $2, end # if i == j, goto end sll $6, $1, 2 # $6 = i << 2, for array index add $7, $3, $6 # $7 = a+(i*4) 주소 계산 sll $1, $4, 2 # $1 = i << 2, for array index add $2, $6, $1 # $2 = a+(i*4) 주소 계산 lw $3, 0($2) # $3 = a[i] 값 lw $8, 0($7) # $8 = a[i] 값로드 lw $1, 4($2) # $1 = a[i+1] 값 add $7, $4, $6 # $7 = b+(i*4) 주소 계산 beq $3, $1, end # exit loop sw $0, 0($2) # a[i] = 0
sw $8, 0($7) # b[i] = a[i] addi $4, $4, 1 # i++ addi $1, $1, 1 # i++ j loop: # for j loop: # for end: end: 4.28.2.a Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID ** EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB 4.28.2.b
Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID ** EX ME WB add $2, $6, $1 IF ** ID EX ME WB lw $3, 0($2) IF ** ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ** ME WB lw $1, 4($2) IF ID ** EX ME WB beq $3, $1, end IF ID ** ** ** EX ME 4.28.3.a a. 변화없음 / 바꿀수있는것이 sw $8, 0($7), addi $1, $1, 1 뿐이지만바꿔도전체수행량은다 르지가않다. b. 변화없음 4.28.4.a 변화없음 4.28.4.b 변화없음
4.28.5.a 1-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID EX ME WB sll $6, $1, 2 IF ID EX ME WB add $7, $3, $6 IF ID EX ME WB lw $8, 0($7) IF ID EX ME WB add $7, $4, $6 IF ID EX ME WB sw $8, 0($7) IF ID EX ME WB addi $1, $1, 1 IF ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID EX ME WB 위표에서굵은글씨가 loop 안내용이다. 전제조건이 1,000,000 반복이므로파이프라인에서의앞 / 뒤를잘라내면각명령당한번의 cycle 이사용됨을알수있다. 따라서 CPI = 1 2-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB
sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID ** EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB 총명령어가 16 개, cycle 이 14 이므로 CPI = 14/16 이다. 따라서 speed-up = 16/14 4.28.5.b 1-issue processor Instructions 1 2 3 4 5 6 7 8 9 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ME WB lw $1, 4($2) IF ID EX ME WB beq $3, $1, end IF ID ** EX ME WB sw $0, 0($2) IF ** ID EX ME WB addi $4, $4, 1 IF ID EX ME WB beq $0, $0, loop IF ID EX ME WB CPI = 9/8 2-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID ** EX ME WB add $2, $6, $1 IF ** ID EX ME WB lw $3, 0($2) IF ** ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB
beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ** ME WB lw $1, 4($2) IF ID ** EX ME WB beq $3, $1, end IF ID ** ** ** EX ME CPI = 15/16, Speedup = (9/8) / (15/16) = 1.2 4.28.6.a Instructions add $1, $0, $0 beq $1, $2, end a 1 sll $6, $1, 2 b 2 add $7, $3, $6 b 3 lw $8, 0($7) c 4 add $7, $4, $6 c 5 sw $8, 0($7) d 6 addi $1, $1, 1 d 7 beq $0, $0, loop e 8 beq $1, $2, end e 8
sll $6, $1, 2 f 9 add $7, $3, $6 f 10 lw $8, 0($7) g 11 add $7, $4, $6 g 12 sw $8, 0($7) h 13 addi $1, $1, 1 h 14 beq $0, $0, loop a 1 beq $1, $2, end 위에서같은알파벳으로되어있는명령어가같이수행될수있는명령어들이다. 하지만대부분 메모리에쓰는명령어기때문에같이수행불가. beq 에서만실제로한클락에일어나는것을 확인할수가있다. CPI = 14/16, Speed-up = 16/14 4.28.6.b Instructions add $4, $0, $0 sll $1, $4, 2 a 1 add $2, $6, $1 b 2 lw $3, 0($2) b 3 lw $1, 4($2) c 4 beq $3, $1, end c 5 sw $0, 0($2) d 6 addi $4, $4, 1 d 7 beq $0, $0, loop e 8 sll $1, $4, 2 e 8 add $2, $6, $1 f 9 lw $3, 0($2) f 10 lw $1, 4($2) g 11 beq $3, $1, end g 12 sw $0, 0($2) h 13 addi $4, $4, 1 h 14 beq $0, $0, loop a 1
A 와마찬가지다. CPI = 14/16. Speed-up = (8/9) / (14/16) = 1.29