sw $6,50($1) I3 If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1)

Similar documents
Page 2 of 6 Here are the rules for conjugating Whether (or not) and If when using a Descriptive Verb. The only difference here from Action Verbs is wh

Stage 2 First Phonics

Hi-MO 애프터케어 시스템 편 5. 오비맥주 카스 카스 후레쉬 테이블 맥주는 천연식품이다 편 처음 스타일 그대로, 부탁 케어~ Hi-MO 애프터케어 시스템 지속적인 모발 관리로 끝까지 스타일이 유지되도록 독보적이다! 근데 그거 아세요? 맥주도 인공첨가물이

Page 2 of 5 아니다 means to not be, and is therefore the opposite of 이다. While English simply turns words like to be or to exist negative by adding not,

Microsoft PowerPoint - ch03ysk2012.ppt [호환 모드]

Lectures for 2nd Edition

본문01

2 min 응용 말하기 01 I set my alarm for It goes off. 03 It doesn t go off. 04 I sleep in. 05 I make my bed. 06 I brush my teeth. 07 I take a shower.

하나님의 선한 손의 도우심 이세상에서 가장 큰 축복은 하나님이 나와 함께 하시는 것입니다. 그 이 유는 하나님이 모든 축복의 근원이시기 때문입니다. 에스라서에 보면 하나님의 선한 손의 도우심이 함께 했던 사람의 이야기 가 나와 있는데 에스라 7장은 거듭해서 그 비결을


09김정식.PDF

- 2 -

¹Ìµå¹Ì3Â÷Àμâ

11¹ÚÇý·É

300 구보학보 12집. 1),,.,,, TV,,.,,,,,,..,...,....,... (recall). 2) 1) 양웅, 김충현, 김태원, 광고표현 수사법에 따른 이해와 선호 효과: 브랜드 인지도와 의미고정의 영향을 중심으로, 광고학연구 18권 2호, 2007 여름

퇴좈저널36호-4차-T.ps, page Preflight (2)

182 동북아역사논총 42호 금융정책이 조선에 어떤 영향을 미쳤는지를 살펴보고자 한다. 일제 대외금융 정책의 기본원칙은 각 식민지와 점령지마다 별도의 발권은행을 수립하여 일본 은행권이 아닌 각 지역 통화를 발행케 한 점에 있다. 이들 통화는 일본은행권 과 等 價 로 연

I&IRC5 TG_08권

2011´ëÇпø2µµ 24p_0628


step 1-1

#Ȳ¿ë¼®


歯15-ROMPLD.PDF

슬라이드 제목 없음

<32382DC3BBB0A2C0E5BED6C0DA2E687770>


서론 34 2

Microsoft PowerPoint - 7-Work and Energy.ppt

<B3EDB9AEC1FD5F3235C1FD2E687770>

<C7D1B9CEC1B7BEEEB9AEC7D03631C1FD28C3D6C1BE292E687770>

PowerPoint 프레젠테이션

DBPIA-NURIMEDIA

<31342D3034C0E5C7FDBFB52E687770>

49-9분동안 표지 3.3

2007 학년도 하반기 졸업작품 아무도 모른다 (Nobody Knows) 얄리, 보마빼 (AIi, Bomaye) 외계인간 ( 外 界 人 間 ) 한국예술종합학교 연극원 극작과 예술전문사 안 재 승

<B1E2C8B9BEC828BFCFBCBAC1F7C0FC29322E687770>

Microsoft PowerPoint - CHAP-03 [호환 모드]

hwp

May 2014 BROWN Education Webzine vol.3 감사합니다. 그리고 고맙습니다. 목차 From Editor 당신에게 소중한 사람은 누구인가요? Guidance 우리 아이 좋은 점 칭찬하기 고맙다고 말해주세요 Homeschool [TIP] Famil

PowerChute Personal Edition v3.1.0 에이전트 사용 설명서


Journal of Educational Innovation Research 2017, Vol. 27, No. 2, pp DOI: : Researc

K7VT2_QIG_v3

272 石 堂 論 叢 49집 기꾼이 많이 확인된 결과라 할 수 있다. 그리고 이야기의 유형이 가족 담, 도깨비담, 동물담, 지명유래담 등으로 한정되어 있음도 확인하였 다. 전국적인 광포성을 보이는 이인담이나 저승담, 지혜담 등이 많이 조사되지 않은 점도 특징이다. 아울

2009년 국제법평론회 동계학술대회 일정

- 이 문서는 삼성전자의 기술 자산으로 승인자만이 사용할 수 있습니다 Part Picture Description 5. R emove the memory by pushing the fixed-tap out and Remove the WLAN Antenna. 6. INS

<30322D28C6AF29C0CCB1E2B4EB35362D312E687770>

2 2010년 1월 15일 경상북도 직업 스쿨 운영 자격 취득 위한 맞춤형 교육 시 10곳 100명에 교육 기회 제공 본인에게 적합한 직종 스스로 선택 1인당 최고 100만원까지 교육비 지원 경상북도는 결혼이주여성 100명에게 맞춤형 취업교 육을 제공하는 결혼이민자 직


_KF_Bulletin webcopy

OP_Journalism

74 현대정치연구 2015년 봄호(제8권 제1호) Ⅰ. 서론 2015년 1월 7일, 프랑스 파리에서 총격 사건이 발생했다. 두 명의 남성이 풍자 잡지 주간 샤를리 의 본사에 침입하여 총기를 난사한 것이다. 이 사건으로 인해 열두 명의 사람이 목숨을 잃었다. 얼마 후에

?????

지능정보연구제 16 권제 1 호 2010 년 3 월 (pp.71~92),.,.,., Support Vector Machines,,., KOSPI200.,. * 지능정보연구제 16 권제 1 호 2010 년 3 월

#중등독해1-1단원(8~35)학


6자료집최종(6.8))

2 동북아역사논총 50호 구권협정으로 해결됐다 는 일본 정부의 주장에 대해, 일본군 위안부 문제는 일 본 정부 군 등 국가권력이 관여한 반인도적 불법행위이므로 한일청구권협정 에 의해 해결된 것으로 볼 수 없다 는 공식 입장을 밝혔다. 또한 2011년 8월 헌 법재판소는

04-다시_고속철도61~80p

Journal of Educational Innovation Research 2019, Vol. 29, No. 1, pp DOI: (LiD) - - * Way to

민속지_이건욱T 최종

untitled

,.,..,....,, Abstract The importance of integrated design which tries to i

untitled

... 수시연구 국가물류비산정및추이분석 Korean Macroeconomic Logistics Costs in 권혁구ㆍ서상범...

PJTROHMPCJPS.hwp

975_983 특집-한규철, 정원호

1. 서론 1-1 연구 배경과 목적 1-2 연구 방법과 범위 2. 클라우드 게임 서비스 2-1 클라우드 게임 서비스의 정의 2-2 클라우드 게임 서비스의 특징 2-3 클라우드 게임 서비스의 시장 현황 2-4 클라우드 게임 서비스 사례 연구 2-5 클라우드 게임 서비스에

Microsoft Word Hanwha Daily_New.doc


public key private key Encryption Algorithm Decryption Algorithm 1

274 한국문화 73

<32B1B3BDC32E687770>


<3136C1FD31C8A320C5EBC7D52E687770>

[ 영어영문학 ] 제 55 권 4 호 (2010) ( ) ( ) ( ) 1) Kyuchul Yoon, Ji-Yeon Oh & Sang-Cheol Ahn. Teaching English prosody through English poems with clon

Vol.259 C O N T E N T S M O N T H L Y P U B L I C F I N A N C E F O R U M

5/12¼Ò½ÄÁö

<C1DF3320BCF6BEF7B0E8C8B9BCAD2E687770>

4. 수업의 흐름 차시 창의 인성 수업모형에 따른 단계 수업단계 활동내용 창의 요소 인성 요소 관찰 사전학습: 날짜와 힌트를 보고 기념일 맞춰보기 호기심 논리/ 분석적 사고 유추 5 차시 분석 핵심학습 그림속의 인물이나 사물의 감정을 생각해보고 써보기 타인의 입장 감정

3항사가 되기 위해 매일매일이 시험일인 듯 싶다. 방선객으로 와서 배에서 하루 남짓 지내며 지내며 답답함에 몸서리쳤던 내가 이제는 8개월간의 승선기간도 8시간같이 느낄 수 있을 만큼 항해사로써 체질마저 변해가는 듯해 신기하기도 하고 한편으론 내가 생각했던 목표를 향해

11¹Ú´ö±Ô

歯kjmh2004v13n1.PDF

<B3EDB9AEC1FD5F3235C1FD2E687770>

<BFACBCBCC0C7BBE7C7D E687770>

IKC43_06.hwp

16-기06 환경하중237~246p


<C7D1B9CEC1B7BEEEB9AEC7D C3D6C1BE295F31392EB9E8C8A3B3B22E687770>

새천년복음화연구소 논문집 제 5 권 [특별 기고] 說 敎 의 危 機 와 展 望 조재형 신부 한국천주교회의 새로운 복음화에 대한 小 考 정치우 복음화학교 설립자, 교장 [심포지엄] 한국 초기 교회와 순교영성 한반도 평화통일과 한국 교회의 과제 교황 방한의 메시지와 복음의

영어-중2-천재김-07과-어순-B.hwp

<31325FB1E8B0E6BCBA2E687770>

대한한의학원전학회지24권6호-전체최종.hwp

Journal of Educational Innovation Research 2016, Vol. 26, No. 2, pp DOI: * Experiences of Af

012임수진

10송동수.hwp

ºÎ·ÏB

슬라이드 1

<31372DB9CCB7A1C1F6C7E22E687770>

Transcription:

Q1) Exercise 4.13 Instruction sequence a lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.1. Indicate dependences and their type. From Ia to Ib on $x : Ia -> Ib ($x) a : RAW : I1 -> I3 ($1), I2 -> I3 ($6) WAR : I1 -> I2 ($6) b : RAW : I1 -> I2 ($5), I1 -> I3 ($5) WAR : I2 -> I3 ($5), I1 -> I3 ($5) WAW : I1 -> I3 ($5) 4.13.2 Assume there is no forwarding in this pipelined processor. Indicate hazards and add instructions to eliminate them. WAW and WAR does not affect the single five stage pipeline. a : I1 -> I3 ($1) -> Hazard because dist(i1, I3) = 2 <= dist(id, WB) = 3 I2 -> I3 ($6) -> Hazard because dist(i2, I3) = 1 <= 3 lw $1,40($6) I1 add $6,$2,$2 I2

sw $6,50($1) I3 If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> Hazard because dist(i1, I2) = 1 < 3 I1 -> I3 ($5) -> Hazard because dist(i1, I3) = 2 < 3. lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 4.13.3 Assume there is full forwarding. Indicate hazards and add instructions to eliminate them. The remaining problems in this exercise assume the following clock cycle

times: without fowarding with full fowarding with ALU-ALU forwarding only a 300ps 400ps 360ps b 200ps 250ps 220ps a : I1 ->I3 ($1) -> The result of MEM(I1) can be forwarded to EX(I3) stage. I2 -> I3 ($6) -> The result of EX(I2) can be forwarded to EX(I3) stage. lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> The result of MEM(I1) cannot be foworaded to EX(I2) stage : 1. I1 -> I3 ($5) -> The result of MEM(I1) can be foworaded to EX(I3) stage. lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.4 What is the total execution time of this instruction sequence without forwarding and with full forwarding? What is the speed-up achieved by adding full forwarding? a : Without forwarding t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 ID I1 I2 I3

EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 10 cycles for 3 instructions: 10 * 300ps = 3000ps If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 9 cycles for 3 instructions: 9 * 300 = 2700ps With forwarding t0 t1 t2 t3 t4 t5 t6 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 7 cycles for 3 instructions: 7 * 400ps = 2800ps Speedup: 3000/2800 = 1.07 ( 2700/2800 = 0.96)

b : Without forwarding t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 10 cycles for 3 instructions: 10 * 200ps = 2000ps If register read/write can be done in one cycle, only 2 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 9 cycles for 3 instructions: 9 * 200ps = 1800ps With forwarding t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3

MEM I1 I2 I3 WB I1 I2 I3 8 cycles for 3 instructions: 8 * 250ps = 2000ps Speedup: 2000/2000 = 1. (1800/2000 = 0.9) 4.13.5 Add instructions to this code to eliminate hazards if there is ALU-ALU forwarding only (no forwarding from the MEM to EX stage)? a : I1 ->I3 ($1) -> The result of MEM(I1) cannot be forwarded to EX(I3) stage. I2 -> I3 ($6) -> The result of EX(I2) can be forwarded to EX(I3) stage. t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 We need 2 instructions between I2 and I3. lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 If register read/write can be done in one cycle, only 1 s are needed. (Both are correct) t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I3

ID I1 I2 I3 EX I1 I2 I3 MEM I1 I2 I3 WB I1 I2 I3 lw $1,40($6) I1 add $6,$2,$2 I2 sw $6,50($1) I3 b : I1 -> I2 ($5) -> The result of MEM(I1) cannot be forwarded to EX(I2) stage. I1 -> I3 ($5) -> The result of MEM(I1) cannot be forwarded to EX(I3) stage. No dependencies can be forwarded: Same as without forwarding lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 If register read/write can be done in one cycle, only 1 s are needed. (Both are correct) lw $5,-16($5) I1 sw $5,-16($5) I2 add $5,$5,$5 I3 4.13.6. What is the total execution time of this instruction sequence with only ALU-ALU

forwarding? What is the speed-up over a no-forwarding pipeline? a : Without forwarding : 3000ps(2700ps) With ALU-ALU forwarding : 9 cycles for 3 instructions, 9 * 360 = 3240ps (8 * 360 = 2880ps) Speed up : 0.93(0.94) b : Without forwarding = 2000ps ( 1800ps) With ALU-ALU forwarding : 10 * 220ps = 2200ps (9 * 220 = 1980ps) Speed up : 0.91(0.91) Q2) Exercise 4.14

Instruction sequence a lw $1,40($6) I1 beq $2,$0,Label ; Assume $2 == $0 I2 sw $6,50($2) I3 Label: add $2,$3,$4 I4 sw $3,50($4) I5 b lw $5,-16($5) I1 sw $4,-16($4) I2 lw $3,-20($4) I3 beq $2,$0,Label ; Assume $2!= $0 I4 add $5,$1,$4 I5 4.14.1 For this problem, assume that all branches are perfectly predicted and that no delay slots are used. If we have only one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the five-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by adding s to the code. Can you do the same with this structural hazard? Why? a : 9 cycles for 4 instructions t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I4 stall I5 ID I1 I2 I4 stall I5 EX I1 I2 I4 stall I5 MEM I1 I2 I4 stall I5 WB I1 I2 I4 stall I5

b : 12 cycles for 5 instructions t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 IF I1 I2 I3 stall stall stall I4 I5 ID I1 I2 I3 stall stall stall I4 I5 EX I1 I2 I3 stall stall stall I4 I5 MEM I1 I2 I3 stall stall stall I4 I5 WB I1 I2 I3 stall stall stall I4 I5 Because also needs to access memory, we cannot resolve structural hazard through inserting s. 4.14.2. For this problem, assume that all branches are perfectly predicted and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only four stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speed-up is achieved this instruction sequence? Assume that $6 + 40 = $a, $2 + 50 = $b, $4 + 50 = $c, Instruction sequence a lw $1,$a I1 beq $2,$0,Label ; Assume $2 == $0 I2 sw $6,$b I3 Label: add $2,$3,$4 I4 sw $3,$c I5 b lw $5,$d I1 sw $4,$e I2 lw $3,$f I3

beq $2,$0,Label ; Assume $2!= $0 I4 add $5,$1,$4 I5 a : 5 stages -> 8 cycles t0 t1 t2 t3 t4 t5 t6 t7 IF I1 I2 I4 I5 ID I1 I2 I4 I5 EX I1 I2 I4 I5 MEM I1 I2 I4 I5 WB I1 I2 I4 I5 4 stages -> 7 cycles t0 t1 t2 t3 t4 t5 t6 IF I1 I2 I3 I5 ID I1 I2 I3 I5 E/M I1 I2 I3 I5 WB I1 I2 I3 I5 Speedup = 8/7 = 1.14 b : 5 stages -> 9 cycles 4 stages -> 8 cycles Speedup = 9/8 = 1.13 4.13.3 Assuming stall-on-branch and no delay slots, what speed-up is achieved on this code if branch outcomes are determined in the ID stage, relative to the executions where branch outcomes are determined in the EX stage?

a : ID stage -> 9 cycles t0 t1 t2 t3 t4 t5 t6 t7 t8 IF I1 I2 I3 I4 I5 ID I1 I2 stall I4 I5 EX I1 I2 stall I4 I5 MEM I1 I2 stall I4 I5 WB I1 I2 stall I4 I5 Ex stage -> 10 cycles t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 IF I1 I2 I3 I3 I4 I5 ID I1 I2 stall stall I4 I5 EX I1 I2 stall stall I4 I5 MEM I1 I2 stall stall I4 I5 WB I1 I2 stall stall I4 I5 Speedup = 10/9 = 1.11 b : ID stage -> 10 cycles EX stage -> 11 cycles Speedup = 11/10 = 1.10 The remaining problems in this exercise assume that individual pipeline stages have the following latencies: IF ID EX MEM WB a 100ps 120ps 90ps 130ps 60ps

b 180ps 100ps 170ps 220ps 60ps 4.14.4 Given these pipeline stage latencies, repeat the speed-up calculation from 4.14.2, but take into account the change in clock cycle time. When EX and MEM are don in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20ps needed for the work that could not be done in parallel. The latency of the pipeline stage should be greater than the slowest stage latency. a : 5 stage -> 130ps(MEM) * 8 cycles = 1040ps 4 stage -> (130 + 20)ps(EX/MEM) * 7 cycles = 1050ps Speedup = 0.99 b : 5 stage -> 220ps(MEM) * 9 cycles = 1980ps 4 stage -> (220 + 20)(EX/MEM) * 8 cycles = 1920ps Speedup = 1.03 4.14.5 Given these pipeline stage latencies, repeat the speed-up calculation from Exercise 4.14.3, taking into account the change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decreases by 10ps when branch outcome resolution is moved from EX to ID. a : Branch resolution at EX stage -> 130ps(MEM) * 10cycles = 1300ps Branch resolution at ID stage -> 180ps(ID) * 9 cycles = 1620ps Speedup = 0.80 b : Branch resolution at EX stage -> 220ps(MEM) * 11cycles = 2420ps Branch resolution at ID stage -> 220ps(MEM) * 10 cycles = 2200ps Speedup = 1.1 4.14.6 Assuming stall-on-branch and no delay slots, what is the new clock cycle time and

execution time of this instruction sequence if beq address computation is moved to the MEM stage? What is the speed-up from this change? Assume that the latency of the EX stage is reduced by 20ps and the latency of the MEM stage is unchanged when branch outcome resolution is moved from EX to MEM. a : Branch resolution at EX stage -> 130ps(MEM) * 10cycles = 1300ps Branch resolution at MEM stage -> 130ps(MEM) * 11 cycles = 1430ps Speedup = 0.91 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 IF I1 I2 I3 I3 I3 I4 I5 ID I1 I2 stall stall stall I4 I5 EX I1 I2 stall stall stall I4 I5 MEM I1 I2 stall stall stall I4 I5 WB I1 I2 stall stall stall I4 I5 b : Branch resolution at EX stage -> 220ps(MEM) * 11cycles = 1300ps Branch resolution at MEM stage -> 220ps(MEM) * 12 cycles = 2640ps Speedup = 0.92 Q3) 4.19.1. If we use no forwarding, what fraction of cycles are we stalling due to data hazards? (3) Dependences to the 1 st next instruction: 2 stall cycles Dependences to both 1 st and 2 nd next instruction: 2 stall cycles Dependences to only the 2 nd next instruction: 1 stall cycle

CPI Fraction a. 1 + 0.45x2 + 0.05x1 = 1.95 49% (0.95/1.95) b. 1 + 0.4x2 + 0.1x1 = 1.9 47% (0.9/1.9) 4.19.2. If we use full forwarding (forward all results that can be forwarded), what fraction of cycles are we stalling due to data hazards? (3) The only RAW data dependences that cause stalls are those from the MEM state of one instruction to the 1 st next instruction, and the stall is only 1 cycle. CPI Fraction a. 1 + 0.25 = 1.25 20% (0.25/1.25) b. 1 + 0.2 = 1.2 17% (0.2/1.2) 4.19.3. Which of the two options results in fewer data stall cycles? (3) With forwarding only from the EX/MEM register: EX to 1st dependences: no stalls. EX to 2nd dependences: 1 stall cycles MEM to 1st dependences: 2 stall cycles With forwarding only from the MEM/WB register: EX to 2nd dependences: no stalls. MEM to 1st dependences: 1 stall cycle (no time travel). EX to 1st dependences: 1 stall cycle because we must wait for the instruction to complete the MEM stage to be able to forward to the next instruction. EX/MEM MEM/WB Fewer stall cycles with a. 0.1 + 0.05 + 0.25x2 = 0.65 0.1 + 0.1 + 0.25 = 0.45 MEM/WB b. 0.05 + 0.1 + 0.2x2 = 0.55 0.15 + 0.05 + 0.2 = 0.4 MEM/WB

4.19.4. What is the speed-up achieved by adding full forwarding to a pipeline that had no forwarding? (2) We have already computed the CPI without forwarding and with full forwarding in 4.19.1 and 4.19.2. We can compute time per instruction by taking into account the clock cycle time: Without forwarding With forwarding Speed-up a. 1.95 x 100ps = 195ps 1.25 x 110ps = 137.5ps 1.42 b. 1.9 x 300ps = 570ps 1.2 x 350ps = 420ps 1.36 4.19.5. What would be the additional speed-up if we added time-travel forwarding that eliminates all data hazards? (2) We already computed the time per instruction for full forwarding in 4.19.4. We can compute timeper instruction with time-travel forwarding and the speed-up over full forwarding. Without forwarding Time-travel forwarding Speed-up a. 1.25 x 110ps = 137.5ps 1 x 210ps = 210ps 0.65 b. 1.2 x 350ps = 420ps 1 x 450ps = 450ps 0.93 4.19.6. Repeat Exercise 4.19.3 but this time determine which of the two options results in shorter time per instruction. (1) EX/MEM MEM/WB Fewer stall cycles with a. 1.4 x 100ps = 140ps 1.45 x 100ps = 145ps EX/MEM b. 1.35 x 320ps = 432ps 1.4 x 310ps = 434ps EX/MEM Q4) 4.23.1. Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are

determined in the EX stage, that there are no data hazards, and that no delay slots are used. (2) - cycle stall 을 2cycle, 3cycle 모두정답으로처리. a. Misprediction 에의해생기는 cycle stall 은 3 cycle 이다. Branch instruction 의전체양은 15% 이고, always-taken predictor 의 miss rate 은 60% 이므로상승하는 CPI 을계산하면, 3 * 0.6 * 0.15 = 0.27 만큼 CPI 상승이발생한다. b. predictor 의 miss rate 은 40% 이고, branch instruction 의양은 10% 이므로, 3 * 0.4 * 0.10 =0.12 4.23.2. Repeat Exercise 4.21.1 for the "always not-taken" predictor. (2) a. always not-taken 의 miss rate 은 40% 이고, branch instruction 의양은 15% 이므로, 3 * 0.4 * 0.15 = 0.18 b. miss-rate 은 60% 이고, branch instruction 의양은 10% 이므로, 3 * 0.6 * 0.10 = 0.18 4.23.3. Repeat Exercise 4.23.1 for the 2-bit predictor. (2) a. 2-bit predictor 의 miss rate 은 20% 이고, branch instruction 의양은 15% 이므로, 3 * 0.2 * 0.15 = 0.09 b. miss-rate 은 5% 이고, branch instruction 의양은 10% 이므로, 3 * 0.05 * 0.10 = 0.015 3 * 0.2 4.23.4. With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced. (3) 2-bit predictor의 CPI는 ALU instruction으로의 conversion이없는경우, 4.23.3의 CPI만큼의증가가일어난 a = 1.09, b = 1.015이다. 여기서 branch instruction의절반을 (branch prediction이 correct 하던, incorrect 하던지상관없이 ) ALU instruction으로교체가가능하다면 CPI는 a = 1 + 3 * 0.2 * 0.15 * 0.5 = 1.045, b = 1+ 3 * 0.05 * 0.10 * 0.5 = 1.0075이된다. 따라서 conversion으로생기는 speed-up은 a = 1.09 / 1.045 = 1.043, b = 1.015 / 1.0075 = 1.007가된다. 4.23.5. With the 2-bit predictor, what speed-up would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume that correctly and incorrectly predicted instructions have the same

chance of being replaced. (3) Branch instruction의 50% 를 2개의 ALU instruction으로 conversion하게되면총 program의 instruction의개수는늘어나게된다. 따라서 CPI를계산하면, a = 1 + (1 + 3 * 0.2) * 0.15 * 0.5 = 1.12, b = 1 + (1 + 3 * 0.05) * 0.1 * 0.5 = 1.058로나오게되므로, speed-up을계산하면, a = 1.09 / 1.12 = 0.97, b = 1.015 / 1.058 = 0.96이나오게된다. 4.23.6. Some branch instructions are much more predictable than others. If we know that 80% of all executed branch instructions are easy-to-predict look-back branches that are always predicted correctly, what is the accuracy of the 2-bit predictor on the remaining 20% of the branch instructions? (2) a. 2-bit predictor 의 accuracy 는 80% 인데, 위문제에서 80% 의 branch instruction 은항상 correctly predicted 하므로나머지 20% 의 branch instruction 에대해서는 100% miss 가난다고볼 수있다. b. 2-bit predictor의 accuracy는 95% 인데, 위문제에서 80% 의 branch instruction은항상 correctly predicted 하므로나머지 20% 의 branch instruction 중 15% 에대해서정확히맞춘다고볼수있다. 따라서이경우나머지 20% 의 branch instruction에대한 accuracy는 15/20 = 75% 가된다. Q5) 4.24.1. What is the accuracy of always-taken and always-not-taken predictors for this sequence of branch outcomes? (2) a. always-taken = 75%, always not-taken = 25% b. always-taken = 60%, always not-taken = 40% 4.24.2. What is the accuracy of the 2-bit predictor for the first four branches in this pattern, assuming that the predictor starts off in the bottom left state from Figure 4.63 (predict not taken) (2) 처음 2-bit predictor 의 prediction 은 predict not taken 에서시작하므로, a. 0%, b. 25% 4.24.3. What is the accuracy of the 2-bit predictor if this pattern is repeated forever? (2) a. 75%, b. 40%

4.24.4. Design a predictor that would achieve a perfect accuracy if this pattern is repeated forever. Your predictor should be a sequential circuit with one output that provides a prediction (1 for taken, 0 for not taken) and no inputs other than the clock and the control signal that indicates that the instruction is a conditional branch. (3) Shift register 를통해구현이가능하다. 1 for taken, 0 for not taken 으로 shift 하는 register 를만들어 a 의경우 100% prediction 을위한초기값은 1101 로준다. ( 즉 4-bit shift register 를 left shift 방식 으로구현 ) b 의경우 5-bit shift register 를두어서초기값을 11100 으로두면된다. 4.24.5. what is the accuracy of your predictor from Exercise 4.24.4 if it is given a repeating pattern that is the exact opposite of this one? (2) Predictor 의 output 이항상 opposite 하게반복되는 pattern 이라면 accuracy 는 0% 가된다. 4.24.6. Repeat the Exercise 4.24.4, but now your predictor should be able to eventually (after a warm-up period during which it can make wrong predictions) start perfectly predicting both this pattern and its opposite. Your predictor should have an input that tells it what the real outcome was. Hint: this input lets your predictor determine which of the two repeating patterns it is given. (3) 4.24.5에서구현된 shift register에서처음 predictor의결과가 incorrect로나올경우, shift register 의 pattern을 invert하여인식할수있게만들면된다. 이렇게하면처음 prediction에서만 incorrect가발생하고, 그후부터 opposite pattern에대해서도 100% prediction이가능하게된다. Opposite pattern으로시작하지않는다면처음부터 shift register를통해 100% prediction이가능하게될것이다. Q6) 4.25.1. Which exceptions can each of these instructions trigger? For each of these exceptions, specify the pipeline state in which it is detected. (2) Instruction 1 Instruction 2 a. Overflow (EX) Invalid target address (EX) b. Invalid data address (MEM) No exceptions

4.25.2. Show how the pipeline organization must be changed to be able to handle this exception. (2) The Mux that selects the next PC must have inputs added to it. Each input is a constant address of an exception handler. The exception detectors must be added to the appropriate pipeline stage and the outputs of these detectors must be used to control the pre-pc Mux, and also to convert to s instructions that are already in the pipeline behind the exception-triggering instruction. 4.25.3. What happens in the pipeline when the first instruction causes the first exception? (3) Instructions are fetched normally until the exception is detected. When the exception is detected, all instructions that are in the pipeline after the first instruction must be converted to s. As a result, the second instruction never completes and does not affect pipeline state. In the cycle that immediately follows the cycle in which the exception is detected, the processor will fetch the first instruction of the exception handler. 4.25.4. What is the address of the exception handler? What happens if there is an invalid instruction at that address in instruction memory? (3) Handler address: a. 0xFFFFF000 b. 0x00000010 The first instruction word from the handler address is fetched in the cycle after the one in which the original exception is detected. When this instruction is decoded in the next cycle, the processor detects that the instruction is invalid. This exception is treated just like a normal exception it converts the instruction being fetched in that cycle into a and puts the address of the Invalid Instruction handler into the PC at the end of the cycle in which the Invalid Instruction exception is detected. 4.25.5. Repeat Exercise 4.25.3 using this modified pipeline and vectored exception handling. (3) This approach requires us to fetch the address of the handler from memory. We must add the

code of the exception to the address of the exception vector table, read the handler s address from memory, and jump to that address. One way of doing this is to handle it like a special instruction that computer the address in EX, loads the handler s address in MEM, and sets the PC in WB. 4.25.6. We want to emulate vectored exception handling on a machine that has only one fixed handler address. Write the code that should be at that fixed address. (2) We need a special instruction that allows us to move a value from the (exception) Cause register to a general-purpose register. We must first save the general-purpose register (so we can restore it later), load the Cause register into it, add the address of the vector table to it, use the result as an address for a load that gets the address of the right exception handler from memory, and finally jump to that handler. Q7) 4.28.1. a. b. add $1, $0, $0 # i = 0 add $4, $0, $0 loop: loop: beq $1, $2, end # if i == j, goto end sll $6, $1, 2 # $6 = i << 2, for array index add $7, $3, $6 # $7 = a+(i*4) 주소 계산 sll $1, $4, 2 # $1 = i << 2, for array index add $2, $6, $1 # $2 = a+(i*4) 주소 계산 lw $3, 0($2) # $3 = a[i] 값 lw $8, 0($7) # $8 = a[i] 값로드 lw $1, 4($2) # $1 = a[i+1] 값 add $7, $4, $6 # $7 = b+(i*4) 주소 계산 beq $3, $1, end # exit loop sw $0, 0($2) # a[i] = 0

sw $8, 0($7) # b[i] = a[i] addi $4, $4, 1 # i++ addi $1, $1, 1 # i++ j loop: # for j loop: # for end: end: 4.28.2.a Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID ** EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB 4.28.2.b

Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID ** EX ME WB add $2, $6, $1 IF ** ID EX ME WB lw $3, 0($2) IF ** ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ** ME WB lw $1, 4($2) IF ID ** EX ME WB beq $3, $1, end IF ID ** ** ** EX ME 4.28.3.a a. 변화없음 / 바꿀수있는것이 sw $8, 0($7), addi $1, $1, 1 뿐이지만바꿔도전체수행량은다 르지가않다. b. 변화없음 4.28.4.a 변화없음 4.28.4.b 변화없음

4.28.5.a 1-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID EX ME WB sll $6, $1, 2 IF ID EX ME WB add $7, $3, $6 IF ID EX ME WB lw $8, 0($7) IF ID EX ME WB add $7, $4, $6 IF ID EX ME WB sw $8, 0($7) IF ID EX ME WB addi $1, $1, 1 IF ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID EX ME WB 위표에서굵은글씨가 loop 안내용이다. 전제조건이 1,000,000 반복이므로파이프라인에서의앞 / 뒤를잘라내면각명령당한번의 cycle 이사용됨을알수있다. 따라서 CPI = 1 2-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 add $1, $0, $0 IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB

sll $6, $1, 2 IF ** ID EX ME WB add $7, $3, $6 IF ** ID ** EX ME WB lw $8, 0($7) IF ** ID EX ME WB add $7, $4, $6 IF ** ID ** EX ME WB sw $8, 0($7) IF ** ID EX ME WB addi $1, $1, 1 IF ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB beq $1, $2, end IF ID ** EX ME WB 총명령어가 16 개, cycle 이 14 이므로 CPI = 14/16 이다. 따라서 speed-up = 16/14 4.28.5.b 1-issue processor Instructions 1 2 3 4 5 6 7 8 9 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ME WB lw $1, 4($2) IF ID EX ME WB beq $3, $1, end IF ID ** EX ME WB sw $0, 0($2) IF ** ID EX ME WB addi $4, $4, 1 IF ID EX ME WB beq $0, $0, loop IF ID EX ME WB CPI = 9/8 2-issue processor Instructions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 add $4, $0, $0 IF ID EX ME WB sll $1, $4, 2 IF ID ** EX ME WB add $2, $6, $1 IF ** ID EX ME WB lw $3, 0($2) IF ** ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB

beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID ** EX ME WB lw $1, 4($2) IF ** ID EX ME WB beq $3, $1, end IF ** ID ** ** EX ME WB sw $0, 0($2) IF ** ** ID EX ME WB addi $4, $4, 1 IF ** ** ID EX ME WB beq $0, $0, loop IF ID EX ME WB sll $1, $4, 2 IF ID EX ME WB add $2, $6, $1 IF ID EX ME WB lw $3, 0($2) IF ID EX ** ME WB lw $1, 4($2) IF ID ** EX ME WB beq $3, $1, end IF ID ** ** ** EX ME CPI = 15/16, Speedup = (9/8) / (15/16) = 1.2 4.28.6.a Instructions add $1, $0, $0 beq $1, $2, end a 1 sll $6, $1, 2 b 2 add $7, $3, $6 b 3 lw $8, 0($7) c 4 add $7, $4, $6 c 5 sw $8, 0($7) d 6 addi $1, $1, 1 d 7 beq $0, $0, loop e 8 beq $1, $2, end e 8

sll $6, $1, 2 f 9 add $7, $3, $6 f 10 lw $8, 0($7) g 11 add $7, $4, $6 g 12 sw $8, 0($7) h 13 addi $1, $1, 1 h 14 beq $0, $0, loop a 1 beq $1, $2, end 위에서같은알파벳으로되어있는명령어가같이수행될수있는명령어들이다. 하지만대부분 메모리에쓰는명령어기때문에같이수행불가. beq 에서만실제로한클락에일어나는것을 확인할수가있다. CPI = 14/16, Speed-up = 16/14 4.28.6.b Instructions add $4, $0, $0 sll $1, $4, 2 a 1 add $2, $6, $1 b 2 lw $3, 0($2) b 3 lw $1, 4($2) c 4 beq $3, $1, end c 5 sw $0, 0($2) d 6 addi $4, $4, 1 d 7 beq $0, $0, loop e 8 sll $1, $4, 2 e 8 add $2, $6, $1 f 9 lw $3, 0($2) f 10 lw $1, 4($2) g 11 beq $3, $1, end g 12 sw $0, 0($2) h 13 addi $4, $4, 1 h 14 beq $0, $0, loop a 1

A 와마찬가지다. CPI = 14/16. Speed-up = (8/9) / (14/16) = 1.29