MO601/MC973 - Computer Architecture II

22/12	Final grades.
18/12	Exam 2 grades.
12/12	Project 4 deadline postponed to Dec. 15th.
11/12	Project 2 grades.
05/12	Project 1 grades.
20/10	Class for 21/oct cancelled. Project 2 postponed to 23/oct (end of day).
16/10	Exam 1 grades.
10/08	Lectures will start on Monday, 22nd.
22/jul	Important dates in the school calendar: Graduate and Undergraduate programs.

This course will cover tools and methodologies for Computer Architecture research including modern simulators, benchmarks for single/multi-cores and clusters. We will study recent papers on the area and how they model pipelines, caches, execution engines, power evaluation, etc.

The recommended bibliography contains:

Processor Microarchitecture: An Implementation Perspective. Antonio González, Fernando Latorre and Grigorios Magklis. Synthesis Lectures on Computer Architecture. Morgan && Claypool Publishers.
Modern Processor Design: Fundamentals of Superscalar Processors. John Paul Shen, Mikko H. Lipasti. Waveland Press. 2013.
Papers from Top Computer Architecture Conferences

Videos about virtual memory (Contributed by Alberto Oliveira): 1, 2, 3, 4.

2 Written exams: 60% of final grade (30% each).

Practical Projects: 40% of final grade.

Grade ranges: A for grade > 8.4, B for grade > 6.9, C for grade > 4.9, D for grade < 5.

Any unethical behavior related to the evaluation process will result in failing the course with the lowest possible grade. Every assignment is an individual assignment unless otherwise mentioned. Students are not expected to see each other solutions to the assignments.

Exercises provided by Marcus Angeloni

O que são benchmarks?
Qual é a diferença entre arquitetura e microarquitetura de computadores?
Quais são as dimensões mais comuns para classificação de microarquiteturas de processadores? Explique três delas.
Qual a diferença de processadores multicore e multithreaded?
Quais são os sete estágios do pipeline de um microprocessador?
Qual seria o tamanho da memória cache em um sistema ideal? Por quê?
Qual a diferença entre a memória cache de primeiro nível e as de demais níveis?
Quais as motivações para uso de endereçamento virtual?
O que é virtual aliasing?
O que é uma página?
O que é e para que serve a TLB?
Qual a diferença entre tag paralela e tag serial no contexto de acesso a array de dados?
Quais são os tipos de misses no contexto de loockup-free caches?
Qual a diferença entre MSHRs explicitamente endereçadas e implicitamente endereçadas?
Qual a diferença entre caches multiportas e multibancos?
Qual é a responsabilidade da instruction fetch unit?
Como funciona e para que serve o branch prediction?
Qual a diferença entre BTB e RAS?
Qual a diferença entre cache convencional e trace cache?
Qual a diferença entre predição estática e dinâmica?
Como é feita a escolha de qual branch predictor utilizar?
Qual o propósito do decodificador de instrução? O que ele identifica?
Qual a diferença de RISC e CISC?
Qual o tamanho em bytes que uma instrução pode assumir em uma arquitetura x86?
O que são e para que servem micro-operações?
Qual é o papel da fase de alocação no pipeline de microprocessadores?
Por que o register renaming é necessário?
Qual é o principal limitante do número de registradores no processador?
Para que serve o reorder buffer?
Qual a diferença entre rename buffer e reorder buffer?
Como funciona a estratégia de merged register file?
Quais são as estratégias para quando os valores de registradores são lidos? Quais as vantagens de cada uma delas?
Qual a ideia da abordagem de SimPoints? Quais os benefícios?
O que são fases de um programa?
O que são SimPoints?

Other exercises

What are the main limitants on the number of instructions executed in parallel?
Design a piece of C code containing a function call, a loop, and an if statement. Show where should be the branches in the assembly code. For each branch, explain how easy/difficult they can be predicted.
Show a piece of assembly code containing, at least, one WAR, RAR, WAW, RAW dependencies. What will happen to this code after register renaming?
Show an example (piece of code) of a Trace Cache performing better than a conventional cache. How good is this cache in this example?
How can you decode multiple x86 instructions from a block of bytes retrieved from the instruction cache?
Considering a sequency of 10 instructions to be fetched from memory in a scalar processor. Consider also that this processor has a branch predictor. How many times will the branch predictor hardware be accessed? Why?

Exercises provided by Marcus Angeloni

Qual o objetivo do estágio de despacho?
Qual a diferença entre despacho em ordem e fora de ordem?
Como funciona o scoreboarding?
Como funciona a fila de despacho quando os operandos são lidos antes do despacho?
Quais são e o que fazem os diferentes eventos da fila de despacho?
Qual a principal diferença entre a leitura depois do despacho e antes do despacho?
Como funciona a fila de emissão quando os operandos são lidos após o despacho?
Como funcionam os diferentes tipos de desambiguação de memória?
Quais os conceitos por detrás de uma fila de despacho distribuída?
Para que servem as matrizes de indeterminação e dependência?
Por que especular no acesso a memória é considerado bem crítico?
Quais as diferenças entre wakeup conservador e especulativo?
Qual objetivo do estágio de execução?
Quais as unidades de execução mais comuns?
O que é a rede de bypassing?
Por que as unidades de lógica e aritmética em geral é separada da de multiplicação e divisão?
Como funcionam as operação de multiplicação e divisão em processadores que não implementam essas unidades?
Qual a diferença entre modelo de memória segmentada e flat?
O que é endereço efetivo?
Qual o objetivo da unidade de branch?
Como funcionam as unidades SIMD? Quais suas vantagens?
Geralmente como é composta uma unidade SIMD? O que são vias (lanes)?
O que são bolhas? Com o bypassing pode minimizá-las?
Quais as vantagens e desvantagens do bypass?
Por que o result bypassing em processadores em ordem costuma ser mais complexo do que de processadores fora de ordem?
O que são e para que servem os SRF?
Como funciona o clustering?
Cite e explique dois tipos de clustering.
Por que é necessário um estágio de commit?
O que são e como se relacionam os estágios arquiteturais e especulativos?
Qual a diferença entre estados arquiteturais baseados em Retire Register File e Merged Register File? Para que tipo de processadores é mais adequado utilizar cada um deles?
Como é realizada a recuperação de um branch misprediction?
Quais são as formas de tratamento de branch misprediction?
Como é realizada a recuperação de pois de uma exceção?

Exercises provided by Pedro Henrique Amorim

Qual o papel do Issue stage?
Quais os principais esquemas de issue?
Em geral como é executado o esquema in order e o esquema out of order?
Qual o papel do scoreboarding?
Comente sucintamente como funciona1 a técnica "in-order" nos processadores VLIW?
Descreva o cenário em que é assumido o unified issue queue.
Descreva o cenário em que é assumido o reservation stations.
Em um esquema read before issue existem vários componentes, comente o objetivo dos seguintes componentes:

Qual o papel do do bloco "ctrl info" no esquema issue out of order?
Qual a função da Memória CAM (Content Address Memory)?
Arrays chamados Src1Id e Src2Id?
Blocos V1 e V2?
Blocos R1 e R2?

Descreva os seguintes eventos:

Issue queue allocation.
Instruction wakeup.
Instruction selection.
Issue queue reclamation.

Cite dois motivos que podem fazer uma instrução ficar parada no estágio de issue.
No pipeline de um determinado processador o estágio de Issue executa 8 instruções por vez mas no estágio de fetch é possível executar 16 instruções simultaneamente, quantas instruções serão obtidas simultaneamente após o estágio de commit ser finalizado?

Exercises provided by Rafael Junio

Como se calcula o tamanho de uma flat page table? Calcule o tamanho de uma flat page table de um processador de 64 bits utilizando páginas de 4KB.
Como uma multilevel page table consegue salvar espaço em relação a flat page table?
Dado uma page table de 4 níveis: 64 bits endereço virtual, 64 kb page size, 8B entrada da página. Apenas os endereços entre 0 e 4GB são utilizadas, calcule o tamanho da flat page table e da page table de 4 níveis dívididas igualmente.
Calcule a latência de um miss em uma tradução de endereço virtual para físico considerando uma page table de 3 níveis. Considere que:

1 ciclo para computar o endereço virtual
1 ciclo para acessar a cache
20 ciclos para acessar a memória
90% de hit para dados

page table não esta em cache.
page table esta em cache e tem 90% de hit de dados.

Qual o tamanho de uma TLB com página de 4KB tal que tenha a mesma taxa de hits de uma cache com 32KB de tamanho e 64B de tamanho de bloco?

I will provide office hour after each class. If you need more or alternative time, feel free to schedule by email.

Every assignment is an individual assignment unless otherwise mentioned. Students are not expected to see each other solutions to the assignments.

Project 1

Infrastructure: PIN and SPEC 2006

Tasks:

Install SPEC 2006, execute it, understand the runspec script.
Install PIN, execute a few examples. Understand how it works.
Use the available pintools to count the number of instructions of each SPEC program.
Create a new pintool and use it in, at least, 5 SPEC programs.

Report:

Create a folder called project1 in your repository
Due to 19/Sep, 10AM
Report document. SBC Template. You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document. Create a file called presentation.pdf containing slides for a 5 minutes presentation of your project.
CSV file. Include a file results.csv containing the results of the instruction count. This file should contain two columns where the first is the program name and the second is the instruction count. For programs that execute more than one time, include the execution number (1, 2, 3) after their names.
Source code. Include your source code in the folder src. This folder should contain all code that you created/modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for pre-requisites.

Project 2

Infrastructure: PIN, and 10 benchmarks

Task:

Evaluate Virtual to Physical memory translation for 4KB and 4MB pages.
Consider up to 512 entries TLB for instruction and data.
Consider 3 or 4 levels page table.
Look for benchmarks with large memory footprint.
Create one toy benchmark to check your environment.

Report:

Create a folder called project2 in your repository
Due to 23/Oct (end of the day)
Report document. SBC Template. You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document. Create a file called presentation.pdf containing slides for a 5 minutes presentation of your project.
CSV file. Include a file results.csv containing your results. For each benchmark, include the following columns: benchmark name (nd input if necessary), total memory access for instructions, total TLB misses for instructions, total page table access for instructions, total memory access for data, total TLB misses for data, total page table access for data.
Source code. Include your source code in the folder src. This folder should contain all code that you created/modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for pre-requisites.

Project 3

Goal: Reproduce one item (graph, table, etc) of a pre-selected paper from the last three editions of the following conferences: ISCA, ASPLOS, MICRO, HPCA.

Tasks:

Create a folder called project3 in your repository
Every Friday, up to 21/Oct, include the reference to one paper that you preliminary inspected in a file called papers.txt
When you feel that you have selected the desired paper, talk to me to reserve it and avoid conflicts. Insert the paper PDF in your repository as paper.pdf and create a short overview of it together with the specification of your task and how you plan to execute it in the following weeks. This presentation should be in a file called presentation1.pdf and you are expected to talk about it for 15 minutes in the classes of 31/Oct and 04/Nov.

Report:

Create a folder called project3 in your repository
Due to 17/Nov (end of the day)
Report document. SBC Template. You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document. Create a file called presentation2.pdf containing slides for a up to 15 minutes presentation of your project.
Source code. Include your source code in the folder src. This folder should contain all code that you created/modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for pre-requisites.

Selected papers

João Paulo Labegalini de Carvalho: Varun Agrawal, Abhiroop Dabral, Tapti Palit, Yongming Shen, and Michael Ferdman. 2015. Architectural Support for Dynamic Linking. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 691-702. DOI: http://dx.doi.org/10.1145/2694344.2694392
Gustavo Ciotto Pinton: R. Parihar and M. C. Huang, "Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 662-677.
Ciro Ceissler: Shao, Yakun Sophia, et al. "Aladdin: A pre-rtl, power-performance accelerator simulator enabling large design space exploration of customized architectures." 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA). IEEE, 2014.
Rafael Junio: FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack. Usenix Security Simposium 2014.
Lucas Prado Melo: Hilton, A. D., B. C. Lee, and Z. Huang. "Decoupling loads for nano-instruction set computers." Proceedings of The 43rd International Symposium on Computer Architecture. 2016.
Paulo Henrique Junqueira Amorim: Tri M. Nguyen and David Wentzlaff. 2015. MORC: a manycore-oriented compressed cache. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48).
Uglaybe Fernandes: Albericio, Jorge, et al. "Wormhole: Wisely predicting multidimensional branches." Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2014.
Alceu Emanuel Bissoto: Akanksha Jain, Calvin Lin. "Back to the Future: Leveraging Belady’s Algorithm for Improved Cache Replacement". Proceedings of The 43rd International Symposium on Computer Architecture (ISCA). 2016.
Marcus de Assis Angeloni: S. Bucur, J. Kinder, and G. Candea - "Prototyping symbolic execution engines for interpreted languages". In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 239-254.
Rafael Soares: S. Girbal, G. Mouchard, A. Cohen, and O. Temam. DiST: A simple, reliable and scalable method to significantly reduce processor architecture simulation time. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 1–12, June 2003.
Alfredo Salvarani: D. Gope and M. H. Lipasti, "Bias-Free Branch Predictor," 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014, pp. 521-532.
George Araújo: S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutluy and D. A. Jimenezz, "Improving cache performance using read-write partitioning," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), Orlando, FL, 2014, pp. 452-463.
Paulo Henrique: Eric Rotenberg, Steve Bennett, and James E. Smith. 1996. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture (MICRO 29). IEEE Computer Society, Washington, DC, USA, 24-35.
Davi Castro: G. Aşılıoğlu, Z. Jin, M. Köksal, O. Javeri and S. Önder, "LaZy Superscalar," 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, OR, 2015, pp. 260-271.
Pedro Tadahiro: The Inner Most Loop Iteration counter: a new dimension in branch history - Andre Seznec (INRIA/IRISA), Joshua San Miguel (University of Toronto), Jorge Albericio (University of Toronto)

Project 4

Goal: Expand one activity of the project 3 paper.

Tasks:

Create a folder called project4 in your repository
You can explore more configurations, you can make any variation to the algorithm, you can choose other opportunities. You do not need to get better results. You can even work to better explain the oficial result.

Report:

Create a folder called project4 in your repository
Due to 12/Dec (end of the day)
Report document. SBC Template. You can choose either English or Portuguese for your report. Maximum of 6 pages containing the count (item 3 above) and the description and result of your pintool. Filename: report.pdf
Presentation Document. Create a file called presentation.pdf containing slides for a up to 15 minutes presentation of your project.
Source code. Include your source code in the folder src. This folder should contain all code that you created/modified together with scripts to execute and a README.md file explaining dependencies. Your script could rely on environment variables for pre-requisites.

Date	Topic
22/ago	Introduction
26/ago	Reading a paper - Project 3 description
29/ago	Introduction do Microarchitecture
02/set	Overview of Execution Environments
05/set	Caches
09/set	Work time for Project 1
12/set	Work time for Project 1
16/set	Fetch Unit
19/set	Project 1
23/set	Decode Unit
26/set	Allocation
30/set	SimPoints
03/out	Work time for Project 2
07/out	Work time for Project 2
10/out	Review
14/out	Exam 1
17/out	Exam 1 resolution
21/out	Class cancelled
24/out	Project 2 - Presentation
28/out	Holliday
31/out	Project 3 - Preliminary presentation
04/nov	Project 3 - Preliminary presentation
07/nov	Issue
11/nov	Review Project 3
14/nov	Holliday
18/nov	Project 3 - Presentation
21/nov	Project 3 - Presentation
25/nov	Execute
28/nov	Commit
02/dez	Cache Coherence
05/dez	Review
09/dez	Holliday
12/dez	Exam 2
16/dez	Project 4

MO601/MC973 - Computer Architecture II

Update

Description

Bibliography

Evaluation

Exercises

Office hours

Course Projects

Project 1

Project 2

Project 3

Project 4

Schedule