6354 schedule

This schedule will be updated as the semester progresses; it is likely only accurate about a week in advance.

H&P refers to Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 5th edition. Textbook readings are provided primarily for reference; most discussion will be based on the assigned papers and material presented in lecture.

Some papers can only be accessed through the UVa network (or another institution that subscribes to the appropriate service). If you are Off-Grounds, you can use the UVa proxy or UVa VPN.

no class

blank

Logistics / Tech Trends (slides 1up/4up)

H&P 1.4-6, 1.8-1.10
* (no review) Moore, Progress in Digital Integrated Electronics
* (no review) Amdahl, Validity of the single processor approach to achieving large scale computing capabilities

nothing assigned

Memory Hierarchy 1 (slides 1up/4up)

H&P B
Smith, Cache memories, 1982
Bernstein, Cache-timing attacks on AES, 2005

Homework 1 (Membench) out

Memory Hierarchy 2 (slides 1up/4up)

H&P 2.2-4
Jouppi, Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, 1990
Cook et al, A Hardware Evaluation of Cache Partitioning to Improve Utilization and Energy-Efficiency while Preserving Responsiveness, 2013

nothing assigned

Memory Hierarchy 3 (slides 1up/4up)

Goto and van de Geijn, Anatomy of a High-Performance Matrix Multiplcation, 2008
Beamer et al, Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server, 2015
* (no review) Bilmes et al, The PhiPAC v1.0 Matrix-Multiply Distribution, 1998

nothing assigned

Pipelining (slides 1up/4up)

H&P C, 3.3
Bhandarker and Clark, Performance from architecture: comparing a RISC and CISC with similar hardware organization, 1991
Waterman et al, The RISC V Instruction Set Manual: Volume I: User-Level ISA, Chapter 1 (including commentary) only

nothing assigned

Out-of-Order 0: Static Scheduling / Branch Prediction (slides 1up/4up)

H&P 3.1-3
Weiss and Smith, A Study of Scalar Compilation Techniques for Pipelined Supercomputers, 1990
McFarling, Combining Branch Predictors, 1993

Homework 1 Checkpoint DUE

Out-of Order 1: Multiple Issue (slides 1up/4up)

H&P 3.7, H.3-4
Fisher, Very Long Instruction Word Architectures and the ELI-512, 1983
Colwell et al, A VLIW Architecture for a Trace Scheduling Compiler, 1987

nothing assigned

Out-of-Order 2: Dynamic Issue I / Precise Interrupts (slides)

H&P 3.6
Smith and Pleszkan, Implementation of Precise Interrupts for Pipelined Processors

blank

Out-of-Order 3: Dynamic Issue II (slides)

H&P 3.4-5, 3.8
Tomasulo, An Effective Algorithm for Exploiting Multiple Arithmetic Units

~~Homework 1 DUE Friday~~

Out-of-Order 4: Dynamic Issue III(slides)

H&P 3.9
Yeager, The MIPS R10000 Superscalar microprocessor
* (no review) Kanter, Intel’s Haswell CPU Microarchitecture

Homework 1 DUE NOON

Out-of-Order 5: SMT (slides)

H&P 3.12
Tullsen et al, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor
Alverson et al, The Tera Computer System

Homework 2 (OOO) out

reading day

blank

Multicore 1: Processor networks (slides audio screencapture)

H&P 5.1-2
Wulf and Bell, C.mmp—A multi-mini-processor
Scott, Synchronization and Communication in the T3E Multiprocessor
* (no review) Leiserson, Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing

nothing assigned

Multicore 2: Snooping cache coherence (slides audio screencapture)

H&P 5.3
Goodman, Using cache memory to reduce processor-memory traffic
Archibald and Baer, Cache coherence protocols: evaluation using a multiprocessor simulation model

nothing assigned

Multicore 3: Directory-based cache coherence (slides audio screencapture)

H&P 5.4
Lenoski et al, The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor
* (no review) Le et al, IBM POWER6 Microarchitecture

Homework 2 (OOO) Checkpoint DUE SATURDAY

Multicore 4: Memory models (slides audio screencapture)

H&P 5.6
Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial
Boehm and Adve, Foundations of the C++ Concurrency Memory Model, section 1 only

nothing assigned

Multicore 5: Synchronization support (slides audio screencapture)

H&P 5.5
Anderson, The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
Guiroux and Lachaize, Multicore Locks: The Case Is Not Closed Yet
* (no review) David et al, Everything you always wanted to know about synchronization but were afraid to ask

nothing assigned

Multicore 6: Transactional Memory (slides audio)

Herlihy and Moss, Transactional Memory: Architectural Support for Lock-Free Data Structures
McKenney et al, Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory
* (no review) Cutress, Intel Disables TSX Instructions: Erratum Found in Haswell, Haswell-E/EP, Broadwell-Y

Homework 2 (OOO) Due TUESDAY

Vector 1: Vector supercomputers / GPUs (slides audio screencapture)

H&P 4.1-4.2
Russell, The CRAY-1 Computer System
Lindholm et al, A User-Programmable Vertex Engine

blank

Vector 2: Vector Programming Interfaces 1 (slides audio)

Guest lecture (Jack Wadden)

blank

Vector 3: Vector Programming Interfaces 2 (slides audio)

Guest lecture (Jack Wadden)

Homework 3 (GPGPU) Out

Homework 2 Post-Mortem / Vector 4: GPGPU Case Studies (slides audio video)

H&P Chapter 4
Lee et al, Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Lee et al, Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures
* (no review) Volokv and Demmel, Benchmarking GPUs to Tune Dense Linear Algebra

blank

FPGAs (sldies audio video)

* (no review) Brown and Rose, Architecture of FPGAs and CPLDs: A Tutorial. Section 1.3, 2.2.1-7 optional.
Putnam et al, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

blank

ASIC accelerators (slides audio)

Reagan et al, Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
Shao et al, The Aladdin Approach to Accelerator Design and Modeling
* (no review) Han et al, EIE: Efficient Inference Engine on Compressed Neural Networks

Homework 3 Part 1 Due

Warehouse Scale Computers (slides audio screencapture)

* (no review) Barroso et al, The Datacenter as a Computer, chapters 1 and 3 and 6

blank

Security (slides audio screencapture)

Smith and Weingart, Building a high-performance, programmable secure coprocessor, 1998, Sections 1-6, 10
* (no review) Costan and Devadas, Intel SGX Explained (probably too long/detailed to read in full; focus on sections 1, 3, 4, 6)

blank

no class

blank

Exam Review (slides audio screencapture)

TBA

Homework 3 (GPGPU) due TUESDAY

Exam Review (slides audio screencapture)

blank

Exam (in-class)

List of topics

blank