Reading List for CS
6501, Special Topics in Computer Architecture: Hardware Accelerators, Spring
2024
Kevin Skadron
University of Virginia
Motivations
for specialized hardware:
·
Esmailzadeh et al, “Dark silicon
and the end of multicore scaling”, ISCA 2011, https://ieeexplore.ieee.org/abstract/document/6307773
·
Hameed et al, “Understanding Sources of Inefficiency in General-Purpose
Chips,” ISCA 2010, https://dl.acm.org/doi/abs/10.1145/1815961.1815968
·
Goulding-Hotta et al., “The GreenDroid
Mobile Application Processor: An Architecture for Silicon’s Dark Future:” IEEE
Micro, 2011, h Links to an external site.ttps://cseweb.ucsd.edu//~swanson/papers/IEEEMicro2011GreenDroid.pdf
·
Sampson et al., “Efficient Complex Operators for Irregular Codes.” HPCA
2011, https://cseweb.ucsd.edu//~swanson/papers/HPCA2011EcoCores.pdf
Vector/SIMD
architectures:
· Rivoire et al., "Vector lane threading." ICPP 2006. https://ieeexplore.ieee.org/document/1690605
· Lee at al.,
"Exploring the tradeoffs between programmability and efficiency in
data-parallel accelerators." ISCA 2011. https://dl.acm.org/doi/abs/10.1145/2000064.2000080
Stream
processing (they predate GPGPU but form a useful background):
· Kapasi et al, "The
Imagine Stream Processor", ICCD 2002. https://ieeexplore.ieee.org/abstract/document/1106783
· Dally et al. "Merrimac:
Supercomputing with Streams." SC 2003. https://dl.acm.org/doi/abs/10.1145/1048935.105018
· Buck et ak. “Brook for GPUs: stream computing on graphics
hardware.” ACM Transactions on Graphics, 2004. https://doi.org/10.1145/1015706.1015800
GPUs
overview:
· Nickolls et al. "Scalable
Parallel Programming with CUDA," ACM Queue 2008, https://dl.acm.org/doi/pdf/10.1145/1401132.1401152
· Bakhoda et al.,
"Analyzing CUDA workloads using a detailed GPU simulator", ISPASS
2009, https://ieeexplore.ieee.org/abstract/document/4919648
· Lee et al.,
"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput
computing on CPU and GPU", ISCA 2010, https://dl.acm.org/doi/10.1145/1815961.1816021
· Meng et al.
"Dynamic Warp Subdivision for Integrated Branch and Memory Divergence
Tolerance." ISCA 2010. https://dl.acm.org/doi/10.1145/1816038.1815992
· Jia et al. "MRPB:
Memory request prioritization for massively parallel processors", HPCA
2014. https://ieeexplore.ieee.org/abstract/document/6835938
· Chatterjee et al.
"Architecting an Energy-Efficient DRAM System for GPUs", HPCA 2017, https://www.cs.utexas.edu/~skeckler/pubs/HPCA_2017_Subchannels.pdf
· Zhu et al. "Sparse
Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural
Networks on Modern GPUs", MICRO 2019, https://dl.acm.org/doi/pdf/10.1145/3352460.3358269
· NVIDIA's whitepaper on
the Grace Hopper architecture: https://resources.nvidia.com/en-us-tensor-core
Xeon
Phi (Intel’s GPU-like competitor):
· Sodani et al, "Knights
Landing: Second-Generation Intel Xeon Phi Product," IEEE Micro
2016, https://ieeexplore.ieee.org/abstract/document/7453080
FPGAs:
· Boutros and Betz. “FPGA
Architecture: Principles and Progression”, https://ieeexplore.ieee.org/abstract/document/9439568
· Cong et al,
Understanding Performance Differences of FPGAs and GPUs (FCCM 2018)
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=845763
· Putnam et al. "A
Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,"
ISCA 2014. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf
· Liu et al. "OverGen: Improving FPGA Usability through Domain-specific
Overlay Generation," MICRO 2022. https://ieeexplore.ieee.org/abstract/document/9923882
CGRAs:
· Ansaloni et al, "EGRA: A
Coarse Grained Reconfigurable Architectural Template," TVLSI 2011 - https://uweb.engr.arizona.edu/~ece506/readings/project-reading/3-coarse-grain/EGRA.pdf
· Prabhakar et al,
"Plasticine: A Reconfigurable Architecture For Parallel Patterns,"
ISCA 2017 - https://stanford-ppl.github.io/website/papers/isca17-raghu-plasticine.pdf
TPU:
· Jouppi et al,
"In-Datacenter Performance Analysis of a Tensor Processing Unit," https://dl.acm.org/doi/10.1145/3079856.3080246
· Jouppi et al. "TPU v4: An
Optically Reconfigurable Supercomputer for Machine Learning with Hardware
Support for Embeddings," ISCA'23. https://dl.acm.org/doi/abs/10.1145/3579371.3589350
PIM:
· M. He et al.
"Newton: A DRAM-maker’s Accelerator-in-Memory (AiM)
Architecture for Machine Learning," MICRO'20. https://ieeexplore.ieee.org/document/9251855
· S. Lee et al.
"Hardware Architecture and Software Stack for PIM Based on Commercial DRAM
Technology : Industrial Product" ISCA'21. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9499894
· Hajinazar et al, SIMDRAM,
ASPLOS’21 – we will read a slightly extended version on arXiv:
https://arxiv.org/pdf/2105.12839.pdf
· Lenjani et al, Gearbox,
ISCA'22 - http://www.cs.virginia.edu/~ml2au/papers/GearboxISCAFinalVersion.pdf
PIM+ML:
· Liu et al,
"Accelerating Personalized Recommendation with Cross-level Near-Memory
Processing," ISCA'23. https://dl.acm.org/doi/abs/10.1145/3579371.3589101
· Zhou et al. "TransPIM: A Memory-based Acceleration via Software-Hardware
Co-Design for Transformer," HPCA'22, https://ieeexplore.ieee.org/document/9773212
· Park et al. "AttAcc! Unleashing the Power of PIM for Batched
Transformer-based Generative Model Inference." ASPLOS'24. https://ieeexplore.ieee.org/abstract/document/10218731
ML
accelerators – Eyeriss:
·
V1:
Chen et al, ISCA 2016: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf
·
V2: Chen et al, JETCAS 2019: https://www.rle.mit.edu/eems/wp-content/uploads/2019/04/2019_jetcas_eyerissv2.pdf
·
Judd et al. "Stripes: Bit-Serial Deep Neural Network
Computing," MICRO'16, https://people.ece.ubc.ca/~aamodt/publications/papers/stripes-final.pdf
·
Gerogiannis, et al. "HotTiles: Accelerating SpMM with
Heterogeneous Accelerator Architectures", HPCA'24, https://iacoma.cs.uiuc.edu/iacoma-papers/hpca24_2.pdf
Cerebras Wafer-Scale Engine
for ML:
· (It’s hard to find
papers about the Cerebras architecture, but this seems to be the best one)
Lauterbach. “The Path to Successful Wafer-Scale Integration: The Cerebras Story.” IEEE Micro 2021. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623424
· (And an overview of
earlier work on wafer-scale integration) McDonald et al, “The trials of
wafer-scale integration: Although major technical problems have been overcome
since WSI was first tried in the 1960s, commercial companies can't yet make it
fly.” IEEE Spectrum, 1984. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6370295
SmartNICs:
·
Lin et al, "PANIC: A High-Performance Programmable NIC for
Multi-tenant Networks," OSDI'20, https://wisr.cs.wisc.edu/papers/osdi20-panic.pdf
·
Shashidhara et al, "FlexTOE: Flexible TCP Offload with Fine-Grained
Parallelism," NSDI 2022, https://www.usenix.org/system/files/nsdi22-paper-shashidhara.pdf
Graph Analytics:
·
Mukkara et al,
"Exploiting Locality in Graph Analytics through Hardware-Accelerated
Traversal Scheduling," MICRO 2018, https://www.cs.cmu.edu/~beckmann/publications/papers/2018.micro.hats.pdf
·
Zhang et al, "GraphP: Reducing
Communication for PIM-based Graph Processing with Efficient Data
Partition," HPCA2018, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8327036
·
Talati et al, "Mint:
An Accelerator For Mining Temporal Motifs," MICRO'22, https://tnm.engin.umich.edu/wp-content/uploads/sites/353/2023/03/2022.10.Mint_An_Accelerator_For_Mining_Temporal_Motifs.pdf
Google hardware accelerators:
·
Ranganathan et al, "Warehouse-scale video acceleration: co-design
and deployment in the wild", ASPLOS'21, https://dl.acm.org/doi/10.1145/3445814.3446723
·
Karandikar et al, "A
Hardware Accelerator for Protocol Buffers," MICRO'21, https://dl.acm.org/doi/10.1145/3466752.3480051
Potpurri:
·
Shiflett et al, "Flumen: Dynamic
Processing in the Photonic Interconnect", ISCA'23, https://dl.acm.org/doi/10.1145/3579371.3589110
·
Zhu et al, "Lightening-Transformer: A Dynamically-operated
Optically-interconnected Photonic Transformer Accelerator", HPCA'24, https://arxiv.org/pdf/2305.19533.pdf
·
Sriram et al, "SCALO: An Accelerator-Rich Distributed System for
Scalable Brain-Computer Interfacing," ISCA’23, https://dl.acm.org/doi/abs/10.1145/3579371.3589107
·
Kim et al, "SHARP: A Short-Word Hierarchical Accelerator for Robust
and Practical Fully Homomorphic Encryption," ISCA'23, https://dl.acm.org/doi/10.1145/3579371.3589053
·
Gao et al, "BeeZip: Towards An Organized and Scalable Architecture for Data
Compression," ASPLOS'24, https://dl.acm.org/doi/10.1145/3620666.3651323