Reading List for CS 6501, Special Topics in Computer Architecture: Hardware Accelerators, Spring 2024

Kevin Skadron

University of Virginia

 

Motivations for specialized hardware:

·       Esmailzadeh et al, “Dark silicon and the end of multicore scaling”, ISCA 2011, https://ieeexplore.ieee.org/abstract/document/6307773

·       Hameed et al, “Understanding Sources of Inefficiency in General-Purpose Chips,” ISCA 2010, https://dl.acm.org/doi/abs/10.1145/1815961.1815968

·       Goulding-Hotta et al., “The GreenDroid Mobile Application Processor: An Architecture for Silicon’s Dark Future:” IEEE Micro, 2011, h Links to an external site.ttps://cseweb.ucsd.edu//~swanson/papers/IEEEMicro2011GreenDroid.pdf

·       Sampson et al., “Efficient Complex Operators for Irregular Codes.” HPCA 2011, https://cseweb.ucsd.edu//~swanson/papers/HPCA2011EcoCores.pdf

 

Vector/SIMD architectures:

·       Rivoire et al., "Vector lane threading." ICPP 2006. https://ieeexplore.ieee.org/document/1690605

·       Lee at al., "Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators." ISCA 2011. https://dl.acm.org/doi/abs/10.1145/2000064.2000080

 

Stream processing (they predate GPGPU but form a useful background):

·       Kapasi et al, "The Imagine Stream Processor", ICCD 2002. https://ieeexplore.ieee.org/abstract/document/1106783

·       Dally et al. "Merrimac: Supercomputing with Streams." SC 2003. https://dl.acm.org/doi/abs/10.1145/1048935.105018

·       Buck et ak. “Brook for GPUs: stream computing on graphics hardware.” ACM Transactions on Graphics, 2004. https://doi.org/10.1145/1015706.1015800

 

GPUs overview:

·       Nickolls et al. "Scalable Parallel Programming with CUDA," ACM Queue 2008, https://dl.acm.org/doi/pdf/10.1145/1401132.1401152

·       Bakhoda et al., "Analyzing CUDA workloads using a detailed GPU simulator", ISPASS 2009, https://ieeexplore.ieee.org/abstract/document/4919648

·       Lee et al., "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", ISCA 2010, https://dl.acm.org/doi/10.1145/1815961.1816021

·       Meng et al. "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance." ISCA 2010. https://dl.acm.org/doi/10.1145/1816038.1815992

·       Jia et al. "MRPB: Memory request prioritization for massively parallel processors", HPCA 2014. https://ieeexplore.ieee.org/abstract/document/6835938

·       Chatterjee et al. "Architecting an Energy-Efficient DRAM System for GPUs", HPCA 2017, https://www.cs.utexas.edu/~skeckler/pubs/HPCA_2017_Subchannels.pdf

·       Zhu et al. "Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs", MICRO 2019, https://dl.acm.org/doi/pdf/10.1145/3352460.3358269

·       NVIDIA's whitepaper on the Grace Hopper architecture: https://resources.nvidia.com/en-us-tensor-core

 

Xeon Phi (Intel’s GPU-like competitor):

·       Sodani et al, "Knights Landing: Second-Generation Intel Xeon Phi Product," IEEE Micro 2016, https://ieeexplore.ieee.org/abstract/document/7453080

 

FPGAs:

·       Boutros and Betz. “FPGA Architecture: Principles and Progression”, https://ieeexplore.ieee.org/abstract/document/9439568

·       Cong et al, Understanding Performance Differences of FPGAs and GPUs (FCCM 2018)
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=845763

·       Putnam et al. "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," ISCA 2014. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Catapult_ISCA_2014.pdf

·       Liu et al. "OverGen: Improving FPGA Usability through Domain-specific Overlay Generation," MICRO 2022. https://ieeexplore.ieee.org/abstract/document/9923882

 

CGRAs:

·       Ansaloni et al, "EGRA: A Coarse Grained Reconfigurable Architectural Template," TVLSI 2011 - https://uweb.engr.arizona.edu/~ece506/readings/project-reading/3-coarse-grain/EGRA.pdf

·       Prabhakar et al, "Plasticine: A Reconfigurable Architecture For Parallel Patterns," ISCA 2017 - https://stanford-ppl.github.io/website/papers/isca17-raghu-plasticine.pdf

 

TPU:

·       Jouppi et al, "In-Datacenter Performance Analysis of a Tensor Processing Unit," https://dl.acm.org/doi/10.1145/3079856.3080246

·       Jouppi et al. "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings," ISCA'23. https://dl.acm.org/doi/abs/10.1145/3579371.3589350

 

PIM:

·       M. He et al. "Newton: A DRAM-maker’s Accelerator-in-Memory (AiM) Architecture for Machine Learning," MICRO'20. https://ieeexplore.ieee.org/document/9251855

·       S. Lee et al. "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product" ISCA'21. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9499894

·       Hajinazar et al, SIMDRAM, ASPLOS’21 – we will read a slightly extended version on arXiv: https://arxiv.org/pdf/2105.12839.pdf

·       Lenjani et al, Gearbox, ISCA'22 - http://www.cs.virginia.edu/~ml2au/papers/GearboxISCAFinalVersion.pdf

 

PIM+ML:

·       Liu et al, "Accelerating Personalized Recommendation with Cross-level Near-Memory Processing," ISCA'23. https://dl.acm.org/doi/abs/10.1145/3579371.3589101

·       Zhou et al. "TransPIM: A Memory-based Acceleration via Software-Hardware Co-Design for Transformer," HPCA'22, https://ieeexplore.ieee.org/document/9773212  

·       Park et al. "AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference." ASPLOS'24.  https://ieeexplore.ieee.org/abstract/document/10218731

 

ML accelerators – Eyeriss:

·       V1: Chen et al, ISCA 2016: http://www.rle.mit.edu/eems/wp-content/uploads/2016/04/eyeriss_isca_2016.pdf

·       V2: Chen et al, JETCAS 2019: https://www.rle.mit.edu/eems/wp-content/uploads/2019/04/2019_jetcas_eyerissv2.pdf

·       Judd et al. "Stripes: Bit-Serial Deep Neural Network Computing," MICRO'16, https://people.ece.ubc.ca/~aamodt/publications/papers/stripes-final.pdf

·       Gerogiannis, et al. "HotTiles: Accelerating SpMM with Heterogeneous Accelerator Architectures", HPCA'24, https://iacoma.cs.uiuc.edu/iacoma-papers/hpca24_2.pdf

 

Cerebras Wafer-Scale Engine for ML:

·       (It’s hard to find papers about the Cerebras architecture, but this seems to be the best one) Lauterbach. “The Path to Successful Wafer-Scale Integration: The Cerebras Story.” IEEE Micro 2021. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623424

·       (And an overview of earlier work on wafer-scale integration) McDonald et al, “The trials of wafer-scale integration: Although major technical problems have been overcome since WSI was first tried in the 1960s, commercial companies can't yet make it fly.” IEEE Spectrum, 1984. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6370295

 

SmartNICs:

·       Lin et al, "PANIC: A High-Performance Programmable NIC for Multi-tenant Networks," OSDI'20, https://wisr.cs.wisc.edu/papers/osdi20-panic.pdf

·       Shashidhara et al, "FlexTOE: Flexible TCP Offload with Fine-Grained Parallelism," NSDI 2022, https://www.usenix.org/system/files/nsdi22-paper-shashidhara.pdf

 

Graph Analytics:

·       Mukkara et al, "Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling," MICRO 2018, https://www.cs.cmu.edu/~beckmann/publications/papers/2018.micro.hats.pdf

·       Zhang et al, "GraphP: Reducing Communication for PIM-based Graph Processing with Efficient Data Partition," HPCA2018, https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8327036

·       Talati et al, "Mint: An Accelerator For Mining Temporal Motifs," MICRO'22, https://tnm.engin.umich.edu/wp-content/uploads/sites/353/2023/03/2022.10.Mint_An_Accelerator_For_Mining_Temporal_Motifs.pdf

 

Google hardware accelerators:

·       Ranganathan et al, "Warehouse-scale video acceleration: co-design and deployment in the wild", ASPLOS'21, https://dl.acm.org/doi/10.1145/3445814.3446723

·       Karandikar et al, "A Hardware Accelerator for Protocol Buffers," MICRO'21, https://dl.acm.org/doi/10.1145/3466752.3480051

 

Potpurri:

·       Shiflett et al, "Flumen: Dynamic Processing in the Photonic Interconnect", ISCA'23, https://dl.acm.org/doi/10.1145/3579371.3589110

·       Zhu et al, "Lightening-Transformer: A Dynamically-operated Optically-interconnected Photonic Transformer Accelerator", HPCA'24, https://arxiv.org/pdf/2305.19533.pdf

·       Sriram et al, "SCALO: An Accelerator-Rich Distributed System for Scalable Brain-Computer Interfacing," ISCA’23, https://dl.acm.org/doi/abs/10.1145/3579371.3589107

·       Kim et al, "SHARP: A Short-Word Hierarchical Accelerator for Robust and Practical Fully Homomorphic Encryption," ISCA'23, https://dl.acm.org/doi/10.1145/3579371.3589053

·       Gao et al, "BeeZip: Towards An Organized and Scalable Architecture for Data Compression," ASPLOS'24, https://dl.acm.org/doi/10.1145/3620666.3651323