Bibliography

[ABC+16] M. Abadi, P. Barham, J. Chen, et al. TensorFlow: a system for large-scale machine learning. OSDI, 2016.

[ACG+16] M. Abadi, A. Chu, I. Goodfellow, H. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. CCS, Oct. 2016.

[AA16] M. Abadi and D. Andersen. Learning to protect communications with adversarial neural cryptography. Oct. 2016.

[AGM+18] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. NeurIPS, Dec. 2018.

[ARS+20] D. Abts, J. Ross, J. Sparling, et al. Think fast: a tensor streaming processor (TSP) for accelerating deep learning workloads. ISCA, Jun. 2020.

[AMP+19] A. Agrawal, A. Modi, A. Passos, et al. TensorFlow Eager: a multi-stage, Python-embedded DSL for machine learning (slides). MLSys, Feb. 2019.

[AAB+19] Z. Ahmed, S. Amizadeh, M. Bilenko, et al. Machine learning at Microsoft with ML .NET. SIGKDD, Jul. 2019.

[ALV08] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. SIGCOMM, Oct. 2008.

[Ali20] Alibaba. Machine Learning Platform for AI. 2020.

[Ala18] J. Alammar. The illustrated transformer. June 2018.

[AHJ+18] D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli. The convergence of sparsified gradient methods. NeurIPS, Dec. 2018.

[AVG+15] L. Alvarez, L. Vilanova, M. Gonzalez, X. Martorell, N. Navarro, and E. Ayguade. Hardware-software coherence protocol for the coexistence of caches and local memories. TC, Jan. 2015.

[Ama19] Amazon. EC2 Inf1 Instances. 2019.

[Ama19b] Amazon. AWS re:Invent 2019: deliver high performance ML inference with AWS Inferentia. Dec. 2019.

[Ama20] Amazon. SageMaker. 2020.

[Amd67] G. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. AFIPS, Apr. 1967.

[Amd19] Amd. EPYC 7742. 2019.

[AAB+15] D. Amodei, R. Anubhai, E. Battenberg, et al. Deep Speech 2: end-to-end speech recognition in English and Mandarin. ICML, Dec. 2015.

[AC16] D. Amodei and J. Clark. Faulty reward functions in the wild. OpenAI, Dec. 2016.

[DH18] A. Dario and D. Hernandez. AI and compute. OpenAI, May 2018.

[AES19] A. Antoniou, H. Edwards, and A. Storkey. How to train your MAML. ICLR, Mar. 2019.

[AP19] S. Arik and T. Pfister. ProtoAttend: attention-based prototypical learning. Sep. 2019.

[ABF+19] N. Arivazhagan, A. Bapna, O. Firat, et al. Massively multilingual neural machine translation in the wild: findings and challenges. July 2019.

[ACB17] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Jan. 2017.

[ADC11] T. Ashby, P. Diaz, and M. Cintra. Software-based cache coherence with hardware-assisted selective self-invalidations using Bloom filters. TC, Apr. 2011.

[AFO18] S. Ashkiani, M. Farach-Colton, and J. Owens. A dynamic hash table for the GPU. IPDPS, May 2018.

[ACW18] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. ICML, Jul. 2018.

[BKH16] J. Ba, J. Kiros, and G. Hinton. Layer normalization. July 2016.

[BGJ+18] V. Bacoyannis, V. Glukhov, T. Jin, J. Kochems, and D. Song. Idiosyncrasies and challenges of data driven learning in electronic trading. NeurIPS, Dec. 2018.

[Bai20] Baidu. Kunlun. 2020.

[BKK18] S. Bai, J. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Mar. 2018.

[BKK19] S. Bai, J. Kolter, and V. Koltun. Deep equilibrium models. NeurIPS, Dec. 2019.

[BTV06] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: speeded up robust features. ECCV, 2006.

[BES+19] P. Balaprakash, R. Egele, M. Salim, V. Vishwanath, F. Xia, T. Brettin, and R. Stevens. Scalable reinforcement learning based neural architecture search for cancer deep learning research. SC, Nov. 2019.

[BV20] M. Balunovic and M. Vechev. Adversarial training and provable defenses: bridging the gap. ICLR, Feb. 2020.

[BHR18] L. Barroso, U. Holze, and P. Ranganathan. The datacenter as a computer: designing warehouse-scale machines. M\&C, Oct. 2018.

[BLK+19] F. Belletti, K. Lakshmanan, W. Krichene, et al. Scaling up collaborative filtering data sets through randomized fractal expansions. Apr. 2019.

[Ben12] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. NNs: Tricks of the Trade, Sep. 2012.

[BBC+19] C. Berner, G. Brockman, B. Chan, et al. Dota 2 with large scale deep reinforcement learning. Dec. 2019.

[BCC+19] D. Berg, R. Chirravuri, R. Cledat, S. Goyal, F. Hamad, and V. Tuulos. Open-sourcing Metaflow, a human-centric framework for data science. Netflix Tech Blog, Dec. 2019.

[Ber19] Berkeley. Ray. 2019.

[BDD+20] M. Binkowski, J. Donahue, S. Dieleman, et al. High fidelity speech synthesis with adversarial networks. ICLR, Apr. 2020.

[BHH20] P. Blanchard, D. Higham, and N. Higham Accurately computing the log-sum-exp and softmax functions. J. Num. Analysis, Aug. 2020.

[BCK+15] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. ICML, July 2015.

[BCZ+16] T. Bolukbasi, K. Chang, J. Zou, V. Saligrama, and A. Kalai. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. NeurIPS, Dec. 2016.

[BIK+17] K. Bonawitz, V. Ivanov, B. Kreuter, et al. Practical secure aggregation for privacy-preserving machine learning. CCS, Oct. 2017.

[BHR+08] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN, June 2008.

[BAC16] U. Bondhugula, A. Acharya, and A. Cohen. The Pluto+ algorithm: A practical approach for parallelization and locality optimization of affine loop nests. TOPLAS, Apr. 2016.

[BLB17] A. Botev, G. Lever, and D. Barber. Nesterov's accelerated gradient and momentum as approximations to regularised update descent. IJCNN, Jul. 2017.

[BCD+18] T. Boyd, Y. Cao, S. Das, T. Joerg, and J. Lebar. Pushing the limits of GPU performance with XLA. Nov. 2018.

[BGL+93] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, and R. Shah. Signature verification using a ``Siamese'' time delay neural network. NeurIPS, Dec. 1993.

[Bro19] Y. Brovman. Complementary item recommendations at eBay scale. Feb. 2019.

[BMR+20] T. Brown, B. Mann, N. Ryder, M. Subbiah, et al. Language models are few-shot learners. May 2020.

[BCN06] C. Bucila, R. Caruana, and A. Niculescu-Mizil. Model compression. SIGKDD, Aug. 2006.

[BEP+18] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. Efros. Large-scale study of curiosity-driven learning. Aug. 2018.

[CZH19] H. Cai, L. Zhu, and S. Han. ProxylessNAS: direct neural architecture search on target task and hardware. ICLR, Feb. 2019.

[CBG+20] L. Cambier, A. Bhiwandiwalla, T. Gong, O. H. Elibol, M. Nekuii, and H. Tang. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. ICLR, Jan. 2020.

[HSW+18] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. CVPR, Dec. 2018.

[CLN+17] I. Caspi, G. Leibovich, G. Novik, and S. Endrawis. Reinforcement Learning Coach. Dec. 2017.

[CMG+18] P. Castro, S. Moitra, C. Gelada, S. Kumar, and M. Bellemare. Dopamine: a research framework for deep reinforcement learning. Dec. 2018.

[CJL+16] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. ICASSP, 2016.

[CFL20] O. Chang, L. Flokas, and H. Lipson. Principled weight initialization for hypernetworks. ICLR, Feb. 2020.

[CCS+17] P. Chaudhari, A. Choromanska, S. Soatto, et al. Entropy-SGD: biasing gradient descent into wide valleys. ICLR, Mar. 2017.

[CBH+11] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. SMOTE: synthetic minority over-sampling technique. JAIR, June 2011.

[CHM+19] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Closing the sim-to-real loop: adapting simulation randomization with real world experience. ICRA, May 2019.

[CXZ+16] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. Apr. 2016.

[CES16] Y. Chen, J. Emer, and V. Sze. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. ISCA, June 2016.

[CG16] T. Chen and C. Guestrin. XGBoost: a scalable tree boosting system. SIGKDD, Aug. 2016.

[CPS+17] L. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking Atrous convolution for semantic image segmentation. June 2017.

[CES17] Y. Chen, J. Emer, and V. Sze. Using dataflow to optimize energy efficiency of deep neural network accelerators. MICRO, June 2017.

[CMJ+18] T. Chen, T. Moreau, Z. Jiang, et al. TVM: an automated end-to-end optimizing compiler for deep learning. OSDI, 2018.

[CYC19] C. Chen, C. Yang, and H. Cheng. Efficient and robust parallel DNN training through model parallelism on multi-GPU platform. Oct. 2019.

[CZZ+19] C. Chen, M. Zhang, M. Zhang, Y. Liu, Y. Li, and S. Ma. Social attentional memory network: modeling aspect- and friend-level differences in recommendation. WSDM, Jan. 2019.

[CZL+19] Q. Chen, H. Zhao, W. Li, P. Huang, and W. Ou. Behavior sequence transformer for e-commerce recommendation in Alibaba. DLP-KDD, Aug. 2019.

[CMF+20] B. Chen, T. Medini, J. Farwell, S. Gobriel, C. Tai, and A. Shrivastava. SLIDE : in defense of smart algorithms over hardware acceleration for large-scale deep learning systems. MLSys, Mar. 2020.

[CYE+19] Y. Chen, T. Yang, J. Emer, and V. Sze. Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. JETCAS, June 2019.

[CKH+16] H. Cheng, L. Koc, J. Harmsen, et al. Wide and deep learning for recommender systems. DLRS, Sep. 2016.

[CWV+14] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: efficient primitives for deep learning. Dec. 2014.

[CCK+17] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. CVPR, Nov. 2017.

[CWV+18] J. Choi, Z. Wang, S. Venkataramani, P. Chuang, V. Srinivasan, and K. Gopalakrishnan. PACT: parameterized clipping activation for quantized neural networks. July 2018.

[Cho16] F. Chollet. Xception: deep learning with depthwise separable convolutions. CVPR, Oct. 2016.

[CB18] N. Choma and J. BrunaY. Graph neural networks for neutrino classification. Big Data Summit, Feb. 2018.

[CGC+14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. Dec. 2014.

[CFO+18] E. Chung, J. Fowers, K. Ovtcharov, et al. Serving DNNs in real time at datacenter scale with project Brainwave. MICRO, Mar. 2018.

[CAL+16] O. Cicek, A. Abdulkadir, S. Lienkamp, T. Brox, and O. Ronneberger. 3D U-Net: learning dense volumetric segmentation from sparse annotation. MICCAI, June 2016.

[Cor20] Cortex. Deploy machine learning models in production. 2020.

[CAS16] P. Covington, J. Adams, and E. Sargin. Deep neural networks for YouTube recommendations. RecSys, Sep. 2016.

[DB19] W. Dai and D. Berleant. Benchmarking contemporary deep learning hardware and frameworks: a survey of qualitative metrics. CogMI, Dec. 2019.

[DAM+16] D. Das, S. Avancha, D. Mudigere, et al. Distributed deep learning using synchronous stochastic gradient descent. Feb. 2016.

[Dal17] B. Dally. High-performance hardware for machine learning. ENN, Feb. 2017.

[DMM+18] D. Das, N. Mellempudi, D. Mudigere, et al. Mixed precision training of convolutional neural networks using integer operations. ICLR, Feb. 2018.

[DPG+14] Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NeurIPS, Dec. 2014.

[DKA+19] S. Dave, Y. Kim, S. Avancha, K. Lee, and A. Shrivastava. DMazeRunner: executing perfectly nested loops on dataflow accelerators. TECS, Oct. 2019.

[Daw20] DAWNBench. DAWNBench: an end-to-end deep learning benchmark and competition. 2020.

[DCJ19] M. Dacrema, P. Cremonesi, and D. Jannach. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. RecSys, Sep. 2019.

[Dee19] DeepBench. Benchmarking deep learning operations on different hardware. 2019.

[DGY+74] R. Dennard, F. Gaensslen, H. Yu, V. Rideout, E. Bassous, and A. LeBlanc. Design of ion-implanted MOSFET's with very small physical dimensions. JSSC, Oct. 1974.

[DAM+19] D. Dennis, D. Acar, V. Mandikal, V. Sadasivan, H. Simhadri, V. Saligrama, and P. Jain. Shallow RNNs: a method for accurate time-series classification on tiny devices. NeurIPS, Dec. 2019.

[Dev17] J. Devlin. Sharp models on dull hardware: fast and accurate neural machine translation decoding on the CPU. May 2017.

[DCL+18] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. Oct. 2018.

[DAL+18] G. Dhillon, K. Azizzadenesheli, Z. Lipton, et al. Stochastic activation pruning for robust adversarial defense. ICLR, Mar. 2018.

[dDF+19] F. de Dinechin, L. Forget, J. Muller, and Y. Uguen. Posits: the good, the bad and the ugly. CoNGA, Mar. 2019.

[DSK+19] Y. Ding, J. Sohn, M. Kawczynski, et al. A deep learning model to predict a diagnosis of Alzheimer disease by using F-FDG PET of the brain. Radiology, Feb. 2019.

[DPB+17] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. ICML, Aug. 2017.

[DWO+19] Z. Doctor, D. Wysocki, R. O'Shaughnessy, D. Holz, and B. Farr. Black hole coagulation: modeling hierarchical mergers in black hole populations. Nov. 2019.

[DDV+20] T. Domhan, M. Denkowski, D. Vilar, X. Niu, F. Hieber, and K. Heafield. The Sockeye 2 neural machine translation toolkit at AMTA 2020. Aug. 2020.

[Don19] L. Dong. eBay's hyperscale platforms. Sep. 2019.

[DYC+19] Z. Dong, Z. Yao, Y. Cai, D. Arfeen, A. Gholami, M. Mahoney, and K. Keutzer. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. Nov. 2019.

[Doz16] T. Dozat. Incorporating Nesterov momentum into Adam. ICLR, May 2016.

[DMM+19] N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen. Channel and filter parallelism for large-scale CNN training. SC, Nov. 2019.

[DJS20] M. Du, R. Jia, and D. Song. Robust anomaly detection and backdoor attack detection via differential privacy. ICLR, Feb. 2020.

[DHS11] J. Duchji, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, July 2011.

[Efr20] A. Efrati. AI startups proliferate as businesses look for savings. The Information, Aug. 2020.

[ERR+18] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover. Diesel: DSL for linear algebra and neural net computations on GPUs. MAPL, June 2018.

[Eid18] Eider. Expo Demo. NeurIPS, Dec. 2018.

[ENG+18] A. Eisenman, M. Naumov, D. Gardner, M. Smelyanskiy, S. Pupyrev, K. Hazelwood, A. Cidon, and S. Katti. Bandana: using non-volatile memory for storing deep learning models. Nov. 2018.

[ETT15] T. Erez, Y. Tassa, and E. Todorov. Simulation tools for model-based robotics: comparison of Bullet, Havok, MuJoCo, ODE and PhysX. ICRA, May 2015.

[EBA+11] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. ISCA, June 2011.

[EG16] R. Evans and J. Gao. DeepMind AI reduces Google data centre cooling bill by 40 percent. July 2016.

[Fac18] Facebook. Glow IR. Oct. 2018.

[Fac20] Facebook. Compiler for neural network hardware accelerators. Feb. 2020.

[FHY19] F. Farshchi, Q. Huang, and H. Yun. Integrating NVIDIA deep learning accelerator (NVDLA) with RISC-V SoC on FireSim. EMC2, Dec. 2019.

[Fel19] M. Feldman. AI recommendation systems get a GPU makeover. 2018.

[Fel19b] A. Feldman. Cerebras deploys the CS-1, the industry's fastest AI computer, at Argonne National Lab. Nov. 2019.

[FGM+10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. PAMI, Sep. 2010.

[Fey20] M. Fey. PyTorch geometric documentation. 2020.

[FL19] M. Fey and J. Lenssen. Fast graph representation learning with PyTorch geometric. Mar. 2019.

[FAL17] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ICML, July 2017.

[FWT11] V. Firoiu, W. Whitney, and J. Tenenbaum. Beating the world's best at Super Smash Bros. with deep reinforcement learning. May 2017.

[FRP+20] S. Flennerhag, A. Rusu, R. Pascanu, F. Visin, H. Yin, and R. Hadsell. Meta-learning with warped gradient descent. ICLR, Apr. 2020.

[FC19] J. Frankle and M. Carbin. The lottery ticket hypothesis: finding sparse, trainable neural networks. ICLR, Mar. 2019.

[FLP+99] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. 1999.

[Gab46] D. Gabor. Theory of communication. Part 1: the analysis of information. Radio & Comm. Eng., Nov. 1946.

[GZY+20] T. Gale, M. Zaharia, C. Young, and Erich Elsen. Sparse GPU kernels for deep learning. June 2020.

[GCL+19] J. Gauci, E. Conti, Y. Liang, et al. Horizon: Facebook's open source applied reinforcement learning platform. Sep. 2019.

[GMV+20] T. Gebru, J. Morgenstern, B. Vecchione, J. Vaughan, H. Wallach, H. Daume III, and K. Crawford. Datasheets for datasets. Mar, 2019.

[GAG+17] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. Dauphin. Convolutional sequence to sequence learning. ICML, May 2017.

[GRM+18] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. Wichmann, and W. Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. Nov. 2018.

[Gen09] C. Gentry. A fully homomorphic encryption scheme. Sep. 2009.

[GAB+18] E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke. Anatomy Of high-performance deep learning convolutions on SIMD architectures. SC, Aug. 2018.

[GSC99] F. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: continual prediction with LSTM. ICANN, Sep. 1999.

[Gha17] A. Gharakhanian. Generative adversarial networks-hot topic in machine learning. KDnuggets, Jan. 2017.

[GAJ+18] A. Gholami, A. Azad, P. Jin, K. Keutzer, and A. Buluc. Integrated model, batch, and domain parallelism in training neural networks. SPAA, July 2018.

[GLH+19] S. Ghose, T. Li, N. Hajinazar, D. Cali, and O. Mutlu. Understanding the interactions of workloads and DRAM types: a comprehensive experimental study. Oct. 2019.

[GCH+20] B. Ginsburg, P. Castonguay, O. Hrinchuk, et al. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. Feb. 2020.

[GBB11] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. AISTATS, 2011.

[GB10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. AISTATS, 2010.

[GPM+14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al. Generative adversarial networks. Jun. 2014.

[Goo19] Google. MLIR: a new intermediate representation and compiler framework. Apr. 2019.

[Goo20] Google. Embeddings: translating to a lower-dimensional space. 2020.

[Goo20b] Google. C++ differential privacy library. Feb. 2020.

[Goo20c] Google. TensorFlow XLA index. Feb. 2020.

[Goo20d] Google. AI Platform. 2020.

[Goo20e] Google. TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines. 2020.

[Goo20f] Google. TensorFlow TCAV. 2020.

[Goo20g] Google. TensorFlow-XLA Operation Semantics. 2020.

[Gvd08] K. Goto, and R. van de Geijn. Anatomy of high-performance matrix multiplication. TOMS, May 2008.

[Gra19] GraphCore. Microsoft and Graphcore collaborate to accelerate artificial intelligence. 2019.

[Gra20] Graphcore. Intelligent processing unit. July 2020.

[GSK+17] K. Greff, R. Srivastava, J. Koutn\'{\i}k, B. Steunebrink, and J. Schmidhuber. LSTM: a search space dyssey. TNNLS, Oct. 2017.

[GW00] A. Griewank and A. Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. TOMS, Mar. 2000.

[GMY+19] H. Guan, A. Malevich, J. Yang, Jongsoo Park, and H. Yuen. Post-training 4-bit quantization on embedding tables. NeurIPS, Dec. 2019.

[GWY+19] S. Gui, H. Wang, C. Yu, H. Yang, Z. Wang, and J. Liu. Model compression with adversarial robustness: a unified optimization framework. NeurIPS, Dec. 2019.

[Gui20] GuildAI. The ML Engineering Platform. 2020.

[Gun17] D. Gunning. Explainable Artificial Intelligence (XAI). DARPA, Nov. 2017.

[GPV+20] P. Gupta, N. Puri, S. Verma, D. Kayastha, S. Deshmukh, B. Krishnamurthy, and S. Singh. Explain your move: understanding agent actions using focused feature saliency. ICLR, 2020.

[GTY+17] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He. DeepFM: a factorization-machine based neural network for CTR prediction. Mar. 2017.

[Gus17] J. Gustafson. Posit arithmetic. 2017.

[Hab19] Habana Labs. Goya inference platform white paper. Aug. 2019.

[Hab19b] Habana Labs. System-1. June 2019.

[HKK16] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. CVPR, Oct. 2016.

[HPN+17] S. Han, J. Pool, S. Narang, et al. DSD: dense-sparse-dense training for deep neural networks. ICLR, Feb. 2017.

[HRM+19] A. Hard, K. Rao, R. Mathews, et al. Federated learning for mobile keyboard prediction. Feb. 2019.

[HNP+18] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. PipeDream: fast and efficient pipeline parallel DNN training. June 2018.

[Har18] F. Hartmann. Federated learning for Firefox. Aug. 2018.

[Has18] M. Hassan. AlexNet-1.png. 2018.

[Haz18] K. Hazelwood. Applied machine learning at Facebook: an infrastructure perspective. Sep. 2018.

[HBB+18] K. Hazelwood, S. Bird, D. Brooks, et al. Applied machine learning at Facebook: a datacenter infrastructure perspective. HPCA, Feb. 2018.

[Haz20] K. Hazelwood. Deep learning: it's not all about recognizing cats and dogs. SAIS, June 2020.

[HBG+08] H. He, Y. Bai, E. A. Garcia, and S. Li. ADASYN: adaptive synthetic sampling approach for imbalanced learning. IJCNN, June 2008.

[HZR+15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, Dec. 2015.

[HZR+15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. ICCV, Feb. 2015.

[HZR+15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. Apr. 2015.

[HGD+17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. ICCV, Mar. 2017.

[HLZ+17] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua. Neural collaborative filtering. ICIWWW, Apr. 2017.

[HSP+19] Y. He, T. Sainath, R. Prabhavalkar, et al. Streaming end-to-end speech recognition for mobile devices. ICASSP, Apr. 2019.

[HLL+19] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han. AMC: AutoML for model compression and acceleration on mobile devices. ECCV, Jan. 2019.

[HAP+19] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, et al. ExTensor: an accelerator for sparse tensor algebra. MICRO, Oct. 2019.

[HIB+19] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep Reinforcement Learning that Matters. Jan. 2019.

[HG16] D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). June 2016.

[HDB17] J. Hermann and M. Del Balso. Meet Michelangelo: Uber's machine learning platform. Sep. 2017.

[HMv+17] M. Hessel, J. Modayil, H. van Hasselt, et al. Rainbow: combining improvements in deep reinforcement learning. AAAI, Oct. 2017.

[HR15] T. Highlander and A. Rodriguez. Very efficient training of convolutional neural networks using fast Fourier transform and overlap-and-add. BMVA, Sep. 2015.

[HSS12] G. Hinton, N. Srivastava, and K. Swersky. RMSProp: divide the gradient by a running average of its recent magnitude. Coursera, 2012.

[HVD15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. Mar. 2015.

[HAP17] B. Hitaj, G. Ateniese, and F. Perez-Cruz. Deep models under the GAN: information leakage from collaborative deep learning. SIGSAC CCS, Sep. 2017.

[HS97] S. Hochreiter and J. Schmidhuber. Flat minima. Neural Comp., Jan. 1997.

[HS97] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comp., Nov. 1997.

[HHS17] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. NeurIPS, Dec. 2017.

[HM19] A. Holler and M. Mui. Evolving Michelangelo model representation for flexibility at scale. Oct. 2019.

[HEK+19] S. Hooker, D. Erhan, P. Kindermans, and B. Kim. A benchmark for interpretability methods in deep neural networks. NeurIPS, Dec. 2019.

[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. NNs, Mar. 1989.

[Hor14] M. Horowitz. 1.1 Computing's energy problem (and what we can do about it). ISSCC, Feb. 2014.

[Hou19] J. Hou. New research on quantization could revolutionize power-efficient AI. July 2019.

[HZC+17] A. G. Howard, M. Zhu, B. Chen, et al. MobileNets: efficient convolutional neural networks for mobile vision applications. Apr. 2017.

[HSA+19] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu. Squeeze-and-excitation networks. CVPR, May 2019.

[HLG+19] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and J. Leskovec. Strategies for pre-training graph neural networks. ICLR, Sep. 2019.

[HZS+19] W. Hua, Y. Zhou, C. Sa, Z. Zhang, and G. Suh. Channel gating neural networks. NeurIPS, Dec. 2019.

[HLv+16] G. Huang, Z. Liu, L. van der Maaten, and K. Weinberger. Densely connected convolutional networks. CVPR, Aug. 2016.

[HLP+17] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. CVPR, June 2017.

[HCB+19] Y. Huang, Y. Cheng, A. Bapna, et al. GPipe: efficient training of giant neural networks using pipeline parallelism. NeurIPS, Dec. 2019.

[HDS+19] D. Huang, P. Dhariwal, D. Song, and I. Sutskever. GamePad: a learning environment for theorem proving. ICLR, 2019.

[Hua19] Huawei. Ascend 910 AI processor. 2019.

[Hug15] C. Hughes. Single-instruction multiple-data execution. M\&C, May 2015.

[HS14] K. Hwang and W. Sung. Fixed-point feedforward deep neural network design using weights +1, 0, and -1. SiPS, Oct. 2014.

[IHM+16] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CVPR, Feb. 2016.

[Ibm20] IBM. IBM reveals next-generation IBM POWER10 processor. Aug. 2020.

[Int18] Intel. Knowledge Distillation. 2018.

[Int19] Intel. Next-generation Intel Xeon Scalable processors to deliver breakthrough platform performance with up to 56 processor cores. Aug. 2019.

[Int19b] Intel. Aurora SuperComputer. Nov. 2019.

[Int20] Intel. Innovation through intelligence. Jan. 2020.

[Int20b] Intel. Intel architecture instruction set extensions and future features programming reference. June 2020.

[Int20c] Intel. Analytics zoo. 2020.

[IS15] S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. Feb. 2015.

[Iof17] S. Ioffe. Batch renormalization: towards reducing minibatch dependence in batch-normalized models. NeurIPS, Dec. 2017.

[IZZ+16] P. Isola, J. Zhu, T. Zhou, and A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, Nov. 2016.

[Iva71] A. Ivakhnenko. Polynomial theory of complex systems. SMC, Oct. 1971.

[IPG+19] P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. Wilson. Averaging weights leads to wider optima and better generalization. UAI, Feb. 2019.

[Jad19] A. Jadhav. Applications of graph neural networks. Feb. 2019.

[JJN+19] P. Jain, A. Jain, A. Nrusimha, A. Gholami, P. Abbeel, K. Keutzer, I. Stoica, and J. Gonzalez. Checkmate: breaking the memory wall with optimal tensor rematerialization. Oct. 2019.

[JFZ+19] M. Janner, J. Fu, M. Zhang, and S. Levine. When to trust your model: model-based policy optimization. NeurIPS, Dec. 2019.

[JYS19] D. Jauk, D. Yang, and M. Schulz. Predicting faults in high performance computing systems: an in-depth survey of the state-of-the-practice. SC, Nov. 2019.

[Jax20] Jax. Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more. Feb. 2020.

[JJM+18] Y. Jia, M. Johnson, W. Macherey, et al. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. ICASSP, Nov. 2018.

[JWB+19] Y. Jia, R. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu. Direct speech-to-speech translation with a sequence-to-sequence model. Apr. 2019.

[JZW+18] Y. Jia, Y. Zhang, R. Weiss, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. NeurIPS, Dec. 2018.

[JZA18] Z. Jia, M. Zaharia, and A. Aiken. Beyond data and model parallelism for deep neural networks. ML, July 2018.

[JHJ+20] Y. Jiao, L. Han, R. Jin, et al. 12nm programmable convolution-efficient neural-processing-unit chip achieving 825 TOPS. ISSCC, Feb. 2020.

[JGK18] P. Jin, B. Ginsburg, and K. Keutzer. Spatially parallel convolution. ICLR, 2018.

[Joh18] J. Johnson. Rethinking floating point for deep learning. NeurIPS, Dec. 2018.

[JS18] M. Johnson and B. Stevens. Pruning hypothesis comes of age. Nature, Feb. 2018.

[JYv19] J. Jordon, J. Yoon, and M. van der Schaar. PATE-GAN: generating synthetic data with differential privacy guarantees. ICLR, Feb. 2019.

[JYP+17] N. Jouppi, C. Young, N. Patil, D. Patterson, et al. In-datacenter performance analysis of a tensor processing unit. ISCA, June 2017.

[JYK+20] N. Jouppi, D. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson. A domain-specific supercomputer for training deep neural networks. CACM, July 2020.

[JZS15] R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. ICML, July 2015.

[KZK+19] D. Kaji, J. Zech, J. Kim, S. Cho, N. Dangayach, A. Costa, and E. Oermann. An attention based deep learning model of clinical events in the intensive care unit. Feb. 2019.

[KES+18] N. Kalchbrenner, E. Elsen, K. Simonyan, et al. Efficient neural audio synthesis. June 2018.

[KMM+19] D. Kalamkar, D. Mudigere, N. Mellempudi, et al. A study of bfloat16 for deep learning training. June 2019.

[KMH+20] J. Kaplan, S. McCandlish, T. Henighan, T. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. Jan. 2020.

[KBA18] S. Karandikar, D. Biancolin, and A. Amid. FireSim. 2018.

[KCH+19] S. Karita, N. Chen, T. Hayashi, et al. A comparative study on transformer vs RNN in speech applications. Sep. 2019.

[Kar19] A. Karpathy. A recipe for training neural networks. Apr. 2019.

[KLA19] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. CVPR, Mar. 2019.

[KR19] S. Katariya and A. Ramani. eBay's transformation to a modern AI platform. Dec. 2019.

[KMN+17] N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang. On large-batch training for deep learning: generalization gap and sharp minima. ICLR, Apr. 2017.

[KS17] N. Keskar and R. Socher. Improving generalization performance by switching from Adam to SGD. Dec. 2017.

[KDT+05] J. Kim, W. Dally, B. Towles, and A. Gupta. Microarchitecture of a high-radix router. ISCA, June 2005.

[KDS+08] J. Kim, W. Dally, S. Scott, and D. Abts. Technology-driven, highly-scalable dragonfly topology. ISCA, June 2008.

[KWG+18] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres. Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). ICML, June 2018.

[KKS+19] C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, and H. Yoo. A 2.1TFLOPS/W mobile deep RL accelerator with transposable PE array and experience compression. ISSCC, Feb. 2019.

[KB17] D. Kingma and J. Ba. Adam: a method for stochastic optimization. ICLR, Jan. 2017.

[KKC+17] F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe. The tensor algebra compiler. OOPSLA, Oct. 2017.

[KUM+17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. NeurIPS, Dec. 2017.

[Kod19] R. Koduri. Intel unveils new GPU architecture with high-performance computing and AI acceleration, and oneAPI software stack with unified and scalable abstraction for heterogeneous architectures. Intel HPC Dev. Conf., Nov. 2019.

[KSA+15] R. Komuravelli, M. Sinclair, J. Alsop, et al. Stash: have your scratchpad and cache it too. ISCA, Oct. 2015.

[KMY+17] J. Konecny, H. McMahan, F. Yu, P. Richtarik, A. Suresh, and D. Bacon. Federated learning: strategies for improving communication efficiency. Oct. 2017.

[KCV+20] A. Kosson, V. Chiley, A. Venigalla, J. Hestness, and U. Koster. Pipelined backpropagation at scale: training large models without batches. Mar. 2020.

[KWW+17] U. Koster, T. Webb, X. Wang, et al. Flexpoint: an adaptive numerical format for efficient training of deep neural networks. NeurIPS, Dec. 2017.

[KL19] W. Kouw and M. Loog. An introduction to domain adaptation and transfer learning. Jan. 2019.

[KBC+18] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. Apr. 2018.

[KSH12] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. NeurIPS, Dec. 2012.

[Kri14] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. Apr. 2014.

[KGG+18] O. Kuchaiev, B. Ginsburg, I. Gitman, et al. Mixed-precision training for NLP and speech recognition with OpenSeq2Seq. Nov. 2018.

[LMM+19] I. Laguna, R. Marshall, K. Mohror, M. Ruefenacht, A. Skjellum, and N. Sultana. A large-scale study of MPI usage in open-source HPC applications. SC, Nov. 2019.

[LCG+19] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: a lite BERT for self-supervised learning of language representations. Sep. 2019.

[LS19] R. Larsen and T. Shpeisman. TensorFlow graph optimizations. 2019.

[LA04] C. Lattner and V. Adve. LLVM: a compilation framework for lifelong program analysis \& transformation. CGO, Mar. 2004.

[LP19] C. Lattner and J. Pienaar. MLIR primer: a compiler infrastructure for the end of Moore's Law. CGO, Feb. 2019.

[LG16] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. CVPR, Sep. 2015.

[Lec16] Y. Lecun. RI seminar: Yann LeCun : the next frontier in AI: unsupervised learning. Nov. 2016.

[LBB+98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. IEEE, Nov. 1998.

[LDS89] Y. Lecun, J. Denker, and S. Solla. Optimal brain damage. NeurIPS, 1989.

[LAG+19] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana. Certified robustness to adversarial examples with differential privacy. S\&P, May 2019.

[LTH+16] C. Ledig, L. Theis, F. Huszar, et al. Photo-realistic single image super-resolution using a generative adversarial network. CVPR, Sep. 2016.

[LMC+17] E. Lee, D. Miyashita, E. Chai, B. Murmann, and S. Wong. LogNet: energy-efficient neural networks using logarithmic computation. ICASSP, Mar. 2017.

[LLH+19] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H. Yoo. 7.7 LNPU: a 25.3TFLOPS/W sparse deep-neural-network learning processor with fine-grained mixed precision of FP8-FP16. ISSCC, Feb. 2019.

[LMR+19] K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. CVPR, Apr. 2019.

[LLX+20] D. Lepikhin, H. Lee, Y. Xu, et al. GShard: scaling giant models with conditional computation and automatic sharding. June 2020.

[LAS+07] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. Comparing memory systems for chip multiprocessors. ISCA, June 2007.

[LM18] Y. Leviathan and Y. Matias. Google Duplex: an AI system for accomplishing real-world tasks over the phone. May 2018.

[LSZ+19] T. Li, A. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneous networks. Sep. 2019.

[LCH+19] X. Li, S. Chen, X. Hu, and J. Yang. Understanding the disharmony between dropout and batch normalization by variance shift. CVPR, Jan. 2019.

[LKH+18] D. Liang, R. Krishnan, M. Hoffman, and T. Jebara. Variational autoencoders for collaborative tiltering. IW3C2, Feb. 2018.

[LHP+19] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. July 2019.

[LGH19] J. Lin, C. Gan, and S. Han. Defensive quantization: when efficiency meets robustness. ICLR, Apr. 2019.

[LGH+16] T. Lin, P.Doll\'ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. CVPR, Dec. 2016.

[LGG+17] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. ICCV, Aug. 2017.

[LSP+19] T. Lin, S. Stich, K. Patel, and M. Jaggi. Don't use large mini-batches, use local SGD. June 2019.

[LHM+18] Y. Lin, S. Han, H. Mao, Y. Wang, and W. Dally. Deep gradient compression: reducing the communication bandwidth for distributed training. ICLR, Feb. 2018.

[LHL+18] P. Lindstrom, J. Hittinger, M. Larsen, S. Lloyd, and M. Salasoo. Alternatives to IEEE: NextGen number formats for scientific computing. IPAM, Oct. 2018.

[LRS+18] G. Liu, F. Reda, K. Shih, T. Wang, A. Tao, and B. Catanzaro. Image inpainting for irregular holes using partial convolutions. ECCV, Apr. 2018.

[LDR+18] L. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. ICML, Apr. 2018.

[LPH+18] X. Liu, J. Pool, S. Han, and W. Dally. Efficient sparse Winograd convolutional neural networks. ICLR, Feb. 2018.

[LSY19] H. Liu, K. Simonyan, and Y. Yang. DARTS: differentiable architecture search. ICLR, Apr. 2019.

[LJH+19] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the adaptive learning rate and beyond. Aug. 2019.

[LZL+19] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei. A survey of coarse-grained reconfigurable architecture and design: taxonomy, challenges, and applications. CSUR, Oct. 2019.

[LAE+15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: single shot multibox detector. ECCV, Dec. 2015.

[LOG+19] Y. Liu, M. Ott, N. Goyal, et al. RoBERTa: a robustly optimized BERT pretraining approach. July 2019.

[LSZ+19] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell. Rethinking the value of network pruning. ICLR, Mar. 2019.

[Llv20] LLVM. MLIR: the case for a simplified polyhedral form. 2020.

[LSD14] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR, Nov. 2014.

[Lor19] B. Lorica. One simple graphic: researchers love PyTorch and TensorFlow. July 2019.

[LH17] I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. ICLR, May 2017.

[LH19] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. ICLR, Jan. 2019.

[Lov19] S. Lovely. How many titles are available on Netflix in your country?. May 2019.

[LPM15] M. Luong, H. Pham, and C. Manning. Effective approaches to attention-based neural machine translation. Aug. 2015.

[LCZ+19] S. Lym, E. Choukse, S. Zangeneh, W. Wen, S. Sanghavi, and M. Erez. PruneTrain: fast neural network training by dynamic sparse model reconfiguration. SC, Nov. 2019.

[MYM+19] L. Ma, Z. Yang, Y. Miao, J. Xue, M. Wu, L. Zhou, and Y. Dai. NeuGraph: parallel deep neural network computation on large graphs. ATC, July 2019.

[MMS+19] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, Sep. 2019.

[MHP+17] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. Dally. Exploring the regularity of sparse structure in convolutional neural networks. NeurIPS, Dec. 2017.

[ML18] D. Masters and C. Luschi. Revisiting small batch training for deep neural networks. Apr. 2018.

[MKA+18] S. McCandlish, J. Kaplan, D. Amodei, et al. An empirical model of large-batch training. Feb. 2017.

[MMR+17] H. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Arcas. Communication-efficient learning of deep networks from decentralized data. Feb. 2017.

[MSD+19] N. Mellempudi, S. Srinivasan, D. Das, and B. Kaul. Mixed precision training with 8-bit floating point. May 2019.

[MC17] D. Meng and H. Chen. MagNet: a two-pronged defense against adversarial examples. CCS, Sep. 2017.

[Mer19] S. Merity. Single headed attention RNN: stop thinking with your head. Nov. 2019.

[Met19] MetaFlow. A framework for real-life data science. 2019.

[Met19b] Metaflow. Metaflow on AWS. 2019.

[MLN19] P. Michel, O. Levy, and G. Neubig. Are sixteen heads really better than one?. NeurIPS, Dec. 2019.

[Mic20] Microsoft. ML.NET Documentation. 2020.

[Mic20b] Microsoft. Azure Cognitive services. 2020.

[Mig17] S. Migacz. 8-bit inference with TensorRT. GTC, May 2017.

[MSU+19] H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama. Massively distributed SGD: ImageNet/ResNet-50 training in a flash. Mar. 2019.

[MSC+13] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NeurIPS, Dec. 2013.

[MNA16] F. Milletari, N. Navab, and S. Ahmadi. V-Net: fully convolutional neural networks for volumetric medical image segmentation. 3DV, June 2016.

[MGP+18] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. Le, and J. Dean. A hierarchical model for device placement. ICLR, 2018.

[MFL+19] S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant. AAAI, Dec. 2019.

[MWZ+19] M. Mitchell, S. Wu, A. Zaldivar, et al. Model cards for model reporting. Jan. 2019.

[MZH+16] I. Mitliagkas, C. Zhang, S. Hadjis, and C. Re. Asynchrony begets momentum, with an application to deep learning. Comm., Control, and Comp., Nov. 2016.

[Mlf20] MLFlow. An open source platform for the machine learning lifecycle. 2020.

[Mlp18] MLPerf. MLPerf. 2018.

[MBM+16] V. Mnih, A. Badia, M. Mirza, et al. Asynchronous methods for deep reinforcement learning. ICML, June 2016.

[KSe+13] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Playing atari with deep reinforcement learning. Dec. 2013.

[MKS+15] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement learning. Nature, Feb. 2015.

[Moo65] G. Moore. Cramming more components onto integrated circuits. Electronics, Apr. 1965.

[Moo75] G. Moore. Progress in digital integrated electronics. Technical Digest, Sep. 1975.

[MPG+20] R. Mor, E. Peterfreund, M. Gavish, and A. Globerson. Optimal strategies against generative attacks. ICLR, Feb. 2020.

[MYP+19] A. Morcos, H. Yu, M. Paganini, and Y. Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. NeurIPS, Dec. 2019.

[Mor19] T. Morgan. Nvidia shows off tech chops with RC18 inference chip. Next Platform, Sep. 2019.

[MNW+18] P. Moritz, R. Nishihara, S. Wang, et al. Ray: a distributed framework for emerging AI applications. OSDI, Sep. 2018.

[Mos17] R. Mosic. Deep reinforcement learning based trading application at JP Morgan Chase. July 2017.

[MY17] T. Munkhdalai and H. Yu. Meta networks. ICML, June 2017.

[NvB+19] M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling. Data-free quantization through weight equalization and bias correction. CVPR, Nov. 2019.

[NIG+18] D. Nagy, G. Indalecio, A. Garcia-Loureiro, M. Elmessary, K. Kalna, and N. Seoane. FinFET versus gate-all-around nanowire FET: performance, scaling, and variability. EDS, Feb. 2018.

[Nak19] P. Nakkiran. Adversarial robustness may be at odds with simplicity. Jan. 2019.

[NKB+20] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: where bigger models and more data hurt. ICLR, Apr. 2020.

[Nar19] N. Narayanan. How to recognize AI snake oil. 2019.

[NSA+19] A. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan. Speech recognition using deep neural networks: a systematic review. Access, 2019.

[NMS+19] M. Naumov, D. Mudigere, H. Shi, et al. Deep learning recommendation model for personalization and recommendation systems. May 2019.

[NKM+20] M. Naumov, J. Kim, D. Mudigere, et al. Deep learning training in Facebook data centers: design of scale-up and scale-out systems. Mar. 2020.

[Nay19] P. Nayak. Understanding searches better than ever before. Oct. 2019.

[NMZ19] E. Neftci, H. Mostafa, and F. Zenke. Surrogate gradient learning in spiking neural networks: bringing the power of gradient-based optimization to spiking neural networks. SPM, Nov. 2019.

[Nea95] R. Neal. Bayesian learning for neural networks. Ph.D. Thesis, University of Toronto, 1995.

[Nim20] Nimbix. Groq tensor streaming processors. 2020.

[NDC+17] J. Novikova, O. Dusek, A. Curry, and V. Rieser. Why we need new evaluation metrics for NLG. July 2017.

[NKJ+19] E. Nurvitadhi, D. Kwon, A. Jafari, et al. Why compete when you can work together: FPGA-ASIC integration for persistent RNNs. FCCM, May 2019.

[Nvi15] Nvidia. PTX and SASS assembly debugging. 2015.

[Nvi20] Nvidia. RAPIDS. 2020.

[Nvi20b] Nvidia. T4. 2020.

[Nvi20c] Nvidia. Data center deep learning product performance. July 2020.

[OSJ+18] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev. The building blocks of interpretability. 2018.

[OPM02] T. Ojala, M. Pietik\"ainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. PAMI, July 2002.

[Ope18] OpenAI. Kinds of RL algorithms. 2018.

[Orr99] G. Orr. Momentum and Learning Rate Adaptation. {Willamette University}, 1999.

[Pad19] S. Padmanabhan. Building a product catalog: eBay's university machine learning competition. Oct. 2019.

[PdN18] M. Paganini, L. de Oliveira, and B. Nachman. Accelerating science with generative adversarial networks: an application to 3D particle showers in multi-layer calorimeters. PRL, Jan. 2018.

[PY10] S. Pan and Q. Yang. A survey on transfer learning. TKDE, Oct. 2010.

[PMW+16] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. S\&P, Mar. 2016.

[PCZ+19] D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk, and Q. Le. SpecAugment: a simple data augmentation method for automatic speech recognition. Apr. 2019.

[PNB+18] J. Park, M. Naumov, P. Basu, S. Deng, et al. Deep learning inference in Facebook data centers: characterization, performance optimizations and hardware implications. Nov. 2018.

[PRH+17] A. Pedram, S. Richardson, M. Horowitz, S. Galal, and S. Kvatinsky. Dark memory and accelerator-rich system optimization in the dark silicon era. D\&T, May 2016.

[PSC+19] M. Pellauer, Y. Shao, J. Clemons, et al. Buffets: an efficient and composable storage idiom for explicit decoupled data orchestration. ASPLOS, Apr. 2019.

[PSM14] J. Pennington, R. Socher, and C. Manning. GloVe: global vectors for word representation. EMNLP, 2014.

[PGZ+18] H. Pham, M. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. Feb. 2018.

[Phi18] M. Phi. Illustrated guide to LSTM's and GRU's: a step by step explanation. TDS. Sep. 2018.

[PPG+17] W. Ping, K. Peng, A. Gibiansky, S. Arik, A. Kannan, S. Narang, J. Raiman, and J.Miller. Deep Voice 3: scaling text-to-speech with convolutional sequence learning. Oct. 2017.

[PPC18] W. Ping, K. Peng, and J. Chen. ClariNet: parallel wave generation in end-to-end text-to-speech. July 2018.

[Pol99] F. Pollack. New microarchitecture challenges in the coming generations of CMOS process technologies. MICRO, Nov. 1999.

[PZK+17] R. Prabhakar, Y. Zhang, D. Koeplinger, et al. Plasticine: a reconfigurable architecture for parallel patterns. SIGARCH, June 2017.

[PHX+18] V. Pratap, A. Hannun, Q. Xu, et al. wav2letter++: the fastest open-source speech recognition system. Dec. 2018.

[Qia99] N. Qian. On the momentum term in gradient descent learning algorithms. Jan. 1999.

[RMC15] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICIGP, Nov. 2015.

[RWC+19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.

[RBA+13] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. PLDI, June 2013.

[RSR+19] C. Raffel, N. Shazeer, A. Roberts, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Oct. 2019.

[RZQ+19] K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine. Efficient off-policy meta-reinforcement learning via probabilistic context variables. Mar. 2019.

[ROR+16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. ECCV, Sep. 2016.

[RD19] S. Raza and C. Ding. Progress in context-aware recommender systems-an overview. Jan. 2019.

[RAH+19] E. Real, A. Aggarwal, Y. Huang, and Q. Le. Regularized evolution for image classifier architecture search. AAAI, Feb. 2019.

[RKK19] S. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. ICLR, Apr. 2019.

[RDG+16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified, real-time object detection. CVPR, 2016.

[RF18] J. Redmon and A. Farhadi. YOLOv 3: an incremental improvement. Apr. 2018.

[RHG+15] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. NeurIPS, Dec. 2015.

[RAA+19] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler. SparCML: high-performance sparse communication for machine learning. SC, Aug. 2019.

[RKL+18] A. Rodriguez, T. Kacprzak, A. Lucchi, et al. Fast cosmic web simulations with generative adversarial networks. CompAC, Nov. 2018.

[RKB+09] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the bandwidth wall: challenges in and avenues for CMP scaling. SIGARCH, Jun. 2009.

[RDK+19] D. Rolnick, P. Donti, L. Kaack, et al. Tackling climate change with machine learning. Nov. 2019.

[RDK+19] D. Rolnick, P. Donti, L. Kack, et al. Tackling climate change with machine learning workshop. NeurIPS, Dec. 2019.

[RFB15] O. Ronneberger, P. Fischer, and T. Brox. U-Net convolutional networks for biomedical image segmentation. May 2015.

[Ros20] C. Rosset. Turing-NLG: a 17-billion-parameter language model by Microsoft. Feb. 2020.

[RXT19] B. Roune and XLA Team. Compiling ML with XLA. Feb. 2019.

[RJP19] K. Roy, A. Jaiswal, and P. Panda. Towards spike-based machine intelligence with neuromorphic computing. Nature, 2019.

[Rud17] S. Ruder. An overview of multi-task learning in deep neural networks. June 2017.

[Rup20] K. Rupp. Microprocessor trend data. 2020.

[RDS+15] O. Russakovsky, J. Deng, H. Su, et al. Large scale visual recognition challenge. IJCV, 2015.

[RRS+19] A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. ICLR, Mar. 2019.

[Sam16] Samgsung. Samsung begins mass producing world's fastest DRAM-based on newest high bandwidth memory (HBM) interface. 2016.

[SST09] P. Sanders, J. Speck, and J. Traff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Sep. 2009.

[SDC+19] V. Sanh, L. Debut, J. Chaumond, and T. Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Oct. 2019.

[San19] V. Sanh. Smaller, faster, cheaper, lighter: introducing DistilBERT, a distilled version of BERT. Medium, Aug. 2019.

[Sas19] K. Sasaki. Federated Learning with TensorFlow. 2019.

[SYP17] K. Sato, C. Young, and D. Patterson. An in-depth look at Google's first Tensor Processing Unit (TPU). May 2017.

[SGT+09] F. Scarselli, M. Gori, A. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. TNNLS, Jan. 2009.

[Sch19] J. Schalkwyk. An all-neural on-device speech recognizer. Mar. 2019.

[SAH+20] J. Schrittwieser, I. Antonoglou, T. Hubert, et al. Mastering Atari, Go, Chess and Shogi by planning with a learned model. Feb. 2020.

[SKP15] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: a unified embedding for face recognition and clustering. CVPR, Mar. 2015.

[SLM+17] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. Apr. 2017.

[SFD+14] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech DNNs. Int' Speech Comm. Association, Sep. 2014.

[SDB18] A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. Feb. 2018.

[SHB15] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with subword units. Aug. 2015.

[SKF+16] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension. Nov. 2016.

[SLA+19] C. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. Dahl. Measuring the effects of data parallelism on neural network training. JMLR, July 2019.

[SWR18] Y. Sharan, H. Wang, and S. Rath. GUI testing powered by deep learning. eBay Tech Blog, June 2018.

[SCP+18] N. Shazeer, Y. Cheng, N. Parmar, et al. Mesh-TensorFlow: deep learning for supercomputers. NeurIPS, Dec. 2018.

[SPW+17] J. Shen, R. Pang, R. Weiss, et al. Natural TTS synthesis by conditioning WaveNet on Mel Spectrogram predictions. ICASSP, Dec. 2017.

[SDY+19] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. Mahoney, and K. Keutzer. Q-BERT: Hessian based ultra low precision quantization of BERT. Sep. 2019.

[She18] R. Sheth. Introducing PyTorch across Google Cloud. Oct. 2018.

[SLA+19] B. Shickel, T. Loftus, L. Adhikari, T. Ozrazgat-Baslanti, A. Bihorac, and P. Rashidi. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning. Feb. 2019.

[SPP+19] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron LM training multi billion parameter language models using model parallelism. Oct. 2019.

[SL19] T. Shpeisman and C. Lattner. MLIR: multi-level intermediate representation for compiler infrastructure. Apr. 2019.

[SHM+16] D. Silver, A. Huang, C. Maddison, et al. Mastering the game of Go with deep neural networks and tree search. Nature, Jan. 2016.

[SSS+17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, et al. Mastering the game of Go without human knowledge. Nature, Oct. 2017.

[SSS+18] D. Silver, J. Schrittwieser, K. Simonyan, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Dec. 2018.

[SZ14] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Sep. 2014.

[Smi17] L. Smith. Cyclical learning rates for training neural networks. WACV, Apr. 2017.

[SSZ17] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. NeurIPS, Dec. 2017.

[Ste19] I. Steinwart. A sober look at neural network initializations. Sep. 2019.

[Ste19b] N. Stephens. BFloat16 processing for neural networks on Armv8-A. Aug. 2019.

[SA19] A. Stooke and P. Abbeel. Accelerated methods for deep reinforcement learning. Jan. 2019.

[SPE19] A. Straw, A. Procter, and R. Earhart. nGraph: unlocking next-generation performance with deep learning compilers. 2019.

[SGB+19] S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in transformers. May 2019.

[SCC+19] X. Sun, J. Choi, C. Chen, et al. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. NeurIPS, Dec. 2019.

[SWL+19] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. ERNIE 2.0: a continual pre-training framework for language understanding. 2019.

[SAD+20] Y. Sun, N. Agostini, S. Dong, and D. Kaeli. Summarizing CPU and GPU design trends with product data. 2020.

[SVL14] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. NeurIPS, Dec. 2014.

[SCY+17] V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE, Dec. 2017.

[SCY+20] V. Sze, Y. Chen, T. Yang, and J. Emer. Efficient processing of deep neural networks. M\&C, June 2020.

[SLJ+14] C. Szegedy, W. Liu, Y. Jia, et al. Going deeper with convolutions. CVPR, Sep. 2014.

[SVI+15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. CVPR, Dec. 2015.

[SZS+14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. Feb. 2014.

[Syn17] Synced. A brief overview of attention mechanism. Medium, Sep. 2017.

[TPL19] M. Tan, R. Pang, and Q. Le. EfficientDet: scalable and efficient object detection. Nov. 2019.

[TL19] M. Tan and Q. Le. EfficientNet: rethinking model scaling for convolutional neural networks. May 2019.

[TYD+18] Y. Tassa, Y, Doron, A. Muldal, et al. DeepMind control suite. Jan. 2018.

[TKT+16] S. Tavarageri, W. Kim, J. Torrellas, and P. Sadayappan. Compiler support for software cache coherence. HiPC, Dec. 2016.

[Ter19] Terry. Inlining decisions in visual studio. July 2019.

[TRG05] R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. HiPC, Feb. 2005.

[TGL+20] N. Thompson, K. Greenewald, K. Lee, and G. Manso. The computational limits of deep learning. July 2020.

[TKP+18] F. Tramer, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: attacks and defenses. ICLR, July 2018.

[TAN+18] H. Tsai, S. Ambrogio, P. Narayanan, R. Shelby, and G. Burr. Recent progress in analog memory-based accelerators for deep learning. J. Phys. D: Appl. Phys, June 2018.

[Tsa18] S. Tsang. Review: YOLOv1 - you only look once (object detection). TDS, Oct. 2018.

[TSE+19] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. ICLR, Sep. 2019.

[Tvm19] TVM. TVM deep learning compiler joins Apache Software Foundation. Mar. 2019.

[Tvm19] TVM. Introduction to Relay IR. 2019.

[vKK+16] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. Jan. 2016.

[vDZ+16] A. van den Oord, S. Dieleman, H. Zen, et al. WaveNet: a generative model for raw audio. Sep. 2016.

[vLB+17] A. van den Oord, Y. Li, I. Babuschkin, et al. Parallel WaveNet: fast high-fidelity speech synthesis. Nov. 2017.

[VS19] J. Valin and J. Skoglund. LPCNet: improving neural speech synthesis through linear prediction. ICASSP, May 2019.

[VZT+18] N. Vasilache, O. Zinenko, T. Theodoridis, et al. Tensor Comprehensions: framework-agnostic high-performance machine learning abstractions. ICASSP, May 2019.

[VSP+17] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need. NeurIPS, Dec. 2017.

[VSZ+19] R. Venkatesan, Y. Shao, B. Zimmer, et al. A 0.11 PJ/OP, 0.32-128 TOPS, scalable multi-chip-module-based deep neural network accelerator designed with a high-productivity VLSI methodology. HCS, Aug. 2019.

[Vil18] M. Villmow. Optimizing NMT with TensorRT. Mar. 2018.

[VTB+14] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: a neural image caption generator. CVPR, Nov. 2014.

[VBL+17] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. NeurIPS, Dec. 2017.

[VBC+19] O. Vinyals, I. Babuschkin, J. Chung, et al. AlphaStar: mastering the real-time strategy game StarCraft II. Dec. 2019.

[VAK19] A. Vladimirov, R. Asai, and V. Karpusenko. Parallel programming and optimization with Intel Xeon Phi coprocessors. Jan. 2019.

[Wal13] C. Walsh. Peter Huttenlocher (1931-2013). Nature, Oct. 2013.

[SMH+18] A.Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman. GLUE: a multi-task benchmark and analysis platform for natural language understanding. Apr. 2018.

[WYL+20] H. Wang, J. Yang, H. Lee, and S. Han. Learning to design circuits. Jan. 2020.

[WCB+18] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. NeurIPS, Dec. 2018.

[WVP+19] G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: fast and generic collectives for distributed ML. Oct. 2019.

[WYZ+17] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang. IRGAN: a minimax game for unifying generative and discriminative information retrieval models. SIGIR, May 2017.

[WML+19] Y. Wang, A. Mohamed, D. Le, et al. Transformer-based acoustic modeling for hybrid speech recognition. Oct. 2019.

[WYK+19] Y. Wang, Q. Yao, J. Kwok, and L. Ni. Generalizing from a few examples: a survey on few-shot learning. Comp. Surveys, May 2019.

[WWB19] Y. Wang, G. Wei, and D. Brooks. Benchmarking TPU, GPU, and CPU platforms for deep learning. Oct. 2019.

[WSS+17] Y. Wang, R. Skerry-Ryan, D. Stanton, et al. Tacotron: towards end-to-end speech synthesis. Mar. 2017.

[WWS+19] Y. Wang, Q. Wang, S. Shi, X. He, Z. Tang, K. Zhao, and X. Chu. Benchmarking the performance and power of AI accelerators for AI training. Nov. 2019.

[WSA18] R. Wei, L. Schwartz, and V. Adve. DLVM: a modern compiler infrastructure for deep learning systems. ICLR, Apr. 2018.

[WWW+16] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. NeurIPS, Dec. 2016.

[WXY+17] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: ternary gradients to reduce communication in distributed deep learning. NeurIPS, Dec. 2017.

[Wen17] L. Weng. From GAN to WGAN. Aug. 2017.

[Wik11] Wikimedia. Kernel Machine.svg. 2011.

[Wik12] Wikimedia. Cart-pendulum.svg. 2012.

[Wik15] Wikimedia. Typical cnn.png. 2015.

[Wik17] Wikimedia. MnistExamples.png. 2017.

[Wik18] Wikimedia. Spectrogram-19thC.png. 2018.

[Wik19] Wikipedia. Apple A13. 2019.

[Wik20] Wikipedia. Authors Guild, Inc. v. Google, Inc. Feb. 2020.

[Wik20b] Wikipedia. RankBrain. Feb. 2020.

[WWP09] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. ACM, Apr. 2009.

[WRS+18] A. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. NeurIPS, Dec. 2018.

[WZL+19] R. Wilson, C. Zhang, W. Lam, D. Desfontaines, D. Simmons-Marengo, and B. Gipson. Differentially private SQL with bounded user contribution. Nov. 2019.

[Win20] P. Winder. \MYhref{https://rl-book.com}{Reinforcement Learning: industrial applications of intelligent agents}. O'Reilly, Nov. 2020.

[Wri19] L. Wright. New deep learning optimizer, Ranger synergistic combination of RAdam + LookAhead for the best of both. Aug. 2019.

[WZX+16] J. Wu, C. Zhang, T. Xue, W. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. NeurIPS, Dec. 2016.

[WSC+16] Y. Wu, M. Schuster, Z. Chen, et al. Google's neural machine translation system: bridging the gap between human and machine translation. Sep. 2016.

[WAB+17] C. Wu, A. Ahmed, A. Beutel, A. Smola, and H. Jing. Recurrent recommender networks. WSDM, Feb. 2017.

[WWF+17] S. Wu, J. Wieland, O. Farivar, and J. Schiller. Automatic alt-text: computer-generated image descriptions for blind users on a social network service. CSCW, Feb. 2017.

[WH18] Y. Wu and K. He. Group normalization. ECCV, Mar. 2018.

[WZZ+19] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer. SqueezeSegV.2: improved model structure and unsupervised domain adaptation for road-object segmentation from a LiDAR point cloud. ICRA, May 2019.

[WFB+19] F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli. Pay less attention with lightweight and dynamic convolutions. Jan. 2019.

[WKM+19] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick. Detectron2. 2019.

[WKM+19] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick. Detectron.2: a PyTorch-based modular object detection library. 2019.

[Wu19] H. Wu. Low precision inference on GPU. GTC, Mar. 2019.

[WDZ+19] B. Wu, X. Dai, P. Zhang, et al. FBNet: hardware-aware efficient ConvNet design via differentiable neural architecture search. CVPR, May 2019.

[WM95] W. Wulf and S. McKee. Hitting the memory wall: implications of the obvious. SIGARCH, Mar. 1995.

[XYB+19] S. Xi, Y. Yao, K. Bhardwaj, P. Whatmough, G. Wei, and D. Brooks. SMAUG: end-to-end full-stack simulation infrastructure for deep learning workloads. Dec. 2019.

[XZZ20] C. Xiao, P. Zhong, and C. Zheng. Enhancing adversarial defense by k-winners-take-all. ICLR, Feb. 2020.

[XGD+17] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. CVPR, July 2017.

[Xil19] Xilinx. Versal: the first adaptive compute acceleration platform (ACAP). 2019.

[XAT+18] C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A walk with SGD. May 2018.

[XEQ17] W. Xu, D. Evans, and Y. Qi. Feature squeezing: detecting adversarial examples in deep neural networks. Dec. 2017.

[XLF+18] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song. Neural network-based graph embedding for cross-platform binary code similarity detection. CCS, July 2018.

[YKT+18] M. Yamazaki, A. Kasagi, A. Tabuchi, et al. Yet another accelerated SGD: ResNet-50 training on ImageNet in 74.7 seconds. Mar. 2019.

[Yam12] R. Yampolskiy. Turing test as a defining feature of AI-Completeness. SCI, 2012.

[YCS17] T. Yang, Y. Chen, and V. Sze. Designing energy-efficient convolutional neural networks using energy-aware pruning. CVPR, Apr. 2017.

[YDY+19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le. XLNet: generalized autoregressive pretraining for language understanding. NeurIPS, Dec. 2019.

[YHG+15] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. CVPR, Nov. 2015.

[YGL+18] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. Mahoney. Hessian-based Analysis of large batch training and robustness to adversaries. NeurIPS, Dec. 2018.

[YGS+20] Z. Yao, A. Gholami, S. Shen, K. Keutzer, and M. Mahoney. AdaHessian: an adaptive second order optimizer for machine learning. Jun. 2020.

[YSE+20] J. Yin, S. Sethumurugan, Y. Eckert, N. Enright Jerger, et al. Experiences with ML-driven design: a NoC case study. HPCA, Feb. 2020.

[YKC+18] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng. Image classification at supercomputer scale. NeurIPS, Dec. 2018.

[YGG17] Y. You, I. Gitman, and B. Ginsburg. Large batch training of convolutional networks. Sep. 2017.

[YLR+20] Y. You, J. Li, S. Reddi, et al. Large batch optimization for deep learning: training BERT in 76 minutes. ICLR, Jan. 2020.

[YZH+18] Y. You, Z. Zhang, C. Hsieh, J. Demmel, and K. Keutzer. ImageNet training in minutes. Jan. 2018.

[YAB+18] Y. Yu, M. Abadi, P. Barham, et al. Dynamic control flow in large-scale machine learning. EUROSYS, May 2018.

[YTL+19] L. Yuan, F. Tay, G. Li, T. Wang, and J. Feng. Revisit knowledge distillation: a teacher-free framework. Sep. 2019.

[ZK15] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks. CVPR, June 2015.

[ZXL+18] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and R. Collobert. Fully convolutional speech recognition. Dec. 2018.

[Zei12] M. Zeiler. ADADELTA: an adaptive learning rate method. Dec. 2012.

[ZF13] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. ECCV, Nov. 2013.

[ZF13] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. ICLR, May 2013.

[ZES+20] A. Zela, T. Elsken, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter. Understanding and robustifying differentiable architecture search. ICLR, Jan. 2020.

[ZB19] T. Zerrell and J. Bruestle. Stripe: tensor compilation via the nested polyhedral model. Mar. 2019.

[ZDH19] B. Zhang, A. Davoodi, and Y. Hu. Efficient inference of CNNs via channel pruning. Aug. 2019.

[ZYY18] J. Zhang, J. Yang, and H. Yuen. Training with low-precision embedding tables. NeurIPS, Dec. 2018.

[ZRW+18] M. Zhang, S. Rajbhandari, W. Wang, and Y. He. DeepCPU: serving RNN-based deep learning models 10x faster. ATC, 2018.

[ZLH+19] M. Zhang, J. Lucas, G. Hinton, and J. Ba. Lookahead optimizer: k steps forward, 1 step back. NeurIPS, Dec. 2019.

[ZL19] W. Zhang and P. Li. Spike-train level backpropagation for training deep recurrent spiking neural networks. NeurIPS, Dec. 2019.

[ZZL+17] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. CVPR, July 2017.

[ZXH+17] Y. Zhang, T. Xiang, T. Hospedales, and H. Lu. Deep mutual learning. CVPR, Jan. 2018.

[ZZZ+19] C. Zhao, S. Zhao, M. Zhao, Z. Chen, C. Gao, H. Li, and Y. Tan. Secure multi-party computation: theory, practice and applications. Inf. Sciences, Feb. 2019.

[ZZX+19] W. Zhao, J. Zhang, D. Xie, Y. Qian, R. Jia, and P. Li. AIBox: CTR prediction model training on a single node. CIKM, Nov. 2019.

[ZHW+19] Z. Zhao, L. Hong, L. Wei, et al. Recommending what video to watch next: a multitask ranking system. RecSys, Sep. 2019.

[ZZZ+18] G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. Yuan, X. Xie, and Z. Li. DRN: a deep reinforcement learning framework for news recommendation. IW3C2, Apr. 2018.

[ZMF+18] G. Zhou, N. Mou, Y. Fan, Q. Pi, W. Bian, C. Zhou, X. Zhu, and K. Gai. Deep interest evolution network for click-through rate prediction. AAAI, Nov. 2018.

[ZTZ+18] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination aware channel pruning for deep neural networks. NeurIPS, Dec. 2018.

[ZZY+19] R. Zhu, K. Zhao, H. Yang, W. Lin, C. Zhou, B. Ai, Y. Li, and J. Zhou. AliGraph: a comprehensive graph neural network platform. PVLDB, Aug. 2019.

[Zis18] A. Zisserman. Self-supervised learning. July 2018.