@techreport{TR-IC-PFG-21-29, number = {IC-PFG-21-29}, author = {Victor Ferreira {Ferrari} and Guido Costa Souza de {Araujo}}, title = {{Improving Convolutions with Tensor Hardware Accelerators}}, month = {December}, year = {2021}, institution = {Institute of Computing, University of Campinas}, note = {In English, 35 pages. \par\selectlanguage{english}\textbf{Abstract} Convolutional Neural Network (CNN) models are among the most popular choices for deep learning solutions to problems with huge data sets. Given that CNNs are very computationally expensive, optimizing convolutions is central to enable larger models and speed up inference time. Tensor operations, e.g. matrix multiplication, have increasingly relied on hardware accelerators, such as IBM POWER10's MMA engine. \par This work explores how to exploit MMA and the POWER10 architecture to improve convolution performance, and proposes a novel algorithm for the operation, named Convolution Slicing Optimization (CSO), which tiles the instance into multiple sub-problems, and schedules the resulting tiles so to minimize DRAM memory accesses. After the convolution is tiled, a micro-kernel is used to increase throughput with the MMA engine. \par To evaluate the proposed approach, a set of experiments was performed using a POWER10 CPU, and the results show that CSO is capable of efficiently tile the convolution according to a set ofparameters calculated at compile time. Speedups of up to $229\%$ result when comparing the CSO convolution-based slicing technique to a widely used reduction to matrix multiplication. } }