Home         Authors   Papers   Year of conference   Themes   Organizations        To MES conference

Matrix multiplication of n-bit fixed point numbers via n/2-bit vector instructions  

Authors
 Safonov I.V.
 Ayupov A.B.
 Burns S.
Date of publication
 2016

Abstract
 The market of wearable and mobile devices is growing fast. Many mobile devices have a video camera and are capable of various computer vision tasks. We selected an IP core that contains a low-power DSP intended for image and video processing. The DSP has a VLIW/SIMD architecture, instructions for processing of 32-bit integer scalars and vector instructions for 16-bit integers. The IP core allows effective implementation of a large number of image processing algorithms. However there are some problems in computer vision which require intensive use of dense rectangular matrix multiplication, where matrix dimensions are several hundred by several hundred. An example of such a task is visual SLAM via various modifications of the Kalman filter [1].
In the general case the filter does calculations with floating point numbers. With some restrictions it is possible to use 32-bit fixed point numbers. Matrix multiplication via 32-bit scalars has unacceptably long runtime. It is necessary to employ vector instructions of processor, but DSP has SIMD instructions for only 16-bit integers. In the paper we propose a method for matrix multiplication of 32-bit fixed point numbers by means of vector instructions for 16-bit integers. The technique can be easily generalized to n-bit numbers and n/2-bit vector instructions. In addition we describe several approaches for performance optimization and parallel processing, which can be applied for various DSP engines.
It is well-known that algorithms for multiplication of n-bit unsigned integers via n/2-bit arithmetic operations for low (or the less significant) half of n-bit word and high (or the most significant) half. The book [2] depicts one such method by means of four multiplications for n/2 bit unsigned integers as well as several arithmetic shifts and additions. The monograph [3] describes a technique with three multiplications for n/2 bit unsigned integers. In general the method is less applicable for parallel processing, because it uses a larger number of additions, subtractions and shifts in comparison with that described in [3]. Also [3] declares three ways for adjustment of multiplication for signed integers: a) getting the absolute values of operands, then performing unsigned multiplication for low and high halves and then changing the sign of the result depending on the signs of operands; b) using multiplication of unsigned by unsigned integers for low halves of operands, multiplication of signed by signed integers for high halves, multiplications of signed by unsigned integers for high and low halves (this also requires a sign extension instruction); c) correction of multiplication outcome for unsigned integers by means of subtractions of operands depending on their sign.
For the given matrix dimensions, fast techniques such as various modifications of Strassen’s algorithm [4] are ineffective. Matrix multiplication via dot product of rows of leftmost operand and columns of rightmost matrix is the most popular method. However the technique is ill-suited for an implementation on DSP with SIMD instructions, because elements of columns do not lie in the memory sequentially. Reference [5] depicts a specific architecture for effective access to elements of columns, but in practice a transposition of rightmost matrix or a copying of elements of columns to separate array is required. Both actions leads to a significant overhead in spite of a fact that transposition by Eklundh’s algorithm [7] operates quite fast on our DSP. In addition usage of dot product requires zero-padding of rows and instruction for effective summation of vector register elements.
It is preferable to implement of matrix multiplication via a sum of outer (or vector tensor) products of columns of the leftmost operand and rows of rightmost matrix. So-called wide vector register should be used for accumulation of outer product sums. The algorithm has a lot of advantages: access is performed to sequential memory cells; there are no necessity in zero-padding and summation of vector register elements. We demonstrate processing times of matrix multiplication by means of dot products and outer products for matrices NxN of 16-bit unsigned integers. Both methods were implemented by means of vector instructions. A realization via outer products outperforms the alternative solution even without taking into account time required for transposition of rightmost matrix.
As a start, let’s combine it all together: matrix product and multiplication of 32-bit unsigned integers via 16-bit arithmetic operations. There are two approaches for implementation by SIMD instructions. The first one is decomposition of 32-bit numbers onto vector registers of 16-bit low and high halves and calculation with those registers in the scope of single matrix multiplication procedure. We measured performance of this approach. For matrices 128x128 it operates only 30% faster in comparison with multiplication by means of scalar instructions for 32-bit numbers. This is unsatisfactorily. We analyzed our program using a profiler in order to explore the causes of the delays. This showed that this approach does not employ the VLIW pipeline effectively. The second approach is decomposition of both matrices of 32-bit integers onto matrices of 16-bit numbers and performing four matrix products between those matrices. Outcomes of four matrix multiplications are combined to final matrix of 32-bit integers. We demonstarte plots of processing speed depending on matrix dimention for both approaches. The second one is much faster.
Further, let’s modify the second approach for 32-bit fixed point numbers. The main concepts of arithmetic with fixed point numbers is described in [6]. Multiplication of fixed point numbers can be considered as multiplication of the signed integers and arithmetic shift to the right on length (in bits) of a fractional part. Thus, it is necessary to extend the algorithm of matrix multiplication of unsigned integers to a method for matrices of signed integers. Existing approaches for such extension for scalars were enumerated above. Way a) is unfeasible for matrix multiplication, because each element of the final matrix is the sum of products of elements of matrix-operands. IVP-IP does not have all the necessary vector instructions for way b). Way c) can be applied in our case.
Finally, we have the following algorithm for AB matrix multilplication. For a given range of dimenions a matrix multiplication of 16-bit numbers produces a matrix of 48-bit numbers. We designate matrices with 16-bit halves of 32-bit numbers from the A and B matrix by indices: L-low, H-high. The first step is decomposition of A on AL and AH, B on BL and BH. Four matrix multiplications of ALBL, AHBL, ALBH and AHBH are performed via outer products independently from each other. ALBL corresponds to a lower 48 bit of outcome. Products ALBH and AHBL should be considered as shifted on 16 bit to the left, accordingly they correspond to the range from 63 to 16 bit. AHBH should be considered as shifted on 32 bit to the left, accordingly it corresponds to the range from 79 to 32 bit. Maximal length of elements of resulting matrix is 80 bits. For obtaining of resulting matrix it is necessary to make summation of shifted outcomes of ALBL, AHBL, ALBH, AHBH and to subtract of shifted matrices Da, Db and S:
if A(i,j)<0, then Da(i,j) = B(i,j), else Da(i,j) = 0;
if B(i,j)<0, then Db(i,j) = A(i,j), else Db(i,j) = 0;
if A(i,j)<0 and B(i,j)<0, then S(i,j) = 1,else S(i,j) = 0,
where matrices Da and Db should be considered as shifted on 32 bits to the left, matrix S as shifted on 64 bits to the left. Depending on fixed point format we extract corresponding bits from 80-bit outcome. For example, for 16.16 fixed point numbers (that is length of integer part is 16 bits and length of fractional part is 16 bits) we get bits from 47 to 16.
There are two optimization tricks, which are able to decrease processing time significantly. The first is aligning of addresses of matrix rows. Addresses should be divisible by the length (in bytes) of the vector register. There is no necessity to zero-pad matrix multiplication via outer products. Padding bytes can have any values. The second trick is manual loop unrolling. In general modern compilers automatically make effective optimization including loop unrolling. However we were able to write faster code compared with the optimizing compiler.
Decompositions on matrices of 16-bit numbers and products of the matrices can be done in parallel. It can be applied in conjunction with “natural” approach for parallelization of matrix multiplication by splitting of rightmost matrix on vertical blocks, multiplication of leftmost matrix on each block and concatenation of outcomes. Width of block should be divisible on width of vector register.
Processing time of proposed algorithm realized by means of 16-bit vector instructions for matrix 160x160 of 16.16 fixed point numbers is more than 6 times faster in comparison with implementation via 32-bit scalar instructions. What is the theoretical limit of achievable performance? The multiplication of square matrices of size NxN requires N3 products for scalars. In the ideal case due to VLIW pipeline, memory operations and additions can be performed simultaneously with the multiplications instructions. A vector register of our DSP contains 32 16-bit numbers. Algorithm does four matrix multiplications. Assume, a instruction requires a cycle of a processor. Thus, theoretical limit is 4N3/32. Matrix multiplication of 32-bit unsigned integers has a performance of only 8-10% worse in comparison with evaluation for an ideal case. This is pretty good. For matrix of fixed point numbers the generation and subtraction of Da and Db lead to 10-15% overhead. Nevertheless processing speed is high.
Keywords
 rectangular matrix multiplication, SIMD instructions, outer product, fixed point, DSP programming, parallel processing.
Library reference
 Safonov I.V., Ayupov A.B., Burns S. Matrix multiplication of n-bit fixed point numbers via n/2-bit vector instructions // Problems of Perspective Micro- and Nanoelectronic Systems Development - 2016. Proceedings / edited by A. Stempkovsky, Moscow, IPPM RAS, 2016. Part 3. P. 141-148.
URL of paper
 http://www.mes-conference.ru/data/year2016/pdf/D197.pdf

Copyright © 2009-2024 IPPM RAS. All Rights Reserved.

Design of site: IPPM RAS