Home         Authors   Papers   Year of conference   Themes   Organizations        To MES conference

Implementation of functions of the linear algebra subroutines on a vector coprocessor for unaligned arrays  

Authors
 Aryashev S.I.
 Zubkovskiy P.S.
 Tsvetkov V.V.
Date of publication
 2020
DOI
 10.31114/2078-7707-2020-4-181-186

Abstract
 The efficiency of using a vector coprocessor can be estimated by the amount of acceleration of the execution of programs that implement functions on a vector coprocessor, in relation to their execution on a real coprocessor. Program execution time is determined by the time data is loaded into registers, the time it takes to process data, and the time it takes to store results in memory. Getting higher values of the acceleration coefficient is possible by reducing the time at each stage of processing and combining the execution of stages. For this purpose, the programmer is offered a set of vector commands and effective commands for loading / saving data through the cache memory of the 1st level or pairs of vectors through the cache memory of the 2nd level.
The ability to use certain load/save commands is related to the alignment of the arrays used. For example, the most efficient VLDQ/VSDQ commands that load/save two 128-bit vectors require alignment of arrays along the border of a 32-byte word (align32). Commands that load/save a single vector (VLDM/VSDM) can be used when working with arrays whose alignment level is not lower than the alignment level on the border of a 16-byte word (align16). For arrays with an 8-byte word alignment level (align8) or higher, you can use commands that load/save data in the upper (VLDH/VSDH) half of the register, or commands (VLD/VSDH) to load/save the same data in both halves of the register.
Higher values of the acceleration coefficient can be achieved if the arrays that the program works with are aligned along the border of a 32-byte word. At align16 or align8 alignment levels, the load/save stage execution time increases, because in these cases, you can use less productive commands and the effect of speeding up the execution of functions on the vector coprocessor is reduced.
Problems occur when working with float arrays aligned along the border of a 4-byte word (align4). Since there are no vector commands for loading/saving data to CPV registers for float arrays in the vector coprocessor command system, the loading/saving of vector registers has to be performed via the processor's GPR registers, which is time-consuming and can offset the advantages of the vector coprocessor.
In this paper, we consider approaches that allow us to minimize the time of loading/saving data to vector registers of the coprocessor by using VLDQ/VSDQ commands when working with arrays whose alignment level is not necessarily equal to align32, but may be lower. For example, align16/align8 for double/float arrays, or align4 for float arrays. The proposed approaches are considered in relation to the functions of the first level of the BLAS library.
Keywords
 vector coprocessor, coprocessor of real arithmetic, acceleration factor, loading instructions, save instructions.
Library reference
 Aryashev S.I., Zubkovskiy P.S., Tsvetkov V.V. Implementation of functions of the linear algebra subroutines on a vector coprocessor for unaligned arrays // Problems of Perspective Micro- and Nanoelectronic Systems Development - 2020. Issue 4. P. 181-186. doi:10.31114/2078-7707-2020-4-181-186
URL of paper
 http://www.mes-conference.ru/data/year2020/pdf/D096.pdf

Copyright © 2009-2024 IPPM RAS. All Rights Reserved.

Design of site: IPPM RAS