Single-Pass Covariance Matrix Calculation on a Hybrid FPGA/CPU Platform

Abstract. Covariance matrices are used for a wide range of applications in particle physics, including Kálmán filter for tracking purposes or Primary Component Analysis for dimensionality reduction. Based on a novel decomposition of the covariance matrix, a design that requires only one pass of data for calculating the covariance matrix is presented. Two computation engines are used depending on parallelizability of the necessary computation steps. The design is implemented onto a hybrid FPGA/CPU system and yields speed-up of up to 5 orders of magnitude compared to previous FPGA implementation.


Covariance matrix
Covariance is a measure of variability between two different data sets X and Y defined as: We will use the understanding of the expected value as the sample mean: The covariance matrix K represents the covariances between all the permutations of the different data sets. We use the convention that m denotes dimensionality, i.e. the number of different data sets, and n the number of samples. Data are represented by an m × n matrix X. Then the covariance matrix is defined as: Note: subscripts denote indices and subscript A z. is referring to the subvector with all the elements in row z from matrix A. The covariance matrix will be a symmetrical m × m matrix, with the diagonal K ii being the variance on data set X i. : var(X 1. ) cov(X 1. , X 2. ) . . . cov(X 1. , X m. ) cov(X 1. , X 2. ) var(X 2. ) . . . . . .

Applications and challenges
Covariance matrices serve a variety of purposes within particle physics, particularly as dimensionality reduction and filtering techniques. They are used within Primary Component Analysis (PCA) to decompose the eigenvalues from [1], and form the base of Whitening transformation [2]. Its most prevalent use in particle physics is its employment in the Kálmán filter. Kálmán filters are linear estimators that rely on covariance matrices of state vectors. They are widely applied as particle tracking algorithms, and their implementation onto FPGA forms one of the approaches to increase low-lever trigger capabilities for future luminosity upgrades. [3] [4] Covariance matrices face the challenge of the curse of dimensionality -denoting the exponential increase in computation complexity with increasing dimensionality -leading to unbearable memory usage and computation times. To mitigate this obstacle, estimators such as the maximum likelihood estimator can often approximate the covariance matrix up to sufficient accuracy [5], but might be vulnerable to non-normally distributed random variables.

Related work
As the implementation base line we use the work of Perera and Li (2011) [6]. They implemented an accurate covariance matrix calculator on a Xilinx Virtex-6 FPGA clocked at 100MHz. Their design was tested on 8.56MB of integer input data, representing a fixed amount of 64 dimensions. The design is built upon two passes of data: Firstly, the mean is calculated according to eq. 2, and after the means X i. and X j. have been calculated, the corresponding covariance is calculated according to: This approach requires two passes of data: data has to be passed a first time to calculate the means -including a division -and then a second time to calculate the actual covariance value -with the number of mn 2 multiplications, as eq. 5 is performed for each covariance matrix value.

Theory
For our implementation, focus is lead on having a single-pass design that requires just one pass of data. The reasoning is that we expect to face a huge number of data samples within a lower (≤ 160) range of dimensions in potential applications. Thus, the computation task is an I/O-bound problem limited by data transfer via cache line from CPU to FPGA up to the point where dimensionality exceeds parallelizability. Assuming that computation time is proportional to sample size, a single-pass design will be able to speed up every dual-pass design by the factor of two at least and it can potentially be used in real-time applications. We use a novel decomposition of the covariance matrix: where 1 refers to an n × n matrix of ones.

Proof
The first part of the proof is commonly known: Taking the definition of covariance from eq. 1 with our understanding of the expected value given in eq. 2 and the definition of the covariance matrix in eq. 3 we come from For the second part, we use individual equivalencies to the terms A i j and B i j defined in the previous eq. 8. In the first part, we can use the general definition of matrix multiplications to derive and for the second part, we use an n-dimensional vector of ones 1: and -building an n × n matrix of ones 1 and using the definition of matrix multiplication again -we get: Eq. 9 and eq. 11 correspond to the term definitions in eq. 8, thus correspond to the decomposition in eq. 6. Also, another approach is given. From eq. 8, it can be seen that Equivalency of 1 n XX T to X i. X j. can be directly derived from eq. 9. The second part also holds: 3 Design

Algorithmic complexity
In order that our assumption on speed-up holds, the computation task has to be I/O-bound.
As it would be the case in real-time applications, data will be sent sample by sample, or -in terms of the above-defined data matrix X -column by column. We will split our calculations Intel Arria 10 FPGA-CPU bandwidth 20GB/s described in eq. 6: Firstly, we will calculate the values X T .k ⊗ X .k for each data sample k via tensor product, whose outputs then are added up. This exploits the equivalency Secondly, we will compute the sums n k=0 X ik in parallel, as they lie base for calculating the second term B ik , as can be derived from eq. 13.
The computationally most complex step is the tensor product that requires O(m 2 + m) multiplications to be performed in parallel (m denotes the number of dimensions), in order that the calculation is a strictly I/O-bound problem, meaning that it is cache transfer and not compute that limits throughput. By omitting this necessity, we can increase dimensionality.

Hardware and framework
The covariance matrix accelerator is implemented onto a hybrid FPGA/CPU device: the Intel HARP. Intel HARP v2 contains both an FPGA and a CPU; the FPGA and CPU both have access to a shared memory region [7]. The specifications are listed in table 1. Software-wise, it was implemented using doppioDB, a hardware-accelerated database framework [9].

Implementation
A strictly I/O-bound solution only works to the number of dimensions where the tensor operation is fully parallelizable. For higher dimensions, a compromise has to be found. Thus, our design is split into two cases: The first case holds, if the number of dimensions is below this threshold (which is by coincidence about the same as one cache line); in the second case where dimensionality exceeds parallelizability, an adjusted design will be used. The scheme is outlined in fig. 1 and a depiction of the data flow is illustrated for both cases in fig. 2. The computation engine is synthesized onto FPGA where it can take advantage of parallelizability of tensor operations using Digital Signal Processors (DSPs):  The engine receives data sample by sample. Then, the tensor operation is performed by Digital Signal Processors (DSPs). Depending on dimensionality, this is either done fully in parallel ( fig. 2(a)) or batched by shifting the multipliers consecutively into the DSP registers. An accumulator than adds the DSP outputs to the XX T matrix and the sample values to the X1X T (sum) matrix.
The values are stored in RAM blocks that represent rows of the XX T matrix. As this matrix is symmetrical, only the upper diagonal is stored. Data is sent sample per sample and cache line per cache line; thus, for lower dimensions, not the whole cache line is filled. Therefore, whereas the algorithm theoretically is I/O-bound, throughput is below the theoretical I/O-bound value. For values that require semi-parallel computation, computation time is bound by the time of shifting values to the DSP multiplication registers which is proportional to the number of dimensions; therefore, for higher dimensions, computation time will increase linearly with dimensionality.
After these operations are completed, the m × m output matrices are transferred to the CPU to perform the divisons. As dimensionality is assumed to be much lower than sample size, their computational costs are low. As these are mostly floating point operations, it would be disadvantageous to perform these on the FPGA, because this would either result in high hardware resource usage or high latency. The current design exploits advantages of both FPGA and CPU.

Experimental setup
The test set-up uses homogeneously distributed random integer values. The sample size is set from 200k to 3G values, and the number of dimensions ranges from 2 to 160. These correspond to data sizes of 3MB to 10GB. The maximum number of dimensions is limited by the design; whereas there is no formal limit for sample size, the latter is constrained by accumulator overflow. Two different baselines are used: • as the FPGA base line, the design by Perera and Li (2011) [6]. This design has -unlike the presented design -fixed dimensionality of 64.
• as the CPU base line, the NumPy covariance estimator, an estimation function run on an Intel i5 with 4 cores at 2.3GHz clock speed [10].
For both base lines, results are measured from the beginning of the compute until the results have been obtained by the host machine.

Comparison to FPGA designs
Throughput for the covariance matrix algorithm has been measured for data samples containing 2 to 160 dimensions, as presented in fig. 3. The threshold up to which the fully parallel data flow is used, is 16 dimensions. Up to this threshold, memory throughput is steadily increasing from 450MB/s to 3.9GB/s. For dimensions beyond that, throughput will drop sharply to -and stay constant at -850MB/s, as the semi-parallel data flow is used and throughput is bound by computation time rather than I/O. The result shows a speed-up Figure 3. Throughput of the presented design, and the baselines [6], [11]. Note that the current implementation has variable input dimensionality, whereas the baseline designs have fixed dimensionality.
(increase of throughput) of up to 183000× compared to the base line (21.2kB/s [6]). The speed-up for dimensions above 16 is 40000×. For comparison, also a real-time covariance updating algorithm for the Kálmán filter has been included [11] for which the presented implementation achieves 3× higher throughput; but this yields limited significance due to the different nature of the algorithms. Compared to a standard CPU implementation of maximum likelihood covariance matrix estimation [10], significant speed-up is only achieved for number of dimensions using the fully parallel engine (≤ 16 dimensions, up to 5×), but remains within the same order of magnitude for number of dimensions using the the semi-parallel engine.

Conclusion and outlook
A single-pass covariance algorithm design based on a novel decomposition of the covariance matrix has been implemented and successfully tested. It shows speed-up that is significant compared to previous FPGA implementation as well as compared to standard covariance estimation on lower number of dimensions. An outlook of a possible real-time application would be to use a novel covariance matrix: update: XX T would be updated according to eq. 15 (k denotes the index of the new sample), the sums would be accumulated in parallel, and then the covariance values would be calculated based on these updated values. This could provide an alternative to the currently used most common covariance updating algorithm [12]: