In the “Word-Aligned hybrid compression (WAH)” section we briefly described how a range query of the form A1∨…∨An, where Ai is a bitmap bin can be solved iteratively. However, the same problem could be solved in parallel by exploiting independent operations. For example, R1=A1∨A2 and R2=A3∨A4 could be solved simultaneously. An additional step of R1∨R2 would yield the final result. This pattern of processing is called a parallel reduction. Such a reduction transforms a serial \(\mathcal {O}(n)\) time process to a \(\mathcal {O}(\log n)\) algorithm, where n is the number of bins in the query.
Further potential for parallel processing arises from the fact that row operations are independent of one another (e.g., the reduction along row i is independent of the reduction along row i+1). In actuality, independent processing of rows in compressed bitmaps is very challenging. The difficulty comes from the variable compression achieved by fill atoms. In the sequential-query algorithm this is not a problem as compressed bit vectors are treated like stacks, where only the top atom on the stack is processed and only after all of the represented rows have been exhausted is it removed from the stack. This approach ensures row alignment. Without additional information, it would be impossible to exploit row independence. When selecting an atom in the middle of a compressed bit vector, its row number cannot be known without first examining the preceding atoms to account for the number of rows compressed in fills.
In the remainder of this section, we present parallel algorithms for processing WAH range queries using GPUs and multi-core CPUs.
GPU decompression strategy
All of our GPU-based range query algorithms rely on the same preparations stage. In this stage, the CPU sends compressed columns to the GPU. As concluded in [12], it is a natural decision to decompress bitmaps on GPUs when executing queries as it reduces the communication costs between CPU and GPU. Once the GPU obtains the compressed columns, it decompresses them in parallel using Algorithm 1. Once decompressed, the bit vectors involved in the query are word-aligned. This alignment makes the bitwise operation on two bit vectors embarrassingly parallel and an excellent fit for the massively parallel nature of GPUs.
The procedure Decompression (Algorithm 1) takes, as input, a set of compressed bit vectors, Cols. This is the set of bins that have been identified as necessary to answer a range query. Decompression processes each bit vector in Cols in parallel, sending each of them to the Decomp function. Ultimately, Decompression returns a set of fully decompressed bit vectors.
The Decomp procedure (Algorithm 2) is a slightly modified version of the decompression algorithm presented by Andrzejewski and Wrembel [11]. One notable implementation difference is that 32-bit words were used in [11, 12], while we use 64-bit words. We also modified their algorithm to exploit data structure reuse. Algorithm 2 takes a single WAH compressed bit vector, CData. It also requires, the size of CData in number of 64-bit words, Csize, and Dsize which is the size number of 64-bit words required to store the decompressed version of CData.
In Algorithm 2, lines 2 to 8, Decomp generates the DecompSizes array, which is the same size as CData. For each WAH atom in Cdata, the DecompSizes element at the same index will hold the number of words being represented by that atom. The algorithm does this by generating a thread for each atom in Cdata. If Cdata[ i] is a literal, then threadi writes a 1 in DecompSizes[ i], as that atom encodes 1 word (line 4). If Cdata[ i] holds a fill atom, which are of the form (flag,value,len) (see “Background” section), then threadi writes the number of words compressed by that atom, or len, to DecompSizes[ i] (line 6).
Next, Decomp performs an exclusive scan (parallel element summations) on DecompSizes storing the results in StartingPoints (line 9). StartingPoints[ i] contains the total number of decompressed words represented by CData[ 0] to CData[i−1], inclusive. StartingPoints[ i]∗63 is the number of the bitmap row first represented in CData[ i].
In lines 10 to 13, the EndPoints array is created and initializes. This array has a length of Dsize and is initially filled with 0s. Decomp then processes each element of StartingPoints in parallel. A 1 is assigned to EndPoints at index StartingPoints[ i]−1 for i<|StartingPoints|. In essence, each 1 in EndPoints represents where a heterogeneous chunk was found in the decompressed data by the WAH compression algorithm. At line 14 another exclusive scan is performed, this time on EndPoints. The result of this scan is saved to WordIndex. WordIndex[ i] stores the index to the atom in CData that contains the information for the ith decompressed word.
The final for-loop (lines 15 - 26) is a parallel processing of every element of WordIndex. For each WordIndex element, the associated atom is retrieved from CData. If CData[WordIndex[ i]] is a literal atom (designated by a 0 value in the most significant bit (MSB)), then it is placed directly into DecompData[ i]. Otherwise, the atom must be a fill. If it is a fill of zeroes (second MSB is a zero), then 64 zeroes are assigned into DecompData[ i]. If it is a fill of ones, a word consisting of 1 zero (to account for the flag bit) and 63 ones is assigned to DecompData[ i]. The resulting DecompData is the fully decompressed bitmap.
Figure 3 illustrates a thread access pattern for the final stage of Decomp. As shown, CData, the WAH compressed bit vector, is composed of three literal atoms (L0, L1, and L2) and two fill atoms (the shaded sectors).
For each literal, Decomp uses a thread that writes the value portion of the atom to the DecompData bit vector. Fill atoms need as many threads as the number of compressed words the fragment represents. For example, consider the first fill in CData, it encodes a run of three words of 0. Decomp creates three threads all reading the same compressed word but writing 0 in the three different locations in DecompData. If a run of 1s had been encoded, a value of 0x3FFFFFFFFFFFFFFF would have been written instead of 0.
GPU range query execution strategies
Here we present four methods for executing range queries in parallel on GPUs. These are column-oriented access (COA), row-oriented access (ROA), hybrid, and ideal hybrid access approaches. These approaches are analogous to structure-of-arrays, array-of-structures, and a blend thereof. Structure-of-arrays and array-of-structures approaches have been used successfully to accelerate scientific simulations on GPUs [17, 18], but differ in the how data is organized and accessed which can impact GPU efficiency.
Column-oriented access (COA)
Our COA approach to range query processing is shown in Algorithm 3. The COA procedure takes a collection of decompressed bit vectors needed to answer the query and performs a column oriented reduction on them. At each level of the reduction, the bit vectors are divided into two equal groups: low-order vectors and high-order vectors. The s variable in Algorithm 3 stores the divide position (lines 5 and 14). During processing, the first low-order vector is paired with the first high-order, as are the seconds of each group and so on (lines 8 and 9). The bitwise operation is performed between these pairs. To increase memory efficiency, the result of the query operation is written back to the low order column (Algorithm 3, line 11). The process is then repeated using only the low-order half of the bit vectors as input until a single decompressed bit vector remains. The final bit-vector containing the result can then be copied back to the CPU.
Figure 4a shows the COA reduction pattern for a range query across bit vectors 0 through 3. A 1-dimensional thread grid is assigned to process each pair of bit vectors. Note that multiple thread blocks are used within the grid, as a single GPU thread block cannot span the full length of a decompressed bit vector. Figure 4b shows how the thread grid spans two columns and illustrates the inner workings of a thread block. As shown, a thread block encompasses 1024 matched 64-bit word pairs from two columns. A thread is assigned to each pair of words. Each thread performs the OR operation on its word pair and writes the result back to the operand word location in the low ordered column. As each thread block only has access to a very limited shared memory (96 kB for the GPU used in this study), and since each round of the COA reduction requires the complete result of the column pairings, all of COA memory reads and writes have to be to global memory. Specifically, given a range query of m bit vectors, each with n rows, and a system word size of w bits, the COA approach performs \((2m-2)\frac {n}{w}\) coalesced global memory reads and \((m-1)\frac {n}{w}\) coalesced global memory writes on the GPU.
Row-oriented access (ROA)
Algorithm 4 presents our ROA approach to range query processing. Because all rows are independent, they can be processed in parallel. To accomplish this, ROA uses many 1-dimensional thread blocks that are arranged to create a one-to-one mapping between thread blocks and rows (Algorithm 4, line 5).
This data access pattern is shown in Fig. 5. The figure represents the query C0∨C1∨C2∨C3, where Cx is a decompressed bit vector. As shown, the individual thread blocks are represented by rectangles with perforated borders. Unlike COA, where thread blocks only span two columns, the ROA thread blocks span all columns of the query (up to 2048, 2× the maximum number of threads in a thread block.)
Inside any given ROA thread block, the column access pattern within it is identical to the COA pattern (Algorithm 4 line 8-11). The words of the row are partitioned into low-order and high-order by column ID. Each thread performs a bitwise OR on word pairs, where one operand word is from the low-order columns, and the other is from the high-order set (shown in the Thread block of Fig. 5). The results of the operation are written back to the low order word.
Like COA, a ROA reduction has log2(n) levels, where n is the number of bit vectors in the query. However, all of ROA processing is limited in scope to a single row. By operating along rows, the ROA approach loses coalesced global memory accesses as row data are not contiguous in memory. However, for the majority of queries, the number of bit vectors is significantly less than the number of words in a bit vector. This means that ROA can use low-latency GPU shared memory to store the row data (up to 96 kB) and intermediate results necessary for performing the reduction. Using shared memory for the reduction avoids repeated reads and writes to high-latency global memory (∼100× slower than shared memory). Given a range query of m bins, each with n rows, and a system word size of w bits, the ROA approach performs \(\frac {mn}{w}\) global memory reads and \(\frac {n}{w}\) global memory writes. A significant reduction of both relative to COA.
Hybrid
We form the hybrid approach to range query processing by combining the 1-dimensional COA and ROA data access patterns into 2-dimensional thread blocks. These 2D thread blocks are tiled to provide complete coverage of the query data. An example tiling is shown in Fig. 6. To accomplish this tiling the hybrid method uses a thread grid of p×q thread blocks, where p and q are integers. Each thread block is composed of k×j threads and spans 2k columns and j rows, where k and j are integers. With this layout, each thread block can use the maximum of 1024 threads.
A single thread block in the hybrid process performs the same work as multiple ROA thread blocks stacked vertically. A major difference being that thread blocks in the hybrid process do not span all bit vectors. Using these 2-dimensional thread blocks provides the hybrid approach the advantages of both coalesced memory accesses of COA, and ROA’s use of GPU shared memory to process the query along rows. The disadvantage of the hybrid approach is that the lowest order column of each thread block along the rows must undergo a second round of the reduction process to obtain the final result of the range query. This step combines the answers of the individual thread block tiles.
The hybrid process is shown in Algorithm 5 where the first round of reductions are on lines 8-20 and the second round of reductions are on lines 22-34.
Due to the architectural constraints of NVIDIA GPUs, the hybrid design is limited to processing range queries of ≤ 222 bins. This is far beyond the scope of typical bitmap range queries and GPU memory capacities. Given a range query of m bins, each with n rows, a system word size of w, and k thread blocks needed to span the bins, up to \((m+k)\frac {n}{w}\) global memory reads and \((k+1)\frac {n}{w}\) global memory writes are performed. Although the hybrid approach requires more global memory reads and writes than the ROA approach, its use of memory coalescing can enhance the potential for computational throughput.
Ideal hybrid
In practice, most WAH range queries involve less than 1024 columns. This mean that in most query scenarios it is possible to map a single 2-dimensional thread block tile across multiple rows and all of the columns (bit vectors) of the query. This purely vertical tiling is shown in Fig. 7. Such a tiling improves throughput by allowing each thread block to comprise the maximum of 1024 threads. Like the hybrid method, each thread block retains the advantages of coalesced memory accesses and the use of GPU shared memory. Further, this tiling pattern eliminates the need for a second round of reduction. The result of this arrangement is the ideal hybrid algorithm as described in Algorithm 6.
The theoretical expressions for global reads and writes in the hybrid algorithm agree that an “ideal” hybrid layout is one where a single thread block of k×j threads spans all 2k columns. Multiple k×j thread blocks are still used to span all of the rows. This layout limits the number of global writes in the first round to 1 and removes the need to perform the second reduction between thread blocks along rows. For processing a range query of m bins, each with n rows, and a system word size of w, the ideal hybrid layout thereby reduces the total number of global memory reads and writes to \(\frac {mn}{w}\) and \(\frac {n}{w}\), respectively. These are the same quantities obtained for ROA, but the ideal hybrid method guarantees a higher computational throughput as each k×j thread block has 1024 threads.
Multi-core CPU methods
For an experimental baseline, we created a CPU-based parallel algorithm for processing range queries. Most multi-core CPUs cannot support the number of concurrent operations needed to fully exploit all of the available parallelism in WAH bitmap query processing. For this reason, we limited the CPU algorithm to two approaches: 1) a baseline approach that iterates through bit vectors to execute a query and 2) a COA style reduction approach.
Given an np-core CPU, approach 1 uses OpenMP [19] to execute up to np parallel bitwise operations on paired compressed bit vectors. Once a set of paired bit vectors is processed, the CPU iterates to execute up to np parallel bitwise operations on the result and the next remaining bit vector to process.
Approach 2 uses OpenMP to execute up to np parallel bitwise operations on paired compressed bit vectors for any reduction level. If more than np bit vector pairs exist in a given reduction level, the CPU must iterate until all pairs are processed and the reduction level is complete. The range-query result is obtained once the final reduction level is processed. The pattern of the CPU reduction process is similar to the COA pattern shown in Fig. 4a.