Optimally selecting the top k values from X + Y with layer-ordered heaps

Selection and sorting the Cartesian sum, X + Y, are classic and important problems. Here, a new algorithm is presented, which generates the top k values of the form Xi+Yj. The algorithm relies on layer-ordered heaps, partial orderings of exponentially sized layers. The algorithm relies only on median-of-medians and is simple to implement. Furthermore, it uses data structures contiguous in memory, cache efficient, and fast in practice. The presented algorithm is demonstrated to be theoretically optimal.


INTRODUCTION
Given two vectors of length n, X and Y, top-k on X + Y finds the k smallest values of the form X i + Y j . Note that this problem definition is presented w.l.o.g.; X and Y need not share the same length. Top-k is important to practical applications, such as selecting the most abundant k isotopologue peaks from a compound (Kreitzberg et al., 2020). Top-k is ∈ Ω(n + k), because loading the vectors is ∈ Θ(n) and returning the minimal k values is ∈ Θ(k).

Naive approach
Top-k can be solved trivially in Oðn 2 logðnÞ þ kÞ ¼ Oðn 2 logðnÞÞ steps by generating and sorting all n 2 values of the form X i + Y j . By using median-of-medians (Blum et al., 1973), this can be improved to O(n 2 ) steps by generating all n 2 values and performing k-selection on them.

Existing, tree-based methods for top-k
In 1982, Frederickson & Johnson introduced a method reminiscent of median-of-medians (Blum et al., 1973); their method selects only the k th minimum value from X + Y in O n þ minðn; kÞ log k minðn;kÞ Kaplan et al. described an alternative method for selecting the k th smallest value (Kaplan et al., 2019); that method explicitly used Chazelle's soft heaps (Chazelle, 2000). By heapifying X and Y in linear time (i.e., guaranteeing w.l.o.g. that X i X 2i ; X 2iþ1 ), min i,j X i + Y j = X 1 + Y 1 . Likewise, X i þ Y j X 2i þ Y j ; X 2iþ1 þ Y j ; X i þ Y 2j ; X i þ Y 2jþ1 . The soft heap is initialized to contain tuple (X 1 + Y 1 ,1,1). Then, as tuple (v,i,j) is popped from soft heap, lower-quality tuples are inserted into the soft heap. These lower-quality tuples of (i,j) are fð2i; 1Þ; ð2i þ 1; 1Þ; ði; 2Þ; ði; 3Þg; j ¼ 1 fði; 2jÞ; ði; 2j þ 1Þg; j > 1: In the matrix X i + Y j (which is not realized), this scheme progresses in row-major order, thereby avoiding a tuple being added multiple times.
To compute the k th smallest value from X + Y, the best k values are popped from the soft heap. Even though only the minimal k values are desired, "corruption" in the soft heap means that the soft heap will not always pop the minimal value; however, as a result, soft heaps can run faster than the ðn logðnÞÞ lower bound on comparison sorting. The free parameter e 2 ð0; 1 2 bounds the number of corrupt elements in the soft heap (which may be promoted earlier in the queue than they should be) as ≤ t · ε, where t is the number of insertions into the soft heap thus far. Thus, instead of popping k items (and inserting their lower-quality dependents as described in Eq. (1)), the total number of pops, p, can be found: The maximal size of the soft heap after p pops is ≤ 3p (because each pop removes one element and inserts ≤ 4 elements according to Eq. (1)); therefore, p − corruption ≥ p − 4p· ε, and thus p À 4p Á e ! k guarantees that p À corruption ! k. This leads to p ¼ k 1À4e , e 1 4 . This guarantees that Θ(k) values, which must include the minimal k values, are popped. These values are post-processed to retrieve the minimal k values via linear time one-dimensional selection (Blum et al., 1973). For constant ε, both pop and insertion operations to the soft heap are 2Õð1Þ, and thus the overall runtime of the algorithm is ∈ O(n + k).
Note that the Kaplan et al. method easily solves top-k in O(n + k) steps; this is because computing the k th smallest value from X + Y pops the minimal k values from the soft heap.
Layer-ordered heaps and a novel selection algorithm on X + Y This paper uses layer-ordered heaps (LOHs) (Kreitzberg et al., 2020) to produce an optimal selection algorithm on X + Y. LOHs are stricter than heaps but not as strict as sorting: Heaps guarantee only that X i ≤ X children(i) , but do not guarantee any ordering between one child of X i , a, and the child of the sibling of a. Sorting is stricter still, but sorting n values cannot be done faster than log 2 ðn!Þ 2 ðn logðnÞÞ. LOHs partition the array into several layers such that the values in a layer are ≤ to the values in subsequent layers: 1 , X (u) 2 , … ≤ X (u + 1) . The size of these layers starts with |X (1) | = 1 and grows exponentially such that lim u!1 jX ðuþ1Þ j jX ðuÞ j ¼ a ! 1 (note that a = 1 is equivalent to sorting because all layers have size 1). By assigning values in layer u children from layer u + 1, this can be seen as a more constrained form of the heap; however, unlike sorting, for any constant a > 1, LOHs can be constructed ∈ O(n) by performing iterative linear time one-dimensional selection, iteratively selecting and removing the largest layer until all layers have been partitioned. For example, 8,1,6,4,5,3,2 can be LOHified with a = 2 into an LOH with three layers (1 ≤ 3,2 ≤ 8,4,6,5) by first selecting the largest 4 values on the entire list (8,4,6,5), removing them, and then selecting the largest 2 values from the remaining 3 values (3,2).
Although selections reminiscent of LOHs may have been used previously, formalization of rank α LOHs has been necessary to demonstrate that for 1 ( a ( 2, a combination of LOHs and soft heaps allow generating the minimum k values from X 1 + X 2 + ⋯ + X m (where each X i has length n) in oðn Á m þ k Á mÞ (Kreitzberg, Lucke & Serang, 2020). Furthermore, efficiently constructing an LOH of rank α is not trivial when a ( 2; after all, a ! 1 results in layers of size |X (1) | = |X (2) | = ⋯ = 1, indicating a sorting, which implies a runtime ∈ (nlog(n)) (Pennington et al., 2020).
A python implementation of a LOH is shown in listing 1.

Contribution in this manuscript
The new, optimal algorithm for solving top-k presented here makes extensive use of LOHs. It is simple to implement, does not rely on anything more complicated than linear time one-dimensional selection (i.e., it does not use soft heap). Due to its simplicity and contiguous memory access, it has a fast performance in practice.

METHODS Algorithm
The algorithm presented is broken into phases. An illustration of these phases is provided in Fig. 1.

Phase 0
The algorithm first LOHifies (i.e., constructs a layer order heap from) both X and Y. This is performed by using linear time one-dimensional selection to iteratively remove the largest remaining layer (i.e., the simplest LOH construction method, which is optimal when a ) 1).

Phase 1
Now layer products of the form 1 ; . . . are considered, where X (u) and Y (v) are layers of their respective LOHs.
In phases 1-2, the algorithm initially considers only the minimum and maximum values in each layer product: bðu; vÞc ¼ ðminðX ðuÞ þ Y ðvÞ Þ; ðu; vÞ; 0Þ, dðu; vÞe ¼ ðmaxðX ðuÞ þ Y ðvÞ Þ; ðu; vÞ; 1Þ. It is unnecessary to compute the Cartesian product of values to build a layer product; instead, only the minimum or maximum values in X (u) and Y (v) are needed. Note that the final value in the tuple uses 0 to indicate that this is the minimum value in the layer product or 1 to indicate the maximum value in the layer product; this ensures that even layer products with homogeneous values satisfy bðu; vÞc < dðu; vÞe. Scalar values can be compared to tuples: X i þ Y j dðu; vÞe ¼ ðmaxðX ðuÞ þ Y ðvÞ Þ; ðu; vÞ; 1Þ $ X i þ Y j maxðX ðuÞ þ Y ðvÞ Þ.
Binary heap H is initialized to contain tuple bð1; 1Þc. A set of all tuples in H is maintained to prevent duplicates from being inserted into H (this set could be excluded by using the Kaplan et al. proposal scheme). The algorithm proceeds by popping the lexicographically minimum tuple from H. W.l.o.g., there is no guaranteed ordering of the form X (u) + Y (v) ≤ X (u + 1) + Y (v) , because it may be that max(X (u) + Y (v) ) > min(X (u + 1) + Y (v) ); however, lexicographically, bðu; vÞc < bðu þ 1; vÞc; bðu; v þ 1Þc; dðu; vÞe; thus, the latter tuples need be inserted into H only after ⌊(u,v)⌋ has been popped from H. Note that for this reason and to break ties where layer products contain identical values, (u,v) are included in the tuple. ⌈(u,v)⌉ tuples do not insert any new tuples into H when they're popped.
Whenever a tuple of the form ⌈(u,v)⌉ is popped from H, the index (u,v) is appended to list q and the size of the layer product |X (u) + Y (v) | = |X (u) |·|Y (v) | is accumulated into integer s. This method proceeds until s ≥ k.

Phase 2
Any remaining tuple in H of the form ðmaxðX ðu 0 Þ þ Y ðv 0 Þ Þ; ðu 0 ; v 0 Þ; 1Þ has its index (u′,v′) appended to list q. s′ is the total number of elements in each of these (u′,v′) layer products appended to q during phase 2. 1 2,4 5,8,7,9 12,26,40,43,14,46,20,49 0 1,5 6,8,7,9 11,41,31,26,33,39,20 Max corner sizes to exceed k is s=17 Final max corner visited has value τ=10  ,5,11,7,33,6,39,42,20,0,9,1,41,26,8} and Y = {12,26,40,9,14,49,8,2,20,1,46,43,4,5,7} are both LOHified to axes in O(n) time. Note that the minimum and maximum values in a layer are placed at the first and last positions in the layer, respectively; otherwise values within layers are themselves unordered. Phase 1: The minimum and maximum corners of all layer products (grid) are visited together in ascending order until the area of the layer products whose max corners are visited exceeds k (inset), and the largest value visited is labeled as τ = 10. Phase 2: The layer products whose max corners have been visited (blue) has area s that exceeds k but has s ∈ O(k). Likewise, the layer products whose min corners have been visited but whose max corners have not been visited, and which therefore contain some elements < τ, have area s′ ∈ O(k). Phase 3 The values from every element in each layer product in q are generated. A linear time onedimensional k-selection is performed on these values and returned.

Proof of correctness
Lemma 2.4 proves that at termination all layer products found in q must contain the minimal k values in X + Y. Thus, by performing one-dimensional k-selection on those values in phase 3, the minimal k values in X + Y are found.
Thus, all tuples are popped in ascending order. □ Lemma 2.4 At the end of phase 2, the layer products whose indices are found in q contain the minimal k values. Proof. Let (u,v) be the layer product that first makes s≥ k. There are at least k values of X + Y that are maxðX ðuÞ þ Y ðvÞ Þ; this means that s ¼ ðselectðX þ Y; kÞÞ maxðX ðuÞ þ Y ðvÞ Þ. The quality of the elements in layer products in q at the end of phase 1 can only be improved by trading some value for a smaller value, and thus require a new value <max(X (u) + Y (v) ).
By lemma 2.3, tuples will be popped from H in ascending order; therefore, any layer product ðu 0 ; v 0 Þ containing values < maxðX ðuÞ þ Y ðvÞ Þ must have had bðu 0 ; v 0 Þc popped before dðu; vÞe. If dðu 0 ; v 0 Þe was also popped, then this layer product is already included in q and cannot improve it. Thus the only layers that need be considered further have had bðu 0 ; v 0 Þc popped but not dðu 0 ; v 0 Þe popped; these can be found by looking for all dðu 0 ; v 0 Þe that have been inserted into H but not yet popped.
Phase 2 appends to q all such remaining layer products of interest. Thus, at the end of phase 2, q contains all layer products that will be represented in the k-selection of X + Y. □ A python implementation of this method is shown in listing 2.

Runtime
Theorem 2.8 proves that the total runtime is ∈ O(n + k).
Proof. (u,v) is the layer product whose inclusion during phase 1 in q achieves s ≥ k; therefore, s À jX ðuÞ þ Y ðvÞ j < k. This happens when dðu; vÞe is popped from H.
When u′ = v′ = 1, the layer product contains 1 element. Therefore, s′, the total number of elements found in layer products appended to q during phase 2, has s 0 ða 2 þ 2aÞ Á s þ 1. By lemma 2.6, s 2 OðkÞ, and thus s′∈ O(k). □ Theorem 2.8 The total runtime of the algorithm is ∈ O(n + k).
Proof. For any constant α > 1, LOHification of X and Y runs in linear time, and so phase 0 runs ∈ O(n).
The total number of layers in each LOH is ≈ log α (n); therefore, the total number of layer products is % log 2 a ðnÞ. In the worst-case scenario, the heap insertions and pops (and corresponding set insertions and removals) will sort ≈ 2 log 2 α (n) elements, because each layer product may be inserted as both bÁc or dÁe; the worst-case runtime via comparison sort will be 2 Oðlog 2 a ðnÞ logðlog 2 a ðnÞÞÞ & oðnÞ. The operations to maintain a set of indices in the heap have the same runtime per operation as those inserting/removing to a binary heap, and so can be amortized out. Thus, the runtimes of phases 1-2 are amortized out by the O(n) runtime of phase 0. Lemma 2.6 shows that s∈ O(k). Likewise, lemma 2.7 shows that s′∈ O(k). The number of elements in all layer products in q during phase 3 is s + s′∈ O(k). Thus, the number of elements on which the one-dimensional selection is performed will be 2 OðkÞ. Using a linear time one-dimensional selection algorithm, the runtime of the k-selection in phase 3 is ∈ O(k).
The total runtime of all phases ∈ O(n + k + k + k) = O(n + k). □

RESULTS
Runtimes of the naive O(n 2 log(n) + k) method (chosen for reference because it is the easiest method to implement and because of the fast runtime constant on python's built-in sorting routine), the soft heap-based method from Kaplan et al., and the LOH-based method in this paper are shown in Table 1. The proposed approach achieves a >295× speedup over the naive approach and >18× speedup over the soft heap approach. LOHs are more lightweight than soft heaps, including contiguous memory access patterns and far fewer pointer dereferences than soft heaps.

DISCUSSION
The algorithm can be thought of as "zooming out" as it pans through the layer products, thereby passing the value threshold at which the k th best value X i + Y j occurs. It is somewhat reminiscent of skip lists (Pugh, 1990); however, where a skip list begins coarse and progressively refines the search, this approach begins finely and becomes progressively coarser. The notion of retrieving the best k values while "overshooting" the target by as little as possible results in some values that may be considered but which will Table 1 Average runtimes (in seconds) on random uniform integer X and Y with |X| = |Y| = n. The layer-ordered heap implementation used α = 2 and resulted in s + s′/k = 3.637 on average. Individual and total runtimes are rounded to three significant figures. not survive the final one-dimensional selection in phase 3. This is reminiscent of "corruption" in Chazelle's soft heaps. Like soft heaps, this method eschews sorting in order to prevent a runtime 2 ðn logðnÞÞ or ∈ ω(k log(k)). But unlike soft heaps, LOHs can be constructed easily using only an implementation of median-of-medians (or any other linear time one-dimensional selection algorithm). Phase 3 is the only part of the algorithm in which k appears in the runtime formula. This is significant because the layer products in q at the end of phase 2 could be returned in their compressed form (i.e., as the two layers to be combined). The total runtime of phases 0-2 is ∈ O(n). It may be possible to recursively perform X + Y selection on layer products X (u) + Y (v) to compute layer products constituting exactly the k values in the solution, still in factored Cartesian layer product form. Similarly, it may be possible to perform the one-dimensional selection without fully inflating every layer product into its constituent elements. For some applications, a compressed form may be acceptable, thereby making it plausible to remove the requirement that the runtime be ∈ ω(k).
As noted in theorem 2.8, even fully sorting all of the minimum and maximum layer products would be 2 oðnÞ; sorting in this manner may be preferred in practice, because it simplifies the implementation (Listing 3) at the cost of incurring greater runtime in practice when k ( n 2 . Furthermore, listing 3 is unsuitable for online processing (i.e., where X and Y are extended on the fly or where several subsequent selections are performed), whereas listing 2 could be adapted to those uses.
Phase 0 (which performs LOHification) is the slowest part of the presented python implementation; it would benefit from having a practically faster implementation to perform LOHify.
The fast practical performance is partially due to the algorithm's simplicity and partially due to the contiguous nature of LOHs. Online data structures like soft heap are less easily suited to contiguous access, because they support efficient removal and therefore move pointers to memory rather than moving the contents of the memory.
The choice of α affects performance through the cost of LOHifying and the amount by which the number of generated values overshoots the k minimum values wanted: when α ≈ 1, LOHify effectively sorts X and Y, but generates few extra values; a ) 1, LOHify has a linear runtime, but generates more extra values, which need to be removed by the final k-selection.

CONCLUSION
LOHs can be constructed in linear time and used to produce a theoretically optimal algorithm for selecting the minimal k values from X + Y. The new optimal algorithm presented here is faster in practice than the existing soft heap-based optimal algorithm.

APPENDIX Python code
Listing 1. LayerOrderedHeap.py: A class for LOHifying, retrieving layers, and the minimum and maximum value in a layer.