Best Performance for Array Intersection in PHP

Photo of Greg
Photo by Nathan E Photography -

Among some PHP aficionados, this is probably already well known, but it was news to me: array_intersect_key is vastly more efficient than array_intersect. Even when intersecting one-dimensional arrays, it turns out to be much faster to artificially turn them into associative arrays with array_combine first, just to avoid the slowness of array_intersect.

One of the tasks I need done on a fairly regular basis requires computing all possible paths connecting a given set of points in a hypercube, where the paths are characterized by each point on the path being separated from the next point by a Hamming distance of one. The details aren’t important, but the initial set of points lives in a MySQL database, and what I need to do via PHP is to figure out all the different paths which can be created between those points by changing the value along one axis at a time.

This job sucks up a fair bit of computing power, and recently I set out to parallelize it so I could spread the load over more than one processor core, while also improving the efficiency of each thread. Among other things, this involves an initial pass at the database to extract a set of unique IDs for each possible value along each of several axes, throwing out any sets that are empty. That part is straightforward and only has to be done once, and fortunately we don’t have to go back to the database again, except to store results. (Performing any of the path computations using MySQL itself, via cleverly constructed SQL queries and COUNT, is entirely pointless: even with diligent indexing, database access is way too slow even to consider it.)

Having extracted what we need from the database, we are left with a set of sets of integers, and computing paths is a relatively simple but also relatively time-consuming matter of performing a sequence of intersections on those sets. For example, suppose ID 1 lies at the point specified by values of ‘red’ and ’round’ for the two axes ‘colour’ and ‘shape’, while ID 2 lies at ‘red’ and ‘square’. Then our set corresponding to red will include {1,2}, while our set for round will include {1} and our set for square will include {2}. Intersecting {1,2} and {1} tells us that there is a path from red that involves round, while intersecting {1,2} and {2} tells us that there is a path from red that involves square.

Again, the details are not important, but in terms of numbers of potential paths through the hypercube, the problem blows up quickly. Even if we have only a few thousand points stored in the database, with just a few different axes (i.e., database columns) and corresponding sets of possible values, the number of potential paths to be evaluated rapidly zooms through the hundreds of millions, on to tens of billions, and beyond.

What is important is that whatever way you slice it, if you’re having to perform large numbers of intersections, the speed of each one carries quite a bit of weight in terms of determining the overall computational performance. And what I was astonished to discover was that array_intersect is vastly slower — by at least an order of magnitude — than array_intersect_key. From what I can gather from a few Google searches trying to figure out why the code was so unbearably slow (including Stack Overflow discussions here, here and here), array_intersect scales like O( m x n ) on each and every run, where m and n are the sizes of the arrays, regardless of whether the arrays are already sorted — as if the underlying code is running an O( n ) linear search on each. As I understand it, some array functions in PHP are implemented with O( 1 ) hash table lookups, but apparently this one is not. By contrast, array_intersect_key is doing something vastly more efficient that apparently does make use of hash tables.

The funny thing is that in the hypercube example, our sets are only one-dimensional anyway, so making use of array_intersect_key requires artificially turning them into associative arrays with array_combine. Even with the overhead of doubling their size, array_intersect_key is still vastly faster. Of course, I only do that once, at the stage of creating the sets in the first place, but a quick just-for-fun test suggested that it was still faster when doing it every single time an intersection was performed. But if that’s the case, then why wouldn’t array_intersect itself just do that first? Who knows; maybe there is something after all that is locked up in the details which I keep saying are unimportant!

In any case, the moral of the story is that if you ever find yourself needing to perform a large number of intersections in PHP, it’s very likely to be worth your while to swap out array_intersect in favour of array_intersect_key, even if that means introducing an artifice like turning your one-dimensional arrays into two-dimensional arrays.

In my particular case, array_intersect_key made the difference between being able to perform the task at all using a particular approach, and having to give up on that approach and use an alternative, which was significantly slower. Using array_intersect_key made it possible to keep everything in RAM, to split up the job into separate threads, and to achieve an overall speedup between 10x and 42x, relative to the alternative method. (The speedup varies depending on the topology of the paths within the hypercube.)

All material on this site is carefully reviewed, but its accuracy cannot be guaranteed, and some suggestions offered here might just be silly ideas. For best results, please do your own checking and verifying. This specific article was last reviewed or updated by Greg on .

This site is provided for informational and entertainment purposes only. It is not intended to provide advice of any kind.