How to make your code fast with parallel streams
In the eyes of many programmers, Java 8 streams have been a dissapointment regarding the preformance of parallel streams. This article is a basic guide on how to avoid performance pitfalls when using parallel streams.
When to use parallel streams
First off, only use parallel streams if your N is large. A good empirical rule of thumb is to only use parallel streams when the number of elements reaches 10_000. You should, however, keep in mind that it's the number of computations and not the number of elements that matter. One could, in theory, create an example where parallel stream wins even for a stream of 2 elements! Also, using parallel streams for blocking io is a poor substitute for proper nonblocking, asynchronous io. Don't do that. Parallel streams may choose the amount of threads based on available processors (in which case it will be too small).
Choosing the smallest sequential unit of computation
This is perhaps the most important thing to keep in mind. Not everything should be parallel. On the other hand, putting too much work in sequential computations will limit your scalability in the presence of many processors. It is very easy to parallelize too much in which case the performance penalty is incredible . Here is an example of a poorly parallelized prime number generator (please do not ever generate primes like this... there are better ways):
public class JavaPrimes { public static void main(String[] args) { int limit = Integer.parseInt(args[0]); java.util.stream.IntStream.range(2, limit) .parallel() .filter(JavaPrimes::isPrime) .forEach(System.out::println); } private static boolean isPrime(int n) { return java.util.stream.IntStream.range(2, n) .parallel() .mapToDouble(i -> (float)n / i) .noneMatch(JavaPrimes::isInt); } private static boolean isInt(double f) { return (int)f == f; } }
Compared to a simple, for-loop based approach this code is 4x slower when using a limit of 50_000. Instead of complaining that streams are slow, let's examine this code for a bit. The smallest unit of computation here consists of float division, double-to-int rounding and double equality comparison. We exclude the expense of static method calls under the assumption that they, being monomorphic call sites, are fully inlined. Now, this unit of computation is way too small – the time required to compute it is practically zero. Let's modify the program so that the smallest unit of sequential computation is checking whether a given number is prime.
public class JavaPrimes { public static void main(String[] args) { int limit = Integer.parseInt(args[0]); java.util.stream.IntStream.range(2, limit) .parallel() .filter(JavaPrimes::isPrime) .forEach(System.out::println); } private static boolean isPrime(int n) { for(int i = 2; i < n; i++) { if(isInt((float)n / i)) return false; return true; } private static boolean isInt(float f) { return (int)f == f; } }
Now, the code runs 2x faster than sequential loops not to mention 8x faster than poorly written parallel streams. As an additional bonus we can now run some computations using floats instead of doubles (there is no FloatStream so we had to use DoubleStream). On some hardware this may have significant benefits.