Big Oh in the parallel world
Big-Oh notation is a simple and powerful way to express how running time of a particular algorithm depends on the size of the input. When you say that a particular algorithm runs in O(N2) time, you mean that the number of steps the algorithm takes is proportional to the input size squared. Or, in mathematical terms, there is some fixed constant C, such that to process input of size N, the algorithm needs at most C x N2 steps. One interesting question is how to define a “step”. The beauty of the Big-Oh notation is that any somewhat reasonable definition of a step will do. The step could be a clock cycle, a hardware instruction, or an expression in source code. If an algorithm takes O(N) steps according to one definition of a “step”, it takes O(N) according to all reasonable definitions.
While this is great, you already probably heard it a thousand times. But, what about the complexity of parallel programs? A major departure from the sequential case is that the number of steps an algorithm takes and the actual real-world time may not be proportional. In the parallel case, we may have potentially many cores executing the computation steps!
In fact, the number of computational cores is a new variable that we need to take into account. There are several ways to incorporate this variable into the Big-Oh notation, and I am going to describe them one-by-one.
Method 1: “the algorithm runs in O(N) time in parallel”
The crudest approach is simply to assume that there will be enough processors available on the machine, regardless of what “enough” means. Even this approximation tells us something deep about a particular algorithm. If the algorithm runs in O(N) time sequentially but in O(log N) in parallel, that means that the algorithm parallelizes well. On the other hand, if the algorithm runs in O(N) both sequentially and in parallel, then a large number of cores will probably not make the algorithm run much faster.
For example, consider the standard algorithm for multiplication of square matrices:
static int[,] Multiply(int[,] matrix1, int[,] matrix2) { int N = matrix1.GetLength(0); int[,] result = new int[N, N]; for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { result[j, k] += matrix1[j, i] * matrix2[i, k]; } } } return result; }
If we have enough cores, the iterations of the loop over parameter j can execute in parallel, and so can iterations over parameter k. In fact, if we had a matrix of processors of size j * k, each of them would have to only perform one multiplication and one addition for each iteration of the parameter i, bringing down the running time to O(N).
So, in theory, we can say that provided that on a parallel machine, the program runs in O(N) time. There are two problems with that statement:
- We may need a lot of processors to achieve the O(N) running time. To multiply two 100×100 matrices, we would need a machine with a 10,000 processors.
- Today’s mainstream CPUs tends to perform poorly in cases where threads have to wait on each other a lot. In the matrix multiplication example, each thread would perform one multiplication and one addition and then have to wait. This would result in terrible performance in practice. On the other hand, SIMD hardware such as GPUs could potentially do much better.
Method 2: “the algorithm runs in O(N2) time with O(N) processors, or O(N) with O(N2) processors”
To address the first of the two problems I mentioned, we can use the Big-Oh notation to also express the upper limit on the number of processors required to achieve a particular running time. For example, we can say that the matrix multiplication algorithm runs in O(N2) time on a machine with O(N) processors, or in O(N) time on a machine with O(N2) processors.
Now, we are not only saying how fast a program can run on a parallel machine, but also how many processors do we need to get there.
Method 3: “the algorithm runs in O(N3/P) time, so long as P is in O(N2)”
Finally, my favorite way to express complexity of a parallel algorithm is to introduce a new parameter P that represents the number of cores on a machine. For example, we would say that the matrix multiplication algorithm runs in O(N3/P) time. For completeness, we should also say that this only holds so long as P is in O(N2). After all, even if the computer has O(N3) cores, the matrix multiplication algorithm still cannot beat O(N).
One interesting thought exercise is to think about this: if the fastest possible sequential algorithm is O(F(N)), is it possible that there is a parallel algorithm which is asymptotically faster than O(F(N) / P) on a machine with P processors? If you you know the answer, let’s hear about it in the comments section!
A few examples
This table lists several sequential algorithms, and their parallel complexities using methods 1 and 3. Note that I don’t claim that each complexity listed in this table is optimal. A value in the table simply means that I am aware of an algorithm with this complexity. For example, I list the complexity of sequential matrix multiplication as O(N3), but there are somewhat faster algorithms out there.
| Algorithm | Sequential | Method 1 | Method 3 |
| Sum, Max or Prefix sum | O(N) | O(log N) | O((N log N) / P), with P in O(N) OR O(N / P) with P in O(sqrt N) |
| Edit distance | O(N^2) | O(N) | O(N2 / P), with P in O(N) |
| Sorting | O(N log N) | O((log N)2) | O(N (log N)2/P), with P in O(N) |
| Matrix multiplication, Floyd Warshall | O(N3) | O(N) | O(N3/P), with P in O(N2) |
| Binary search | O(log N) | O(log N) | O(log N) |
Monday 11 Aug 2008 | Igor Ostrovsky | Uncategorized