Big-Oh notation is a simple and powerful way to express how running time of a particular algorithm depends on the size of the input. When you say that a particular algorithm runs in O(N2) time, you mean that the number of steps the algorithm takes is proportional to the input size squared. Or, in mathematical terms, there is some fixed constant C, such that to process input of size N, the algorithm needs at most C x N2 steps. One interesting question is how to define a “step”. The beauty of the Big-Oh notation is that any somewhat reasonable definition of a step will do. The step could be a clock cycle, a hardware instruction, or an expression in source code. If an algorithm takes O(N) steps according to one definition of a “step”, it takes O(N) according to all reasonable definitions.
While this is great, you already probably heard it a thousand times. But, what about the complexity of parallel programs? A major departure from the sequential case is that the number of steps an algorithm takes and the actual real-world time may not be proportional. In the parallel case, we may have potentially many cores executing the computation steps!
In fact, the number of computational cores is a new variable that we need to take into account. There are several ways to incorporate this variable into the Big-Oh notation, and I am going to describe them one-by-one.
Method 1: “the algorithm runs in O(N) time in parallel”
The crudest approach is simply to assume that there will be enough processors available on the machine, regardless of what “enough” means. Even this approximation tells us something deep about a particular algorithm. If the algorithm runs in O(N) time sequentially but in O(log N) in parallel, that means that the algorithm parallelizes well. On the other hand, if the algorithm runs in O(N) both sequentially and in parallel, then a large number of cores will probably not make the algorithm run much faster.
For example, consider the standard algorithm for multiplication of square matrices:
static int[,] Multiply(int[,] matrix1, int[,] matrix2) { int N = matrix1.GetLength(0); int[,] result = new int[N, N]; for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { for (int k = 0; k < N; k++) { result[j, k] += matrix1[j, i] * matrix2[i, k]; }
}
} return result; }
If we have enough cores, the iterations of the loop over parameter j can execute in parallel, and so can iterations over parameter k. In fact, if we had a matrix of processors of size j * k, each of them would have to only perform one multiplication and one addition for each iteration of the parameter i, bringing down the running time to O(N).
So, in theory, we can say that provided that on a parallel machine, the program runs in O(N) time. There are two problems with that statement:
- We may need a lot of processors to achieve the O(N) running time. To multiply two 100×100 matrices, we would need a machine with a 10,000 processors.
- Today’s mainstream CPUs tends to perform poorly in cases where threads have to wait on each other a lot. In the matrix multiplication example, each thread would perform one multiplication and one addition and then have to wait. This would result in terrible performance in practice. On the other hand, SIMD hardware such as GPUs could potentially do much better.
Method 2: “the algorithm runs in O(N2) time with O(N) processors, or O(N) with O(N2) processors”
To address the first of the two problems I mentioned, we can use the Big-Oh notation to also express the upper limit on the number of processors required to achieve a particular running time. For example, we can say that the matrix multiplication algorithm runs in O(N2) time on a machine with O(N) processors, or in O(N) time on a machine with O(N2) processors.
Now, we are not only saying how fast a program can run on a parallel machine, but also how many processors do we need to get there.
Method 3: “the algorithm runs in O(N3/P) time, so long as P is in O(N2)”
Finally, my favorite way to express complexity of a parallel algorithm is to introduce a new parameter P that represents the number of cores on a machine. For example, we would say that the matrix multiplication algorithm runs in O(N3/P) time. For completeness, we should also say that this only holds so long as P is in O(N2). After all, even if the computer has O(N3) cores, the matrix multiplication algorithm still cannot beat O(N).
One interesting thought exercise is to think about this: if the fastest possible sequential algorithm is O(F(N)), is it possible that there is a parallel algorithm which is asymptotically faster than O(F(N) / P) on a machine with P processors? If you you know the answer, let’s hear about it in the comments section!
A few examples
This table lists several sequential algorithms, and their parallel complexities using methods 1 and 3. Note that I don’t claim that each complexity listed in this table is optimal. A value in the table simply means that I am aware of an algorithm with this complexity. For example, I list the complexity of sequential matrix multiplication as O(N3), but there are somewhat faster algorithms out there.
Algorithm | Sequential | Method 1 | Method 3 |
Sum, Max or Prefix sum | O(N) | O(log N) | O((N log N) / P), with P in O(N) OR O(N / P) with P in O(sqrt N) |
Edit distance | O(N^2) | O(N) | O(N2 / P), with P in O(N) |
Sorting | O(N log N) | O((log N)2) | O(N (log N)2/P), with P in O(N) |
Matrix multiplication, Floyd Warshall | O(N3) | O(N) | O(N3/P), with P in O(N2) |
Binary search | O(log N) | O(log N) | O(log N) |
Tags: Concurrency
[…] — Algebraic Properties of Equality and Properties … First saved by cxcheng | 8 days ago Big Oh in the parallel world First saved by kulaju8 | 11 days ago Matrices for programmers First saved by janus | 21 days […]
“One interesting thought exercise is to think about this: if the fastest possible sequential algorithm is O(F(N)), is it possible that there is a parallel algorithm which is asymptotically faster than O(F(N) / P) on a machine with P processors?”
I think this isn’t possible. A single processor can simulate P processors with an O(P) increase in running time. If parallelization can produce an asymptotic speedup better than O(P), then the single processor can take advantage of that to reduce the running time of the sequential algorithm. This would contradict the best running time for the sequential algorithm. This can also contradict trivial lower bounds (such as O(N^2) memory accesses for matrix multiplication) or information-theoretic lower bounds (e.g. O(N log N) for comparison sorting).
(This is in the realm of theoretical computer science, disregarding implementation issues like memory hierarchy or superscalar execution or out-of-order execution.)
Thank you for this post. I guess, edit distance example is based on the dynamic programming algorithm on 2 strings of length N.
If we had a “binary” data set with m objects with each object of size N, how would we write the parallel complexity for calculating the distance matrix using method 3? Here the edit distance turns out to be Hamming distance computed by XORing two objects + counting 1s.
O((m^2 * N) / p) and p is O(m^2) ???
dummk:
Counting the ones takes O(N) time, so the sequential complexity would be O(N * m * m).
Then, the parallel complexity would be:
O(N * m * m / p), where p must be O(m).
Basically, the algorithm now has more work to do by a factor of N, but it does not have any more opportunity for parallelism.
Of course, you could try to further parallelize the work of XORing and counting the 1s. For that algorithm, you could claim a higher bound on p. In practice, I doubt the extra parallelism would be worthwhile on a typical modern computer unless N is very large.
Igor,
I am writing a work and had a big argument with a friend about the way I said one algorithm is asymptotically faster running on parallel than the one running serial, using the big oh notation. BTW your post helped a lot in my argumentation. But do you know any paper or scientific work that explored this concept?
Tks
Diogo – No, I do not have any references, unfortunately. Glad to hear that the post helped!
Binary search is O(logN/P) with P in O(N).
Given P, split the search space (N) into N/P spaces. The run time is now min(SearchTime(N/P)) where SearchTime is logN. Hence min(SearchTime(N/P) is O(logN/P). It should be obvious that a binary search of 8 elements requires at most 3 compare cycles, but with P=8 I need only one cycle.