Igor Ostrovsky Blogging » Concurrency

Video of my PLINQ session at PDC 2009

Igor Ostrovsky — Sun, 22 Nov 2009 05:25:21 +0000

PDC 2009 was an exciting event, with announcements about Azure, Silverlight 4, and Office 2010 popping up one after another. For me, there was another reason why this year’s PDC was exciting – it was my first chance to present in a major conference.

The video of my session is available as Silverlight video, and also as high-quality WMV and standard-quality WMV. Here are the slides. Check it out and let me know how I did!

And, here are videos of all PDC sessions related to parallel programming.

Never having presented at PDC before, I didn’t know how many people to expect at my session. I was happy with the turn out – this is a view of my session from the back of the room:

Most Common Performance Issues in Parallel Programs

Igor Ostrovsky — Wed, 13 Aug 2008 08:15:34 +0000

Performance of parallel programs is an interesting – but also tricky – issue. I put together an article for our team blog that talks about the most common reasons why a parallel program may not scale as desired:

Developers ask why one program shows a parallel speedup but another one does not, or how to modify a program so that it scales better on multi-core machines.
The answers tend to vary. In some cases, the problem observed is a clear issue in our code base that is relatively easy to fix. In other cases, the problem is that Parallel Extension does not adapt well to a particular workload. This may be harder to fix on our side, but understanding real-world workloads is a first step to do that, so please continue to send us your feedback. In yet another class of issues, the problem is in the way the program uses Parallel Extensions, and there is little we can do on our side to resolve the issue.
Interestingly, it turns out that there are common patterns that underlie most performance issues, regardless of whether the source of the problem is in our code or the user code. In this blog posting, we will walk though the common performance issues to take into consideration while developing applications using Parallel Extensions.

Read the rest of the article here.

Big Oh in the parallel world

Igor Ostrovsky — Mon, 11 Aug 2008 07:34:29 +0000

Big-Oh notation is a simple and powerful way to express how running time of a particular algorithm depends on the size of the input. When you say that a particular algorithm runs in O(N²) time, you mean that the number of steps the algorithm takes is proportional to the input size squared. Or, in mathematical terms, there is some fixed constant C, such that to process input of size N, the algorithm needs at most C x N² steps. One interesting question is how to define a “step”. The beauty of the Big-Oh notation is that any somewhat reasonable definition of a step will do. The step could be a clock cycle, a hardware instruction, or an expression in source code. If an algorithm takes O(N) steps according to one definition of a “step”, it takes O(N) according to all reasonable definitions.

While this is great, you already probably heard it a thousand times. But, what about the complexity of parallel programs? A major departure from the sequential case is that the number of steps an algorithm takes and the actual real-world time may not be proportional. In the parallel case, we may have potentially many cores executing the computation steps!

In fact, the number of computational cores is a new variable that we need to take into account. There are several ways to incorporate this variable into the Big-Oh notation, and I am going to describe them one-by-one.

Method 1: “the algorithm runs in O(N) time in parallel”

The crudest approach is simply to assume that there will be enough processors available on the machine, regardless of what “enough” means. Even this approximation tells us something deep about a particular algorithm. If the algorithm runs in O(N) time sequentially but in O(log N) in parallel, that means that the algorithm parallelizes well. On the other hand, if the algorithm runs in O(N) both sequentially and in parallel, then a large number of cores will probably not make the algorithm run much faster.

For example, consider the standard algorithm for multiplication of square matrices:

static int[,] Multiply(int[,] matrix1, int[,] matrix2) {
    int N = matrix1.GetLength(0);
    int[,] result = new int[N, N];

    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            for (int k = 0; k < N; k++) {
                result[j, k] += matrix1[j, i] * matrix2[i, k];
            }
        }
    }

    return result;
}

If we have enough cores, the iterations of the loop over parameter j can execute in parallel, and so can iterations over parameter k. In fact, if we had a matrix of processors of size j * k, each of them would have to only perform one multiplication and one addition for each iteration of the parameter i, bringing down the running time to O(N).

So, in theory, we can say that provided that on a parallel machine, the program runs in O(N) time. There are two problems with that statement:

We may need a lot of processors to achieve the O(N) running time. To multiply two 100×100 matrices, we would need a machine with a 10,000 processors.
Today’s mainstream CPUs tends to perform poorly in cases where threads have to wait on each other a lot. In the matrix multiplication example, each thread would perform one multiplication and one addition and then have to wait. This would result in terrible performance in practice. On the other hand, SIMD hardware such as GPUs could potentially do much better.

Method 2: “the algorithm runs in O(N²) time with O(N) processors, or O(N) with O(N²) processors”

To address the first of the two problems I mentioned, we can use the Big-Oh notation to also express the upper limit on the number of processors required to achieve a particular running time. For example, we can say that the matrix multiplication algorithm runs in O(N²) time on a machine with O(N) processors, or in O(N) time on a machine with O(N²) processors.

Now, we are not only saying how fast a program can run on a parallel machine, but also how many processors do we need to get there.

Method 3: “the algorithm runs in O(N³/P) time, so long as P is in O(N²)”

Finally, my favorite way to express complexity of a parallel algorithm is to introduce a new parameter P that represents the number of cores on a machine. For example, we would say that the matrix multiplication algorithm runs in O(N³/P) time. For completeness, we should also say that this only holds so long as P is in O(N²). After all, even if the computer has O(N³) cores, the matrix multiplication algorithm still cannot beat O(N).

One interesting thought exercise is to think about this: if the fastest possible sequential algorithm is O(F(N)), is it possible that there is a parallel algorithm which is asymptotically faster than O(F(N) / P) on a machine with P processors? If you you know the answer, let’s hear about it in the comments section!

A few examples

This table lists several sequential algorithms, and their parallel complexities using methods 1 and 3. Note that I don’t claim that each complexity listed in this table is optimal. A value in the table simply means that I am aware of an algorithm with this complexity. For example, I list the complexity of sequential matrix multiplication as O(N³), but there are somewhat faster algorithms out there.

Algorithm	Sequential	Method 1	Method 3
Sum, Max or Prefix sum	O(N)	O(log N)	O((N log N) / P), with P in O(N) OR O(N / P) with P in O(sqrt N)
Edit distance	O(N^2)	O(N)	O(N² / P), with P in O(N)
Sorting	O(N log N)	O((log N)²)	O(N (log N)²/P), with P in O(N)
Matrix multiplication, Floyd Warshall	O(N³)	O(N)	O(N³/P), with P in O(N²)
Binary search	O(log N)	O(log N)	O(log N)

Overview of concurrency in .NET Framework 3.5

Igor Ostrovsky — Mon, 16 Jun 2008 08:22:08 +0000

There is a lot of information on the concurrent primitives and concepts exposed by the .NET Framework 3.5 available on MSDN, blogs, and other websites. The goal of this post is to distill the information into an easy-to-digest high-level summary: what are the different pieces, where they differ and how they relate. If you want to know the difference between a Thread and a BackgroundWorker, or what is the point of interlocked operations, you are reading the right article.

For each construct, I will give a motivating usage example and explain how it relates to other concurrent constructs.I will not attempt to cover everything there is to say about concurrent programming on .NET Framework, but whenever possible, I will link to more in-depth resources on each topic. Entire books have been written on concurrent programming for different platforms, so I certainly cannot fit everything into a single blog posting. But, I will walk you through the concurrent constructs and primitives so that you understand what is out there, and know where to look for more information if you need to.

There are three key concepts that .NET concurrency primitives relate to: concurrent execution, synchronization and memory sharing. Let’s go over them one by one.

Concurrent execution

In order to move from sequential to concurrent programming, we need some way to do multiple things at the same time. In the .NET Framework, threads are the main mechanism for concurrent execution.

For example, this program reads integers from the console and factors them into primes, executing each factorization on a separate thread:

static void Factor(long x) {
    var factors = new List<long>();
    for (long i = 2; i * i <= Math.Abs(x); i++) {
        while (x % i == 0) {
            x /= i;
            factors.Add(i);
        }
    }

    if (x != 1 || factors.Count == 1) factors.Add(x);

    Console.WriteLine(
        "{0} = {1}",
        x,
        string.Join(" x ", factors.Select(i => i.ToString()).ToArray())); }
}

The nice thing about this program is that even if some integer takes a long time to factor, the program will remain responsive. To see this, run the above program, and enter 100000000000000019. While the program is busy attempting to factor the large prime, you can continue to type in integers, and each answer pops up as soon as it is computed.

A disadvantage of the above approach is that if you use this program to factor many small integers, you will end up creating a lot of threads. Creation and destruction of threads is fairly expensive, so you will pay a significant performance cost. A thread pool is a construct to mitigate that cost. Instead of creating a new thread each time we want to perform some work asynchronously, we just enqueue the work on the thread pool. The thread pool keeps recycling the same threads to execute more and more work items, and thus avoids the cost of creating and destroying threads frequently. A thread pool is exposed in .NET via the ThreadPool class. We can rewrite the Main method from the previous example to use ThreadPool instead of directly creating threads:

static void Main()
{
    while (true)
    {
        string line = Console.ReadLine();
        long x = long.Parse(line);
        ThreadPool.QueueUserWorkItem((o) => Factor(x));
    }
}

Applications with a graphical user interface are one class of programs that benefit from concurrency. Frozen interface is a terrible user experience, and concurrent programming is one way to solve that problem. BackgroundWorker is a class designed to make it easier to integrate asynchronous execution with application that use Windows Forms. BackgroundWorker provides the RunWorkerCompleted event that fire once all work is done, as well as the ProgressChanged to report periodically how much of the work has already been completed.

The main feature of the BackgroundWorker, and arguably the reason for its existence, is that events RunWorkerCompleted and ProgressChanged fire on the UI thread rather than on the thread on which the worker is executing. This is very important for integration with Windows Forms, because UI elements may only be modified from the UI thread. BackgroundWorker accomplishes this by posting messages from the worker thread, which get picked up by the event loop on the UI thread.

Synchronization

In most real-world usages, threads are not fully independent. Instead, they coordinate their progress and wait on each other at predetermined places. The mechanism for thread coordination is called synchronization.

The most important example of synchronization is mutual exclusion. Often, it is important to be able to ensure that two threads will not execute in some section of code at the same time. If one thread is executing in the critical section, and another thread attempts to enter that section, the second thread will have to wait until the first thread exits the critical section.

Monitor is the simplest primitive for mutual exclusion provided by the .NET framework. It allows any .NET object to be treated as a lock. Only one thread can be inside a region protected by a monitor which is associated with a particular object. However, note that two different threads can be inside two monitors which are associated with different objects. Here is an example that uses Monitor to implement a thread-safe Account class:

class Account {
    private int balance;
    private object myLock = new object();
    
    public void Deposit(int amount) {
        Monitor.Enter(this.myLock);
        try {
            this.balance += amount;
        }
        finally {
            Monitor.Exit(this.myLock);
        }
    }
    
    public void Withdraw(int amount) {
        Monitor.Enter(this.myLock);
        try { 
            if (this.balance < amount) {
                throw new InvalidOperationException("Balance too low.");
            } 
            this.balance -= amount; 
        } 
        finally {
            Monitor.Exit(this.myLock);
        }
    }
}

Since mutual exclusion using Monitor is such a common pattern, some .NET languages have a special language construct for it. C# is one of those languages, so we can rewrite the Withdraw() method from the Account class as follows:

public void Withdraw(int amount) {
    lock (myLock) {
        if (this.balance < amount) {
            throw new InvalidOperationException("Balance too low.");
        }
        this.balance -= amount;
    }
}

Mutual exclusion imposes a constraint that at most one thread executes in a particular critical section at a time. However, a looser synchronization pattern is often useful, where we distinguish between a read and write access to a critical section. Many readers can access the same critical section at the same time, but if a writer is in the critical section, all other readers and writers must stay out. As a result, throughput is improved if there are many more readers than writers, particularly on a multi-core machine.

This pattern of usage is supported by the ReaderWriterLock class in the .NET framework. However, it turns out that the implementation of ReaderWriterLock has performance issues, and as of version 3.0, .NET includes a new class ReaderWriterLockSlim. Long story short, you should always use ReaderWriterLockSlim rather than ReaderWriterLock.

As an example, let’s implement another class to represent a bank account, this time with reader-writer locks:

class Account2 {
    private int balance;
    private int overdraftLimit;
    ReaderWriterLockSlim rwLock = new ReaderWriterLockSlim();
    
    public Account2(int balance, int overdraftLimit) { ... }
    
    public void SetOverdraftLimit(int overdraftLimit) { ... }
    
    public void Deposit(int overdraftLimit) { ... }
    
    public void Withdraw(int amount) {
        this.rwLock.EnterWriteLock();
        try {
            if (this.balance + this.overdraftLimit < amount) {
                throw new InvalidOperationException("Balance too low.");
            }
            this.balance -= amount;
        }
        finally {
            this.rwLock.ExitWriteLock();
        }
    }
    
    public int GetBalance() {
        this.rwLock.EnterReadLock();
        try {
            return this.balance;
        }
        finally {
            this.rwLock.ExitReadLock();
        }
    }
}

If calls to GetBalance() are much more frequent than calls to Deposit(), Withdraw() and SetOverdraftLimit(), we should observe a significant performance improvement from using ReaderWriterLockSlim over Monitor.

Reader-writer lock is one possible generalization of the mutual exclusion concept. Another possible generalization is to limit the number of threads executing in a critical section to a number larger than one. This problem certainly does not arise as frequently as the critical section problem, but it is frequent enough to be addressed by a classic computer science construct known as a semaphore. In .NET, semaphores are exposed via the Semaphore class. This code sample implements method ThrottledQueryDB which limits the number of threads that can concurrently call QueryDB() to 5:

class SemaphoreSample {
    private Semaphore semaphore = new Semaphore(0, 5);
    
    public void ThrottledQueryDB() {
        semaphore.WaitOne();
        try {
            QueryDB();
        }
        finally {
            semaphore.Release();
        }
    }
}

However, there are many synchronization scenarios that are more complicated than mutual exclusion. For example, we may want one thread to wait until another thread computes an result, and only proceed once the result is ready. Monitor’s methods Wait, Pulse and PulseAll are often the easiest way to implement this scenario. The nice thing about using Monitor’s methods is that Pulse and Wait are called while holding a lock. If we use events (described below) rather than Monitor signaling, it becomes much harder to avoid race conditions. Here is a simple implementation of a producer-consumer exchange data structure that has a capacity for one element, and blocks both consumers and producers as appropriate:

class ProducerConsumer {
    private T value;
    private bool isEmpty = true;

    public void Produce(T t) {
        lock (this) {
            while (!isEmpty) {
                Monitor.Wait(this);
            }
            this.value = t;
            isEmpty = false;
            Monitor.Pulse(this);
        }
    }
    
    public T Consume() {
        lock (this) {
            while (isEmpty) {
                Monitor.Wait(this);
            }

            isEmpty = true;
            Monitor.Pulse(this);
            return this.value;
        }
    }
}

An even more powerful construct than Monitor signaling is an event. In .NET, events are represented via the ManualResetEvent and AutoResetEvent classes. The main two operations on an event are WaitOne and Set. Various threads can "wait" on the event by calling WaitOne. Each call to WaitOne will block, until some other thread calls Set, at which point all threads that have been waiting on the event wake up. The difference between the two events is that AutoResetEvent automatically resets into the "unset" state itself after it has been set, while the ManualResetEvent can be reset manually by calling the Reset method.

For a usage example of events, see for example the MSDN article, How to: Synchronize a Producer and a Consumer Thread.

Memory sharing

I talked about how to run code on different threads, and also how to synchronize them. One thing that I did not talk about is how to access shared memory from different threads.

First, the nice part: if all reads and writes to a particular memory location are protected by the same lock, each read will see the most up-to-date value written to that memory location. Reads and writes will behave exactly as you would expect. That is pretty nice and simple, and in 99% of cases, all that you need.

But, once we leave the safety of locks and attempt to access the same memory from multiple threads, things get much, much more complicated. I won’t get into details here, but in summary: many properties that we take for granted in single-threaded programming (and get for free in multi-threaded programming with proper locking) break down when we try to share memory among threads without locking. Reads and writes may get reordered in strange ways. For example, if one thread updates value X and then updates value Y, and another thread reads value Y and then value X, the reader thread may see the new value of X, but not the new value of Y. Or, one thread may set a flag, and another thread may loop forever checking the flag, never finding out that the flag has in fact already been set.

If you want to use low-lock techniques, you need to precisely understand the guarantees that the CLR makes about memory operations in situations where multiple threads access the same memory without locking. This is not for the faint at heart, but if you are interested, you can read up on the CLR memory model in Vance Morrison’s article.

As I said, low-lock techniques are hard. Despite that, it is useful to know what is out there. So, let’s move on to describe the constructs offered by the framework.

One technique to avoid locking is to take advantage of atomic operations exposed as static methods on the Interlocked class. These atomic operations are available:

Add: given two integers, replaces the first one with their sum, and returns the sum.
CompareExchange: if a value is equal to a comparand, replaces it with another value.
Decrement: decrements an integer and returns the new value.
Exchange: sets a variable to some value.
Increment: increments an integer and returns the new value.
Read: reads a 64-bit value.

Now, the operations on the above list may seem pretty uninteresting. The reason why they are useful is that they are atomic. The entire operation either takes place fully or not at all, and no other thread may observe the intermediate invalid state. That is a very nice property, typically difficult to achieve in low-lock situations.

For example, consider this Counter class:

class Counter {
    int count = 0;

    public void Increment() {
        Interlocked.Increment(ref count);
    }
    
    public int Count {
        get { return count; }
    }
}

You could implement the same method by replacing the Interlocked.Increment(count) with count++. But, that would only work if all calls to the Increment method are externally synchronized. The ++ increment operator is a notorious example of a simple operation that is not atomic, and if multiple threads attempt to increment the same field at the same time, some increments may get lost, and an incorrect count will get computed. Interlocked.Increment provides an atomic increment operation that does not require locks, which can be very useful.

Another low-lock technique is the use of volatile fields. Some of the strange cases I mentioned earlier (such as a thread seeing a value written earlier, but not a value written later) go away when you mark the shared memory as volatile. Volatile fields prevent the some kinds of reordering from happening: no reads or writes can move before a volatile read, and no reads or writes can move after a volatile write. Understanding precisely what these restrictions allow you to assume in practice is unfortunately quite hard, but Vance Morisson’s article helps.

If a field is volatile, then all reads and writes to that field will be volatile. To make a particular read of a non-volatile field volatile, use the VolatileRead static method on the Thread class. VolatileWrite method is available as well, but due to some bizarre consequences of the .NET memory model, there is no difference between a volatile write and a regular write. At least, that is the case to the best of my knowledge, in .NET 3.5.

As I said earlier, volatile reads and writes prevent other reads and writes from moving across them in one direction. MemoryBarrier method on the Thread class is creates a fence that cannot be crossed in either direction by reads or writes.

Finally, one construct that is quite often useful is a thread-local static field. A thread-local static field differs from regular static fields in that it can hold multiple values at the same time. Specifically, each thread has its own copy of the field that is completely independent from the copies that other threads see. Each thread can read and write into its own copy. To make a static field thread-local, mark it with the ThreadStatic attribute.

This static class uses a thread-local field to track how many times each thread called the Register() method:

static class ThreadAccessCounter {
    [ThreadStatic]
    private static int count;

    public static void Register() {
        count++;
        Console.WriteLine("This thread has called Register() {0} times", count);
    }
}

Comments and Conclusion

These are the most important concurrency constructs and primitives in .NET 3.5… quite a list! And, that list does not even include Parallel Extensions, the community technology preview of which was released two weeks ago. (For those unaware, I am a developer at the Microsoft team developing Parallel Extensions.)

As is usual, all code samples in this posting are provided as-is, with no guarantees.

Related

Managed Threading Best Practices [msdn.microsoft.com] covers concurrency gotchas and how to avoid them, which is something I did not get to in this article.
C# 3.0 in a Nutshell [book] contains an in-depth chapter on concurrency. The chapter is also available online.
Concurrent Programming on Windows Vista [book] is available for pre-order, and will be an excellent resource for concurrency on Windows and .NET.