The memory model is a fascinating topic – it touches on hardware, concurrency, compiler optimizations, and even math.
The memory model defines what state a thread may see when it reads a memory location modified by other threads. For example, if one thread updates a regular non-volatile field, it is possible that another thread reading the field will never observe the new value. This program never terminates (in a release build):
class Test { private bool _loop = true; public static void Main() { Test test1 = new Test(); // Set _loop to false on another thread new Thread(() => { test1._loop = false;}).Start(); // Poll the _loop field until it is set to false while (test1._loop == true) ; // The loop above will never terminate! } }
There are two possible ways to get the while loop to terminate:
- Use a lock to protect all accesses (reads and writes) to the _loop field
- Mark the _loop field as volatile
There are two reasons why a read of a non-volatile field may observe a stale value: compiler optimizations and processor optimizations.
In concurrent programming, threads can get interleaved in many different ways, resulting in possibly many different outcomes. But as the example with the infinite loop shows, threads do not just get interleaved – they potentially interact in more complex ways, unless you correctly use locks and volatile fields.
Compiler optimizations
The first reason why a non-volatile read may return a stale value has to do with compiler optimizations. In the infinite loop example, the JIT compiler optimizes the while loop from this:
while (test1._loop == true) ;
To this:
if (test1._loop) { while (true); }
This is an entirely reasonable transformation if only one thread accesses the _loop field. But, if another thread changes the value of the field, this optimization can prevent the reading thread from noticing the updated value.
If you mark the _loop field as volatile, the compiler will not hoist the read out of the loop. The compiler will know that other threads may be modifying the field, and so it will be careful to avoid optimizations that would result in a read of a stale value.
The code transformation I showed is a close approximation of the optimization done by the CLR JIT compiler, but not completely exact.
The full story is that the assembly code emitted by the JIT compiler will store the value test1._loop in the EAX register. The loop condition will keep polling the register, and will read test1._loop from memory again. Even when the thread is pre-empted, the CPU registers get saved. Once the thread is again scheduled to run, the same stale EAX register value will be restored, and the loop never terminates.
The assembly code generated by the while loop looks as follows:
00000068 test eax,eax 0000006a jne 00000068
If you make the _loop field volatile, this code is generated instead:
00000064 cmp byte ptr [eax+4],0 00000068 jne 00000064
If the _loop field is not volatile, the compiler will store _loop in the EAX register. If _loop is volatile, the compiler will instead keep the test1 variable in EAX, and the value of _loop will be re-fetched from memory on each access (by “ptr [eax+4]”).
From my experience playing around with the current version of the CLR, I get the impression that these kinds of compiler optimizations are not terribly frequent. On x86 and x64, often the same assembly code will be generated regardless of whether a field is volatile or not. On IA64, the situation is a bit different – see the next section.
Processor optimizations
On some processors, not only must the compiler avoid certain optimizations on volatile reads and writes, it also has to use special instructions. On a multi-core machine, different cores have different caches. The processors may not bother to keep those caches coherent by default, and special instructions may be needed to flush and refresh the caches.
The mainstream x86 and x64 processors implement a strong memory model where memory access is effectively volatile. So, a volatile field forces the compiler to avoid some high-level optimizations like hoisting a read out of a loop, but otherwise results in the same assembly code as a non-volatile read.
The Itanium processor implements a weaker memory model. To target Itanium, the JIT compiler has to use special instructions for volatile memory accesses: LD.ACQ and ST.REL, instead of LD and ST. Instruction LD.ACQ effectively says, “refresh my cache and then read a value” and ST.REL says, “write a value to my cache and then flush the cache to main memory”. LD and ST on the other hand may just access the processor’s cache, which is not visible to other processors.
For the reasons explained in this section and the previous sections, marking a field as volatile will often incur zero performance penalty on x86 and x64.
The x86/x64 instruction set actually does contains three fence instructions: LFENCE, SFENCE, and MFENCE. LFENCE and SFENCE are apparently not needed on the current architecture, but MFENCE is useful to go around one particular issue: if a core reads a memory location it previously wrote, the read may be served from the store buffer, even though the write has not yet been written to memory. [Source] I don’t actually know whether the CLR JIT ever inserts MFENCE instructions.
Volatile accesses in more depth
To understand how volatile and non-volatile memory accesses work, you can imagine each thread as having its own cache. Consider a simple example with a non-volatile memory location (i.e. a field) u, and a volatile memory location v.

A non-volatile write could just update the value in the thread’s cache, and not the value in main memory:
However, in C# all writes are volatile (unlike say in Java), regardless of whether you write to a volatile or a non-volatile field. So, the above situation actually never happens in C#.
A volatile write updates the thread’s cache, and then flushes the entire cache to main memory. If we were to now set the volatile field v to 11, both values u and v would get flushed to main memory:

Since all C# writes are volatile, you can think of all writes as going straight to main memory.
A regular, non-volatile read can read the value from the thread’s cache, rather than from main memory. Despite the fact that thread 1 set u to 11, when thread 2 reads u, it will still see value 10:

When you read a non-volatile field in C#, a non-volatile read occurs, and you may see a stale value from the thread’s cache. Or, you may see the updated value. Whether you see the old or the new value depends on your compiler and your processor.
Finally, let’s take a look at an example of a volatile read. Thread 2 will read the volatile field v:
Before the volatile read, thread 2 refreshes its entire cache, and then reads the updated value of v: 11. So, it will observe the value that is really in main memory, and also refresh its cache as a bonus.
Note that the thread caches that I described are imaginary – there really is no such thing as a thread cache. Threads only appear to have these caches as an artifact of compiler and processor optimizations.
One interesting point is that all writes in C# are volatile according to the memory model as documented here and here, and are also presumably implemented as such. The ECMA specification of the C# language actually defines a weaker model where writes are not volatile by default.
You may find it surprising that a volatile read refreshes the entire cache, not just the read value. Similarly, a volatile write (i.e., every C# write) flushes the entire cache, not just the written value. These semantics are sometimes referred to as “strong volatile semantics”.
The original Java memory model designed in 1995 was based on weak volatile semantics, but was changed in 2004 to strong volatile. The weak volatile model is very inconvenient. One example of the problem is that the “safe publication” pattern is not safe. Consider this example:
volatile string[] _args = null;
public void Write() { string[] a = new string[2]; a[0] = "arg1"; a[1] = "arg2"; _args = a; ... } public void Read() { if (_args != null) { // Under weak volatile semantics, this assert could fail! Debug.Assert(_args[0] != null); } }
Under strong volatile semantics (i.e., the .NET and C# volatile semantics), a non-null value in the _args field guarantees that the elements of _args are also not null. The safe publication pattern is very useful and commonly used in practice.
Memory model and .NET operations
Here is a table of how various .NET operations interact with the imaginary thread cache:
| Construct | Refreshes thread cache before? | Flushes thread cache after? | Notes |
| Ordinary read | No | No | Read of a non-volatile field |
| Ordinary write | No | Yes | Write of a non-volatile field |
| Volatile read | Yes | No | Read of volatile field, or Thread.VolatileRead |
| Volatile write | No | Yes | Write of a volatile field – same as non-volatile |
| Thread.MemoryBarrier | Yes | Yes | Special memory barrier method |
| Interlocked operations | Yes | Yes | Increment, Add, Exchange, etc. |
| Lock acquire | Yes | No | Monitor.Enter or entering a lock {} region |
| Lock release | No | Yes | Monitor.Exit or exiting a lock {} region |
For each operation, the table shows two things:
- Is the entire imaginary thread cache refreshed from main memory before the operation?
- Is the entire imaginary thread cache flushed to main memory after the operation?
Disclaimer and limitations of the model
This blog post reflects my personal understanding of the .NET memory model, and is based purely on publicly available information.
I find the explanation based on imaginary thread caches more intuitive than the more commonly used explanation based on operation reordering. The thread cache model is also accurate for most intents and purposes.
To be even more accurate, you should assume that the thread caches can form an arbitrary large hierarchy, and so you cannot assume that a read is served only from two possible places – main memory or the thread’s cache. I think that you would have to construct a somewhat of a clever case in order for the cache hierarchy to make a difference, though. If anyone is aware of a case where the hierarchical thread cache model makes a prediction different from the reordering-based model, I would love to hear about it.
If you are interested in the .NET memory model, I encourage you to read Understand the Impact of Low-Lock Techniques in Multithreaded Apps in the MSDN Magazine, and the Memory model blog post from Chris Brumme.
Read more of my articles:
Gallery of processor cache effects
What really happens when you navigate to a URL
Human heart is a Turing machine, research on XBox 360 shows. Wait, what?
7 tricks to simplify your programs with LINQ
And if you like my blog, subscribe!
As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.
In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.
1. You enter a URL into the browser
It all starts here:
2. The browser looks up the IP address for the domain name
The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:
Most of my readers will understand that cache is a fast but small type of memory that stores recently accessed memory locations. This description is reasonably accurate, but the “boring” details of how processor caches work can help a lot when trying to understand program performance.
In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.
The examples are in C#, but the language choice has little impact on the performance scores and the conclusions they lead to.
Example 1: Memory accesses and performance
How much faster do you expect Loop 2 to run, compared Loop 1?
int[] arr = new int[64 * 1024 * 1024]; Read the rest of this entry »
PDC 2009 was an exciting event, with announcements about Azure, Silverlight 4, and Office 2010 popping up one after another. For me, there was another reason why this year’s PDC was exciting – it was my first chance to present in a major conference.
Around two weeks ago, I found this email in my inbox, with the subject “Complaint about Robozzle”:
Hi Igor
Robozzle is really cute, I like it, but why on earth is it polluted with hundreds of invisible links to porn sites? From a guy like you I don’t expect to do such dirty things. pls remove them.
David
Did you ever see one of those auto-generated random “academic papers” like this one? When I first saw the following title, my first thought was that it is a randomly-generated “paper”:
Implications of the Turing completeness of reaction-diffusion models, informed by GPGPU simulations on an XBox 360: Cardiac arrhythmias, re-entry and the Halting problem [PDF]
Turing completeness, cardiac arrhythmias, XBox 360… those things don’t seem to have much in common. But, I had my interest piqued. I looked up the paper and read through it. And, it turns out that not only is the paper serious, what it has to say is also quite interesting.
Not many developers have the insights of Michael Abrash. He is a game developer with decades of experience building commercial games, including a game you may recognize as "Quake". His Graphics Programming Black Book is years old, but much of it is just as interesting as it was at the time of writing. And, the entire book is available online, for free.
The book consists of 70 chapters on optimization, graphics, and assembly programming. The entire book is insightful and interesting, but my favorite chapters are these:
- The Best Optimizer Is between Your Ears
- There Ain’t No Such Thing as the Fastest Code
- Optimizing for 286 and 386, 486 (continued), and Pentium (continued here and here).
- Algorithms for fast drawing of lines, anti-aliased lines, polygons, fast convex polygons, and 3d objects (continued)
- Quake’s visible-surface determination, 3D clipping, hidden-surface removal, span sorting, lighting, surface caching and post-mortem.
Enjoy! Again, the entire book is accessible here: Graphics Programming Black Book.
Tags: Algorithms
Over the past couple of years, auto-complete has popped up all over the web. Facebook, YouTube, Google, Bing, MSDN, LinkedIn and lots of other websites all try to complete your phrase as soon as you start typing.
Auto-complete definitely makes for a nice user experience, but it can be a challenge to implement efficiently. In many cases, an efficient implementation requires the use of interesting algorithms and data structures. In this blog post, I will describe one simple data structure that can be used to implement auto-complete: a ternary search tree.
Trie: simple but space-inefficient
Tags: Algorithms
Silverlight 2 is an awesome platform for development of rich web applications. One issue to be aware of, though, is that some browser features do not extend into Silverlight apps. For example, the Back and Forward browser buttons do not always work as the user may expect with Silverlight apps. The set of browser features you’ll miss in Silverlight apps is pretty much identical to what you’d miss in Flash or AJAX, and some of them are about to be addressed in the Silverlight 3 release.
Let’s go over the affected browser features one by one, and discuss ways of getting them working in Silverlight apps.
Feature 1: Bookmarking and deep linking
Tags: Silverlight
It took significantly longer than I expected, but the RoboZZle demo video is finally on YouTube.
The video made the reddit front page, so you can read the usual mixture of insightful, funny, and outright insane comments on the reddit page and the YouTube page.
My favorite funny comment is this one:

And finally, this is the video:
Tags: RoboZZle


Recent Comments