Igor Ostrovsky Blogging » Cool Stuff

Is two to the power of infinity more than infinity?

Igor Ostrovsky — Thu, 21 Apr 2011 15:11:45 +0000

Do you know whether this inequality is true?

It’s a simple question, but with an intriguing and revealing answer.

Infinity #1

One concept of infinity that most people would have encountered in a math class is the infinity of limits. With limits, we can try to understand 2^∞ as follows:

The infinity symbol is used twice here: first time to represent “as x grows”, and a second to time to represent “2^x eventually permanently exceeds any specific bound”.

If we use the notation a bit loosely, we could “simplify” the limit above as follows:

This would suggest that the answer to the question in the title is “No”, but as will be apparent shortly, using infinity notation loosely is not a good idea.

Infinity #2

In addition to limits, there is another place in mathematics where infinity is important: in set theory.

Set theory recognizes infinities of multiple “sizes”, the smallest of which is the set of positive integers: { 1, 2, 3, … }. A set whose size is equal to the size of positive integer set is called countably infinite.

“Countable infinity plus one”
If we add another element (say 0) to the set of positive integers, is the new set any larger? To see that it cannot be larger, you can look at the problem differently: in set { 0, 1, 2, … } each element is simply smaller by one, compared to the set { 1, 2, 3, … }. So, even though we added an element to the infinite set, we really just “relabeled” the elements by decrementing every value.
“Two times countable infinity”
Now, let’s “double” the set of positive integers by adding values 0.5, 1.5, 2.5, … The new set might seem larger, since it contains an infinite number of new values. But again, you can say that the sets are the same size, just each element is half the size:
“Countable infinity squared”
To “square” countable infinity, we can form a set that will contain all integer pairs, such as [1,1], [1,2], [2,2] and so on. By pairing up every integer with every integer, we are effectively squaring the size of the integer set.
Can pairs of integers also be basically just relabeled with integers? Yes, they can, and so the set of integer pairs is no larger than the set of integers. The diagram below shows how integer pairs can be “relabeled” with ordinary integers (e.g., pair [2,2] is labeled as 5):
“Two to the power of countable infinity”
The set of integers contains a countable infinity of elements, and so the set of all integer subsets should – loosely speaking – contain two to the power of countable infinity elements. So, is the number of integer subsets equal to the number of integers? It turns out that the “relabeling” trick we used in the first three examples does not work here, and so it appears that there are more integer subsets than there are integers.

Let’s look at the fourth example in more detail to understand why it is so fundamentally different from the first three. You can think of an integer subset as a binary number with an infinite sequence of digits: i-th digit is 1 if i is included in the subset and 0 if i is excluded. So, a typical integer subset is a sequence of ones and zeros going forever and ever, with no pattern emerging.

And now we are getting to the key difference. Every integer, half-integer, or integer pair can be described using a finite number of bits. That’s why we can squint at the set of integer pairs and see that it really is just a set of integers. Each integer pair can be easily converted to an integer and back.

However, an integer subset is an infinite sequence of bits. It is impossible to describe a general scheme for converting an infinite sequence of bits into a finite sequence without information loss. That is why it is impossible to squint at the set of integer subsets and argue that it really is just a set of integers.

The diagram below shows examples of infinite sets of three different sizes:

So, in set theory, there are multiple infinities. The smallest infinity is the “countable” infinity, ₀, that matches the number of integers. A larger infinity is ₁ that matches the number of real numbers or integer subsets. And there are even larger and larger infinite sets.

Since there are more integer subsets than there are integers, it should not be surprising that the mathematical formula below holds (you can find the formula in the Wikipedia article on Continuum Hypothesis):

And since ₀ denotes infinity (the smallest kind), it seems that it would not be much of a stretch to write this:

… and now it seems that the answer to the question from the title should be “Yes”.

The answer

So, is it true that that 2^∞ > ∞? The answer depends on which notion of infinity we use. The infinity of limits has no size concept, and the formula would be false. The infinity of set theory does have a size concept and the formula would be kind of true.

Technically, statement 2^∞ > ∞ is neither true nor false. Due to the ambiguous notation, it is impossible to tell which concept of infinity is being used, and consequently which rules apply.

Who cares?

OK… but why would anyone care that there are two different notions of infinity? It is easy to get the impression that the discussion is just an intellectual exercise with no practical implications.

On the contrary, rigorous understanding of the two kinds of infinity has been very important. After properly understanding the first kind of infinity, Isaac Newton was able to develop calculus, followed by the theory of gravity. And, the second kind of infinity was a pre-requisite for Alan Turing to define computability (see my article on Numbers that cannot be computed) and Kurt Gödel to prove Gödel’s Incompleteness Theorem.

So, understanding both kinds of infinity has lead to important insights and practical advancements.

Graphs, trees, and origins of humanity

Igor Ostrovsky — Fri, 02 Jul 2010 05:44:18 +0000

Imagine that 90,000 years ago, every man alive at the time picked a different last name. Assuming that last names are inherited from father to son, how many different last names do you think there would be today?

It turns out that there would be only one last name!

Similarly, imagine that 200,000 years ago, every woman alive picked a different secret word, and told the secret word to her daughters. And, the female descendants would follow this tradition – a mother would always pass her secret word to her daughters.

As you might guess, today there would be only one secret word in circulation.

The man whose last name all men would carry is called “Y-chromosomal Adam” and the woman whose secret word all women would know is “Mitochondrial Eve”. The names come from the fact that Y chromosome is a piece of DNA inherited from father to son (just like last names), and mitochondrial DNA is inherited from mother to children (just like the hypothetical secret words).

Why the convergence?

So, how likely is it that one last name and one secret word will eventually come to dominate? Given enough time, it is virtually guaranteed, under some assumptions (e.g., the population does not become separated).

Here is a simulation of how last names of five men could flow through six generations:

After six generations, the last name shown as green is the only last name around, purely by chance! In biology, this random effect is called genetic drift.

And, the convergence does not only happen for small populations. Here are the numbers that I got by simulating different population sizes:

Population	Generations to convergence
10	23
50	73
100	239
500	891
1000	1395
5000	7312
10000	13491

Here is the implementation of this simulation:

static int MitochondrialEve(int populationSize)
{
    Random random = new Random();

    int generations = 0;
    int[] cur = new int[populationSize];
    for (int i = 0; i < populationSize; i++) cur[i] = i;

    for ( ; cur.Max() != cur.Min(); generations++)
    {
        int[] next = new int[populationSize];
        for (int i = 0; i < next.Length; i++)
        {
            next[i] = cur[random.Next(populationSize)];
        }
        cur = next;
    }

    return generations;
}

This simulation is not a sophisticated model of a human population, but it is sufficient for the purposes of illustrating genetic drift. It assumes that every man has on average one son, with a standard deviation of roughly one, and a binomial probability distribution.

By the way, the Y-chromosomal Adam lived roughly 60,000–90,000 years ago, while Eve lived roughly 200,000 years ago. The reason why genetic drift acted faster for men is that men have a larger variation in the number of offspring – one man can have many more children than one woman.

UPDATE: Based on comments at Hacker News and Reddit, some readers are dissatisfied with the assumption of a fixed population size. Of course, human population grew over the ages, but there were also periods when it shrank, sometimes a lot. For example, roughly 70,000 years ago, the human population may have dropped down to thousands of individuals (1, 2). So, a fixed population size is a reasonable simplification for my example.

The roles of Mitochondrial Eve and Y-chromosomal Adam

One noteworthy fact about Mitochondrial Eve and Y-chromosomal Adam is that their positions in the history are not as special as they may first appear. Let’s take a look at this in more depth.

Tracing from me, I can follow a path of paternal ancestry all the way to Y-chromosomal Adam:

If I only trace the male ancestry, there is exactly one path that starts at me, and that path leads to Y-chromosomal Adam. You could start the diagram at any man alive today and you’d get a similar picture, with the lineage finally reaching the Y-chromosomal Adam.

However, if I trace ancestry via both parents, the number of ancestors explodes. I have two parents, four grandparents, eight great grandparents, and so forth. The number of ancestors in each generation grows exponentially, although that cannot continue for long. If all ancestors in each generations were distinct, I would have more than 1 billion ancestors just 30 generations back, so the tree certainly has to start collapsing into a directed acyclic graph by then.

Tracing ancestry through both parents, there are many paths to follow, and each generation of ancestors contains a lot of people. Some of those paths will reach Y-chromosomal Adam, but other paths will reach other men in his generation. Similarly, some paths will reach Mitochondrial Eve, but other paths will reach other women in her generation.

Most recent common ancestor

So, Mitochondrial Eve only has a special position with respect to the mother-daughter relationships, and Y-chromosomal Adam only with respect to the father-son relationships.

What if you consider all types of ancestry, father-son, father-daughter, mother-son and mother-daughter? In the resulting directed acyclic graph, neither the Y-chomosomal Adam nor the Mitochondrial Eve appear in a special position. In fact, in the combined graph, the most recent ancestor of all today’s people lived much later than Y-chromosomal Adam. The most recent ancestor is estimated to have lived roughly 15,000 to 5,000 years ago.

One way to visualize the relationship between Mitochondrial Eve, Y-chromosomal Adam, and the Most Recent Common Ancestor (MRCA) is to look at a small genealogy diagram with just a few people:

For the last generation consisting of just four people, this graph shows the Mitochondrial Eve, the Y-chromosomal Adam, and the most recent common ancestors (a couple in this example, but could also be one man or one woman). Adam is at the root of the blue tree, Eve is at the root of the red tree, and the most recent common ancestors are much lower in the graph.

The dating of Y-Adam and M-Eve

Finally, I’ll briefly give you an idea on how biologists calculate when Mitochondrial Eve and Y-chromosomal Adam lived.

The dating is based on DNA analysis. Changes in DNA accumulate at a certain rate that depends on various factors – region of the DNA, the species, population size, etc. To date Mitochondrial Eve, biologists calculate an estimate of the mitochondrial DNA (mtDNA) mutation rate. Then, they look at how much mtDNA varies between today’s women, and then calculate how long it would take to achieve that degree of variation.

Another interesting fact is that the titles of Mitochondrial Eve and Y-chromosomal Adam are not permanent, but instead are reassigned over time. For example, the woman who we call “Mitochondrial Eve” today did not hold that title during her lifetime. Instead, there was another unknown woman who was the most recent common matrilineal ancestor of all women alive at Eve’s time.

Final words

I hope you enjoyed the article. I originally learned about Y-chromosomal Adam and Mitochondrial Eve from reading Before the Dawn, and immediately knew I had to blog about them from a programmer’s perspective. If you want to read more on the topic, the Wikipedia page on Mitochondrial Eve is a good start.

Read more of my articles:

Fast and slow if statements: branch prediction in modern processors

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Self-printing Game of Life in C#

And if you like my blog, subscribe!

Fast and slow if-statements: branch prediction in modern processors

Igor Ostrovsky — Sat, 15 May 2010 21:11:32 +0000

Did you know that the performance of an if-statement depends on whether its condition has a predictable pattern? If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive. In this article, I’ll explain why today’s processors behave this way.

Let’s measure the performance of this loop with different conditions:

for (int i = 0; i < max; i++) if () sum++;

Here are the timings of the loop with different True-False patterns:

Condition	Pattern	Time (ms)
(i & 0x80000000) == 0	T repeated	322
(i & 0xffffffff) == 0	F repeated	276
(i & 1) == 0	TF alternating	760
(i & 3) == 0	TFFFTFFF…	513
(i & 2) == 0	TTFFTTFF…	1675
(i & 4) == 0	TTTTFFFFTTTTFFFF…	1275
(i & 8) == 0	8T 8F 8T 8F …	752
(i & 16) == 0	16T 16F 16T 16F …	490

A “bad” true-false pattern can make an if-statement up to six times slower than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.

Let’s look at the processor counters

One way to understand how the processor used its time is to look at the hardware counters. To help with performance tuning, modern processors track various counters as they execute code: the number of instructions executed, the number of various types of memory accesses, the number of branches encountered, and so forth. To read the counters, you’ll need a tool such as the profiler in Visual Studio 2010 Premium or Ultimate, AMD Code Analyst or Intel VTune.

To verify that the slowdowns we observed were really due to the if-statement performance, we can look at the Mispredicted Branches counter:

The worst pattern (TTFFTTFF…) results in 774 branch mispredictions, while the good patterns only get around 10. No wonder that the bad case took the longest 1.67 seconds, while the good patterns only took around 300ms!

Let’s take a look at what “branch prediction” does, and why it has a major impact on the processor performance.

What’s the role of branch prediction?

To explain what branch prediction is and why it impacts the performance numbers, we first need to take a look at how modern processors work. To complete each instruction, the CPU goes through these (and more) stages:

1. Fetch: Read the next instruction.

2. Decode: Determine the meaning of the instruction.

3. Execute: Perform the real ‘work’ of the instruction.

4. Write-back: Store results into memory.

An important optimization is that the stages of the pipeline can process different instructions at the same time. So, as one instruction is getting fetched, a second one is being decoded, a third is executing and the results of fourth are getting written back. Modern processors have pipelines with 10 – 31 stages (e.g., Pentium 4 Prescott has 31 stages), and for optimum performance, it is very important to keep all stages as busy as possible.

Image from http://commons.wikimedia.org/wiki/File:Pipeline,_4_stage.svg

Branches (i.e. conditional jumps) present a difficulty for the processor pipeline. After fetching a branch instruction, the processor needs to fetch the next instruction. But, there are two possible “next” instructions! The processor won’t be sure which instruction is the next one until the branching instruction makes it to the end of the pipeline.

Instead of stalling the pipeline until the branching instruction is fully executed, modern processors attempt to predict whether the jump will or will not be taken. Then, the processor can fetch the instruction that it thinks is the next one. If the prediction turns out wrong, the processor will simply discard the partially executed instructions that are in the pipeline. See the Wikipedia page on branch predictor implementation for some typical techniques used by processors to collect and interpret branch statistics.

Modern branch predictors are good at predicting simple patterns: all true, all false, true-false alternating, and so on. But if the pattern happens to be something that throws off the branch predictor, the performance hit will be significant. Thankfully, most branches have easily predictable patterns, like the two examples highlighted below:

int SumArray(int[] array) {
    if (array == null) throw new ArgumentNullException("array");

    int sum=0;
    for(int i=0; i; i++) sum += array[i];
    return sum;
}

The first highlighted condition validates the input, and so the branch will be taken only very rarely. The second highlighted condition is a loop termination condition. This will also almost always go one way, unless the arrays processed are extremely short. So, in these cases – as in most cases – the processor branch prediction logic will be effective at preventing stalls in the processor pipeline.

Updates and Clarifications

This article got picked up by reddit, and got a fair bit of attention in the reddit comments. I’ll respond to the questions, comments and criticisms below.

First, regarding the comments that optimizing for branch prediction is generally a bad idea: I agree. I do not argue anywhere in the article that you should try to write your code to optimize for branch prediction. For the vast majority of high-level code, I can’t even imagine how you’d do that.

Second, there was a concern whether the executed instructions for different cases differ in something else other than the constant value. They don’t – I looked at the JIT-ted assembly. If you’d like to see the JIT-ted assembly code or the C# source code, send me an email and I’ll send them back. (I am not posting the code here because I don’t want to blow up this update.)

Third, another question was on the surprisingly poor performance of the TTFF* pattern. The TTFF* pattern has a short period, and as such should be an easy case for the branch prediction algorithms.

However, the problem is that modern processors don’t track history for each branching instruction separately. Instead, they either track global history of all branches, or they have several history slots, each potentially shared by multiple branching instructions. Or, they can use some combination of these tricks, together with other techniques.

So, the TTFF pattern in the if-statement may not be TTFF by the time it gets to the branch predictor. It may get interleaved with other branches (there are 2 branching instructions in the body of a for-loop), and possibly approximated in other ways too. But, I don’t claim to be an expert on what precisely each processor does, and if someone reading this has an authoritative reference to how different processors behave (esp. Intel Core2 that I tested on), please let me know in comments.

Read more of my articles:

Gallery of processor cache effects

How GPU came to be used for general computation

What really happens when you navigate to a URL

Self-printing Game of Life in C#

And if you like my blog, subscribe!

How GPU came to be used for general computation

Igor Ostrovsky — Fri, 12 Mar 2010 09:30:56 +0000

The story of how GPU came to be used for high-performance computation is pretty cool. Hardware heavily optimized for graphics turned out to be useful for another use: certain types of high-performance computations. In this article, I will explore how and why this happened, and summarize the state of general computation on GPUs today.

Programmable graphics

The first step towards computation on the GPU was introduction of programmable shaders. Both DirectX and OpenGL added support for programmable shaders roughly a decade ago, giving game designers more freedom to create custom graphics effects. Instead of just composing pre-canned effects, graphic artists can now write little programs that execute directly on the GPU. As of DirectX 8, they can specify two types of shader programs for every object in the scene: a vertex shader and a pixel shader.

A vertex shader is a function invoked on every vertex in the 3D object. The function transforms the vertex and returns its position relative to the camera view. By transforming vertices, vertex shaders help implement realistic skin, clothes, facial expressions, and similar effects.

A pixel shader is a function invoked on every pixel covered by a particular object and returns the color of the pixel. To compute the output color, the pixel shader can use a variety of optional inputs: XY-position on the screen, XYZ-position in the scene, position in the texture, the direction of the surface (i.e., the normal vector), etc. Pixel shader can also read textures, bump maps, and other inputs.

Here is a simple scene, rendered with six different pixel shaders applied to the teapot:

A always returns the same color. B varies the color based on the screen Y-coordinate. C sets the color depending on the XYZ screen coordinates. D sets the color proportionally to the cosine of the angle between the surface normal and the light direction (“diffuse lighting”). E uses a slightly more complex lighting model and a texture, and F also adds a bump map.

If you are curious how lighting shaders are implemented, check out GamaSutra’s Implementing Lighting Models With HLSL.

Realization: shaders can be used for computation!

Let’s take a look at a simple pixel shader that just blurs a texture. This shader is implemented in HLSL (a DirectX shader language):

float4 ps_main( float2 t: TEXCOORD0 ) : COLOR0 
{ 
   float dx = 2/fViewportWidth; 
   float dy = 2/fViewportHeight; 
   return 
      0.2 * tex2D( baseMap, t ) + 
      0.2 * tex2D( baseMap, t + float2(dx, 0) ) + 
      0.2 * tex2D( baseMap, t + float2(-dx, 0) ) +      
      0.2 * tex2D( baseMap, t + float2(0, dy) ) + 
      0.2 * tex2D( baseMap, t + float2(0, -dy) ); 
}

The texture blur has this effect:

This is not exactly a breath-taking effect, but the interesting part is that simulations of car crashes, wind tunnels and weather patterns all follow this basic pattern of computation! All of these simulations are computations on a grid of points. Each point has one or more quantities associated with it: temperature, pressure, velocity, force, air flow, etc. In each iteration of the simulation, the neighboring points interact: temperatures and pressures are equalized, forces are transferred, grid is deformed, and so forth. Mathematically, the programs that run these simulations are partial differential equation (PDE) solvers.

As a trivial example, here is a simple simulation of heat dissipation. In each iteration, the temperature of each grid point is recomputed as an average over its nearest neighbors:

It is hard to overlook the fact that an iteration of this simulation is nearly identical to the blur operation. Hardware highly optimized for running pixel shaders will be able to run this simulation very fast. And, after years of refinement and challenges from latest and greatest games, GPUs became very efficient at using massive parallelism to execute shaders blazingly fast.

One cool example of a PDE solver is a liquid and smoke simulator. The structure of the simulation is similar to my trivial heat dissipation example, but instead of tracking the temperature of each grid point, the smoke simulator tracks pressure and velocity. Just as in the heat dissipation example, a grid point is affected by all of its nearest neighbors in each iteration.

This simulation was developed by Keenan Crane.

For a view into general computation on GPUs in 2004 when hacked-up pixel shaders were the state of the art, see the General-Purpose Computation on GPUs section of GPU Gems 2.

Arrival of GPGPU

Once GPUs have shown themselves to be a good fit for certain types of high-performance computations (like PDEs), GPU manufacturers moved to make GPUs more attractive for general-purpose computing. The idea of General Purpose computation on a GPU (“GPGPU”) became a hot topic in high-performance computing.

GPGPU computing is based around compute kernels. A compute kernel is a generalization of a pixel shader:

Like a pixel shader, a compute kernel is a routine that will be invoked on each point in the input space.
A pixel shader always operates on two-dimensional space. A compute kernel can work on space of any dimensionality.
A pixel shader returns a single color. A compute kernel can write an arbitrary number of outputs.
A pixel shader operates on 32-bit floating-point numbers. A compute kernel also supports 64-bit floating-point numbers and integers.
A pixel shader reads from textures. A compute kernel can read from any place in GPU memory.
Additionally, compute kernels running on the same core can share data via an explicitly managed per-core cache.

Comparison of a GPU and a CPU

The control flow of a modern application is typically very complicated. Just think about all the different tasks that must be completed to show this article in your browser. To display this blog article, the CPU has to communicate with various devices, maintain thousands of data structures, position thousands of UI elements, parse perhaps a hundred file formats, … that does not even begin to scratch the surface. And, not only are all of these tasks different, they also depend on each other in very complex ways.

Compare that with the control flow of a pixel shader. A pixel shader is a single routine that needs to be invoked roughly a million times, each time on a different input. Different shader invocations are pretty much independent and once all are done, the GPU starts over again with a scene where objects have moved a bit.

It shouldn’t come as a surprise that hardware optimized for running a pixel shader will be quite different from hardware optimized for tasks like viewing web pages. A CPU greatly benefits from a sophisticated execution pipeline with multi-level caches, instruction reordering, prefetching, branch prediction, and other clever optimizations. A GPU does not need most of those complex features for its much simpler control flow. Instead, a GPU benefits from lots of Arithmetic Logic Units (ALUs) to add, multiply and divide floating point numbers in parallel.

This table shows the most important differences between a CPU and a GPU today:

CPU	GPU
2-4 cores	16-32 cores
Each core runs 1-2 independent threads in parallel	Each core runs 16-32 threads in parallel. All threads on a core must execute the same instruction at any time.
Automatically managed hierarchy of caches	Each core has 16-64kB of cache, explicitly managed by the programmer
0.1 billion floating-point operations / second (0.1 TFLOP)	1 billion floating-point operations / second (1 TFLOP)
Main memory throughput: 10GB / sec	GPU memory throughput: 100GB / sec

All of this means that if a program can be broken up into many threads all doing the same thing on different data (ideally executing arithmetic operations), a GPU will probably be able to do this an order of magnitude faster than a CPU. On the other hand, on an application with a complex control flow, CPU is going to be the one winning by orders of magnitude. Going back to my earlier example, it should be clear why a CPU will excel at running a browser and a GPU will excel at executing a pixel shader.

This chart illustrates how a CPU and a GPU use up their “silicon budget”. A CPU uses most of its transistors for the L1 cache and for execution control. A GPU dedicates the bulk of its transistors to Arithmetic Logic Units (ALUs).

Adapted from NVidia’s CUDA Programming Guide.

NVidia’s upcoming Fermi chip will slightly change the comparison table. Fermi introduces a per-core automatically-managed L1 cache. It will be very interesting to see what kind of impact the introduction of an L1 cache will have on the types of programs that can run on the GPU. One point is fairly clear – the penalty for register spills into main memory will be greatly reduced (this point may not make sense until you read the next section).

GPGPU Programming

Today, writing efficient GPGPU programs requires in-depth understanding of the hardware. There are three popular programming models:

DirectCompute – Microsoft’s API for defining compute kernels, introduced in DirectX 11
CUDA – NVidia’s C-based language for programming compute kernels
OpenCL – API originally proposed by Apple and now developed by Khronos Group

Conceptually, the models are very similar. The table below summarizes some of the terminology differences between the models:

DirectCompute	CUDA	OpenCL
thread	thread	work item
thread group	thread block	work group
group-shared memory	shared memory	local memory
warp?	warp	wavefront
barrier	barrier	barrier

Writing high-performance GPGPU code is not for the faint at heart (although the same could probably be said about any type of high-performance computing). Here are examples of some issues you need to watch out for when writing compute kernels:

The program has to have plenty of threads (thousands)
Not too many threads, though, or cores will run out of registers and will have to simulate additional registers using main GPU memory.
It is important that threads running on one core access main memory in such a pattern that the hardware will be coalesce the memory accesses from different threads. This optimization alone can make an order of magnitude difference.
… and so on.

Explaining all of these performance topics in detail is well beyond the scope of this article, but hopefully this gives you an idea of what GPGPU programming is about, and what kinds of problems it can be applied to.

Read more of my articles:

Gallery of processor cache effects

What really happens when you navigate to a URL

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Skip lists are fascinating!

And if you like my blog, subscribe!

What really happens when you navigate to a URL

Igor Ostrovsky — Tue, 09 Feb 2010 08:14:26 +0000

As a software developer, you certainly have a high-level picture of how web apps work and what kinds of technologies are involved: the browser, HTTP, HTML, web server, request handlers, and so on.

In this article, we will take a deeper look at the sequence of events that take place when you visit a URL.

1. You enter a URL into the browser

It all starts here:

2. The browser looks up the IP address for the domain name

The first step in the navigation is to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:

Browser cache – The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a fixed duration (varies between browsers, 2 – 30 minutes).
OS cache – If the browser cache does not contain the desired record, the browser makes a system call (gethostbyname in Windows). The OS has its own cache.
Router cache – The request continues on to your router, which typically has its own DNS cache.
ISP DNS cache – The next place checked is the cache ISP’s DNS server. With a cache, naturally.
Recursive search – Your ISP’s DNS server begins a recursive search, from the root nameserver, through the .com top-level nameserver, to Facebook’s nameserver. Normally, the DNS server will have names of the .com nameservers in cache, and so a hit to the root nameserver will not be necessary.

Here is a diagram of what a recursive DNS search looks like:

One worrying thing about DNS is that the entire domain like wikipedia.org or facebook.com seems to map to a single IP address. Fortunately, there are ways of mitigating the bottleneck:

Round-robin DNS is a solution where the DNS lookup returns multiple IP addresses, rather than just one. For example, facebook.com actually maps to four IP addresses.
Load-balancer is the piece of hardware that listens on a particular IP address and forwards the requests to other servers. Major sites will typically use expensive high-performance load balancers.
Geographic DNS improves scalability by mapping a domain name to different IP addresses, depending on the client’s geographic location. This is great for hosting static content so that different servers don’t have to update shared state.
Anycast is a routing technique where a single IP address maps to multiple physical servers. Unfortunately, anycast does not fit well with TCP and is rarely used in that scenario.

Most of the DNS servers themselves use anycast to achieve high availability and low latency of the DNS lookups.

3. The browser sends a HTTP request to the web server

You can be pretty sure that Facebook’s homepage will not be served from the browser cache because dynamic pages expire either very quickly or immediately (expiry date set to past).

So, the browser will send this request to the Facebook server:

GET http://facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: facebook.com
Cookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]

The GET request names the URL to fetch: “http://facebook.com/”. The browser identifies itself (User-Agent header), and states what types of responses it will accept (Accept and Accept-Encoding headers). The Connection header asks the server to keep the TCP connection open for further requests.

The request also contains the cookies that the browser has for this domain. As you probably already know, cookies are key-value pairs that track the state of a web site in between different page requests. And so the cookies store the name of the logged-in user, a secret number that was assigned to the user by the server, some of user’s settings, etc. The cookies will be stored in a text file on the client, and sent to the server with every request.

There is a variety of tools that let you view the raw HTTP requests and corresponding responses. My favorite tool for viewing the raw HTTP traffic is fiddler, but there are many other tools (e.g., FireBug) These tools are a great help when optimizing a site.

In addition to GET requests, another type of requests that you may be familiar with is a POST request, typically used to submit forms. A GET request sends its parameters via the URL (e.g.: http://robozzle.com/puzzle.aspx?id=85). A POST request sends its parameters in the request body, just under the headers.

The trailing slash in the URL “http://facebook.com/” is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it is not clear whether folderOrFile is a folder or a file. In such cases, the browser will visit the URL without the slash, and the server will respond with a redirect, resulting in an unnecessary roundtrip.

4. The facebook server responds with a permanent redirect

This is the response that the Facebook server sent back to the browser request:

HTTP/1.1 301 Moved Permanently
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
      pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Location: http://www.facebook.com/
P3P: CP="DSP LAW"
Pragma: no-cache
Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;
      path=/; domain=.facebook.com; httponly
Content-Type: text/html; charset=utf-8
X-Cnection: close
Date: Fri, 12 Feb 2010 05:09:51 GMT
Content-Length: 0

The server responded with a 301 Moved Permanently response to tell the browser to go to “http://www.facebook.com/” instead of “http://facebook.com/”.

There are interesting reasons why the server insists on the redirect instead of immediately responding with the web page that the user wants to see.

One reason has to do with search engine rankings. See, if there are two URLs for the same page, say http://www.igoro.com/ and http://igoro.com/, search engine may consider them to be two different sites, each with fewer incoming links and thus a lower ranking. Search engines understand permanent redirects (301), and will combine the incoming links from both sources into a single ranking.

Also, multiple URLs for the same content are not cache-friendly. When a piece of content has multiple names, it will potentially appear multiple times in caches.

5. The browser follows the redirect

The browser now knows that “http://www.facebook.com/” is the correct URL to go to, and so it sends out another GET request:

GET http://www.facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
Accept-Language: en-US
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: lsd=XW[...]; c_user=21[...]; x-referer=[...]
Host: www.facebook.com

The meaning of the headers is the same as for the first request.

6. The server ‘handles’ the request

The server will receive the GET request, process it, and send back a response.

This may seem like a straightforward task, but in fact there is a lot of interesting stuff that happens here – even on a simple site like my blog, let alone on a massively scalable site like facebook.

Web server software
The web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (in ASP.NET, PHP, Ruby, …) that reads the request and generates the HTML for the response.
In the simplest case, the request handlers can be stored in a file hierarchy whose structure mirrors the URL structure, and so for example http://example.com/folder1/page1.aspx URL will map to file /httpdocs/folder1/page1.aspx. The web server software can also be configured so that URLs are manually mapped to request handlers, and so the public URL of page1.aspx could be http://example.com/folder1/page1.
Request handler
The request handler reads the request, its parameters, and cookies. It will read and possibly update some data stored on the server. Then, the request handler will generate a HTML response.

One interesting difficulty that every dynamic website faces is how to store data. Smaller sites will often have a single SQL database to store their data, but sites that store a large amount of data and/or have many visitors have to find a way to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and usage of simplified databases with weakened consistency semantics.

One technique to keep data updates cheap is to defer some of the work to a batch job. For example, Facebook has to update the newsfeed in a timely fashion, but the data backing the “People you may know” feature may only need to be updated nightly (my guess, I don’t actually know how they implement this feature). Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.

7. The server sends back a HTML response

Here is the response that the server generated and sent back:

HTTP/1.1 200 OK
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
    pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
P3P: CP="DSP LAW"
Pragma: no-cache
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
X-Cnection: close
Transfer-Encoding: chunked
Date: Fri, 12 Feb 2010 09:05:55 GMT

2b3
��������T�n�@����[...]

The entire response is 36 kB, the bulk of them in the byte blob at the end that I trimmed.

The Content-Encoding header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you’ll see the HTML you’d expect:

...

In addition to compression, headers specify whether and how to cache the page, any cookies to set (none in this response), privacy information, etc.

Notice the header that sets Content-Type to text/html. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser will use the header to decide how to interpret the response, but will consider other factors as well, such as the extension of the URL.

8. The browser begins rendering the HTML

Even before the browser has received the entire HTML document, it begins rendering the website:

9. The browser sends requests for objects embedded in HTML

As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files.

Here are a few URLs that my visit to facebook.com retrieved:

Images
http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
…
CSS style sheets
http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
…
JavaScript files
http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
…

Each of these URLs will go through process a similar to what the HTML page went through. So, the browser will look up the domain name in DNS, send a request to the URL, follow redirects, etc.

However, static files – unlike dynamic pages – allow the browser to cache them. Some of the files may be served up from cache, without contacting the server at all. The browser knows how long to cache a particular file because the response that returned the file contained an Expires header. Additionally, each response may also contain an ETag header that works like a version number – if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.

Can you guess what “fbcdn.net” in the URLs stands for? A safe bet is that it means “Facebook content delivery network”. Facebook uses a content delivery network (CDN) to distribute static content – images, style sheets, and JavaScript files. So, the files will be copied to many machines across the globe.

Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites will use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook’s static files are hosted by Akamai, the largest CDN provider.

As a demonstration, when you try to ping static.ak.fbcdn.net, you will get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, may get responses from different servers, which demonstrates the load-balancing that happens behind the scenes.

10. The browser sends further asynchronous (AJAX) requests

In the spirit of Web 2.0, the client continues to communicate with the server even after the page is rendered.

For example, Facebook chat will continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the list of your friends who are online.

This pattern is sometimes referred to as “AJAX”, which stands for “Asynchronous JavaScript And XML”, even though there is no particular reason why the server has to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.

Among other things, the fiddler tool lets you view the asynchronous requests sent by your browser. In fact, not only you can observe the requests passively, but you can also modify and resend them. The fact that it is this easy to “spoof” AJAX requests causes a lot of grief to developers of online games with scoreboards. (Obviously, please don’t cheat that way.)

Facebook chat provides an example of an interesting problem with AJAX: pushing data from server to client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages arrived.

Long polling is an interesting technique to decrease the load on the server in these types of scenarios. If the server does not have any new messages when polled, it simply does not send a response back. And, if a message for this client is received within the timeout period, the server will find the outstanding request and return the message with the response.

Conclusion

Hopefully this gives you a better idea of how the different web pieces work together.

Read more of my articles:

Gallery of processor cache effects

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Self-printing Game of Life in C#

Skip lists are fascinating!

And if you like my blog, subscribe!

Gallery of Processor Cache Effects

Igor Ostrovsky — Tue, 19 Jan 2010 10:28:11 +0000

Most of my readers will understand that cache is a fast but small type of memory that stores recently accessed memory locations. This description is reasonably accurate, but the “boring” details of how processor caches work can help a lot when trying to understand program performance.

In this blog post, I will use code samples to illustrate various aspects of how caches work, and what is the impact on the performance of real-world programs.

The examples are in C#, but the language choice has little impact on the performance scores and the conclusions they lead to.

Example 1: Memory accesses and performance

How much faster do you expect Loop 2 to run, compared Loop 1?

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.

The reason why the loops take the same amount of time has to do with memory. The running time of these loops is dominated by the memory accesses to the array, not by the integer multiplications. And, as I’ll explain on Example 2, the hardware will perform the same main memory accesses for the two loops.

Example 2: Impact of cache lines

Let’s explore this example deeper. We will try other step values, not just 1 and 16:

for (int i = 0; i < arr.Length; i += K) arr[i] *= 3;

Here are the running times of this loop for different step values (K):

Notice that while step is in the range from 1 to 16, the running time of the for-loop hardly changes. But from 16 onwards, the running time is halved each time we double the step.

The reason behind this is that today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!

Since 16 ints take up 64 bytes (one cache line), for-loops with a step between 1 and 16 have to touch the same number of cache lines: all of the cache lines in the array. But once the step is 32, we’ll only touch roughly every other cache line, and once it is 64, only every fourth.

Understanding of cache lines can be important for certain types of program optimizations. For example, alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.

Example 3: L1 and L2 cache sizes

Today’s computers come with two or three levels of caches, usually called L1, L2 and possibly L3. If you want to know the sizes of the different caches, you can use the CoreInfo SysInternals tool, or use the GetLogicalProcessorInfo Windows API call. Both methods will also tell you the cache line sizes, in addition to the cache sizes.

On my machine, CoreInfo reports that I have a 32kB L1 data cache, a 32kB L1 instruction cache, and a 4MB L2 data cache. The L1 caches are per-core, and the L2 caches are shared between pairs of cores:

Logical Processor to Cache Map:
*---  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
*---  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
-*--  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
**--  Unified Cache       0, Level 2,    4 MB, Assoc  16, LineSize  64
--*-  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
--*-  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
---*  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
--**  Unified Cache       1, Level 2,    4 MB, Assoc  16, LineSize  64

Let’s verify these numbers by an experiment. To do that, we’ll step over an array incrementing every 16th integer – a cheap way to modify every cache line. When we reach the last value, we loop back to the beginning. We’ll experiment with different array sizes, and we should see drops in the performance at the array sizes where the array spills out of one cache level.

Here is the program:

int steps = 64 * 1024 * 1024; // Arbitrary number of steps
int lengthMod = arr.Length - 1;
for (int i = 0; i < steps; i++)
{
    arr[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arr.Length)
}

And here are the timings:

You can see distinct drops after 32kB and 4MB – the sizes of L1 and L2 caches on my machine.

Example 4: Instruction-level parallelism

Now, let’s take a look at something different. Out of these two loops, which one would you expect to be faster?

int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i// Loop 2
for (int i=0; i
It turns out that the second loop is about twice faster than the first loop, at least on all of the machines I tested. Why? This has to do with the dependencies between operations in the two loop bodies.
In the body of the first loop, operations depend on each other as follows:

But in the second example, we only have these dependencies:

The modern processor has various parts that have a little bit of parallelism in them: it can access two memory locations in L1 at the same time, or perform two simple arithmetic operations. In the first loop, the processor cannot exploit this instruction-level parallelism, but in the second loop, it can.
[UPDATE]: Many people on reddit are asking about compiler optimizations, and whether { a[0]++; a[0]++; } would just get optimized to { a[0]+=2; }. In fact, the C# compiler and CLR JIT will not do this optimization – not when array accesses are involved. I built all of the tests in release mode (i.e. with optimizations), but I looked at the JIT-ted assembly to verify that optimizations aren’t skewing the results.
Example 5: Cache associativity
One key decision in cache design is whether each chunk of main memory can be stored in any cache slot, or in just some of them.
There are three possible approaches to mapping cache slots to memory chunks:

Direct mapped cache
Each memory chunk can only be stored only in one particular slot in the cache. One simple solution is to map the chunk with index chunk_index to cache slot (chunk_index % cache_slots). Two memory chunks that map to the same slot cannot be stored simultaneously in the cache.
N-way set associative cache
Each memory chunk can be stored in any one of N particular slots in the cache. As an example, in a 16-way cache, each memory chunk can be stored in 16 different cache slots. Commonly, chunks with indices with the same lowest order bits will all share 16 slots.
Fully associative cache
Each memory chunk can be stored in any slot in the cache. Effectively, the cache operates like a hash table.

Direct mapped caches can suffer from conflicts – when multiple values compete for the same slot in the cache, they keep evicting each other out, and the hit rate plummets. On the other hand, fully associative caches are complicated and costly to implement in the hardware. N-way set associative caches are the typical solution for processor caches, as they make a good trade off between implementation simplicity and good hit rate.
For example, the 4MB L2 cache on my machine is 16-way associative. All 64-byte memory chunks are partitioned into sets (based on the lowest order bits of the chunk index), and chunks in the same set compete for 16 slots in the L2 cache.
Since the L2 cache has 65,536 slots, and each set will need 16 slots in the cache, we will have 4,096 sets. So, the lowest 12 bits of the chunk index will determine which set the chunk belongs to (2¹² = 4,096). As a result, cache lines at addresses that differ by a multiple of 262,144 bytes (4096 * 64) will compete for the same slot in the cache. The cache on my machine can hold at most 16 such cache lines.
In order for the effects of cache associativity to become apparent, I need to repeatedly access more than  16 elements from the same set. I will demonstrate this using the following method:
public static long UpdateEveryKthByte(byte[] arr, int K)
{
    Stopwatch sw = Stopwatch.StartNew();
    const int rep = 1024*1024; // Number of iterations – arbitrary

    int p = 0;
    for (int i = 0; i < rep; i++)
    {
        arr[p]++;
        p += K;
        if (p >= arr.Length) p = 0;
    }

    sw.Stop();
    return sw.ElapsedMilliseconds;
}
This method increments every K-th value in the array. Once the it reaches the end of the array, it starts again from the beginning. After running sufficiently long (2^20 steps), the loop stops.
I ran UpdateEveryKthByte() with different array sizes (in 1MB increments) and different step sizes. Here is a plot of the results, with blue representing long running time, and white representing short:
 
The blue areas (long running times) are cases where the updated values could not be simultaneously held in the cache as we repeatedly iterated over them. The bright blue areas correspond to running times of ~80 ms, and the nearly white areas to ~10 ms.
Let’s explain the blue parts of the chart:

Why the vertical lines?The vertical lines show the step values that touch too many memory locations (>16) from the same set. For those steps, we cannot simultaneously hold all touched values in the 16-way associative cache on my machine.
Some bad step values are powers of two: 256 and 512. As an example, consider step 512 on an 8MB array. An 8MB cache line contains 32 values that are spaced by 262,144 bytes apart. All of those values will be updated by each pass of our loop, because 512 divides 262,144.
And since 32 > 16, those 32 values will keep competing for the same 16 slots in the cache.
Some values that are not powers of two are simply unfortunate, and will end up visiting disproportionately many values from the same set. Those step values will also show up as as blue lines.
Why do the vertical lines stop at 4MB array length?On arrays of 4MB or less, a 16-way associative cache is just as good as a fully associative one.
A 16-way associative cache can hold at most 16 cache lines that are a multiple of 262,144 bytes apart. There is no set of 17 or more cache lines all aligned on 262,144-byte boundaries within 4MB, because 16 * 262,144 = 4,194,304.
Why the blue triangle in upper left?In the triangle area, we cannot hold all necessary data in cache simultaneously … not due to the associativity, but simply because of the L2 cache size limit.
For example, consider the array length 16MB with step 128. We are repeatedly updating every 128th byte in the array, which means that we touch every other 64-byte memory chunk. To store every other cache line of a 16MB array, we’d need 8MB cache. But, my machine only has 4MB of cache.
Even if the 4MB cache on my machine was fully associative, it still wouldn’t be able to hold 8MB of data.
Why does the triangle fade out in the left?Notice that the gradient goes from 0 to 64 bytes – one cache line! As explained in examples 1 and 2, additional accesses to same cache line are nearly free. For example, when stepping by 16 bytes, it will take 4 steps to get to the next cache line. So, we get four memory accesses for the price of one.
Since the number of steps is the same for all cases, a cheaper step results in a shorter running time.

These patterns continue to hold as you extend the chart:

Cache associativity is interesting to understand and can certainly be demonstrated, but it tends to be less of a problem compared to the other issues discussed in this article. It is certainly not something that should be at the forefront of your mind as you write programs.
Example 6: False cache line sharing
On multi-core machines, caches encounter another problem – consistency. Different cores have fully or partly separate caches. On my machine, L1 caches are separate (as is common), and there are two pairs of processors, each pair sharing an L2 cache. While the details vary, a modern multi-core machine will have a multi-level cache hierarchy, where the faster and smaller caches belong to individual processors.
When one processor modifies a value in its cache, other processors cannot use the old value anymore. That memory location will be invalidated in all of the caches. Furthermore, since caches operate on the granularity of cache lines and not individual bytes, the entire cache line will be invalidated in all caches!
To demonstrate this issue, consider this example:
private static int[] s_counter = new int[1024];
private void UpdateCounter(int position)
{
    for (int j = 0; j < 100000000; j++)
    {
        s_counter[position] = s_counter[position] + 3;
    }
}
On my quad-core machine, if I call UpdateCounter with parameters 0,1,2,3 from four different threads, it will take 4.3 seconds until all threads are done.
On the other hand, if I call UpdateCounter with parameters 16,32,48,64 the operation will be done in 0.28 seconds!
Why? In the first case, all four values are very likely to end up on the same cache line. Each time a core increments the counter, it invalidates the cache line that holds all four counters. All other cores will suffer a cache miss the next time they access their own counters. This kind of thread behavior effectively disables caches, crippling the program’s performance.
Example 7: Hardware complexities
Even when you know the basics of how caches work, the hardware will still sometimes surprise you. Different processors differ in optimizations, heuristics, and subtle details of how they do things.
On some processors, L1 cache can process two accesses in parallel if they access cache lines from different banks, and serially if they belong to the same bank. Also, processors can surprise you with clever optimizations. For example, the false-sharing example that I’ve used on several machines in the past did not work well on my machine without tweaks – my home machine can optimize the execution in the simplest cases to reduce the cache invalidations.
Here is one odd example of “hardware weirdness”:
private static int A, B, C, D, E, F, G;
private static void Weirdness()
{
    for (int i = 0; i < 200000000; i++)
    {
        
    }
}
When I substitute three different blocks for “”, I get these timings:




Time


A++; B++; C++; D++;
719 ms


A++; C++; E++; G++;
448 ms


A++; C++;
518 ms



Incrementing fields A,B,C,D takes longer than incrementing fields A,C,E,G. And what’s even weirder, incrementing just A and C takes longer than increment A and C and E and G!
I don’t know for sure what is the reason behind these numbers, but I suspect it is related to memory banks. If someone can explain these numbers, I’d be very curious to hear about it.
The lesson of this example is that can be difficult to fully predict hardware performance. There is a lot that you can predict, but ultimately, it is very important to measure and verify your assumptions.

Read more of my articles:
Human heart is a Turing machine, research on XBox 360 shows. Wait, what?
Self-printing Game of Life in C#
Efficient auto-complete with a ternary search tree
Numbers that cannot be computed
And if you like my blog, subscribe!

Conclusion
Hopefully all of this helps you understand how caches work, and apply that knowledge when tuning your programs.

Human heart is a Turing machine, research on XBox 360 shows. Wait, what?

Igor Ostrovsky — Thu, 24 Sep 2009 08:13:31 +0000

Did you ever see one of those auto-generated random “academic papers” like this one? When I first saw the following title, my first thought was that it is a randomly-generated “paper”:

Implications of the Turing completeness of reaction-diffusion models, informed by GPGPU simulations on an XBox 360: Cardiac arrhythmias, re-entry and the Halting problem [PDF]

Turing completeness, cardiac arrhythmias, XBox 360… those things don’t seem to have much in common. But, I had my interest piqued. I looked up the paper and read through it. And, it turns out that not only is the paper serious, what it has to say is also quite interesting.

Heart and logic circuits

I had to take author’s word on the medical aspects of the article, since I know nothing about cardiology. Apparently, understanding of electrical signals between cells in a human heart is important for research into heart arrhythmias. Makes sense.

The important insight of the article is that a logic NOR gate can be simulated using electrical signals between heart cells. Constructing a NOR gate is a powerful result, because similarly to NAND, NOR is a universal logic gate. That means that all other logic gates like AND, OR and NOT can be built out of NORs:

NOT(A) = NOR(A, A)
OR(A, B) = NOR(NOR(A, B), NOR(A, B))
AND(A, B) = NOR(NOR(A, A), NOR(B, B))

So, since you can simulate a NOR gate using cardiac cells, you can simulate an arbitrary logic circuit in heart tissue.

Heart and Turing machines

But, there has to be more to the story. The paper title mentioned Turing machines, and logic circuits are not Turing-complete. Halting problem doesn’t even make sense when applied to boolean expressions!

The missing part is the passage of time. See, a logic circuit is not Turing-complete. But, if you take a logic circuit with multiple inputs, the same number of outputs as inputs, and repeatedly apply the circuit over the results of the previous iteration, you get a Turing-complete system. This type of a system can be modeled with behavior of a heart tissue over a period of time.

The most intuitive explanation that I can think of is via Game of Life. In the Game of Life, each cell dies, becomes alive or stays alive depending on how many live neighbors it had in the previous generation. Here is one example of a Game of Life board in action:

Now, the important observation is that the Game of Life rules can be encoded using a logic circuit. For example, if the eight neighbors of a particular cell are represented as A, B … H, then the rule that the cell becomes alive if it has exactly three neighbors can be encoded as ((A and B and C and not(D) and … and not(H)) or (A and B and not(C) and D and not(E) … and not(H)) or …). This expression will be combined with an OR together with the situations under which the cell remains alive, rather than becoming alive. This will be a pretty ugly logic circuit, but it should be clear that it can be constructed.

Since Game of Life is known to be Turing-complete, then an “iterated logic circuit” is also Turing-complete, and the behavior of cardiac tissue is … Turing-complete.

Also read about Numbers that cannot be computed and Self-printing Game of Life in C#.

Why is this useful?

Proving that a particular behavior of cardiac tissue is Turing-complete is a useful result because it shows that the cardiac tissue is in a certain sense “unpredictable”.

For example, since Game of Life is Turing-complete, the Halting problem applies. So, it is a proven fact that there is no general algorithm that can look at a particular Game of Life board and decide whether the movement will eventually stop or continue forever.

Similarly, there is no algorithm that can decide any of these properties for all Game-of-Life boards:

Whether the board will ever reach a particular configuration
Whether the number of live cells will ever exceed X
Whether the game will ever enter a cycle
etc.

Since the studied behavior of cardiac tissue is also Turing-complete, there is no general algorithm that can look at the state of cardiac tissue and decide whether the activity will ever stop, enter a regular pattern, achieve a particular configuration, etc. That is certainly a worthwhile result!

What about the XBox?

Constructing a NOR gate out of cardiac cells is computationally intensive, and the researcher used a GPU in an XBox 360 for that task.

However, the paper doesn’t conclusively show that using an XBox was a real benefit. The paper says that a C++ implementation that was originally “designed more for ease of expansion […] than for speed” ran slower on an XBox 360 CPU than an “unoptimized” shader-based implementation on an XBox 360 GPU. Comparing implementations not designed for speed is unconvincing. If the goal was to speed up the computation, why not first try to optimize the original code instead of porting it to shaders, which is undoubtedly a much more difficult task?

Also, the article doesn’t say how did XBox 360 CPU compare against an ordinary x86 machine, or against say a CUDA-based implementation on a common NVidia card. So, the paper doesn’t come close to showing that the researchers gained much by coding against the XBox 360 GPU, rather than following the current state-of-the-art approaches.

But, it is still a cool paper, and adding XBox 360 into the picture certainly attracted attention. The paper was reported in press, with titles such as these:

How Xbox Can Help Fight Heart Disease [time.com].
Parallel processor computing of XBox chip could save thousands [techshout.com].

And, if it weren’t for the sensational articles, I wouldn’t have found out about the paper at all, so I guess I shouldn’t complain.

How to write a self-printing program

Igor Ostrovsky — Fri, 31 Oct 2008 16:46:38 +0000

As promised at the end of my recent post, I am going to explain how to implement a program that prints itself, in addition to doing other things (like playing Game of Life).

A self-printing program – also called a quine – is a program that prints out its own source code. I will describe one simple way to implement a quine that can be adapted to just about any programming language. The technique does not depend on any unusual language features, but also does not necessarily yield the shortest possible quine in a particular language.

The main idea behind this quine implementation is simple. The quine will consist of two parts: a definition of a string and the program core. The string will contain the source code of the program core. And, what will the program core do? It will print the string twice: once to print the string definition, and again to print the program core.

That’s pretty much all we need to do, except for some details. When printing the string the first time, we need to escape special characters and surround each line with quotes. Also, some code may need to come before the string definition (the “header”), and some may need to come after the program core (the “footer”). Fortunately, we can stuff all the code we need into the program core, so handling these details is not a big problem.

Let’s walk through a quine construction in C#. The quine will look like this:

    using System;
    class P {
        static void Main() {
            string[] S = {
                // the program core, as a string array
            };

            // the program core, as source code:
            //     1. Print the program header
            //     2. Print S, formatted as the definition of S
            //     3. Print S, formatted as source code
            //     4. Print the program footer
        }
    }

Let’s implement the program core. That’s easy:

    using System;
    class P {
        static void Main() {
            string[] S = {
                // the program core, represented as a string array 
            };

            // 1. Print the program header
            Console.WriteLine("using System;");
            Console.WriteLine("class P {");
            Console.WriteLine("    static void Main() {");

            // 2. Print S, formatted as the definition of S
            Console.WriteLine("        string[] S = {");
            foreach (string line in S)
            {
                string escapedLine = line.Replace(@"\", @"\\").Replace("\"", "\\\"");
                Console.WriteLine("\"{0}\",", escapedLine);
            }
            Console.WriteLine("        };");

            // 3. Print S, formatted as source code
            foreach (string line in S) Console.WriteLine(line);

            // 4. Print the program footer
            Console.WriteLine("    }");
            Console.WriteLine("}");
        }
    }

We also need to initialize string S. To do that, simply copy & paste the program core source code into the definition of S, add a backslash before each occurrence of ” or \, and surround each line with quotes.

Here is the final quine:

using System;
class P
{
    static void Main()
    {
        string[] S = {
"        Console.WriteLine(\"using System;\");",
"        Console.WriteLine(\"class P {\");",
"        Console.WriteLine(\"    static void Main() {\");",
"",
"        Console.WriteLine(\"        string[] S = {\");",
"        foreach (string line in S) {",
"            string escapedLine = line.Replace(@\"\\\", @\"\\\\\")",
"                .Replace(\"\\\"\", \"\\\\\\\"\");",
"            Console.WriteLine(\"\\\"{0}\\\",\", escapedLine);",
"        }",
"        Console.WriteLine(\"        };\");",
"",
"        foreach (string line in S) Console.WriteLine(line);",
"",
"        Console.WriteLine(\"    }\");",
"        Console.WriteLine(\"}\");",
        };
        Console.WriteLine("using System;");
        Console.WriteLine("class P {");
        Console.WriteLine("    static void Main() {");

        Console.WriteLine("        string[] S = {");
        foreach (string line in S)
        {
            string escapedLine = line.Replace(@"\", @"\\")
                .Replace("\"", "\\\"");
            Console.WriteLine("\"{0}\",", escapedLine);
        }
        Console.WriteLine("        };");

        foreach (string line in S) Console.WriteLine(line);

        Console.WriteLine("    }");
        Console.WriteLine("}");
    }
}

Pretty simple, isn’t it?

One interesting point is that if we add extra code into the program core, and add the same code into the definition of string S, we will still have a quine. For example, here is how you can extend to quine to print an arbitrary string to standard error, after printing its own source code to standard output:

using System;
class P
{
    static void Main()
    {
        string[] S = {
"        Console.WriteLine(\"using System;\");",
"        Console.WriteLine(\"class P {\");",
"        Console.WriteLine(\"    static void Main() {\");",
"",
"        Console.WriteLine(\"        string[] S = {\");",
"        foreach (string line in S) {",
"            string escapedLine = line.Replace(@\"\\\", @\"\\\\\")",
"                .Replace(\"\\\"\", \"\\\\\\\"\");",
"            Console.WriteLine(\"\\\"{0}\\\",\", escapedLine);",
"        }",
"        Console.WriteLine(\"        };\");",
"",
"        foreach (string line in S) Console.WriteLine(line);",
"",
"        Console.WriteLine(\"    }\");",
"        Console.WriteLine(\"}\");",
"",
"        Console.Error.WriteLine(\"This quine can do other things too!\");",
        };
        Console.WriteLine("using System;");
        Console.WriteLine("class P {");
        Console.WriteLine("    static void Main() {");

        Console.WriteLine("        string[] S = {");
        foreach (string line in S)
        {
            string escapedLine = line.Replace(@"\", @"\\")
                .Replace("\"", "\\\"");
            Console.WriteLine("\"{0}\",", escapedLine);
        }
        Console.WriteLine("        };");

        foreach (string line in S) Console.WriteLine(line);

        Console.WriteLine("    }");
        Console.WriteLine("}");

        Console.Error.WriteLine("This quine can do other things too!");
    }
}

So, we can write a program that knows how to print its source code, but also does other things. This is interesting for three reasons:

It is an important result in theoretical computer science, known as the recursion theorem. Recursion theorem can be used to prove a variety of interesting results. For example, using the recursion theorem is one way to prove that the halting problem is undecidable.
Virus writers write programs that know how to replicate themselves, but also do other things, such as messing up your computer.
You can write pointless but cool little programs, like my self-printing Game of Life.

By the way, the quine that I described in this article is by no means the shortest possible one in C#. Joey Wescott came up with a C# quine that is only 166 characters long. I suggested two improvements, and we got it down to 149 characters (all one line):

class P{static void Main(){var S="class P{{static void Main(){{var S={1}{0}{1};System.Console

.Write(S,S,'{1}');}}}}";System.Console.Write(S,S,'"');}}

And there you have it – that’s how you write quines. In my next article, I will talk about an interesting little problem, and seven very different algorithms that can be used to solve it.

Other articles you may like:

Numbers that cannot be computed

Skip list are fascinating!

Quicksort killer

Ruby quines [metaspring.com]

Self-printing Game of Life in C#

Igor Ostrovsky — Thu, 30 Oct 2008 08:46:51 +0000

Conway’s Game of Life has fascinated computer scientists for decades. Even though its rules are ridiculously simple, Conway’s universe gives rise to a variety of gliders, spaceships, oscillators, glider guns, and other forms of “life”. Self-printing programs are similarly curious, and – rather surprisingly – have an important place in the theory of computation.

What happens when you combine the two? You are about to find out, but one thing is for sure: the geekiness factor should be pretty high.

I wrote a little C# program that contains a Game-of-Life grid. The program advances the game grid to the next generation and prints out a copy of itself, with the grid updated. You can take the output, compile it with a C# compiler, run it, and you’ll get the next generation of the game. You can iterate the process, or change the initial grid state manually.

Here is the source code:

using System;class G  /* GAME OF LIFE by Igor Ostrovsky */  {static string[]S={
"############################################################################",
"#                                                               * *        #",
"#  ***                                                         *           #",
"#                                       *                      *           #",
"#                                     * *                      *  *        #",
"#                           **      **            **           ***         #",
"#                          *   *    **            **                       #",
"#               **        *     *   **                                     #",
"#               **        *   * **    * *                                  #",
"#                         *     *       *                                  #",
"#                          *   *                                           #",
"#                           **                                             #",
"#   **     **                                                              #",
"#    **   **                                        *  *                   #",
"# *  * * * *  *                                         *                  #",
"# *** ** ** ***                                     *   *                  #",
"#  * * * * * *                                       ****                  #",
"#   ***   ***                                                              #",
"#                                                                          #",
"#   ***   ***                                                              #",
"#  * * * * * *            *                                                #",
"# *** ** ** ***           *           *  *                             *** #",
"# *  * * * *  *                           *                           *  * #",
"#    **   **             ***          *   *                **            * #",
"#   **     **                          ****                              * #",
"#                                                                     * *  #",
"############################################################################",    
};static void Main(){string T="\",r=\"using System;class G  /* GAME OF LIFE b"+
"y Igor Ostrovsky \"+\"*/  {static string[]S={\\n\";int p=31,i,j,b,d;for(i=0;"+
"i<27;i++){r+='\"'; for(j=0;j<76;j++){if(S[i][j]!='#'){b=0;for(d=0;d<9;d++)if"+
"(S[i-1+d/3][j-1+d%3]=='*')b++;r+=b==3 ||(S[i][j]=='*'&&b==4)?'*':' ';} else "+
"r+='#';}r+=\"\\\",\\n\";}r+=\"};static\"+\" void Main(){string T=\\\"\";fore"+
"ach(var c in T){if(c=='\\\\'||c=='\"'){r+='\\\\';p++;} r+=c; if(++p>=77){r+="+
"\"\\\"+\\n\\\"\";p=1;}} foreach(var c in T){r+=c;if(++p%79==0)r+='\\n';}Cons"+
"ole.Write(r);}}",r="using System;class G  /* GAME OF LIFE by Igor Ostrovsky "+
"*/  {static string[]S={\n";int p=31,i,j,b,d;for(i=0;i<27;i++){r+='"'; for(j=0;
j<76;j++){if(S[i][j]!='#'){b=0;for(d=0;d<9;d++)if(S[i-1+d/3][j-1+d%3]=='*')b++;
r+=b==3 ||(S[i][j]=='*'&&b==4)?'*':' ';} else r+='#';}r+="\",\n";}r+="};static"
+" void Main(){string T=\"";foreach(var c in T){if(c=='\\'||c=='"'){r+='\\';p++
;} r+=c; if(++p>=77){r+="\"+\n\"";p=1;}} foreach(var c in T){r+=c;if(++p%79==0)
r+='\n';}Console.Write(r);}}

And is here the output the program prints. The output is the same as the source code, except that the game has advanced to the next generation:

using System;class G  /* GAME OF LIFE by Igor Ostrovsky */  {static string[]S={
"############################################################################",
"#   *                                                                      #",
"#   *                                                          **          #",
"#   *                                  *                      ***          #",
"#                                    * *                      ** *         #",
"#                           *       * *           **           ***         #",
"#                          **      *  *           **            *          #",
"#               **        **    **  * *                                    #",
"#               **       ***    **   * *                                   #",
"#                         **    **     *                                   #",
"#                          **                                              #",
"#                           *                                              #",
"#   ***   ***                                                              #",
"#                                                                          #",
"# *    * *    *                                        **                  #",
"# *    * *    *                                      ** **                 #",
"# *    * *    *                                      ****                  #",
"#   ***   ***                                         **                   #",
"#                                                                          #",
"#   ***   ***                                                              #",
"# *    * *    *                                                         *  #",
"# *    * *    *                                                        *** #",
"# *    * *    *          * *             **                            * **#",
"#                         *            ** **                            ***#",
"#   ***   ***             *            ****                             ** #",
"#                                       **                                 #",
"############################################################################",
};static void Main(){string T="\",r=\"using System;class G  /* GAME OF LIFE b"+
"y Igor Ostrovsky \"+\"*/  {static string[]S={\\n\";int p=31,i,j,b,d;for(i=0;"+
"i<27;i++){r+='\"'; for(j=0;j<76;j++){if(S[i][j]!='#'){b=0;for(d=0;d<9;d++)if"+
"(S[i-1+d/3][j-1+d%3]=='*')b++;r+=b==3 ||(S[i][j]=='*'&&b==4)?'*':' ';} else "+
"r+='#';}r+=\"\\\",\\n\";}r+=\"};static\"+\" void Main(){string T=\\\"\";fore"+
"ach(var c in T){if(c=='\\\\'||c=='\"'){r+='\\\\';p++;} r+=c; if(++p>=77){r+="+
"\"\\\"+\\n\\\"\";p=1;}} foreach(var c in T){r+=c;if(++p%79==0)r+='\\n';}Cons"+
"ole.Write(r);}}",r="using System;class G  /* GAME OF LIFE by Igor Ostrovsky "+
"*/  {static string[]S={\n";int p=31,i,j,b,d;for(i=0;i<27;i++){r+='"'; for(j=0;
j<76;j++){if(S[i][j]!='#'){b=0;for(d=0;d<9;d++)if(S[i-1+d/3][j-1+d%3]=='*')b++;
r+=b==3 ||(S[i][j]=='*'&&b==4)?'*':' ';} else r+='#';}r+="\",\n";}r+="};static"
+" void Main(){string T=\"";foreach(var c in T){if(c=='\\'||c=='"'){r+='\\';p++
;} r+=c; if(++p>=77){r+="\"+\n\"";p=1;}} foreach(var c in T){r+=c;if(++p%79==0)
r+='\n';}Console.Write(r);}}

If you want to see the program iterate, save the source code into a file named life.cs, and run this command repeatedly from a Visual Studio console:

csc.exe life.cs && (life > life.cs) && life

Cool, isn’t it? I have a follow-up article nearly ready that explains how to write programs like this one… just in case you ever wanted to.

[Update] The follow-up How to write a self-printing program is up.

More articles:

Numbers that cannot be computed

Skip list are fascinating!

Quicksort killer

Numbers that cannot be computed

Igor Ostrovsky — Sun, 19 Oct 2008 04:59:54 +0000

Did you know that there are numbers that cannot be computed by any computer program? It is weird, but true.

And by number, I mean just an ordinary real number. As a perhaps unnecessarily simple example, the result of the division 1/7 looks like this:

0.1428571428571428571428571428571414285714285714285714285714285714…

We can easily implement a program that prints this number. The decimal expansion of 1/7 is infinite, so the program will have to run in an infinite loop to print the “whole” number. Here is a C# implementation:

    static void Main()
    {
        Console.Write("0.");
        while (true) Console.Write("142857");
    }

It is a bit of a philosophical question whether this program is really “computing” 1/7 or whether it has the answer hard-coded. You can certainly write a program that will compute the answer more legitimately by long division. The part that really matters is that there is some program that prints the infinite decimal expansion of 1/7, which makes 1/7 a computable number.

Similarly, you can write programs that will print the infinite decimal expansion of any rational number, sqrt(2), any algebraic number, π, e, and just about any other number that you can describe. For example, to compute π, you could use one of the known approximation algorithms. You would compute π to greater and greater accuracy, and print more and more digits to the screen. As a side note, we are assuming that the computer that will execute your program has an unlimited amount of memory. While not very realistic, this abstraction is analogous to the famous Turing machine, and extremely useful to understand certain deep truths about computation.

Interestingly, there are numbers that are non-computable. A number is non-computable if there is no program that prints its infinite decimal expansion (adding trailing zeros if a finite expansion is possible). How do we know that there are such numbers? The key insight is that there are more real numbers than there are C# programs. That is pretty surprising, given that both the number of real numbers and the number of C# programs are infinite. Nevertheless, it is true.

C# programs are countable. That means that we can assign a different positive integer to each program. The shortest valid C# program will be 1, the next shortest will be 2, and so forth. If there are multiple valid programs of the same length, we will sort the programs lexicographically and assign integers in that order. By the way, there are many different ways in which to define a “valid program”. One approach is to say that to be valid, the program must compile, run without exceptions, and print an infinite sequence of digits, separated at exactly one place with a decimal point.

Unlike C# programs, real numbers are uncountable. There is no way to assign a different integer to each real number… there are just too many real numbers! This fact was proved by Georg Cantor in a couple different ways, the most famous of which is the diagonalization argument.

Not only do non-computable numbers exist, but in fact they are vastly more abundant than computable numbers. Many, many real numbers are simply infinite sequences of seemingly random digits, with no pattern or special property. But, even though there are so many uncountable numbers, their examples tend to be weird and a little strenuous to explain.

As one such example, consider a number whose part before the decimal point is 0. We choose i-th digit after the decimal point to be different from the digit in the same position in the number printed by program i (by “program i”, I mean the program associated with integer i, as described earlier in the article). So, each digit after the decimal point guarantees that the constructed number will differ from the number printed by a particular program. This demonstrates that the constructed number will be different from any number printed by a computer program! By the way, this construction is basically Cantor’s diagonalization argument, only recast in a different terminology.

Theoretical foundations of computer science, which underlie everything we do as programmers, are nothing short of amazing. If you would like to read more about computability and related concepts, check out Charles Petzold’s book, The Annotated Turing. The original Alan Turing’s paper that introduces computable numbers is available here, but Petzold’s book is a lot easier to read.

[Update] See the forum on the reddit page of this article for additional in-depth discussion of this topic.

More articles:

Skip list are fascinating!

Data structure zoo: ordered set

Quicksort killer

	Time
A++; B++; C++; D++;	719 ms
A++; C++; E++; G++;	448 ms
A++; C++;	518 ms