<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Gallery of Processor Cache Effects</title>
	<atom:link href="http://igoro.com/archive/gallery-of-processor-cache-effects/feed/" rel="self" type="application/rss+xml" />
	<link>http://igoro.com/archive/gallery-of-processor-cache-effects/</link>
	<description>On programming, technology, and random things of interest</description>
	<lastBuildDate>Thu, 29 Jul 2010 00:28:29 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
	<item>
		<title>By: Didier Trosset</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1008</link>
		<dc:creator>Didier Trosset</dc:creator>
		<pubDate>Fri, 18 Jun 2010 10:02:24 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1008</guid>
		<description>@Andrew Borodin

I am not certain that you can calculate as you do the time of processing. I am pretty certain that the processors include some memory read-ahead mechanism. This mechanism is at work with the current code. And my reasoning is that the next few cache lines are fetched in advance when you simply read forward the memory. Hence, my thinking that 417 is the half of 908 (for very large values of 417 ;-) ) because limitation is only computation time here.

I remember Herb Sutter presenting test of algorithms that were using memory, both sequentially, and randomly, and that the results were _very_ different, and are explained by this read-ahead mechanism.</description>
		<content:encoded><![CDATA[<p>@Andrew Borodin</p>
<p>I am not certain that you can calculate as you do the time of processing. I am pretty certain that the processors include some memory read-ahead mechanism. This mechanism is at work with the current code. And my reasoning is that the next few cache lines are fetched in advance when you simply read forward the memory. Hence, my thinking that 417 is the half of 908 (for very large values of 417 <img src='http://igoro.com/wordpress/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  ) because limitation is only computation time here.</p>
<p>I remember Herb Sutter presenting test of algorithms that were using memory, both sequentially, and randomly, and that the results were _very_ different, and are explained by this read-ahead mechanism.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Borodin</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1007</link>
		<dc:creator>Andrew Borodin</dc:creator>
		<pubDate>Fri, 18 Jun 2010 06:55:58 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1007</guid>
		<description>even this
http://msdn.microsoft.com/en-us/library/ms973852.aspx
does not explains first two values</description>
		<content:encoded><![CDATA[<p>even this<br />
<a href="http://msdn.microsoft.com/en-us/library/ms973852.aspx" rel="nofollow">http://msdn.microsoft.com/en-u.....73852.aspx</a><br />
does not explains first two values</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Borodin</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1006</link>
		<dc:creator>Andrew Borodin</dc:creator>
		<pubDate>Fri, 18 Jun 2010 06:55:04 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1006</guid>
		<description>@Didier Trosset
Still I can&#039;t understand first two values. Why processing of all items is 16 times slower then processing only odd items?
processing of all is 908-390 = 520
processing of odd is 417 - 390 = 30

M.b. I should statistically clean results..</description>
		<content:encoded><![CDATA[<p>@Didier Trosset<br />
Still I can&#8217;t understand first two values. Why processing of all items is 16 times slower then processing only odd items?<br />
processing of all is 908-390 = 520<br />
processing of odd is 417 &#8211; 390 = 30</p>
<p>M.b. I should statistically clean results..</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Didier Trosset</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1004</link>
		<dc:creator>Didier Trosset</dc:creator>
		<pubDate>Thu, 17 Jun 2010 12:46:43 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1004</guid>
		<description>@Andrew Borodin

BTW, I can explain the figures you got from your program in June 6th comment.

    &lt;cite&gt;908, 417, 390, 389, 390, 217, 106&lt;/cite&gt;

The value 390 is limited by the memory accesses. As such, when you traverse your data 4 by 4, 8 by 8, or 16 by 16 (32 bits integer values) the limitation is memory.

When you traverse by 32 (217), or 64 (106), then you only use 1 cache line (64 bytes) out of 2, or out of 4. This is showed by the last 2 values, approximately divided by 2, and by 4 from the 390 limit.

For the first two ones, the limitation is not memory, but processing. It simply shows that computing `arr[i] *= 3` 16 times takes longer than accessing a 64 bytes memory line (908). Computing it 8 times divides the time by 2 (417), and is almost equal to the memory access limit.

&lt;em&gt;I fully agree with your last comment.&lt;/em&gt;</description>
		<content:encoded><![CDATA[<p>@Andrew Borodin</p>
<p>BTW, I can explain the figures you got from your program in June 6th comment.</p>
<p>    <cite>908, 417, 390, 389, 390, 217, 106</cite></p>
<p>The value 390 is limited by the memory accesses. As such, when you traverse your data 4 by 4, 8 by 8, or 16 by 16 (32 bits integer values) the limitation is memory.</p>
<p>When you traverse by 32 (217), or 64 (106), then you only use 1 cache line (64 bytes) out of 2, or out of 4. This is showed by the last 2 values, approximately divided by 2, and by 4 from the 390 limit.</p>
<p>For the first two ones, the limitation is not memory, but processing. It simply shows that computing `arr[i] *= 3` 16 times takes longer than accessing a 64 bytes memory line (908). Computing it 8 times divides the time by 2 (417), and is almost equal to the memory access limit.</p>
<p><em>I fully agree with your last comment.</em></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Borodin</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1003</link>
		<dc:creator>Andrew Borodin</dc:creator>
		<pubDate>Thu, 17 Jun 2010 06:21:11 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1003</guid>
		<description>@Didier Trosset:
&quot;Real need&quot; - it&#039;s a thing beyond CS, it&#039;s philosophy.
There is no real need to understand how computer works to write programs. But this knowledge dramaticaly increases probability of creating good programs.
Most of things we discover will be useless at most of time we are working on our projects. But sometimes something of this &quot;useless things&quot; are very usefull.
Euler&#039;s totient function was nearly useless to engineers for about 200 years, but in 1977 it became very important to RSA.

Cache effects do not matters in ERP, CMS and many other systems. But, for exampe, in shaders you cannot avoid them. It&#039;s better to take them into account if you create DBMS or graphics rendering engine.</description>
		<content:encoded><![CDATA[<p>@Didier Trosset:<br />
&#8220;Real need&#8221; &#8211; it&#8217;s a thing beyond CS, it&#8217;s philosophy.<br />
There is no real need to understand how computer works to write programs. But this knowledge dramaticaly increases probability of creating good programs.<br />
Most of things we discover will be useless at most of time we are working on our projects. But sometimes something of this &#8220;useless things&#8221; are very usefull.<br />
Euler&#8217;s totient function was nearly useless to engineers for about 200 years, but in 1977 it became very important to RSA.</p>
<p>Cache effects do not matters in ERP, CMS and many other systems. But, for exampe, in shaders you cannot avoid them. It&#8217;s better to take them into account if you create DBMS or graphics rendering engine.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Didier Trosset</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-1002</link>
		<dc:creator>Didier Trosset</dc:creator>
		<pubDate>Wed, 16 Jun 2010 14:01:28 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-1002</guid>
		<description>On a very recent Intel(R) Xeon(R) W3520 @ 2.67GHz, I got similar results. The graph at end of example 3 I have created for my system perfectly shows the L1, L2, and L3 cache level, at respectively 1.1, 1.5, 2.1, and 5.5 nanoseconds per access.

But this is when reading only 1 integer value (32 bits) in a single cache line comprising 16 of them. Whenever I try to do something with the other 15 integers (which is a better approach of the real use of memory), anything as simple as summing them up makes the CPU time becoming larger than memory latency time.

Maybe memory latency and throughput have enhanced a lot recently, but in the end, it looks to me that there&#039;s no real need to care about this.</description>
		<content:encoded><![CDATA[<p>On a very recent Intel(R) Xeon(R) W3520 @ 2.67GHz, I got similar results. The graph at end of example 3 I have created for my system perfectly shows the L1, L2, and L3 cache level, at respectively 1.1, 1.5, 2.1, and 5.5 nanoseconds per access.</p>
<p>But this is when reading only 1 integer value (32 bits) in a single cache line comprising 16 of them. Whenever I try to do something with the other 15 integers (which is a better approach of the real use of memory), anything as simple as summing them up makes the CPU time becoming larger than memory latency time.</p>
<p>Maybe memory latency and throughput have enhanced a lot recently, but in the end, it looks to me that there&#8217;s no real need to care about this.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: yzt</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-998</link>
		<dc:creator>yzt</dc:creator>
		<pubDate>Mon, 14 Jun 2010 14:34:35 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-998</guid>
		<description>@gulgi:
I&#039;m not that knowledgeable about this stuff, but it is possible that your computer has 32-byte cache lines (instead of Igor&#039;s 64 bytes.)</description>
		<content:encoded><![CDATA[<p>@gulgi:<br />
I&#8217;m not that knowledgeable about this stuff, but it is possible that your computer has 32-byte cache lines (instead of Igor&#8217;s 64 bytes.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Borodin</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-986</link>
		<dc:creator>Andrew Borodin</dc:creator>
		<pubDate>Sun, 06 Jun 2010 16:51:30 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-986</guid>
		<description>2 gulgi:
On my centrino laptop this:
int[] arr = new int[64 * 1024 * 1024];

            for (int x = 1; x &lt; 65; x &lt;&lt;=1)
            {
                Stopwatch sw = new Stopwatch();
                sw.Start();
                for (int i = 0; i &lt; arr.Length; i+=x) arr[i] *= 3;
                sw.Stop();
                Console.WriteLine(sw.ElapsedMilliseconds);
            }

produces output:
908
417
390
389
390
217
106

everything is not so clear, by close enough to described behavior</description>
		<content:encoded><![CDATA[<p>2 gulgi:<br />
On my centrino laptop this:<br />
int[] arr = new int[64 * 1024 * 1024];</p>
<p>            for (int x = 1; x &lt; 65; x &lt;&lt;=1)<br />
            {<br />
                Stopwatch sw = new Stopwatch();<br />
                sw.Start();<br />
                for (int i = 0; i &lt; arr.Length; i+=x) arr[i] *= 3;<br />
                sw.Stop();<br />
                Console.WriteLine(sw.ElapsedMilliseconds);<br />
            }</p>
<p>produces output:<br />
908<br />
417<br />
390<br />
389<br />
390<br />
217<br />
106</p>
<p>everything is not so clear, by close enough to described behavior</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gulgi</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-985</link>
		<dc:creator>gulgi</dc:creator>
		<pubDate>Sat, 05 Jun 2010 22:50:48 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-985</guid>
		<description>Very interesting, but is it a big diff between Intel and AMD?
On my X2 machine, I tried example #1:
 The += 16 takes 1/3 as long as the += 1. (+= 8 takes about the same as += 16) += 32 takes even shorter time.</description>
		<content:encoded><![CDATA[<p>Very interesting, but is it a big diff between Intel and AMD?<br />
On my X2 machine, I tried example #1:<br />
 The += 16 takes 1/3 as long as the += 1. (+= 8 takes about the same as += 16) += 32 takes even shorter time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Borodin</title>
		<link>http://igoro.com/archive/gallery-of-processor-cache-effects/comment-page-2/#comment-979</link>
		<dc:creator>Andrew Borodin</dc:creator>
		<pubDate>Sun, 30 May 2010 12:17:44 +0000</pubDate>
		<guid isPermaLink="false">http://igoro.com/?p=366#comment-979</guid>
		<description>Oh, it&#039;s just AWESOME.
I had made a lecture on this post this saturday.
I&#039;m lecturing computer graphics, but students have to know it.
I had never seen better explanation of this topic, even in Richter`s books.
Sometimes cache effects kill all output of big Oh analysis.</description>
		<content:encoded><![CDATA[<p>Oh, it&#8217;s just AWESOME.<br />
I had made a lecture on this post this saturday.<br />
I&#8217;m lecturing computer graphics, but students have to know it.<br />
I had never seen better explanation of this topic, even in Richter`s books.<br />
Sometimes cache effects kill all output of big Oh analysis.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
