This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.

Title

Toub's 234-page tour-de-force on performance in .NET 9

Description

<img attachment="stephen_toub.jpg" align="right">The <s>article</s>book <a href="https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-9/" author="Stephen Toub" source=".NET Blog">Performance Improvements in .NET 9</a> was published about six months ago. It contains a tremendous amount of interesting information, which I've attempted to summarize below, following the document structure in the original. <h level="3">Tier 0</h> <bq>Another tier 0 boxing example is dotnet/runtime#90496. There’s a hot path method in the async/await machinery: <c>AsyncTaskMethodBuilder<tresult>.AwaitUnsafeOnCompleted</c> (see <a href="https://devblogs.microsoft.com/dotnet/how-async-await-really-works/">How Async/Await Really Works in C#</a> for all the details). It’s really important that this method be optimized well, but it performs various type tests that can end up boxing in tier 0. In a previous release, that boxing was deemed too impactful to startup for async methods invoked early in an application’s lifetime, so <c>[MethodImpl(MethodImplOptions.AggressiveOptimization)]</c> was used to opt the method out of tiering, such that it gets optimized from the get-go. But that itself has downsides, because if it skips tiering up, it also skips dynamic PGO, and thus the optimized code isn’t as good as it possibly could be. So, this PR specifically addresses those type tests patterns that box, removing the boxing in tier 0, enabling removing that <c>AggressiveOptimization</c> from <c>AwaitUnsafeOnCompleted</c>, and thereby enabling better optimized code generation for it.</bq> <h level="3">Loops</h> <bq>In .NET 8, as part of the work to improve dynamic PGO, a more powerful graph-based loop analyzer was added that was able to recognize many more loops. For .NET 9 with dotnet/runtime#95251, that analyzer was factored out so that it could be used for generalized loop reasoning. And then with PRs like dotnet/runtime#96756 for loop alignment, dotnet/runtime#96754 and dotnet/runtime#96553 for loop cloning, dotnet/runtime#96752 for loop unrolling, dotnet/runtime#96751 for loop canonicalization, and dotnet/runtime#96753 for loop hoisting, many of these loop-related optimizations have now been moved to the better scheme. All of that means that more loops get optimized.</bq> <h level="3">ARM SVE</h> <bq>There are multiple ways such an ISA impacts .NET, and in particular the JIT. The JIT needs to be able to be able to work with the ISA, understand the associated registers and be able to do register allocation, be taught about encoding and emitting the instructions, and so on. The JIT needs to be taught when and where it’s appropriate to use these instructions, so that as part of compiling IL down to assembly, if operating on a machine that supports SVE, the JIT might be able to pick SVE instructions for use in the generated assembly. And the JIT needs to be taught how to represent this data, these vectors, to user code. All of that is a huge amount of work, especially when you consider that there are thousands of operations represented. What makes it even more work is hardware intrinsics.</bq> <bq>Designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it’s not yet baked enough for us to be 100% confident the shape won’t need to evolve (for .NET 9, it’s also restricted to hardware with a vector width of 128 bits, but that restriction will be removed subsequently). Hence, <c>[Experimental]</c>.</bq> <h level="3">AVX512</h> <bq>So the values are <c>0 1 1 0 1 0 0 0</c>, which we read as the binary <c>0b01101000</c>, which is <c>0x68</c>. That byte is used as a “control code” to the <c>vpternlog</c> instruction to encode which of the 256 possible truth tables that exist for any possible (deterministic) Boolean combination of those inputs is being chosen. This PR then teaches the JIT how to analyze the tree structures produced by the JIT to recognize such sequences of Boolean operations, compute the control code, and substitute in the use of the better instruction. Of course, the JIT isn’t going to do the enumeration I did above; turns out there’s a more efficient way to compute the control code, performing the same sequence of operations but on specific <c>byte</c> values instead of <c>Booleans</c>.</bq> <bq>This is beneficial for a variety of reasons, including less data to store, less data to load, and if the register containing this state needed to be spilled (meaning something else needs to be put into the register, so the value currently in the register is temporarily stored in memory), reloading it is similarly cheaper.</bq> All the considerations are mind-boggling. Does it fit in a cache line? How many registers does it use? Is it colocated with similar data? Is the data aligned on a boundary? <h level="3">Vectorization</h> <bq>Of course, you may then wonder, why wasn’t <c>bool.TryFormat</c> reverted to use the simpler code? The unfortunate answer is that this optimization only currently applies to <c>array</c> targets rather than span targets. That’s because there are alignment requirements for performing these kinds of writes, and whereas the JIT can make certain assumptions about the alignment of arrays, it can’t make those same assumptions about spans, which can represent slices of something else at unaligned boundaries. This is now one of the few cases where arrays are better than spans; typically span is as good or better. But I’m hopeful it will be improved in the future.</bq> <h level="3">Object Stack Allocation</h> <bq>The hardest part of stack allocating objects is ensuring that it’s safe. If a reference to the object were to escape and end up being stored somewhere that outlived the stack frame containing the stack-allocated object, that would be very bad; when the method returned, those outstanding references would be pointing to garbage. So, the JIT needs to perform escape analysis to ensure that never happens, and doing that well is extremely challenging.</bq> This is the case where you have to be exceedingly clever in order to not have to let pessimism kill the feature entirely. That is, if you can't prove enough, then you end up having to assume that escape is possible in too many cases---and the optimization ends up applying much less than you'd hoped it would. <h level="3">VM</h> <bq>The .NET runtime provides many services to managed code. There’s the GC, of course, and the JIT compiler, and then there’s a whole bunch of functionality around things like assembly and type loading, exception handling, configuration management, virtual dispatch, interop infrastructure, stub management, and so on. All of that functionality is generally referred to as being part of the coreclr virtual machine (VM).</bq> <h level="3">Mono</h> <bq>We frequently say “the runtime,” but in reality there are currently multiple runtime implementations in .NET. “coreclr” is the runtime thus far referred to, which is the default runtime used on Windows, Linux, and macOS, and for services and desktop applications, but there’s also “mono,” which is mainly used when the form factor of the target application requires a small runtime: by default, it’s the runtime that’s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps.</bq> <bq>[...] when targeting WASM, the interpreter has a form of PGO where after methods have been invoked some number of times and are deemed important, it’ll generate WASM on-the-fly to optimize those methods. This tiering gets better in .NET 9 with dotnet/runtime#92981, which enables keeping track of which methods tiered up, and if the code is running in a browser, storing that information in the browser’s cache for subsequent runs. When the code then runs subsequently, it can incorporate the previous learnings to tier up better and more quickly.</bq> <h level="3">Threading / <c>Debugger.NotifyOfCrossThreadDependency</c></h> <bq>When you’re debugging a .NET process and you break in the debugger, it pauses all threads in the debuggee process so that nothing is making forward progress while you examine state. However, .NET debuggers, like the one in Visual Studio, support invoking properties and methods in the debuggee while debugging. That can be a big problem if the functionality being invoked relies on one of those paused threads to do something, e.g. if the property you access tries to take a lock that’s held by another thread or tries to <c>Wait</c> on a <c>Task</c>. To mitigate problems here, the <c>Debugger.NotifyOfCrossThreadDependency</c> method exists. Functionality that relies on another thread to do something can call <c>NotifyOfCrossThreadDependency</c>; if there’s no debugger attached, it’s a nop, but if there is a debugger attached, this signals the problem to the debugger, which can then react accordingly. The Visual Studio debugger reacts by stopping the evaluation but then by offering an opt-in option of “slipping” all threads, unpausing all threads until the evaluated operation completes, at which point all threads will be paused again, thereby again trying to mitigate any problems that might occur from the cross-thread dependency.</bq> <h level="3">VM</h> <bq>The <a href="https://github.com/dotnet/runtime/blob/main/docs/design/specs/Memory-model.md">official .NET memory model</a> has now been documented. However, some of the practices that were being employed in the core libraries (due to defensive coding or uncertainty of the memory model or out-of-date requirements) are no longer necessary. One of the main tools available for folks coding at a level where memory model is relevant is the <c>volatile</c> keyword / the <c>Volatile</c> class.</bq> <bq>Marking fields or operations as <c>volatile</c> can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations.</bq> <h level="3">Reflection</h> <bq>Delegates in .NET are “multicast,” meaning a single delegate instance might actually represent multiple methods to be invoked; this is how .NET events are implemented. If I invoke a delegate, the delegate implementation handles invoking each constituent method, sequentially, in turn. But what if I want to customize the invocation logic? Maybe I want to wrap each individual method in a <c>try</c>/<c>catch</c>, or maybe I want to track the return values from all of the methods rather than just the last, or some such behavior. To achieve that, delegates expose a way to get an array of delegates, one for each method that’s part of the original.</bq> There are a lot of long chapters on number- and text-processing, which is fascinating but not eminently quotable. You can really see how so many of the various improvements build on each other to finally offer incredible speed improvements (e.g. <c>Quaternion.Cosh()</c>). So many operations have been improved to reduce allocations to zero while reducing time to a few percent of the previous time, all often with even more code defined in C# rather in the JIT as native code (see <a href="https://github.com/dotnet/runtime/pull/98623">Move memset/memcpy helpers to managed impl #98623</a> for an extreme example that touched 68 files in 48 commits). I find this to be quite elegant. It shows that the investment in the new C# constructs are paying off because it allows framework developers to build faster and better primitives without escaping to a different language and runtime. This, in turn, allows other skilled developers to benefit from the same. Not only that, but managed code is accessible to the GC whereas native code is not. It's very clear how .NET and C# are being positioned to take over numeric and text processing from Python and C++/C. Everything is being made more generic and funneled to vectorized types, which, in turn, map to the most optimal set of instructions for the myriad supported scenarios, like AOT, ARM, WASM, x64, x86, etc. It's quite an incredible effort. All of these things combine to make your regular expressions and text searches faster, even if you stick to the existing APIs. In some cases, there are new APIs to use, but not too many. Instead, the beauty of .NET 9 is that it will just make everything so much more efficient---faster and with fewer allocations and GC churn---without programmers having to do a thing. A true feat of engineering. <bq>[...] it’s important to recognize that many of the changes discussed thus far implicitly accrue to <c>Regex</c>. <c>Regex</c> already uses <c>SearchValues</c>, and so improvements to <c>SearchValues</c> benefit <c>Regex</c> (it’s one of my favorite things about working at the lowest levels of the stack: improvements there have a multiplicative effect, in that direct use of them improves, but so too does indirect use via intermediate components that instantly get better as the lower level does).</bq> <h level="3">DFA Limits</h> There is a ton of detail about the specifics of regular-expression optimization---enough to make your head spin. Like this: <bq>The non-backtracking implementation works by constructing a finite automata, which can be thought of as a graph, with the implementation walking around the graph as it consumes additional characters from the input and uses those to guide what node(s) it transitions to next. The graph is built out lazily, such that nodes are only added as those states are explored, and the nodes can be one of two kinds: DFA (deterministic) or NFA (non-deterministic). DFA nodes ensure that for any given character that comes next in the input, there’s only ever one possible node to which to transition. Not so for NFA, where at any point in time there’s a list of all the possible nodes the system could be in, and moving to the next state means examining each of the current states, finding all possible transitions out of each, and treating the union of all of those new positions as the next state. DFA is thus much cheaper than NFA in terms of the overheads involved in walking around the graph, and we want to fall back to NFA only when we absolutely have to, which is when the DFA graph would be too large: some patterns have the potential to create massive numbers of DFA nodes. Thus, there’s a threshold where once that number of constructed nodes in the graph is hit, new nodes are constructed as NFA rather than DFA. In .NET 8 and earlier, that limit was somewhat arbitrarily set at 10,000. For .NET 9 as part of this PR, analysis was done to show that a much higher limit was worth the memory trade-offs, and the limit was raised to 125,000, which means many more patterns can fully execute as DFA.</bq> <bq>The inner matching loop is the hot path for a matching operation: read the next character, look up its minterm, follow the corresponding edge to the next node in the graph, rinse and repeat. Performance of the engine is tied to efficiency of this loop. These PRs recognized that there were some checks being performed in that inner loop which were only relevant to a minority of patterns. For the majority, the code could be specialized such that those checks wouldn’t be needed in the hot path.</bq> <h level="3">Span, Span, and more Span</h> <bq>The introduction of <c>Span<t></c> and <c>ReadOnlySpan<t></c> back in .NET Core 2.1 have revolutionized how we write .NET code (especially in the core libraries) and what APIs we expose (see <a href="https://www.youtube.com/watch?v=5KdICNWOfEQ">A Complete .NET Developer’s Guide to Span</a> if you’re interested in a deeper dive.) .NET 9 has continued the trend of doubling-down on spans as a great way to both implicitly provide performance boosts and also expose APIs that enables developers to do more for performance in their own code.</bq> <bq>One of the really nice optimizations the C# compiler added several years back was the ability to recognize when a new <c>byte</c>/<c>sbyte</c>/<c>bool</c> array was being constructed, filled with only constants, and directly assigned to a <c>ReadOnlySpan<t></c>. In such a case, it would recognize that the data was all blittable and could never be modified, so rather than allocating an array and wrapping a span around it, it would blit the data into the assembly and then just construct a span around a pointer into the assembly data with the appropriate length.</bq> This is a wonderful optimization. Clever in a way that only a systems programmer would invent. <bq><code>foreach (Range r in clientSecWebSocketProtocol.AsSpan().Split(',')) { if (clientSecWebSocketProtocol.AsSpan(r).Trim().Equals(acceptProtocol, StringComparison.OrdinalIgnoreCase)) { return true; } }</code>In doing so, it becomes allocation-free, as this <c>Split</c> doesn’t need to allocate a <c>string[]</c> to hold results and doesn’t need to allocate a string for each segment: instead, it’s returning a <c>ref struct</c> enumerator that yields a <c>Range</c> representing each segment. The caller can then use that <c>Range</c> to slice the input. It’s yielding a <c>Range</c> rather than, say, a <c>ReadOnlySpan<t></c>, to enable the splitting to be used with original sources other than spans and be able to get the segments in the original form.</bq> There is such a strong focus on <c>structs</c> and <c>refs</c> to make allocation-free code. And now we see how they leverage the recently introduced <c>Range</c> to provide indexes into a sequence that the calling code can decide how to extract. This offers maximum flexibility to the caller, as the algorithm isn't making any costly decisions for it. In this case, he's discussing how they've made it relatively easy and intuitive to write code that searches a string without any allocations. The sequence doesn't allocate, examining the chunk as a span doesn't allocate, even the <c>Trim()</c> on a <c>Span</c> doesn't allocate anything. <h level="3">LINQ</h> There is a long chapter on LINQ optimizations that boils down to having cleaned up a ton of internal implementation to consolidate on a common base class for customer iteration-combinations like <c>Where</c>/<c>First</c>, <c>Where</c>/<c>OrderBy</c>, etc. Instead of testing for interfaces, it can now test for a single base class and perform a virtual rather than an interface dispatch (which is cheaper). This massive cleanup has the dual benefit of having made many, many LINQ operations 10, 20, and even 100 times faster---and many of them (if not most) are now completely allocation-free. Reducing allocations reduces churn in the GC, which also makes the app faster. <h level="3">Core Collections</h> There is also a long chapter on dictionary optimizations. In particular, you can now store data in a dictionary with <c>string</c> keys but request an alternate view on the dictionary that lets you work with it as if it used <c>ReadOnlySpan<char></c>, which can drastically reduce allocations as the spans you have don't need to be converted to strings simply in order to do the lookups and stores. The changes apply to <c>HashSets</c> as well. <h level="3">Compression</h> This is less about compression and more about the general philosophy and tactics underlying performance optimization in .NET (and, presumably, any runtime). <bq>It’s an important goal of the core .NET libraries to be as platform-agnostic as possible. Things should generally behave the same way regardless of which operating system or which hardware is being used, excepting things that really are operating system or hardware specific (e.g. we purposefully don’t try to paper over casing differences of different file systems). To that end, we generally implement as much as possible in C#, deferring down to the operating system and native platform libraries only when necessary.</bq> <h level="3">Networking</h> <bq>dotnet/runtime#99364 changes the synchronization mechanism from using a pure lock-based scheme to a more opportunistic concurrency scheme that employs a first-layer of lockless synchronization. There’s now still a lock, but for the hot path it’s avoided as long as there are connections in the pool by using a <c>ConcurrentStack<t></c>, such that renting is a <c>TryPop</c> and returning is a <c>Push</c>. <c>ConcurrentStack<t></c> itself uses a lock-free algorithm, that’s a lot more scalable than a lock.</bq> <bq><c>UrlEncode</c> had a complicated scheme where it would UTF8-encode into a newly-allocated <c>byte[]</c>, percent-encode in place in that (thanks to the ability to reinterpret cast with spans), and then use the resulting chars to create a new string. Instead, <c>string.Create</c> can be used, with all of the work done in-place in the buffer generated for that operation.</bq> <bq>[...] updated <c>UrlEncodeToBytes</c>, using stack space instead of allocation for smaller inputs, and using <c>SearchValues<byte></c> to optimize the search for invalid bytes.</bq> You can really see how the changes made over the last several versions allow a literal horde of open-source programmers to optimize the hell out of hot paths in the .NET library. Use <c>Spans</c> and <c>ReadOnlySpans</c> and <c>ref structs</c> and <c>readonly ref structs</c> to avoid allocations, allocate on the stack wherever you can when you can't avoid allocations, return enumerators instead of allocating array results, use highly optimized building blocks like <c>SearchValues</c> and <c>ConcurrentStack</c>, which employ lock-free algorithms or include custom implementations for common patterns. It all adds up to being able to just write performant code by default, writing in a legible, maintainable, and concise high-level API that is carefully marshaled down to the processor by the compiler and/or the JIT to super-efficient IL and assembler code. You can visualize your code being analyzed and then <a href="https://youtu.be/xw5ErADiuec?t=371">sorted like Plinko chips</a> until it finally lands in the processor cache as instructions. <h level="3">Profiling with Benchmark.Net</h> <bq>There’s another very handy nuget package, Microsoft.VisualStudio.DiagnosticsHub.BenchmarkDotNetDiagnosers, which contains additional “diagnosers” for BenchmarkDotNet. Diagnosers are one of the main extensibility points within BenchmarkDotNet, enabling developers to perform additional tracking and analyses over benchmarks. You’ve already seen me use some, including the built-in <c>[MemoryDiagnoser(false)]</c> and <c>[DisassemblyDiagnoser]</c>; there are other built-in ones we haven’t used in this post but that are helpful in various situations, like <c>[ThreadingDiagnoser]</c> and <c>[ExceptionDiagnoser]</c>, but diagnosers can come from anywhere, and the aforementioned nuget package provides several more. The purpose of those diagnosers is to collect and export performance traces that Visual Studio’s performance tools can then consume. In my case, I want to collect a CPU trace, so as to understand where CPU consumption is going, so I added a [CPUUsageDiagnoser] attribute to my Tests class</bq> <h level="3">Lock-free programming</h> What does lock-free programming look like? You replace a <c>lock</c> with an atomic compare/exchange operation, usually in a loop (that's why they're sometimes called "spin locks"). <bq><code>lock (this) { _delta += value; }</code>it used an interlocked operation to perform the addition atomically. Here <c>_delta</c> is a <c>double</c>, and there’s no Interlocked.Add that works with <c>double</c> values, so instead the standard approach of using a loop around an <c>Interlocked.CompareExchange</c> was employed.<code>double currentValue; do { currentValue = _delta; } while (Interlocked.CompareExchange(ref _delta, currentValue + value, currentValue) != currentValue);</code></bq> <h level="3">Cache lines</h> Finally, an optimization that takes CPU cache lines into account. I hadn't seen anything else that low-level so far. <bq>In this benchmark, one thread is incrementing <c>_values[0]</c> and the other thread is incrementing either <c>_values[1]</c> or <c>_values[31]</c>. That index is the only difference, yet the one accessing <c>_values[31]</c> is several times faster than the one accessing <c>_values[1]</c>. That’s because there’s contention here even if it’s not obvious in the code. The contention comes from the fact that the hardware works with memory in groups of bytes called a “cache line.” Most hardware has caches lines of 64 bytes. In order to update a particular memory location, the hardware will acquire the whole cache line. If another core wants to update that same cache line, it’ll need to acquire it. That back and forth results in a lot of overhead. It doesn’t matter if one core is touching the first of those 64 bytes and another thread is touching the last, from the hardware’s perspective there’s still sharing happening. “False sharing.” Thus, the <c>Counter</c> fix is using padding around the double values to try to space them out more so as to minimize the sharing that limits scalability.</bq> 👏 File that under something I understand but would never have programmed. <bq>In the two benchmarks, we can see that the number of instructions executed is almost the same between when false sharing occurred <c>(Index == 1)</c> and didn’t <c>(Index == 31)</c>, but the number of cache misses is more than three times larger in the false sharing case, and reasonably well correlated with the time increase. When one core performs a write, that invalidates the corresponding cache line in the other core’s cache, such that the other core then needs to reload the cache line, resulting in cache misses.</bq> <h level="3">Conclusion</h> <bq>There are multiple forms of performance improvements covered throughout the post. Some of the improvements you get completely for free just by upgrading the runtime; the implementations in the runtime are better, and so when you run on them, your code just gets better, too. Some of the improvements you get completely for free by upgrading the runtime and recompiling; the C# compiler itself generates better code, often taking advantage of newer surface area exposed in the runtime. And other improvements are new features that, in addition to the runtime and compiler utilizing, you can utilize directly and make your code even faster. Educating about those capabilities and why and where you’d want to utilize them is important to me. But beyond the new features, the techniques employed in making all of the rest of the optimizations throughout the runtime are often more broadly applicable. By learning how these optimizations are applied in the runtime, you can extrapolate and apply similar techniques to your own code, making it that much faster.</bq> And that is much appreciated, Stephen. Having seen the available tools, I feel much better equipped to not only write but be able to advise on writing performant code.