This page shows the source for this entry, with WebCore formatting language tags and attributes highlighted.

Title

Building RegEx from scratch with Stephen Toub

Description

This is another excellent 1-hour tour of another complex corner of .NET. Toub describes and shows how the source-generated RegEx engine works. <media href="https://www.youtube.com/watch?v=ptKjWPC7pqw" src="https://www.youtube.com/v/ptKjWPC7pqw" source="YouTube" width="560px" caption="Deep Dive into RegEx with Stephen Toub" author="dotnet / Scott Hanselmann"> <ul>The generated source is human-readable and debuggable. It is well-commented. It updates in real-time as you change the expression. It includes XML documentation that describes the regular expression in plain English. They rewrote the compiler in .NET 7 to not only better support source generators, but also to be able to emit not only IL, but source code. They rebuilt the emitter to allow more leeway in code-generation---the first generation emitted C# that looked very much like IL. They have a gigantic test-suite that they culled from open-source code. 4M expressions deduplicated down to about 20,000 unique expressions that they have in the test suite and that they run against all four RegEx engines to verify that nothing runs pathologically long or with excessive memory. There is an analyzer that tries very hard to eliminate greediness. It seeks atomicity. Fascinating. At <b>47:00</b>, he shows a great example of a regex that requires backtracking, which can lead to pathological, exponential performance. These engines support back-references, which are powerful. They can be super-fast for matches, but they have very bad worst-case behavior that may end up in DDOS behavior. In .NET, you can set a timeout on your regular-expression evaluation to avoid this. You can also set a global timeout. You can also turn off back-tracking. If it can produce the engine to evaluate the expression, then it will evaluate in linear time. If it cannot, it's probably a compile-time error if you're using source generators, which is quite nice. They also examine an email-address RegEx, which takes Toub into showing how the generated source uses the <c>SearchValues</c> variants, which are a highly-optimized way of searching text, with dozens of algorithms that it chooses by analyzing the input string. They have SIMD/Vector/Arm Intrinsics support where possible and are exactly the kind of optimization that a framework like .NET can offer, but that an app developer would never have time to make.</ul> 💙 Stephen Toub. He's absolutely brilliant. Mad props to Scott Hanselmann for reining him in and providing a great sparring partner.