Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Measuring arithmetic overflow checking overhead in C#
Overflow checking for integral-type arithmetic operations is disabled by default and it can be explicitly enabled by using using checked
function or the -checked
compiler switch.
Since I mainly develop business applications where the main focus is correctness, I am wondering if globally switching to checked
makes sense and what would be the performance impact.
My benchmarking code is the following:
[MemoryDiagnoser]
public class Program
{
private const int Loops = 100 * 1000;
private static readonly int[][] Data = new int[Loops][];
private static readonly int[] Results = new int[Loops];
private const int MaxValue = int.MaxValue / 3 + 5 * 1000 * 1000;
[GlobalSetup]
// populate data to avoid wasting benchmarking time with this
public static void CreateData()
{
Random r = new();
for (int i = 0; i < Loops; i++)
{
int n1 = r.Next(MaxValue);
int n2 = r.Next(MaxValue);
int n3 = r.Next(MaxValue);
Data[i] = new[] { n1, n2, n3 };
}
}
// checked is applied based on flag
// since checked(function_call) does not seem to work
private static int GetSum(bool isChecked, int[] numbers)
{
if (isChecked)
return checked (numbers[0] + numbers[1] + numbers[2]);
return numbers[0] + numbers[1] + numbers[2];
}
// execute the sum in a loop to get a significant computation time
private static void Compute(bool isChecked)
{
for (int i = 0; i < Loops; i++)
{
int[] row = Data[i];
int result = GetSum(isChecked , new [] {row[0], row[1], row[2]});
Results[i] = result;
}
}
[Benchmark(Baseline = true)]
public void ComputeBaseline()
{
Compute(false);
}
[Benchmark(Baseline = false)]
public void ComputeChecked_()
{
Compute(true);
}
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<Program>();
}
}
My results are as follows (release without attached to process):
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Allocated |
---|---|---|---|---|---|---|---|
ComputeBaseline | 1.257 ms | 0.0127 ms | 0.0106 ms | 1.00 | 0.00 | 849.6094 | 4 MB |
ComputeChecked_ | 1.153 ms | 0.0213 ms | 0.0209 ms | 0.92 | 0.02 | 849.6094 | 4 MB |
I am not sure I am reading this correct, because it seems that the checked
version is a little bit faster than the unchecked one or my code is incorrectly written for the benchmark.
I would appreciate a code review to understand if I doing something wrong in my assessment.
I have made some changes based on the received feedback:
- fixed a bug related to initialization - checked version gets closer to the unchecked (base) one
- no inlining - did not notice any change
- increased the runtime - extra getting closer between base and checked
- reduced bias by introducing a small chance to actually have an overflow - the checked takes clearly longer even if the overflow very rarely (a couple of dozens of times pe 10M loops, which is reasonable to happen in a real-life scenario).
The final result looks like this:
Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Allocated |
---|---|---|---|---|---|---|---|
ComputeBaseline | 218.0 ms | 4.34 ms | 7.82 ms | 1.00 | 0.00 | 85000.0000 | 381 MB |
ComputeChecked_ | 229.5 ms | 2.51 ms | 2.34 ms | 1.02 | 0.04 | 85000.0000 | 381 MB |
For those interested in what the benchmark is actually outputting, I will add the relevant parts here:
// Validating benchmarks:
// ***** BenchmarkRunner: Start *****
// ***** Found 2 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start *****
// start dotnet restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 /p:Deterministic=true /p:Optimize=true in C:\Users\Alex\source\repos\CheckedOperationsBenchmark\bin\Release\net5.0\8844cf42-ce2d-4210-8746-304950dee939
// command took 2.27s and exited with 0
// start dotnet build -c Release --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 /p:Deterministic=true /p:Optimize=true in C:\Users\Alex\source\repos\CheckedOperationsBenchmark\bin\Release\net5.0\8844cf42-ce2d-4210-8746-304950dee939
// command took 3.78s and exited with 0
// ***** Done, took 00:00:06 (6.21 sec) *****
// Found 2 benchmarks:
// Program.ComputeBaseline: DefaultJob
// Program.ComputeChecked_: DefaultJob
// **************************
// Benchmark: Program.ComputeBaseline: DefaultJob
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "8844cf42-ce2d-4210-8746-304950dee939.dll" --benchmarkName "CheckedOperationsBenchmark.Program.ComputeBaseline" --job "Default" --benchmarkId 0 in C:\Users\Alex\source\repos\CheckedOperationsBenchmark\bin\Release\net5.0\8844cf42-ce2d-4210-8746-304950dee939\bin\Release\net5.0
// BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
// GC=Concurrent Workstation
// Job: DefaultJob
OverheadJitting 1: 1 op, 405100.00 ns, 405.1000 us/op
WorkloadJitting 1: 1 op, 271396400.00 ns, 271.3964 ms/op
WorkloadPilot 1: 2 op, 423324000.00 ns, 211.6620 ms/op
WorkloadPilot 2: 3 op, 646507700.00 ns, 215.5026 ms/op
WorkloadWarmup 1: 3 op, 650490500.00 ns, 216.8302 ms/op
...
WorkloadWarmup 6: 3 op, 671575300.00 ns, 223.8584 ms/op
// BeforeActualRun
WorkloadActual 1: 3 op, 641281200.00 ns, 213.7604 ms/op
...
WorkloadActual 49: 3 op, 634473700.00 ns, 211.4912 ms/op
// AfterActualRun
WorkloadResult 1: 3 op, 641281200.00 ns, 213.7604 ms/op
...
WorkloadResult 41: 3 op, 634473700.00 ns, 211.4912 ms/op
GC: 255 0 0 1200000000 3
Threading: 3 0 3
// AfterAll
// Benchmark Process 18616 has exited with code 0.
Mean = 218.037 ms, StdErr = 1.221 ms (0.56%), N = 41, StdDev = 7.817 ms
Min = 208.001 ms, Q1 = 212.466 ms, Median = 216.227 ms, Q3 = 221.466 ms, Max = 239.382 ms
IQR = 8.999 ms, LowerFence = 198.967 ms, UpperFence = 234.965 ms
ConfidenceInterval = [213.702 ms; 222.372 ms] (CI 99.9%), Margin = 4.335 ms (1.99% of Mean)
Skewness = 1.24, Kurtosis = 3.89, MValue = 2
// **************************
// Benchmark: Program.ComputeChecked_: DefaultJob
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "8844cf42-ce2d-4210-8746-304950dee939.dll" --benchmarkName "CheckedOperationsBenchmark.Program.ComputeChecked_" --job "Default" --benchmarkId 1 in C:\Users\Alex\source\repos\CheckedOperationsBenchmark\bin\Release\net5.0\8844cf42-ce2d-4210-8746-304950dee939\bin\Release\net5.0
// BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
// GC=Concurrent Workstation
// Job: DefaultJob
OverheadJitting 1: 1 op, 377800.00 ns, 377.8000 us/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadJitting 1: 1 op, 221698200.00 ns, 221.6982 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadPilot 1: 2 op, 468579600.00 ns, 234.2898 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadPilot 2: 3 op, 690754700.00 ns, 230.2516 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 1: 3 op, 681204300.00 ns, 227.0681 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 2: 3 op, 687082100.00 ns, 229.0274 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 3: 3 op, 694201200.00 ns, 231.4004 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 4: 3 op, 688857400.00 ns, 229.6191 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 5: 3 op, 688555700.00 ns, 229.5186 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 6: 3 op, 691014900.00 ns, 230.3383 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadWarmup 7: 3 op, 685841900.00 ns, 228.6140 ms/op
// BeforeActualRun
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 1: 3 op, 690004100.00 ns, 230.0014 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 2: 3 op, 687871600.00 ns, 229.2905 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 3: 3 op, 697458600.00 ns, 232.4862 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 4: 3 op, 697258400.00 ns, 232.4195 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 5: 3 op, 695430000.00 ns, 231.8100 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 6: 3 op, 686219800.00 ns, 228.7399 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 7: 3 op, 689060900.00 ns, 229.6870 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 8: 3 op, 690860200.00 ns, 230.2867 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 9: 3 op, 699231000.00 ns, 233.0770 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 10: 3 op, 686350000.00 ns, 228.7833 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 11: 3 op, 682036900.00 ns, 227.3456 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 12: 3 op, 675138400.00 ns, 225.0461 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 13: 3 op, 688420500.00 ns, 229.4735 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 14: 3 op, 679475100.00 ns, 226.4917 ms/op
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadActual 15: 3 op, 681246000.00 ns, 227.0820 ms/op
// AfterActualRun
Arithmetic operation resulted in an overflow.
...
Arithmetic operation resulted in an overflow.
WorkloadResult 1: 3 op, 690004100.00 ns, 230.0014 ms/op
...
WorkloadResult 15: 3 op, 681246000.00 ns, 227.0820 ms/op
GC: 255 0 0 1200007200 3
Threading: 3 0 3
// AfterAll
// Benchmark Process 2072 has exited with code 0.
Mean = 229.468 ms, StdErr = 0.605 ms (0.26%), N = 15, StdDev = 2.345 ms
Min = 225.046 ms, Q1 = 228.043 ms, Median = 229.474 ms, Q3 = 231.048 ms, Max = 233.077 ms
IQR = 3.006 ms, LowerFence = 223.534 ms, UpperFence = 235.557 ms
ConfidenceInterval = [226.961 ms; 231.975 ms] (CI 99.9%), Margin = 2.507 ms (1.09% of Mean)
Skewness = -0.12, Kurtosis = 1.93, MValue = 2
// ***** BenchmarkRunner: Finish *****
// * Export *
BenchmarkDotNet.Artifacts\results\CheckedOperationsBenchmark.Program-report.csv
BenchmarkDotNet.Artifacts\results\CheckedOperationsBenchmark.Program-report-github.md
BenchmarkDotNet.Artifacts\results\CheckedOperationsBenchmark.Program-report.html
// * Detailed results *
Program.ComputeBaseline: DefaultJob
Runtime = .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT; GC = Concurrent Workstation
Mean = 218.037 ms, StdErr = 1.221 ms (0.56%), N = 41, StdDev = 7.817 ms
Min = 208.001 ms, Q1 = 212.466 ms, Median = 216.227 ms, Q3 = 221.466 ms, Max = 239.382 ms
IQR = 8.999 ms, LowerFence = 198.967 ms, UpperFence = 234.965 ms
ConfidenceInterval = [213.702 ms; 222.372 ms] (CI 99.9%), Margin = 4.335 ms (1.99% of Mean)
Skewness = 1.24, Kurtosis = 3.89, MValue = 2
-------------------- Histogram --------------------
[205.025 ms ; 210.959 ms) | @@@@
[210.959 ms ; 216.910 ms) | @@@@@@@@@@@@@@@@@@@@@
[216.910 ms ; 222.730 ms) | @@@@@@@
[222.730 ms ; 229.247 ms) | @@@@@@
[229.247 ms ; 235.592 ms) |
[235.592 ms ; 242.357 ms) | @@@
---------------------------------------------------
Program.ComputeChecked_: DefaultJob
Runtime = .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT; GC = Concurrent Workstation
Mean = 229.468 ms, StdErr = 0.605 ms (0.26%), N = 15, StdDev = 2.345 ms
Min = 225.046 ms, Q1 = 228.043 ms, Median = 229.474 ms, Q3 = 231.048 ms, Max = 233.077 ms
IQR = 3.006 ms, LowerFence = 223.534 ms, UpperFence = 235.557 ms
ConfidenceInterval = [226.961 ms; 231.975 ms] (CI 99.9%), Margin = 2.507 ms (1.09% of Mean)
Skewness = -0.12, Kurtosis = 1.93, MValue = 2
-------------------- Histogram --------------------
[223.798 ms ; 234.325 ms) | @@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1466 (21H2)
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=6.0.100
[Host] : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
DefaultJob : .NET 5.0.10 (5.0.1021.41214), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Allocated |
|---------------- |---------:|--------:|--------:|------:|--------:|-----------:|----------:|
| ComputeBaseline | 218.0 ms | 4.34 ms | 7.82 ms | 1.00 | 0.00 | 85000.0000 | 381 MB |
| ComputeChecked_ | 229.5 ms | 2.51 ms | 2.34 ms | 1.02 | 0.04 | 85000.0000 | 381 MB |
// * Hints *
Outliers
Program.ComputeBaseline: Default -> 8 outliers were removed (259.93 ms..346.44 ms)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Ratio : Mean of the ratio distribution ([Current]/[Baseline])
RatioSD : Standard deviation of the ratio distribution ([Current]/[Baseline])
Gen 0 : GC Generation 0 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ms : 1 Millisecond (0.001 sec)
// * Diagnostic Output - MemoryDiagnoser *
// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:02:23 (143.72 sec), executed benchmarks: 2
Global total time: 00:02:29 (149.95 sec), executed benchmarks: 2
// * Artifacts cleanup *
2 answers
Most general purpose computing operating systems can't be counted on for accurate timing as short at 1 to 2 ms. Any test case should run for a few seconds at least. Those runs should then be repeated a few times to get a sense for how much noise is on the data.
Either increase the complexity of the problem or run more iterations. I'd target the shortest run time to be in the 3-5 second range.
0 comment threads
Benchmarking is hard and there are a lot of potential issues with your code. I agree with Olin Lathrop that you want at least a few seconds worth of run time, not just due to the potential for a low precision clock but also to deal with noise from various sources.
However, there are still other problems. One problem is the proportion of what you're measuring compared to what you want to be measuring. Adding three int
s can be (and you'd hope is) an extremely simple operation that could quite possibly take less than 2 cycles on average, possibly even less than 1. Particularly if GetSum
isn't inlined, it's quite possible that the (unchecked) addition takes less than 1% of the execution time of an iteration of the loop in Compute
. If it was 0.1%, for example, then even a 10x difference would barely register. With enough data you'd be able to detect that there's a statistically significant difference, but that difference would still look minuscule.
The second problem is systematic errors which can bias the result and are especially bad when, per the previous paragraph, effect size is so small. Some sources of bias may be handled by the benchmarking framework such as warming up caches so that whatever is benchmarked first doesn't eat a bunch of cache misses. A subtle, but likely minor example, is the branch in GetSum
. Usually one branch of an if
will be slightly faster than the other due to instruction cache and branch prediction issues. In this case, the CPU branch prediction should quickly figure out which branch is taken very early in the loop, so the effect should be negligible, though there may still be an effect. FoggyFinder points out another potential, and much larger, source of bias depending on how you intend to do checked arithmetic. If you intend to execute a handful of arithmetic operations, then catch any potential overflow exceptions and apply fallback logic, then you should do as FoggyFinder suggests and include such fallback logic but I would not bother making actual overflow cases assuming your assumption of their frequency in practice holds. (Again, even several orders of magnitude of difference for the actual overflow cases isn't going to change the overall result by much.) If, instead, you are going to do larger scale operations and catch an overflow only for the whole result, then you should NOT include the try
-catch
block.
A third problem is missed opportunities. Even if you did a completely perfect job a computing how much costlier add.ovf
is compared to add
, and frankly this would be easy to do and easier than benchmarking by just looking at the assembly, this would quite likely be meaningless in practice. The real cost of add.ovf
may be in missed optimizations. You may have wondered how performing two add operations could take less than a cycle to execute. The answer is superscalar and/or SIMD execution or more generally what's known as instruction level parallelism. If a SIMD opcode let's you execute 4 adds at once in a single cycle, then each only costs a quarter of a cycle. The cost could even be 0 cycles if the compiler is able to eliminate the add entirely. A trivial but illustrative example is (a + b) - b
is equivalent to just a
in int
arithmetic even if a + b
overflows. The need to ensure a + b
doesn't overflow would block this optimization in checked arithmetic.
So now some actual advice.
- Most importantly, benchmark real and representative code. This is what you care about and this will incorporate any relevant second-order effects.
- Minimize overhead. In this case, this means inlining everything and avoiding heap allocation in the inner loop. In other contexts, these may be negligible overhead.
- Get enough samples. The higher the precision you need and the noiser the environment, the more samples you should get. More is always better. So add, at least, another factor of 1,000 to
Loops
. - If possible, measure the overhead. In terms of your original code, this might involve a variant of
GetSum
that does no addition and, perhaps, just returnsnumbers[1]
, say. Admittedly, there's a risk that the compiler does something smart. However, if the result of doing no additions is roughly the same as with additions, then you know most of what you're measuring is overhead. - Don't put any benchmark choosing logic in the benchmark itself. This5is overhead, but, more importantly, it is an easily avoidable source of potential bias.
- Use an AOT compiler and check the assembly NOT (just) the bytecode. This is less important when benchmarking real code as opposed to a microbenchmark because the different assembly output is part of the cost that you want to measure. Nevertheless, you want the difference to be due to the different code-under-test and not the benchmarking setup.
Within the context of your current code, the result of the above advice assuming you don't need fallback code every few arithmetic operations would be:
public void ComputeBaseline()
{
for (int i = 0; i < Loops; i++)
{
int[] row = Data[i];
Results[i] = row[0] + row[1] + row[2];
}
}
and similarly for ComputeChecked_
.
As an aside that I won't go into to detail here as it's not the topic of this question, I find it unlikely that a benchmark is needed to make the decision here. It's fine as a matter of curiosity. It also seems unlikely to me that checked arithmetic is the right solution to this problem (assuming it's a problem that needs solving) as opposed to larger or even unbounded numeric types.
2 comment threads