Home > Uncategorized > A comparison of C++ and Rust compiler performance

A comparison of C++ and Rust compiler performance

Large code bases take a long time to compile and build. How much impact does the choice of programming language have on compiler/build times?

The small number of industrial strength compilers available for the widely used languages makes the discussion as much about distinct implementations as distinct languages. And then there is the issue of different versions of the same compiler having different performance characteristics; for instance, the performance of Microsoft’s C++ Visual Studio compiler depends on the release of the compiler and the version of the standard specified.

Implementation details can have a significant impact on compile time. Compile time may be dominated by lexical analysis, while support for lots of fancy optimization shifts the time costs to code generation, and poorly chosen algorithms can result in symbol table lookup being a time sink (especially for large programs). For short programs, compile time may be dominated by start-up costs. These days, developers rarely have to worry about small memory size causing occurring when compiling a large source file.

A recent blog post compared the compile/build time performance of Rust and C++ on Linux and Apple’s OS X, and strager kindly made the data available.

The most obvious problem with attempting to compare the performance of compilers for different languages is that the amount of code contained in programs implementing the same functionality will differ (this is also true when the programs are written in the same language).

strager’s solution was to take a C++ program containing 9.3k LOC, plus 7.3K LOC of tests, and convert everything to Rust (9.5K LOC, plus 7.6K LOC of tests). In total, the benchmark consisted of compiling/building three C++ source files and three equivalent Rust source files.

The performance impact of number of lines of source code was measured by taking the largest file and copy-pasting it 8, 16, and 32 times, to create three every larger versions.

The OS X benchmarks did not include multiple file sizes, and are not included in the following analysis.

A variety of toolchain options were included in different runs, and for Rust various command line options; most distinct benchmarks were run 10 times each. On Linux, there were a total of 360 C++ runs, and for Rust 1,066 runs (code+data).

The model duration=K_1*benchmark_i, where K_1 is a fitted regression constant, and benchmark_i is the fitted regression coefficient for the i‘th benchmark, explains just over 50% of the variance. The language is implicit in the benchmark information.

The model duration=K_2*benchmark_j e^{copies*(0.028+0.035L)-0.84L}, where copies is the number of copies, L is 0 for C++ and 1 for Rust, explains 92% of the variance, i.e., it is a very good fit.

The expression e^{copies*(0.028+0.035L)-0.84L} is a language and source code size multiplication factor. The numeric values are:

             1      8      16      32
   C++     1.03   1.25    1.57    2.45
   Rust    0.98   1.52    2.52    6.90

showing that while Rust compilation is faster than C++, for ‘shorter’ files, it becomes much relatively slower as the quantity of source increases.

The size factor for Rust is growing quadratically, while it is much closer to linear for C++.

What are the threats to the validity of this benchmark comparison?

The most obvious is the small number of files benchmarked. I don’t have any ideas for creating a larger sample.

How representative is the benchmark of typical projects? An explicit goal in benchmark selection was to minimise external dependencies (to reduce the effort likely to be needed to translate to Rust). Typically, projects contain many dependencies, with compilers sometimes spending more time processing dependency information than compiling the actual top-level source, e.g., header files in C++.

A less obvious threat is the fact that the benchmarks for each language were run in contiguous time intervals, i.e., all Rust, then all C++, then all Rust (see plot below; code+data):

Point in time when each benchmark was run, stratified by language.

It is possible that one or more background processes were running while one set of language benchmarks was being run, which would skew the results. One solution is to intermix the runs for each language (switching off all background tasks is much harder).

I was surprised by how well the regression model fitted the data; the fit is rarely this good. Perhaps a larger set of benchmarks would increase the variance.

  1. strager
    January 31, 2023 23:06 | #1

    Nitpick: My name is “strager”, not “Strager”.

    > A recent blog post compared the compile/build time performance of Rust and C++ on Linux and Appleā€™s OS X

    Correction: The blog post mostly analyzes build+test times. Sometimes it compares only test times (no building). The post never compares build times only. See the appendix in the post for details, including what “incremental” means.

    > A less obvious threat is the fact that the benchmarks for each language were run in contiguous time intervals

    You’re right! I should have randomized the runs. This is something to keep in mind for the future.

    > The expression e^{copies*(0.028+0.035L)-0.84L} is a language and source code size multiplication factor.

    Can we use this equation to estimate incremental build+test times given project size (in lines of code) as input?

  2. February 1, 2023 00:02 | #2

    @strager
    Thanks for the corrections. Fixed the “strager” issue.

    Fitting a model to just the Rust timings would be cleaner, e.g.,

    rust=subset(stra, language == "rust")
    strm_mod=glm(log(duration) ~ copies+benchmark, data=rust)

    giving the code size multiplication factors:
    e^{copies*0.063} -> 1.06, 1.66, 2.76, 7.61
    and then intermediate values could be interpolated, and size mapped to LOC.

    But you have the problem that the value of benchmark varies between benchmarks, i.e., each benchmark makes its own unique contribution to the timing.

    With only three distinct incremental build+test sets of measurements, any generalization is likely to be unreliable.

    What is needed is data on a wider range of build+test.

  1. No trackbacks yet.