1 Billion Rows Challenge
I recently stumbled upon the "One Billion Rows Challenge"—an intriguing test of speed and efficiency. Although the official contest had already wrapped up, I couldn't resist giving it a shot myself. Armed with my growing knowledge of modern C++, I set out to see just how quickly I could process a billion rows. This post is a personal log of that journey: what I tried, what failed, what succeeded, and the final results.
Base case and measurements
Code
Remembering a relevant quote "something something premature optimization, evil something", I started off with a solid effort - simply writing the code. We can always optimize it later. This pre-written code is readable, understandable and maintainable. The entire code outline is written below (with actual code living in the source for this blog post).
;
static inline float
int
I have tried to follow some best-practices in this code, however, some things immediately stand out:
- Excessive string allocation in hot loop
- Inefficient parsing
- Hardcoded measurements file location
- No error handling for file operations
- No support for parallelism or concurrency
There are, of course, many issues lurking beneath the scenes. We will get to that soon.
Timings and Measurements
I ran this code on a 1-billion row file generated by the script in the original challenge repo. This is the timing I observed:
Remarkably, the top entries in the challenge completed this task in less than 2 seconds. That means my baseline implementation is over 100 times slower! Even if I could perfectly parallelize the workload across all 20 CPU cores—ignoring the realities of Amdahl's Law—it would still take more than 10 seconds. Clearly, there’s significant room for optimization. In order to improve anything, metrics need to exist. So that's exactly what we're going to do next - measure.
Tools and methodology
I'm using Arch, btw, which means I get to leverage the excellent perf
utility to analyze performance bottlenecks in my code. I've been working with the perf+hotspot profiler combination, and they integrate seamlessly out of the box. This eliminates the hassle of writing custom scripts to generate flamegraphs and removes the complexity of analyzing per-thread performance and activity. For the remainder of this post, I'll follow this systematic methodology:
- Measure time and record execution with
perf
- Analyze results in
hotspot
- Create targeted micro-benchmarks with Google Benchmark
- Replace performance-critical hotspots with optimized implementations
- Iterate
Since I've already completed the first step with time
, it's time (pun intended) to dive into the profiling phase.
Performance profiling with perf and hotspot
I started by running a perf report on the baseline implementation:
# 1.000 CPUs utilized
# 0.000 /sec
# 0.000 /sec
# 2.071 /sec
# 2.78 insn per cycle
# 0.07 stalled cycles per insn (35.71%)
# 1.944 GHz (35.71%)
# 20.42% frontend cycles idle (35.72%)
# 1.107 G/sec (35.71%)
# 1.36% of all branches (35.71%)
# 2.765 G/sec (35.71%)
# 0.82% of all L1-dcache accesses (35.72%)
# 3.071 M/sec (35.71%)
# 4.90% of all L1-icache accesses (35.72%)
# 8.439 M/sec (35.72%)
# 0.02% of all dTLB cache accesses (35.72%)
# 4.469 K/sec (35.71%)
# 4.84% of all iTLB cache accesses (35.71%)
# 4.503 M/sec (35.71%)
There are a lot of things that stand out in this report:
- Total time taken was 221.726s, with a full CPU utilized.
- The task did not migrate between CPUs, i.e. the scheduler did not move our task
- The number of page faults is very trivial. The file itself is 15G in size so 459 page faults are nothing
- The instructions count is huge.
- Majority of the time was spent waiting on memory (frontend cycles idle). The IPC itself is good, which means, when the CPU did get something to do the code exhibited ILP.
As such, the code looks, ... mostly okay. Which means, the issue is not at the system level but in how I wrote the code itself. Time to dig deeper into this issue. For this, we will record the execution again with record, and analyze the flamegraph with hotspot.
In the flamegraph, it is evident that a chunk of the total time spent is being spent on parsing the parse_value
function - almost 40%. After this, the second most-expensive call is the call to getline
. In this code, I'm reading the file line-by-line, and each read triggers a copy from the kernel-space to the user-space. This can be reduced using mmap
, which I will benchmark later.
Looking at the possible solutions to a faster float parsing, I found from_chars
in the stdlib. Along with this, looking at the data, the measured values always lie between -99.9 and 99.9. Therefore, if we were to represent this using integers instead of floats, in the range of -999 to 999, it will help to make the parsing, and later, calculation process faster down the line. This approach is called Scaled Integers, and has been used quite popularly since a long time. I am specially interested in the last approach, considering the fact that all modern processors have FPUs and this should not have a very big impact in the computation side of things.
However, as the age-old saying goes, measure twice cut once. So, I want to test all these approaches first before making any changes to my code.
Micro Benchmarking
I created a very small benchmark suite to benchmark all these things I wanted to test. Here is the complete benchmark:
- Create the data for the benchmark (10M float strings):
std::vector<std::string> float_strings;
static void
- Create benchmarks for stof and from_chars:
void
void
- Finally, the "scaled integer" implementation.
In this implementation, I make use of the fact that the numbers can be one of the following:
- -xx.x
- -x.x
- xx.x
- x.x
So, after we remove the sign, the '.' is going to either be in position 1, or 2, and depending on that, we can scale the int.
void
And here are the results from those benchmarks:
It's not even close. It's evident that scalar integers beat out the other two implementations easily while parsing. For the next benchmark, I wanted to see if FP calculations were causing the slowdown (even though the parsing is so much fast already I doubt it's going to make a difference).
Here's the FP benchmarks:
std::vector<float> floats;
std::vector<int> ints;
static void
static void
static void
Looks like I have some reading to do at this point 😅
Fixing the value parsing
From the above benchmarks, it is clear that the value parsing section of the code needs a major change.
> #include <cstdint>
> using u64 = uint64_t;
>
> u64 parse_value(const std::string& str) {
> int off = (str[0] == '-') ? 1 : 0;
> int sign = (str[0] == '-') ? -1 : 1;
>
> int value;
> if (str[off + 1] == '.') {
> value = (str[off] - '0') * 10 + (str[off + 2] - '0');
> } else {
> value = ((str[off] - '0') * 10 + (str[off + 1] - '0')) * 10 +
> (str[off + 3] - '0');
> }
> return sign * value;
> }
>
< float min;
< float max;
< int count;
< float sum;
> u64 min;
> u64 max;
> u64 count;
> u64 sum;
< Measurement(float value) : min(value), max(value), count(1), sum(value) {}
< inline void update(float value) {
> Measurement(u64 value) : min(value), max(value), count(1), sum(value) {}
> inline void update(u64 value) {
<
< static inline float parse_value(const std::string& to_parse) {
< return std::stof(to_parse);
< }
Performance counter stats for './2':
156,116.37 msec task-clock:u # 0.997 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
513 page-faults:u # 3.286 /sec
537,873,827,732 instructions:u # 1.80 insn per cycle
# 0.12 stalled cycles per insn (35.71%)
299,303,987,541 cycles:u # 1.917 GHz (35.71%)
62,477,181,732 stalled-cycles-frontend:u # 20.87% frontend cycles idle (35.71%)
108,629,707,400 branches:u # 695.825 M/sec (35.71%)
2,616,411,954 branch-misses:u # 2.41% of all branches (35.72%)
346,101,561,152 L1-dcache-loads:u # 2.217 G/sec (35.72%)
5,471,897,627 L1-dcache-load-misses:u # 1.58% of all L1-dcache accesses (35.72%)
528,608,055 L1-icache-loads:u # 3.386 M/sec (35.71%)
37,194,524 L1-icache-load-misses:u # 7.04% of all L1-icache accesses (35.72%)
1,722,913,805 dTLB-loads:u # 11.036 M/sec (35.71%)
935,347 dTLB-load-misses:u # 0.05% of all dTLB cache accesses (35.72%)
2,185,510 iTLB-loads:u # 13.999 K/sec (35.72%)
75,753 iTLB-load-misses:u # 3.47% of all iTLB cache accesses (35.72%)
1,127,256,983 L1-dcache-prefetches:u # 7.221 M/sec (35.71%)
156.633191854 seconds time elapsed
149.286329000 seconds user
4.655501000 seconds sys
Woo hoo! The runtime decreased by a full 70 seconds! Let's observe some differences in this code:
- The IPC is horrible. But the instructions count went down. Which means the codegen became more terse (makes sense, parsing strings to floats is hard).
- Frontend cycles are still stalled a lot, memory is still the bottleneck.
Time for more hotspot flamegraphs:
The following observations can be made:
parse_value
went to 6% of the total runtime!std::getline
is still taking 21% of the total runtime.- In
M_find_before_node
in theunordered_map
, I can see 8% of total CPU time being spent onoperator==
. This means our HM is performing a probe, and we have too many hash collisions. - The keen-eyed might already have noticed
__memchr_evex
which is used to find the line boundaries.
In addition to this, the old problem still exists - the code does a lot of string allocations in the hot code path. Essentially, if we have a city, which is already in the map, everytime we encounter this city, we:
- Allocate a string for this city
- Look up in the map
- If not found, push the city into the map
- If found, discard the city, and update the value
Even worse, every "measurement" we parse out from the line, will require its own allocation which unconditionally gets discarded.
One solution to this would be to read the entire file into a buffer, and then operate on the pointers and offsets of this buffer. However, the file is a whopping 16GB and I am not making any assumptions for this challenge.
mmap, string_view
and custom parsing
Using mmap in the code will have multiple benefits:
- Reduce time spent copying data from kernel to user space. Although this is < 5s right now, better to get it out of the way (and we can see impact later).
- We can allocate a single buffer and use string_view to index into the buffer, instead of allocating strings.
- Apparently getline does a lot of bounds checking, locale interpretation, and iostream management. I am not interested in any of that. I can work with raw utf-8 strings, as char sequences, until I hit a ';' char or a '\n' char if I write a custom parser.
Using mmap and string_view
is one thing, but a parser is another thing entirely. In order to parse a single line, I wanted to benchmark the differences between 3 implementations - one with memchr
, one with a simple hand-rolled loop, and another one with SWAR. The SWAR one has an additional benefit of me being able to compute hash for the string "on-the-fly" when I'm finding city name bounds. But enough talk.
Here are the 3 respective benchmarks: