1 Billion Rows Challenge - Chapter 1
I stumbled upon Gunnar Morling’s One Billion Rows Challenge and couldn’t resist giving it a shot. The premise is straightforward: you get a text file with a billion lines, each line has a city name and a temperature reading. The task is to print every city (sorted alphabetically) along with its minimum, average, and maximum temperature across all measurements. Here’s a sample of what the input looks like:
...
London;10.0
New York;5.0
London;-1.0
...
And yes, there are a billion lines. Sounds easy, right? The twist is speed. The fastest solution clocks in at 1.535 seconds on a Hetzner AX161. Morling’s own baseline runs in 4:49 minutes on the same hardware. The challenge was aimed at Java folks, but I wanted to see what I could do with modern C++.
Wait, aren’t you a Rustacean?
Always good to pick up new languages. My last serious C++ project was irpam_cpp, an open-source IR camera authentication tool for Linux. That was more about functionality than raw performance. With this challenge, I’m diving into C++ optimization and getting a closer look at how the STL ticks.
Maybe I’ll try a Rust version later 😉
Here’s how I’m approaching this series:
- Write the code and measure its speed
- Analyze performance bottlenecks
- Micro-benchmark the hot spots
- Apply fixes and optimizations
- Repeat until satisfied
Step 1: Implementing a baseline
The first step in this challenge would be to actually write a C++ baseline. I implemented a very simple version in C++ which you can find below, where I try to write idiomatic code, with a couple of optimizations in mind:
- Optimize for readability of the code first of all. Unreadable code is unmaintainable code, and improving will be hard later.
- Pre-allocate a
unordered_map
to reduce probing (and not usemap
which uses BTrees under the hood (much slower in this case)).
And, that's about it. Pretty dry. I already know that there are a lot of improvements that can be made just looking at this first draft I have:
- Excessive string allocation in hot loop: "city" and "measurement"
- Inefficient parsing of float (use of
stof
) - No support for parallelism or concurrency
- Not data-specific.
The last one is pretty heavy - I know the structure of the data file, it's size, etc. already. So the code should try to exploit that fact as much as possible - try to optimize the code based on the data. However, all those optimizations will come later with their own benchmarks. This will suffice for now.
C++ Baseline Implementation
;
static inline float
int
Benchmark conditions
Here is my laptop's specs that I will be using to run all benchmarks. In addition to this, all benchmarks will start with the measurements file evicted from page-cache, all google benchmarks are built in release mode, and cpu frequency scaling will be disabled and set to performance mode by default.
)
)
)
)
Step 2: Compiling and Perf analysis
Compilation
I tried to keep the compilation flags' count as small as possible. Here are the flags I will be using across all benchmarks:
-Ofast
to make the code go brrr-std=c++17
to use modern C++ standards. I chose c++17 for no particular reason other than to usestring_view
(more on that later).-g
to include debug symbols
time
, perf
and Friends
Since running perf on the binary introduces some overhead due to periodic sampling and stack unwinding, I will first run the generated binary with time, native to my shell (fish). Then, I will run perf to generate a detailed report on the code, showing the hot paths and memory characteristics.
The measurements file itself is pretty big - around 15G! Here are the timings I saw on this code:
The implementation is already much faster than the baseline. Time to retire and become a professional speedrunner, right? Well, not so fast.
You can't optimize what you can't measure (unless you enjoy flying blind and writing code that only impresses your girlfriend, and/or your cat). So, to get some real numbers, I fired up perf
with the following flags:
# 0.07 stalled cycles per insn (35.69%)
This shows me detailed performance counters which will give a rough idea of the areas of potential improvement.
- Starting out, we can see the program ran on a single CPU code (not fully utilized, mind you), with 0 context-switches and cpu-migrations, which means our program was on the same CPU core the entire time it was running. This i snice.
- There are 450 page faults. We are using buffered IO here, which implicitly fetches pages while reading a chunk of the file, so for a 15G file, the page fault is trivial.
- 1.2B instructions 😵💫 This is wayy too high number of instructions being executed for a task this simple. But again, we are doing a lot of redundant stuff here.
- 20.29% stalled CPU frontend cycles. This means the CPU could not fetch instructions fast enough for them to be executed. This can also be seen in the very high L1-icache misses, i.e. 6.67%. This means the instructions the CPU wanted to execute could not be found in the L1 cache, and it had to go all the way up to get that data!
- I am not worried about branch misses (for now). They could be much higher, but I will hopefully, eventually, get around to optimizing the branch misses.
- Syscalls consumed ~5s of the total time
The next step will be to run perf
again, but this time to record the program execution itself. For that, I used:
A flamegraph without dwarf
The tool I use these days to analyze the output of perf data, is hotspot. It is an excellent profiler + visualizer because of the built-int flamegraphs, disassembly, and caller-callee graphs.
This is the flamegraph I obtained from hotspot and Brendan Gregg's Flamegraph tools. You can right-click on the image, open it in a new tab and play around with the flamegraph to see how much time each call is spending.
Here are the main takeaways:
- I love flamegraphs. Ever since my friend Pravesh introduced them to me, I've been using them to quickly visualize my code performance. Brendan is truly great.
- The code spends 33% of its total time in
__memchr_evex
. This is a SIMD-optimized string matching function that is used to quickly find characters within a string (or bytes within a memory, memchr). This is a great thing in itself, since our compiler is using optimized code. If you open the flamegraph in a new tab, and hit "Ctrl+S" (or whatever Mac uses), and search for "__memchr", the flamegraph blocks will actually highlight themselves! How great is that!! str::stof
is using a lot of CPU cycles - a whopping 31% of all! This needs to be optimized.- In the Hashtable's
find_before_node
function, the call tostd::__detail::_Hashtable_base<…>::_M_equals(std::string const&, unsigned long, std::__detail::_Hash_node_value<…> const&) const
is taking up 11% of the total CPU cycles! This means we have a lot of hash collisions. Our hashtable's hashing function needs to be optimized. std::getline
is taking up 10.7% of the total time!
Step 3: Baby's first benchmark and optimization
Phew. That was a lot of work so far! Now that we know the root cause of the slowness, we can actually begin to optimize the code. In order to optimize the code, we cannot simply rewrite vast swathes of the code again. It needs to be deliberate, and measured. So, to accomplish the same, I will be using a little library called googlebench.
Optimizing float-parsing
Parsing floats with std::stof
is slow mainly due to its robustness—it validates inputs and handles edge cases, which adds overhead (I don't want to get into setlocale
right now, btw). Since the challenge data is predictable and well-formed, we can use a custom float parser tailored to our input format for better performance. Alternatives like std::from_chars
offer faster, locale-independent parsing.
Another approach is scaled integers, which avoids floating-point arithmetic entirely by representing values as integers. Since all our values lie between -99.9 and 99.9, I can just multiply it by 10, and work with integers in the range of -999 to 999. That way, I avoid doing expensive FP computation. I'll benchmark all these methods below; check the code details for implementation specifics.
std::vector<std::string> float_strings;
static void
void
void
The argument here is that we can have 4 conditions:
- -xx.x
- -x.x
- xx.x
- x.x
After we process the first '-', the remaining string is either 4 or 3 chars long. Then we can simply check if the '.' lies in the first or the second position, and parse accordingly.
void
Benchmark code
All modern CPUs contain FPUs. However, floating-point computations are still slower than integer-based computations.
std::vector<float> floats;
std::vector<int> ints;
static void
static void
static void
Benchmarking FP vs Integer computation speed
After running all benchmarks, this is the result I obtained:
)
)
)
)
)
It's a wipeout. Not even close.
Making the code changes
Now that I know scaled ints work, I rewrote the original program to use them, and generated a timing analysis:
using u64 = uint64_t;
u64
;
int
Rewritten code with scaled integers
With the switch to scaled integers, the runtime dropped from 96.11 seconds to 66.9 seconds—a solid 30-second improvement. That’s a huge win for a single change. If scaling across all 20 CPU cores was perfect, I’d be within striking distance of the leaderboard. Of course, real-world scaling is never perfect, but this is a promising start. There’s still plenty of room to optimize—stay tuned for more deep dives and speedups. This final code will be the starting point for the next optimization, and so on.