1 Billion Rows Challenge

Sep 5, 2025
1440 words · 8 min read

I recently stumbled upon the "One Billion Rows Challenge"—an intriguing test of speed and efficiency. Although the official contest had already wrapped up, I couldn't resist giving it a shot myself. Armed with my growing knowledge of modern C++, I set out to see just how quickly I could process a billion rows. This post is a personal log of that journey: what I tried, what failed, what succeeded, and the final results.

Base case and measurements

Code

Remembering a relevant quote "something something premature optimization, evil something", I started off with a solid effort - simply writing the code. We can always optimize it later. This pre-written code is readable, understandable and maintainable. The entire code outline is written below (with actual code living in the source for this blog post).

#include <algorithm>
#include <cassert>
#include <fstream>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>

#define NUM_CITIES 10000

struct Measurement {
  float min;
  float max;
  int count;
  float sum;

  Measurement() : min(0), max(0), count(0), sum(0) {}
  Measurement(float value) : min(value), max(value), count(1), sum(value) {}
  inline void update(float value) {
    min = std::min(min, value);
    max = std::max(max, value);
    count += 1;
    sum += value;
  }
};

static inline float parse_value(const std::string& to_parse) {
    return std::stof(to_parse);
}

int main() {
  std::ifstream file("/home/sbhusal/temp/1brc/solutions/measurements.txt");
  std::unordered_map<std::string, Measurement> measurements(10000);

  std::string line;

  while (std::getline(file, line)) {
    auto city_off = line.find(';');
    auto city = line.substr(0, city_off);
    auto measurement = line.substr(city_off + 1, line.size());

    auto parsed = parse_value(measurement);

    auto [it, inserted] = measurements.try_emplace(city, parsed);
    if (!inserted) {
      it->second.update(parsed);
    }
  }

  std::vector<std::pair<std::string, Measurement>> ordered(measurements.begin(),
                                                           measurements.end());
  std::sort(ordered.begin(), ordered.end(),
            [](const std::pair<std::string, Measurement> &first,
               const std::pair<std::string, Measurement> &second) {
              return first.first < second.first;
            });

  std::cout << "{";

  for (size_t i = 0; i < ordered.size(); ++i) {
    const auto &[city, measured] = ordered[i];
    std::cout << city << ":" << measured.min << "/"
              << (measured.sum / measured.count) << "/" << measured.max;
    if (i != ordered.size() - 1) {
      std::cout << ",";
    }
  }

  std::cout << "}";

  return 0;
}

I have tried to follow some best-practices in this code, however, some things immediately stand out:

  • Excessive string allocation in hot loop
  • Inefficient parsing
  • Hardcoded measurements file location
  • No error handling for file operations
  • No support for parallelism or concurrency

There are, of course, many issues lurking beneath the scenes. We will get to that soon.

Timings and Measurements

I ran this code on a 1-billion row file generated by the script in the original challenge repo. This is the timing I observed:

Executed in  219.02 secs    fish           external
   usr time  213.70 secs  549.00 micros  213.70 secs
   sys time    4.22 secs  461.00 micros    4.22 secs

Remarkably, the top entries in the challenge completed this task in less than 2 seconds. That means my baseline implementation is over 100 times slower! Even if I could perfectly parallelize the workload across all 20 CPU cores—ignoring the realities of Amdahl's Law—it would still take more than 10 seconds. Clearly, there’s significant room for optimization. In order to improve anything, metrics need to exist. So that's exactly what we're going to do next - measure.

Tools and methodology

I'm using Arch, btw, which means I get to leverage the excellent perf utility to analyze performance bottlenecks in my code. I've been working with the perf+hotspot profiler combination, and they integrate seamlessly out of the box. This eliminates the hassle of writing custom scripts to generate flamegraphs and removes the complexity of analyzing per-thread performance and activity. For the remainder of this post, I'll follow this systematic methodology:

  • Measure time and record execution with perf
  • Analyze results in hotspot
  • Create targeted micro-benchmarks with Google Benchmark
  • Replace performance-critical hotspots with optimized implementations
  • Iterate

Since I've already completed the first step with time, it's time (pun intended) to dive into the profiling phase.

Performance profiling with perf and hotspot

I started by running a perf report on the baseline implementation:

Performance counter stats for './1':

        221,621.88 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               459      page-faults:u                    #    2.071 /sec                      
 1,196,244,505,537      instructions:u                   #    2.78  insn per cycle            
                                                  #    0.07  stalled cycles per insn     (35.71%)
   430,775,078,819      cycles:u                         #    1.944 GHz                         (35.71%)
    87,956,512,926      stalled-cycles-frontend:u        #   20.42% frontend cycles idle        (35.72%)
   245,403,942,423      branches:u                       #    1.107 G/sec                       (35.71%)
     3,325,571,282      branch-misses:u                  #    1.36% of all branches             (35.71%)
   612,676,868,580      L1-dcache-loads:u                #    2.765 G/sec                       (35.71%)
     5,006,231,868      L1-dcache-load-misses:u          #    0.82% of all L1-dcache accesses   (35.72%)
       680,540,330      L1-icache-loads:u                #    3.071 M/sec                       (35.71%)
        33,346,573      L1-icache-load-misses:u          #    4.90% of all L1-icache accesses   (35.72%)
     1,870,232,874      dTLB-loads:u                     #    8.439 M/sec                       (35.72%)
           386,333      dTLB-load-misses:u               #    0.02% of all dTLB cache accesses  (35.72%)
           990,446      iTLB-loads:u                     #    4.469 K/sec                       (35.71%)
            47,933      iTLB-load-misses:u               #    4.84% of all iTLB cache accesses  (35.71%)
       998,072,313      L1-dcache-prefetches:u           #    4.503 M/sec                       (35.71%)

     221.726261575 seconds time elapsed

     215.541935000 seconds user
       3.689600000 seconds sys

There are a lot of things that stand out in this report:

  • Total time taken was 221.726s, with a full CPU utilized.
  • The task did not migrate between CPUs, i.e. the scheduler did not move our task
  • The number of page faults is very trivial. The file itself is 15G in size so 459 page faults are nothing
  • The instructions count is huge.
  • Majority of the time was spent waiting on memory (frontend cycles idle). The IPC itself is good, which means, when the CPU did get something to do the code exhibited ILP.

As such, the code looks, ... mostly okay. Which means, the issue is not at the system level but in how I wrote the code itself. Time to dig deeper into this issue. For this, we will record the execution again with record, and analyze the flamegraph with hotspot. Initial Implementation

In the flamegraph, it is evident that a chunk of the total time spent is being spent on parsing the parse_value function - almost 40%. After this, the second most-expensive call is the call to getline. In this code, I'm reading the file line-by-line, and each read triggers a copy from the kernel-space to the user-space. This can be reduced using mmap, which I will benchmark later.

Looking at the possible solutions to a faster float parsing, I found from_chars in the stdlib. Along with this, looking at the data, the measured values always lie between -99.9 and 99.9. Therefore, if we were to represent this using integers instead of floats, in the range of -999 to 999, it will help to make the parsing, and later, calculation process faster down the line. This approach is called Scaled Integers, and has been used quite popularly since a long time. I am specially interested in the last approach, considering the fact that all modern processors have FPUs and this should not have a very big impact in the computation side of things.

However, as the age-old saying goes, measure twice cut once. So, I want to test all these approaches first before making any changes to my code.

Micro Benchmarking

I created a very small benchmark suite to benchmark all these things I wanted to test. Here is the complete benchmark:

  1. Create the data for the benchmark (10M float strings):
std::vector<std::string> float_strings;

static void SetupFloatStrings() {
    static bool initialized = false;
    if (initialized) return;
    initialized = true;

    std::mt19937 rng(42);
    std::uniform_real_distribution<float> dist(-99.9f, 99.9f);

    float_strings.reserve(10000000);
    for (size_t i = 0; i < 10000000; ++i) {
        float val = dist(rng);
        std::ostringstream oss;
        oss << std::fixed << std::setprecision(1) << val;
        float_strings.push_back(oss.str());
    }
}
  1. Create benchmarks for stof and from_chars:
void bench_stof(benchmark::State& state) {
    SetupFloatStrings();

    size_t index = 0;
    float sum = 0;
    for (auto _ : state) {
        const std::string& str = float_strings[index % float_strings.size()];
        sum += std::stof(str);
        index++;
    }
    benchmark::DoNotOptimize(sum);
}

void bench_from_chars(benchmark::State& state) {
    SetupFloatStrings();

    size_t index = 0;
    float sum = 0;
    for (auto _ : state) {
        const std::string& str = float_strings[index % float_strings.size()];
        float value;
        std::from_chars(str.data(), str.data() + str.size(), value);
        sum += value;
        index++;
    }
    benchmark::DoNotOptimize(sum);
}
  1. Finally, the "scaled integer" implementation.

In this implementation, I make use of the fact that the numbers can be one of the following:

  • -xx.x
  • -x.x
  • xx.x
  • x.x

So, after we remove the sign, the '.' is going to either be in position 1, or 2, and depending on that, we can scale the int.

void bench_scalar_int(benchmark::State& state) {
    SetupFloatStrings();

    size_t index = 0;
    long sum = 0;
    for (auto _ : state) {
        const std::string& str = float_strings[index % float_strings.size()];
        int off = (str[0] == '-') ? 1 : 0;
        int sign = (str[0] == '-') ? -1 : 1;

        int value;
        if (str[off + 1] == '.') {
          value = (str[off] - '0') * 10 + (str[off + 2] - '0');
        } else {
          value = ((str[off] - '0') * 10 + (str[off + 1] - '0')) * 10 +
                  (str[off + 3] - '0');
        }
        sum += sign * value;
    }
    benchmark::DoNotOptimize(sum);
}

And here are the results from those benchmarks:

 ./build/bench1
2025-09-20T15:17:56-04:00
Running ./build/bench1
Run on (20 X 2000.51 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1024 KiB (x10)
  L3 Unified 16384 KiB (x2)
Load Average: 1.31, 1.41, 1.14
-----------------------------------------------------------
Benchmark                 Time             CPU   Iterations
-----------------------------------------------------------
bench_from_chars       31.4 ns         31.2 ns     22207030
bench_stof             87.8 ns         87.3 ns      7992391
bench_scalar_int       2.65 ns         2.64 ns    271563776

It's not even close. It's evident that scalar integers beat out the other two implementations easily while parsing. For the next benchmark, I wanted to see if FP calculations were causing the slowdown (even though the parsing is so much fast already I doubt it's going to make a difference).

Here's the FP benchmarks:

std::vector<float> floats;
std::vector<int> ints;

static void SetupNums() {
    static bool initialized = false;
    if (initialized) return;
    initialized = true;

    floats.reserve(10000000);
    ints.reserve(10000000);
    for (size_t i = 0; i < 10000000; ++i) {
        floats.push_back(static_cast<float>(i) / 10.0f);
        ints.push_back(static_cast<int>(i));
    }
}
static void sum_ints(benchmark::State& state) {
    SetupNums();
    for (auto _ : state) {
        long sum = 0;
        for (const auto& i : ints) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
}
static void sum_floats(benchmark::State& state) {
    SetupNums();
    for (auto _ : state) {
        double sum = 0;
        for (const auto& f : floats) {
            sum += f;
        }
        benchmark::DoNotOptimize(sum);
    }
}
Load Average: 4.08, 3.13, 2.13
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
sum_ints      5109059 ns      5068193 ns          138
sum_floats   15250422 ns     15131541 ns           46

Looks like I have some reading to do at this point 😅

Fixing the value parsing

From the above benchmarks, it is clear that the value parsing section of the code needs a major change.

2a3
> #include <cstdint>
10a12,27
> using u64 = uint64_t;
> 
> u64 parse_value(const std::string& str) {
>   int off = (str[0] == '-') ? 1 : 0;
>   int sign = (str[0] == '-') ? -1 : 1;
> 
>   int value;
>   if (str[off + 1] == '.') {
>     value = (str[off] - '0') * 10 + (str[off + 2] - '0');
>   } else {
>     value = ((str[off] - '0') * 10 + (str[off + 1] - '0')) * 10 +
>             (str[off + 3] - '0');
>   }
>   return sign * value;
> }
> 
12,15c29,32
<   float min;
<   float max;
<   int count;
<   float sum;
---
>   u64 min;
>   u64 max;
>   u64 count;
>   u64 sum;
18,19c35,36
<   Measurement(float value) : min(value), max(value), count(1), sum(value) {}
<   inline void update(float value) {
---
>   Measurement(u64 value) : min(value), max(value), count(1), sum(value) {}
>   inline void update(u64 value) {
26,29d42
< 
< static inline float parse_value(const std::string& to_parse) {
<     return std::stof(to_parse);
< }

 Performance counter stats for './2':

        156,116.37 msec task-clock:u                     #    0.997 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               513      page-faults:u                    #    3.286 /sec                      
   537,873,827,732      instructions:u                   #    1.80  insn per cycle            
                                                  #    0.12  stalled cycles per insn     (35.71%)
   299,303,987,541      cycles:u                         #    1.917 GHz                         (35.71%)
    62,477,181,732      stalled-cycles-frontend:u        #   20.87% frontend cycles idle        (35.71%)
   108,629,707,400      branches:u                       #  695.825 M/sec                       (35.71%)
     2,616,411,954      branch-misses:u                  #    2.41% of all branches             (35.72%)
   346,101,561,152      L1-dcache-loads:u                #    2.217 G/sec                       (35.72%)
     5,471,897,627      L1-dcache-load-misses:u          #    1.58% of all L1-dcache accesses   (35.72%)
       528,608,055      L1-icache-loads:u                #    3.386 M/sec                       (35.71%)
        37,194,524      L1-icache-load-misses:u          #    7.04% of all L1-icache accesses   (35.72%)
     1,722,913,805      dTLB-loads:u                     #   11.036 M/sec                       (35.71%)
           935,347      dTLB-load-misses:u               #    0.05% of all dTLB cache accesses  (35.72%)
         2,185,510      iTLB-loads:u                     #   13.999 K/sec                       (35.72%)
            75,753      iTLB-load-misses:u               #    3.47% of all iTLB cache accesses  (35.72%)
     1,127,256,983      L1-dcache-prefetches:u           #    7.221 M/sec                       (35.71%)

     156.633191854 seconds time elapsed

     149.286329000 seconds user
       4.655501000 seconds sys

Woo hoo! The runtime decreased by a full 70 seconds! Let's observe some differences in this code:

  • The IPC is horrible. But the instructions count went down. Which means the codegen became more terse (makes sense, parsing strings to floats is hard).
  • Frontend cycles are still stalled a lot, memory is still the bottleneck.

Time for more hotspot flamegraphs:

The following observations can be made:

  • parse_value went to 6% of the total runtime!
  • std::getline is still taking 21% of the total runtime.
  • In M_find_before_node in the unordered_map, I can see 8% of total CPU time being spent on operator==. This means our HM is performing a probe, and we have too many hash collisions.
  • The keen-eyed might already have noticed __memchr_evex which is used to find the line boundaries.

In addition to this, the old problem still exists - the code does a lot of string allocations in the hot code path. Essentially, if we have a city, which is already in the map, everytime we encounter this city, we:

  • Allocate a string for this city
  • Look up in the map
  • If not found, push the city into the map
  • If found, discard the city, and update the value

Even worse, every "measurement" we parse out from the line, will require its own allocation which unconditionally gets discarded.

One solution to this would be to read the entire file into a buffer, and then operate on the pointers and offsets of this buffer. However, the file is a whopping 16GB and I am not making any assumptions for this challenge.

mmap, string_view and custom parsing

Using mmap in the code will have multiple benefits:

  • Reduce time spent copying data from kernel to user space. Although this is < 5s right now, better to get it out of the way (and we can see impact later).
  • We can allocate a single buffer and use string_view to index into the buffer, instead of allocating strings.
  • Apparently getline does a lot of bounds checking, locale interpretation, and iostream management. I am not interested in any of that. I can work with raw utf-8 strings, as char sequences, until I hit a ';' char or a '\n' char if I write a custom parser.
+  100.00%     0.00%  2        2                     [.] _start
+  100.00%     0.00%  2        libc.so.6             [.] __libc_start_main
+  100.00%     0.00%  2        libc.so.6             [.] 0x00007f606ee27675
+   79.18%    41.31%  2        2                     [.] main
+   21.92%    11.46%  2        libstdc++.so.6.0.34   [.] std::basic_istream<char, std::char_traits<char> >& std::getline<char, std::char_traits<char>, std::allocator<char> >(std::basic_istream<char, std::char_t
+   11.62%    11.61%  2        libstdc++.so.6.0.34   [.] std::_Hash_bytes(void const*, unsigned long, unsigned long)

Using mmap and string_view is one thing, but a parser is another thing entirely. In order to parse a single line, I wanted to benchmark the differences between 3 implementations - one with memchr, one with a simple hand-rolled loop, and another one with SWAR. The SWAR one has an additional benefit of me being able to compute hash for the string "on-the-fly" when I'm finding city name bounds. But enough talk.

Here are the 3 respective benchmarks:

RSS
/posts/feed.xml