Finding the Maximum of Four Integers

Inspired by an analysis of the source code of Rogue, I decided to find out how modern compilers encode selecting the maximum of four integers.

int max(int a, int b, int c, int d) {
  return a > b ? (a > c ? (a > d ? a : d) : (c > d ? c : d))
               : (b > c ? (b > d ? b : d) : (c > d ? c : d));
}

It is important to note that this is isolated code generation and a modern compiler could probably do better if more context was available.

Using Clang 5.0, I get the following (with optimizations enabled, of course).

 0:  39 f7                	cmp    edi, esi
 2:  0f 4d f7             	cmovge esi, edi
 5:  39 d6                	cmp    esi, edx
 7:  0f 4c f2             	cmovl  esi, edx
 a:  39 ce                	cmp    esi, ecx
 c:  0f 4c f1             	cmovl  esi, ecx
 f:  89 f0                	mov    eax, esi
11:  c3                   	ret

I will rename the code according to the calling convention used here.

a -> edi
b -> esi
c -> edx
d -> ecx

The result follows with some comments on what the code is doing.

cmp    a, b
cmovge b, a    # If a >= b, assign a to b.
               # Now b has max(a, b).
cmp    b, c
cmovl  b, c    # If max(a, b) < c, assign c to b.
               # Now b has max(max(a, b), c).
cmp    b, d
cmovl  b, d    # If max(max(a, b), c) < d, assign d to b.
               # Now b has max(max(max(a, b), c), d).
mov    eax, b  # Return b through eax.
ret

I don’t think it can be done in less code than this.

As one might expect, using C++ and std::max, we get almost identical x86.

Examining Compiler Output 2

In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.

For both compilers, only -O2 was used, but -O3 produced the same code.

float evaluate(float a, float b) {
    return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);
}

GCC

evaluate(float, float):
  subss   xmm0, xmm1
  movss   xmm2, DWORD PTR .LC0[rip]
  movaps  xmm1, xmm0
  addss   xmm1, xmm2
  mulss   xmm1, xmm0
  subss   xmm0, xmm2
  mulss   xmm0, xmm1
  ret
.LC0:
  .long 1065353216

Clang

.LCPI0_0:
  .long 1065353216 # float 1
.LCPI0_1:
  .long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
  subss   xmm0, xmm1
  movss   xmm1, dword ptr [rip + .LCPI0_0]
  addss   xmm1, xmm0
  mulss   xmm1, xmm0
  addss   xmm0, dword ptr [rip + .LCPI0_1]
  mulss   xmm0, xmm1
  ret

Explanation (for the Clang output)

xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)

GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesn’t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.

After benchmarking these alternatives, they had roughly the same throughput.

Examining Compiler Output 1

In this post I present a short comparison between the machine code generated by Clang 5.0.0 for two different for loops.

Sum up to N

The first function sums from 1 to N and returns the result.

C++

int sumUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += i;
  }
  return x;
}

Output

sumUpTo(int):
  test  edi, edi
  jle   .LBB0_1
  lea   eax, [rdi - 1]
  lea   ecx, [rdi - 2]
  imul  rcx, rax
  shr   rcx
  lea   eax, [rcx + 2*rdi - 1]
  ret
.LBB0_1:
  xor   eax, eax
  ret

As you can see, Clang used a closed formula to evaluate the expression!

What should it be

  (1 + n)(n) / 2
= (n^2 + n) / 2

What Clang does

  (n - 1)(n - 2) / 2 + (2n - 1)
= (n^2 - 3n + 2) / 2 + (4n - 2) / 2
= (n^2 + n) / 2

It is important to note that the reason why a compiler would rather evaluate this convoluted expression is that it will not overflow like the first expression would.

Sum up to N multiplying by 2

C++

int sumTwiceUpTo(int n) {
  int x = 0;
  for (int i = 1; i <= n; i++) {
    x += 2 * i;
  }
  return x;
}

Output

sumTwiceUpTo(int):
  test  edi, edi
  jle   .LBB1_1
  lea   eax, [rdi - 1]
  lea   ecx, [rdi - 2]
  imul  ecx, eax
  and   ecx, -2
  lea   eax, [rcx + 4*rdi - 2]
  ret
.LBB1_1:
  xor   eax, eax
  ret

What is is

  (1 + n)(n)
= (n^2 + n)

What Clang does

  ((n - 1)(n - 2) & 111...0) + 4n - 2
= (n - 1)(n - 2) + 4n - 2
= n^2 + n

Again, a closed formula!

It is interesting to note that the and is completely useless here: it is forcing the product (which is always even) into a even number. Also, it requires the multiplication result to be evaluated, so it could add measurable delay.

Wunderlist to Taskwarrior

Description

Wunderlist to Taskwarrior is an application that fetches your tasks from Wunderlist and inserts them into Taskwarrior. It can be regularly ran by the operating system to keep Taskwarrior up-to-date.

Why I Made This

As I cannot update Taskwarrior when I am not near one of my computers I needed a way to update Taskwarrior while on the road. The solution I found uses Wunderlist as an intermediate. I already used Wunderlist for simple lists such as the weekly groceries but I greatly prefer Taskwarrior over it, so I still use Taskwarrior when I am at my computer.

Relevant Features

  • Selective Synchronization. Lists starting with “!” are never synchronized.
  • Data Safety. All communication is done via HTTPS.
  • Persistence. SQLite is used to keep track of what has been synchronized.

I had to decide between a continuous process which would update Taskwarrior or a utility application that would be scheduled to run regularly. I opted for the latter as the application initialization and termination are not so expensive. Currently, I’ve been running it every minute on one of my computers.

The Code

Wunderlist to Taskwarrior was written in Haskell using Stack so it should be easy enough to get a build for your own computer.

This is the GitHub repository.

Minimum Window Substring

Description

This is a post regarding the solution to a substring problem which follows a pattern that is quite common among substring problems: it is solvable with a pair of fast and slow iterators.

This problem can often appear as an interview question or as part of a competitive programming problem in sites such as LeetCode and Codeforces.

If this is not possible, we will return the empty string.

Problem Statement

Given two strings, S and T, find the minimum window in S which will contain all the characters in T (with their frequencies) in linear time.

S T Answer
"" "" ""
"" "A" ""
"A" "" ""
"A" "A" "A"
"A" "AB" "A"
"ADOBECODEBANC" "ABC" "BANC"
"ADOBECODECABANC" "ABC" "CAB"

C++ Solution

string minimum_window_substring(string s, string t) {
  if (t.empty()) {
    return "";
  }
  // Frequency in the token.
  unordered_map<char, size_t> tf;
  for (char c : t) {
    tf[c]++;
  }
  // Frequency in the string.
  unordered_map<char, size_t> sf;
  // How many characters we have to match.
  const auto target = t.size();
  const auto n = s.size();
  size_t best_i = 0;
  const auto no_size = numeric_limits<size_t>::max();
  size_t best_size = no_size;
  size_t slow = 0;
  size_t fast = 0;
  size_t matched = 0;
  // Advance fast until we met the target.
  // Then shrink with slow until we no longer meet the conditions.
  // As this advances each pointer at most N times, this is O(n).
  while (slow < n && fast < n) {
    sf[s[fast]]++;
    if (sf[s[fast]] <= tf[s[fast]]) {
      matched++;
    }
    fast++;
    // Here, the range is [slow, fast).
    while (matched == target && slow < fast) {
      const size_t match_size = fast - slow;
      if (match_size < best_size) {
        best_i = slow;
        best_size = match_size;
      }
      sf[s[slow]]--;
      if (sf[s[slow]] < tf[s[slow]]) {
        matched--;
      }
      slow++;
    }
  }
  if (best_size == no_size) {
    return "";
  }
  return s.substr(best_i, best_size);
}