Examining Compiler Output 2


In this post I present a short comparison between the machine code generated by GCC 7.2 and Clang 5.0.0 for the evaluation of a floating point polynomial.

For both compilers, only -O2 was used, but -O3 produced the same code.

float evaluate(float a, float b) {
    return (a - b + 1.0f) * (a - b) * (a - b - 1.0f);


evaluate(float, float):
  subss   xmm0, xmm1
  movss   xmm2, DWORD PTR .LC0[rip]
  movaps  xmm1, xmm0
  addss   xmm1, xmm2
  mulss   xmm1, xmm0
  subss   xmm0, xmm2
  mulss   xmm0, xmm1
  .long 1065353216


  .long 1065353216 # float 1
  .long 3212836864 # float -1
evaluate(float, float): # @evaluate(float, float)
  subss   xmm0, xmm1
  movss   xmm1, dword ptr [rip + .LCPI0_0]
  addss   xmm1, xmm0
  mulss   xmm1, xmm0
  addss   xmm0, dword ptr [rip + .LCPI0_1]
  mulss   xmm0, xmm1

Explanation (for the Clang output)

xmm0 = a
xmm1 = b
xmm0 = a - b
xmm1 = 1.0f
xmm1 = a - b + 1.0f
xmm1 = (a - b + 1.0f) * (a - b)
xmm0 = a - b - 1.0f
xmm0 = (a - b - 1.0f) * (a - b + 1.0f) * (a - b)

GCC seems to prefer movaps over movss, even though movss is sufficient in this case. A reason for doing so is that using movaps avoid stalls from partial updates to XMM registers. Clang doesn’t generate movaps, but uses two constants and only addss for them rather than only having one and using subss to subtract one from a register.

After benchmarking these alternatives, they had roughly the same throughput.