performance - Logarithm in C++ and assembly -
apparently msvc++2017 toolset v141 (x64 release configuration) doesn't use fyl2x
x86_64 assembly instruction via c/c++ intrinsic, rather c++ log()
or log2()
usages result in real call long function seems implement approximation of logarithm (without using fyl2x
). performance measured strange: log()
(natural logarithm) 1.7667 times faster log2()
(base 2 logarithm), though base 2 logarithm should easier processor because stores exponent in binary format (and mantissa too), , seems why cpu instruction fyl2x
calculates base 2 logarithm (multiplied parameter).
here code used measurements:
#include <chrono> #include <cmath> #include <cstdio> const int64_t cnlogs = 100 * 1000 * 1000; void benchmarklog2() { double sum = 0; auto start = std::chrono::high_resolution_clock::now(); for(int64_t i=1; i<=cnlogs; i++) { sum += std::log2(double(i)); } auto elapsed = std::chrono::high_resolution_clock::now() - start; double nsec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count(); printf("log2: %.3lf ops/sec calculated %.3lf\n", cnlogs / nsec, sum); } void benchmarkln() { double sum = 0; auto start = std::chrono::high_resolution_clock::now(); (int64_t = 1; <= cnlogs; i++) { sum += std::log(double(i)); } auto elapsed = std::chrono::high_resolution_clock::now() - start; double nsec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count(); printf("ln: %.3lf ops/sec calculated %.3lf\n", cnlogs / nsec, sum); } int main() { benchmarklog2(); benchmarkln(); return 0; }
the output ryzen 1800x is:
log2: 95152910.728 ops/sec calculated 2513272986.435 ln: 168109607.464 ops/sec calculated 1742068084.525
so elucidate these phenomena (no usage of fyl2x
, strange performance difference), test performance of fyl2x
, , if it's faster, use instead of <cmath>
's functions. msvc++ doesn't allow inline assembly on x64, assembly file function uses fyl2x
needed.
could answer assembly code such function, uses fyl2x
or better instruction doing logarithm (without need specific base) if there on newer x86_64 processors?
here assembly code using fyl2x
:
_data segment _data ends _text segment public srlog2muld ; xmm0l=tolog ; xmm1l=tomul srlog2muld proc movq qword ptr [rsp+16], xmm1 movq qword ptr [rsp+8], xmm0 fld qword ptr [rsp+16] fld qword ptr [rsp+8] fyl2x fstp qword ptr [rsp+8] movq xmm0, qword ptr [rsp+8] ret srlog2muld endp _text ends end
the calling convention according https://docs.microsoft.com/en-us/cpp/build/overview-of-x64-calling-conventions , e.g.
the x87 register stack unused. may used callee, must considered volatile across function calls.
the prototype in c++ is:
extern "c" double __fastcall srlog2muld(const double tolog, const double tomul);
the performance 2 times slower std::log2()
, more 3 times slower std::log()
:
log2: 94803174.389 ops/sec calculated 2513272986.435 fpu log2: 52008300.525 ops/sec calculated 2513272986.435 ln: 169392473.892 ops/sec calculated 1742068084.525
the benchmarking code follows:
void benchmarkfpulog2() { double sum = 0; auto start = std::chrono::high_resolution_clock::now(); (int64_t = 1; <= cnlogs; i++) { sum += srplat::srlog2muld(double(i), 1); } auto elapsed = std::chrono::high_resolution_clock::now() - start; double nsec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count(); printf("fpu log2: %.3lf ops/sec calculated %.3lf\n", cnlogs / nsec, sum); }
Comments
Post a Comment