Why is this code 6.5x slower with optimizations enabled?With arrays, why is it the case that a[5] == 5[a]?Why...

LWC and complex parameters

When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?

What is the command to reset a PC without deleting any files

Lied on resume at previous job

Was there ever an axiom rendered a theorem?

Is there a familial term for apples and pears?

Patience, young "Padovan"

Does a dangling wire really electrocute me if I'm standing in water?

Can I find out the caloric content of bread by dehydrating it?

Does it makes sense to buy a new cycle to learn riding?

Landing in very high winds

How can I plot a Farey diagram?

Could a US political party gain complete control over the government by removing checks & balances?

Symmetry in quantum mechanics

How can I add custom success page

Why doesn't a const reference extend the life of a temporary object passed via a function?

Information to fellow intern about hiring?

aging parents with no investments

What does 'script /dev/null' do?

Is there a name of the flying bionic bird?

What are the advantages and disadvantages of running one shots compared to campaigns?

Pristine Bit Checking

New order #4: World

How to move the player while also allowing forces to affect it

Why is this code 6.5x slower with optimizations enabled?

With arrays, why is it the case that a[5] == 5[a]?Why is the Android emulator so slow? How can we speed up the Android emulator?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why does changing 0.1f to 0 slow down performance by 10x?Why is it faster to process a sorted array than an unsorted array?Why does the C preprocessor interpret the word “linux” as the constant “1”?Why is printing “B” dramatically slower than printing “#”?Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>

#include <string.h>

#include <stdlib.h>

#include <stdio.h>



int main() {

    char *s = calloc(1 << 20, 1);

    memset(s, 65, 1000000);

    clock_t start = clock();

    for (int i = 0; i < 128; ++i) {

        s[strlen(s)] = 'A';

    }

    clock_t end = clock();

    printf("%lldn", (long long)(end-start));

    return 0;

}

On my machine it outputs:

$ gcc test.c && ./a.out

13336

$ gcc -O1 test.c && ./a.out

199004

$ gcc -O2 test.c && ./a.out

83415

$ gcc -O3 test.c && ./a.out

83415

Somehow, enabling optimizations causes it to execute longer.

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
yesterday

1

Please report it to gcc's bugzilla.

– Marc Glisse
yesterday

1

Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
yesterday

1

It is generating repnz scasb for strlen at -O1.

– Marc Glisse
yesterday

3

@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

– Justin
yesterday

|
show 5 more comments

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>

#include <string.h>

#include <stdlib.h>

#include <stdio.h>



int main() {

    char *s = calloc(1 << 20, 1);

    memset(s, 65, 1000000);

    clock_t start = clock();

    for (int i = 0; i < 128; ++i) {

        s[strlen(s)] = 'A';

    }

    clock_t end = clock();

    printf("%lldn", (long long)(end-start));

    return 0;

}

On my machine it outputs:

$ gcc test.c && ./a.out

13336

$ gcc -O1 test.c && ./a.out

199004

$ gcc -O2 test.c && ./a.out

83415

$ gcc -O3 test.c && ./a.out

83415

Somehow, enabling optimizations causes it to execute longer.

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
yesterday

1

Please report it to gcc's bugzilla.

– Marc Glisse
yesterday

1

Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
yesterday

1

It is generating repnz scasb for strlen at -O1.

– Marc Glisse
yesterday

3

@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

– Justin
yesterday

|
show 5 more comments

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>

#include <string.h>

#include <stdlib.h>

#include <stdio.h>



int main() {

    char *s = calloc(1 << 20, 1);

    memset(s, 65, 1000000);

    clock_t start = clock();

    for (int i = 0; i < 128; ++i) {

        s[strlen(s)] = 'A';

    }

    clock_t end = clock();

    printf("%lldn", (long long)(end-start));

    return 0;

}

On my machine it outputs:

$ gcc test.c && ./a.out

13336

$ gcc -O1 test.c && ./a.out

199004

$ gcc -O2 test.c && ./a.out

83415

$ gcc -O3 test.c && ./a.out

83415

Somehow, enabling optimizations causes it to execute longer.

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.

Here's my code:

#include <time.h>

#include <string.h>

#include <stdlib.h>

#include <stdio.h>



int main() {

    char *s = calloc(1 << 20, 1);

    memset(s, 65, 1000000);

    clock_t start = clock();

    for (int i = 0; i < 128; ++i) {

        s[strlen(s)] = 'A';

    }

    clock_t end = clock();

    printf("%lldn", (long long)(end-start));

    return 0;

}

On my machine it outputs:

$ gcc test.c && ./a.out

13336

$ gcc -O1 test.c && ./a.out

199004

$ gcc -O2 test.c && ./a.out

83415

$ gcc -O3 test.c && ./a.out

83415

Somehow, enabling optimizations causes it to execute longer.

c performance gcc glibc

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

edited yesterday

Muntasir

6321919

edited yesterday

Muntasir

6321919

edited yesterday

Muntasir

6321919

asked yesterday

TsarN

17328

asked yesterday

TsarN

17328

asked yesterday

TsarN

17328

With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
yesterday

1

Please report it to gcc's bugzilla.

– Marc Glisse
yesterday

1

Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
yesterday

1

It is generating repnz scasb for strlen at -O1.

– Marc Glisse
yesterday

3

@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

– Justin
yesterday

|
show 5 more comments

With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
yesterday

1

Please report it to gcc's bugzilla.

– Marc Glisse
yesterday

1

Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
yesterday

1

It is generating repnz scasb for strlen at -O1.

– Marc Glisse
yesterday

3

@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

– Justin
yesterday

With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
yesterday

Please report it to gcc's bugzilla.

– Marc Glisse
yesterday

Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
yesterday

It is generating repnz scasb for strlen at -O1.

– Marc Glisse
yesterday

@MarcGlisse It's already been filed: gcc.gnu.org/bugzilla/show_bug.cgi?id=88809

– Justin
yesterday

|
show 5 more comments

1 Answer
1

active

oldest

votes

Testing your code on Godbolt's Compiler Explorer provides this explanation:

at -O0 or without optimisations, the generated code call the C library function strlen

at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:

the -O1 code is much slower than the C library implementation: 32240 vs 3090

the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.

This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.

I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.

Here is the updated code:

#include <stdlib.h>

#include <string.h>

#include <time.h>



void benchmark(int repeat, int minlen, int maxlen) {

    char *s = malloc(maxlen + 1);

    memset(s, 'A', minlen);

    long long bytes = 0, calls = 0;

    clock_t clk = clock();

    for (int n = 0; n < repeat; n++) {

        for (int i = minlen; i < maxlen; ++i) {

            bytes += i + 1;

            calls += 1;

            s[i] = '';

            s[strlen(s)] = 'A';

        }

    }

    clk = clock() - clk;

    free(s);

    double avglen = (minlen + maxlen - 1) / 2.0;

    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;

    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",

           avglen, ns / bytes, ns / calls);

}



int main() {

    benchmark(10000000, 0, 1);

    benchmark(1000000, 0, 10);

    benchmark(1000000, 5, 15);

    benchmark(100000, 0, 100);

    benchmark(100000, 50, 150);

    benchmark(10000, 0, 1000);

    benchmark(10000, 500, 1500);

    benchmark(1000, 0, 10000);

    benchmark(1000, 5000, 15000);

    benchmark(100, 1000000 - 50, 1000000 + 50);

    return 0;

}

Here is the output:



chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out

average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call

average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call

average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call

average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call

average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call

average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call

average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call

average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call

average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call

average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call

chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out

average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call

average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call

average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call

average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call

average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call

average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call

average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call

average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call

average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call

average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call

chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out

average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call

average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call

average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call

average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call

average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call

average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call

average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call

average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call

average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call

average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

2

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

1

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

4

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

1

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

1

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

|
show 12 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Testing your code on Godbolt's Compiler Explorer provides this explanation:

at -O0 or without optimisations, the generated code call the C library function strlen

at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:

the -O1 code is much slower than the C library implementation: 32240 vs 3090

the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

Here is the updated code:

#include <stdlib.h>

#include <string.h>

#include <time.h>



void benchmark(int repeat, int minlen, int maxlen) {

    char *s = malloc(maxlen + 1);

    memset(s, 'A', minlen);

    long long bytes = 0, calls = 0;

    clock_t clk = clock();

    for (int n = 0; n < repeat; n++) {

        for (int i = minlen; i < maxlen; ++i) {

            bytes += i + 1;

            calls += 1;

            s[i] = '';

            s[strlen(s)] = 'A';

        }

    }

    clk = clock() - clk;

    free(s);

    double avglen = (minlen + maxlen - 1) / 2.0;

    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;

    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",

           avglen, ns / bytes, ns / calls);

}



int main() {

    benchmark(10000000, 0, 1);

    benchmark(1000000, 0, 10);

    benchmark(1000000, 5, 15);

    benchmark(100000, 0, 100);

    benchmark(100000, 50, 150);

    benchmark(10000, 0, 1000);

    benchmark(10000, 500, 1500);

    benchmark(1000, 0, 10000);

    benchmark(1000, 5000, 15000);

    benchmark(100, 1000000 - 50, 1000000 + 50);

    return 0;

}

Here is the output:



chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out

average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call

average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call

average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call

average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call

average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call

average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call

average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call

average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call

average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call

average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call

chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out

average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call

average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call

average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call

average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call

average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call

average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call

average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call

average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call

average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call

average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call

chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out

average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call

average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call

average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call

average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call

average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call

average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call

average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call

average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call

average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call

average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

2

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

1

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

4

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

1

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

1

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

|
show 12 more comments

Testing your code on Godbolt's Compiler Explorer provides this explanation:

at -O0 or without optimisations, the generated code call the C library function strlen

at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:

the -O1 code is much slower than the C library implementation: 32240 vs 3090

the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

Here is the updated code:

#include <stdlib.h>

#include <string.h>

#include <time.h>



void benchmark(int repeat, int minlen, int maxlen) {

    char *s = malloc(maxlen + 1);

    memset(s, 'A', minlen);

    long long bytes = 0, calls = 0;

    clock_t clk = clock();

    for (int n = 0; n < repeat; n++) {

        for (int i = minlen; i < maxlen; ++i) {

            bytes += i + 1;

            calls += 1;

            s[i] = '';

            s[strlen(s)] = 'A';

        }

    }

    clk = clock() - clk;

    free(s);

    double avglen = (minlen + maxlen - 1) / 2.0;

    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;

    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",

           avglen, ns / bytes, ns / calls);

}



int main() {

    benchmark(10000000, 0, 1);

    benchmark(1000000, 0, 10);

    benchmark(1000000, 5, 15);

    benchmark(100000, 0, 100);

    benchmark(100000, 50, 150);

    benchmark(10000, 0, 1000);

    benchmark(10000, 500, 1500);

    benchmark(1000, 0, 10000);

    benchmark(1000, 5000, 15000);

    benchmark(100, 1000000 - 50, 1000000 + 50);

    return 0;

}

Here is the output:



chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out

average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call

average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call

average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call

average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call

average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call

average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call

average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call

average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call

average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call

average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call

chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out

average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call

average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call

average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call

average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call

average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call

average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call

average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call

average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call

average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call

average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call

chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out

average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call

average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call

average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call

average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call

average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call

average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call

average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call

average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call

average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call

average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

2

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

1

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

4

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

1

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

1

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

|
show 12 more comments

Testing your code on Godbolt's Compiler Explorer provides this explanation:

at -O0 or without optimisations, the generated code call the C library function strlen

at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:

the -O1 code is much slower than the C library implementation: 32240 vs 3090

the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

Here is the updated code:

#include <stdlib.h>

#include <string.h>

#include <time.h>



void benchmark(int repeat, int minlen, int maxlen) {

    char *s = malloc(maxlen + 1);

    memset(s, 'A', minlen);

    long long bytes = 0, calls = 0;

    clock_t clk = clock();

    for (int n = 0; n < repeat; n++) {

        for (int i = minlen; i < maxlen; ++i) {

            bytes += i + 1;

            calls += 1;

            s[i] = '';

            s[strlen(s)] = 'A';

        }

    }

    clk = clock() - clk;

    free(s);

    double avglen = (minlen + maxlen - 1) / 2.0;

    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;

    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",

           avglen, ns / bytes, ns / calls);

}



int main() {

    benchmark(10000000, 0, 1);

    benchmark(1000000, 0, 10);

    benchmark(1000000, 5, 15);

    benchmark(100000, 0, 100);

    benchmark(100000, 50, 150);

    benchmark(10000, 0, 1000);

    benchmark(10000, 500, 1500);

    benchmark(1000, 0, 10000);

    benchmark(1000, 5000, 15000);

    benchmark(100, 1000000 - 50, 1000000 + 50);

    return 0;

}

Here is the output:



chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out

average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call

average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call

average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call

average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call

average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call

average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call

average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call

average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call

average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call

average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call

chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out

average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call

average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call

average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call

average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call

average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call

average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call

average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call

average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call

average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call

average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call

chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out

average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call

average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call

average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call

average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call

average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call

average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call

average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call

average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call

average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call

average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

Testing your code on Godbolt's Compiler Explorer provides this explanation:

at -O0 or without optimisations, the generated code call the C library function strlen

at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:

the -O1 code is much slower than the C library implementation: 32240 vs 3090

the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

Here is the updated code:

#include <stdlib.h>

#include <string.h>

#include <time.h>



void benchmark(int repeat, int minlen, int maxlen) {

    char *s = malloc(maxlen + 1);

    memset(s, 'A', minlen);

    long long bytes = 0, calls = 0;

    clock_t clk = clock();

    for (int n = 0; n < repeat; n++) {

        for (int i = minlen; i < maxlen; ++i) {

            bytes += i + 1;

            calls += 1;

            s[i] = '';

            s[strlen(s)] = 'A';

        }

    }

    clk = clock() - clk;

    free(s);

    double avglen = (minlen + maxlen - 1) / 2.0;

    double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;

    printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",

           avglen, ns / bytes, ns / calls);

}



int main() {

    benchmark(10000000, 0, 1);

    benchmark(1000000, 0, 10);

    benchmark(1000000, 5, 15);

    benchmark(100000, 0, 100);

    benchmark(100000, 50, 150);

    benchmark(10000, 0, 1000);

    benchmark(10000, 500, 1500);

    benchmark(1000, 0, 10000);

    benchmark(1000, 5000, 15000);

    benchmark(100, 1000000 - 50, 1000000 + 50);

    return 0;

}

Here is the output:



chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out

average length       0 -> avg time:  14.000 ns/byte,  14.000 ns/call

average length       4 -> avg time:   2.364 ns/byte,  13.000 ns/call

average length      10 -> avg time:   1.238 ns/byte,  13.000 ns/call

average length      50 -> avg time:   0.317 ns/byte,  16.000 ns/call

average length     100 -> avg time:   0.169 ns/byte,  17.000 ns/call

average length     500 -> avg time:   0.074 ns/byte,  37.000 ns/call

average length    1000 -> avg time:   0.068 ns/byte,  68.000 ns/call

average length    5000 -> avg time:   0.064 ns/byte, 318.000 ns/call

average length   10000 -> avg time:   0.062 ns/byte, 622.000 ns/call

average length 1000000 -> avg time:   0.062 ns/byte, 62000.000 ns/call

chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out

average length       0 -> avg time:  20.000 ns/byte,  20.000 ns/call

average length       4 -> avg time:   3.818 ns/byte,  21.000 ns/call

average length      10 -> avg time:   2.190 ns/byte,  23.000 ns/call

average length      50 -> avg time:   0.990 ns/byte,  50.000 ns/call

average length     100 -> avg time:   0.816 ns/byte,  82.000 ns/call

average length     500 -> avg time:   0.679 ns/byte, 340.000 ns/call

average length    1000 -> avg time:   0.664 ns/byte, 664.000 ns/call

average length    5000 -> avg time:   0.651 ns/byte, 3254.000 ns/call

average length   10000 -> avg time:   0.649 ns/byte, 6491.000 ns/call

average length 1000000 -> avg time:   0.648 ns/byte, 648000.000 ns/call

chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out

average length       0 -> avg time:  10.000 ns/byte,  10.000 ns/call

average length       4 -> avg time:   2.000 ns/byte,  11.000 ns/call

average length      10 -> avg time:   1.048 ns/byte,  11.000 ns/call

average length      50 -> avg time:   0.337 ns/byte,  17.000 ns/call

average length     100 -> avg time:   0.299 ns/byte,  30.000 ns/call

average length     500 -> avg time:   0.202 ns/byte, 101.000 ns/call

average length    1000 -> avg time:   0.188 ns/byte, 188.000 ns/call

average length    5000 -> avg time:   0.174 ns/byte, 868.000 ns/call

average length   10000 -> avg time:   0.172 ns/byte, 1716.000 ns/call

average length 1000000 -> avg time:   0.172 ns/byte, 172000.000 ns/call

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

edited 49 mins ago

answered yesterday

chqrlie

63.1k850108

answered yesterday

chqrlie

63.1k850108

answered yesterday

chqrlie

63.1k850108

2

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

1

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

4

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

1

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

1

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

|
show 12 more comments

2

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

1

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

4

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

1

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

1

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
yesterday

@chqrlie: The problem I'm trying to highlight here is that people benchmark on "wildly unrealistic in practice" scenarios, then make false assumptions based on the unrealistic results, then optimise code (e.g. in libraries) based on these false assumptions. If

strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use

strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

strlen() is a bottleneck (e.g. because the strings actually are large) anyone that cares about performance will keep track of string lengths themselves and will not use

strlen(); and (for people that care about performance) strlen()` is mostly only used when the strings are too small to bother keeping track of their lengths.

– Brendan
yesterday

@chqrlie: I'd also say that this is partly a symptom of a larger problem - code in libraries can't be optimised for any specific case and therefore must be "un-optimal" for some cases. To work around that it would've been nice if there was a strlen_small() and a separate strlen_large(), but there isn't.

– Brendan
yesterday

@MatthieuM.: you can look on Godbolt (and note that -march= implies -mtune). gcc -march=skylake -O3 still uses the same 4-byte-at-a-time scalar integer with large cleanup, even though gcc is only inlining strlen at all when it knows the buffer is aligned. Knowing the alignment makes it easy to use SSE2 pcmpeqb / pmovmskb / test/jz, then bsf, without risking page faults for crossing into the next page. (We know calloc gave us 16-byte aligned memory because alignof(maxalign_t) = 16 in x86-64 System V.) Total code-size would probably be smaller because of simpler cleanup.

– Peter Cordes
11 hours ago

This is a major missed optimization because SSE2 is baseline for x86-64, and a simple compact inline loop could probably go at least half the speed of the library version for the known-alignment case for large inputs, and probably be better for small inputs, too. (An AVX2 / AVX512 version would require 32-byte / 64-byte alignment to avoid maybe reading into an unmapped page, which we don't have.)

– Peter Cordes
11 hours ago

|
show 12 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ggthjy