AVR-GCC Compiler Makes Questionable Code
Most people believe that modern compilers generate better-optimized assembly code than humans, but look at this example from AVR-GCC 5.4.0 with -O2 optimization level:
7b96: 10 92 34 37 sts 0x3734, r1 ; 0x803734 <tachFlutter> 7b9a: e0 e0 ldi r30, 0x00 ; 0 7b9c: f0 e0 ldi r31, 0x00 ; 0 7b9e: a0 91 35 37 lds r26, 0x3735 ; 0x803735 <driveTachHalfPeriod> 7ba2: b0 91 36 37 lds r27, 0x3736 ; 0x803736 <driveTachHalfPeriod+0x1> 7ba6: ae 1b sub r26, r30 7ba8: bf 0b sbc r27, r31 7baa: b0 93 89 00 sts 0x0089, r27 ; 0x800089 <OCR1AH> 7bae: a0 93 88 00 sts 0x0088, r26 ; 0x800088 <OCR1AL> 7bb2: 10 92 95 00 sts 0x0095, r1 ; 0x800095 <TCNT3H> 7bb6: 10 92 94 00 sts 0x0094, r1 ; 0x800094 <TCNT3L> 7bba: 32 2d mov r19, r2 7bbc: e0 e0 ldi r30, 0x00 ; 0 7bbe: f0 e0 ldi r31, 0x00 ; 0 7bc0: f0 93 e3 33 sts 0x33E3, r31 ; 0x8033e3 <currentTrackBytePos+0x1> 7bc4: e0 93 e2 33 sts 0x33E2, r30 ; 0x8033e2 <currentTrackBytePos>
This is straight-line code with no branching. All registers and memory references are 8-bit. With AVR-GCC, the register r1 always holds the value 0, so the code is doing this: Set tachFlutter to 0, load driveTachHalfPeriod, set OCR1A to driveTachHalfPeriod minus 0, set TCNT3 to 0, set currentTrackBytePos to 0. There’s also a move of r2 to r19, which is used later, and I’m not sure why the compiler located the instruction here. There are at least three glaring inefficiences:
- the compiler wastes time loading 0 into r30 and r31, when it could have just used r1
- it does this TWICE, when we know r30 and r31 were already zero after the first time
- it subtracts a constant 0 from driveTachHalfPeriod
I can maybe understand the subtraction of constant 0, if there’s another code path that jumps to 7ba6 where the value in r30:r31 isn’t 0. But why wouldn’t the compiler make a completely separate path for that, with faster execution speed when the subtracted value is known to be 0, even if the code size is greater? After all this is -O2, not -Os.
It also appears there’s no optimization for setting multi-byte variables like currentTrackBytePos to zero. Instead of just storing r1 twice for the low and high bytes, the compiler first creates an unnamed 16-bit temporary variable in r30:r31 and sets its value to 0, then stores the unnamed variable at currentTrackBytePos.
This whole block of code could easily be rewritten:
sts 0x3734, r1 ; 0x803734 <tachFlutter> lds r26, 0x3736 ; 0x803736 <driveTachHalfPeriod+0x1> sts 0x0089, r26 ; 0x800089 <OCR1AH> lds r26, 0x3735 ; 0x803735 <driveTachHalfPeriod> sts 0x0088, r26 ; 0x800088 <OCR1AL> sts 0x0095, r1 ; 0x800095 <TCNT3H> sts 0x0094, r1 ; 0x800094 <TCNT3L> mov r19, r2 sts 0x33E3, r1 ; 0x8033e3 <currentTrackBytePos+0x1> sts 0x33E2, r1 ; 0x8033e2 <currentTrackBytePos>
This is much shorter, and avoids using r27, r30, and r31, so there are more free registers available for other purposes.
Read 17 comments and join the conversation17 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
Disclaimers: I write plenty of questionable code of my own, both in compiled languages and assembly. I also have near zero experience with compiler internals and thus am completely unqualified to talk about this, but I’m not going to let that stop me. Lastly, none of this represents the views of $DAY_JOB, where I’m probably only 2 degrees away from the folks that work on avr-gcc (No I don’t work at Microchip).
Despite that I use various ARM processors far more (and where every cycle counts), I still can fluently (mostly) read/write AVR asm, but only really can read ARM (at the same level of effort). A significant portion of that is that I can trust the compiler is usually better than I am, and so my optimization wins are more directed at finding the stupid inlining issue, or the slow inner loop (that maybe needs a small bit of inline ASM). A straight line function that’s too slow, I’m just far more likely to jump into asm on the AVR.
Don’t get me wrong, there’s a bunch of lower hanging fruit (I’m not a compiler developer, and it’s easy to criticize other people’s code) with optimization on arm-gcc (or places where the optimization choices are tuned more for superscalar cores), especially around support code (like the 64×64 multiply), but I don’t go into optimization expecting a 40% win like I do for AVR.
This page I found a while back talks about the 64×64 multiply on Cortex-M0, which IIRC still looks the same as the GCC7 column in GCC10 which is the last time I updated the compiler at $DAY_JOB.
https://www.purposeful.co.uk/arm_rtabi/ARMv6M__aeabi_lmul.html
I think that the primary reason for all of this has quite a lot to do with the sheer number of people that work and contribute (and are paid to do so) to ARM GCC (even if a larger proportion are optimizing for smartphone class processors these days) versus the much smaller (and probably likely mostly working at one company) group of people available to work on AVR support and optimizations. Couple that with being an 8-bit CPU where larger width variables do need to be treated differently at least some of the time, and it’s less of a surprise that optimizations are weirder on the AVR. I suspect there’s some rule about thinking of 16-bit values as pairs of registers when at all possible that means it’s not able to see the zero register as being valid, because a 16-bit store *always* uses a pair of registers.
I’ve got to admit though, my first thought when I saw the title of this pop up in my rss, was something along the lines of “And this is news?” but that’s not a helpful comment or one that adds to the discussion.
Have you tried the same with the most recent AVR GCC? I’m on version 12.2.0 because I use some of the newer Microchip variants. Can you provide some example code to prove the point? Cheers.
It’s also possibly a self perpetuating problem, where because the compiler is not as good at optimizing, most of the people who’d complain the loudest also probably just end up writing it in asm because it gets done on schedule that way, and so there ends up not being the push to improve as much, and so the cycle repeats.
256byteram – At least I’ve never thought about newer versions of the compiler above and beyond that Microchip provides on their toolchain page (7.3.0) even for the May 2022 release. I don’t know if they have a newer version somewhere else (maybe directly in Microchip Studio?) or if you’ve got somewhere that distributes builds of new versions I’ve missed, but all I found were pre-patched versions of 7.3.0 with the newer device packs added automatically (which is what I use at $DAY_JOB for AVR-DB and the newer tiny series). I’d love to consider updating compilers on new projects if there’s newer out there. A GCC 7 build in 2022 did strike me as odd, but I didn’t really think hard enough to realize that there might be newer versions, and even my ARM compilers are GCC10, let alone current development.
So if I’m out of date, and the compilers are way better these days, then I’m sorry for continuing to spout the outdated information.
– https://www.microchip.com/en-us/tools-resources/develop/microchip-studio/gcc-compilers
I don’t think the problem is being out of date, the problem is it’s a *real* pain to update to the most recent version. There are no binary packages available that I could see, so I compiled GCC and binutils from source on Linux. I needed them for a project that uses the ATTiny414. I spent several hours getting it right.
The source code is available here (from one of the mirrors), but really, unless you’re desperate for the latest optimisation routines or hardware support there’s not a lot of point.
https://mirrorservice.org/sites/sourceware.org/pub/gcc/releases/gcc-12.2.0/
https://mirrorservice.org/sites/sourceware.org/pub/binutils/releases/
I did not mean to bash any avr-gcc authors, should they be reading this. I just thought it was interesting that some “simple” straight line code could be optimized so much with hand-written assembly, and it means I need to revisit my development assumptions. I’ve never really looked in detail at the AVR compiler’s output before, because I assumed any speed-ups I might get with hand-crafted code would be minimal.
mind posting the c code for the snippet? along with the variable defs? I’ve managed to have cases where I thought the compiler was doing a bad job but I’d made a bad assumption on what the compiler is allowed to do.
256byteram:
I was afraid you’d say that you compiled them from source. I was hoping you’d tell me I’m bad at searching the web. Although there are Linux/Windows builds of GCC12 available from zak kemble I just found. It is strange that Microchip doesn’t have a release that supports newer chips they want you to use, w/o manual adding stuff. I guess they are pushing their IDE? The folks behind the Arduino DxCoe for AVR-DA/DB/etc provide an updated 7.3.0 with newer devices added, if that interests anybody. I use that compiler release at work, but oviously could be missing out on newer optimzations and features.
Steve:
Yeah, looking at the assembly is really pretty useful, because sometimes you do spot those glaring issues, and can improve them. It’s always a good thing to have in your toolkit especially as you keep adding more newer and more complicated features that need ever more timing critical code paths.
If you wanted to try a newer version easily, you could try either Microchip’s official 7.3.0 releases, or zak’s 12.1.0 releases.
And to put a different spin on it, the optimized code was enough to get you to this point, where you’ve got a successful product (even if you may have had extra pain along the way). And to be honest, AVR-GCC is nowhere near the worst at optimizing code (or the compiler with the most bugs) that I’ve encountered, even from a big name in the industry. But I ranted enough earlier, and that’s a different and only tangentially related rant.
GCC 5.4.0 is not what I would call a “modern compiler”.
> GCC 5.4.0 is not what I would call a “modern compiler”.
How so? It’s only 6 or 7 years old, and was the default compiler for this version of Microchip Studio 7 that I downloaded and installed in 2020. It may not be the latest and greatest, but I’d say it’s certainly “modern”.
In general, I would assume GCC to be quite good at optimizing code. Not perfect, but good. Perfect optimization almost always requires more knowledge than just the bare high-level code (e.g. C) can express.
A second point is what to optimize for. See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
“-Og” disables a lot of optimization passes, mainly to make the assembly code look like the equivalent of a word-by-word translation. Instructions have the same order as in the high-level code, loops are the same, no smart or dirty tricks. This optimizes the “debugging experience”, i.e. ensures that the debugger does not jump around the high-level code like crazy.
“-Os” and the more aggressive “-Oz” optimizes for smaller code size, at the cost of sometimes using a slower implementation.
“-O1″ to -O3” optimize for execution speed, more aggressively with larger numbers, at the cost of sometimes using a larger implementation or “wasting” resources. See the gcc link above for more details.
At work, we had some negative experiences with AVR-GCC and “-O2”. No one really bothered to look into the details, but code that worked perfectly both with “-O1” and “-Og” misbehaved when compiled with “-O2”. So we simply decided to use only “-O1” for AVR-GCC (as included in Atmelstudio 7).
On a modern x86 CPU, I would assume setting three independant registers to 0, and then using them to write 0 somewhere else, was done to optimize for parallel execution and/or register renaming in the complex x86 microcode queues. But as far as I remember, AVR executes strictly linear. So that does not make much sense.
Another idea might be that some registers are defined as scratch registers that may be changed at any time and don’t have to be safed at all, so it might be reasonable to initialise them close to their usage.
https://gcc.gnu.org/wiki/avr-gcc#Register_Layout confirms that R18 to R27, R30, R31 are call clobbered, i.e. any called function mas clobber them without restoring their contents. ISRs need to save and restore those registers.
Maybe there was a function call between the first and second assignments that was optimized away?
Maybe the GCC simply does not track the use of those registers?
I’ve extracted the relevant code snippet and posted it here: avr-gcc-main.cpp and a slightly longer snippet of the relevant disassembly here: avr-gcc-main.lss You’ll notice a pattern I used twice where instead of something like:
I instead use:
The goal is to avoid doing two separate writes of globalVar from a register back to memory, in the case where it’s incremented and then reset to zero. This worked well in previous testing for 8-bit variables, but it may be throwing off the compiler when 16-bit variables are used.
I tried pasting the C code directly into this comment, but it’s not working even when I escape all the HTML entities. Just use the link to view it. avr-gcc-main.cpp
I tried reproducing the disassembly in Compiler Explorer with AVR GCC 5.4.0 here:
https://godbolt.org/z/1Mq9Y45ov
It seems that with just the snippet the compiler is smart enough to use the zero register directly. There is some extra “subtract by zero” that I think is an artifact of doing 16-bit arithmetic on 8-bit arch.
BTW if you compile with GCC 12.2.0 the disassembly looks even more compact.
Thanks, that’s fascinating. It’s not until AVR GCC 11.1 that the output looks like what I’d expect to see. I’ll have to investigate why the code snippet compiled in isolation with 5.4.0 produces such a different result than in my actual monster-sized function.
There’s a strange behavior (bug?) in the 5.4.0 compiler output from the snippet on godbolt.org that’s not present in my actual compiled code. Look at what’s happening on lines 26-28 of Tony’s assembly listing. It loads driveTachHalfPeriod to r18:r19, then immediately copies the word to r20:r21 and never references r18:r19 again until it’s later overwritten with zeros. I can’t think of any reasonable explanation. None of the other AVR GCC versions do this, neither the older nor the newer versions.
Clang and gcc invest more effort identifying possibilities for platform-independent ways of “optimizing” constructs where the Standard failed to specify things the authors thought were obvious (thus allowing them to process such constructs in astonishing ways) than in avoiding inefficiencies.
Consider, for example, https://godbolt.org/z/zPEWcnc43 which is shown below (but may be garbled):
unsigned mul(unsigned char x, unsigned char y)
{
return x*y;
}
unsigned char arr[140];
void test(unsigned char x)
{
unsigned char i;
unsigned q = 128;
for (i=128; x > i; i++)
q = mul(i,255);
if (130 > x)
arr[x] = q;
}
What would you expect the test() function to do on the AVR if x is 130? Would the generated code:
test:
mov r30,r24
ldi r31,0
subi r30,lo8(-(arr))
sbci r31,hi8(-(arr))
st Z,__zero_reg__
ret
behave in a manner consistent with that?
Note that as the maintainers see it, the astonishing behavior isn’t caused by a bug in gcc, but in the supplied code. Although the authors of the Standard expected that after promoting the argumenting to ‘mul’ to type ‘signed int’, implementations for quiet-wraparound two’s-complement platforms would then perform the multiply in a manner equivalent to using ‘unsigned int’, they didn’t mandate that such platforms do so. Consequently, gcc will cleverly realize that because behavior of the function would only be defined in cases where ‘x’ is 128 or less, and because it is supposed to store the number 128 to ‘arr[x]’ in all such cases, it can generate code that unconditionally stores the value 128 to ‘arr[x]’.
I’m not familiar with the AVR, but on the Cortex-M0, there are some situations where having a function accept a constant a register-qualified argument will allow gcc’s optimizer to generate more efficient code than it would if it could recognize that the value would always be the same (sometimes this can cause gcc -O0 to generate more efficient loops than higher optimization settings!) Perhaps you might try using -Og rather than -O2.
From what I’ve heard, GCC has had massive improvements in code generation recently, even since recent-ish releases like 5.4.0.