-
Multiplication / Division...
11/26/2016 at 10:54 • 19 commentsThese can be *huge* operations... Here are some thought-points on alternatives.
Again, these techniques won't save you much space (nor maybe *any*) if you use libraries which make use of them... So, when space is a concern, you're probably best-off not using others' libraries.
------
So, here's an alternative... Say you need to do a multiplication of an integer by 4...
A *really* fast way of doing-so is (a<<2), two instructions on an AVR.
If you need to do a division by 4? (a>>2), two instructions on an AVR.
(Beware that signed-integer operations may be a bit more difficult).
.....
Another alternative is... a bit less-self-explanatory, and likely quite a bit more messy...
In most cases, there will be *numerous* functions automatically-generated which handle multiplications/divisions between integers of different sizes. That's a lot of code generated which mightn't be necessary with some pre-planning.
(and don't even try to use floating-point... I'm not certain, but I'm guessing a floating-point division function alone is probably close to 1kB).
----------
ON THE OTHER HAND: Some architectures have inbuilt support for some of these things... E.G. whereas
(a<<3)
might require three instructions on any AVR,(uint8_t)a * (uint8_t)8
may be only *one* instruction on a megaAVR containing a MULT instruction, but may be darn-near-countless instructions on a tinyAVR.Read that again... On both architectures, using <<3 may result in exactly *three* instructions, whereas in one architecture (e.g. megaAVR), *8 may result in *one* instruction, whereas in another (e.g. tinyAVR) it may result in loading two registers, jumping to a function, and a return. AND, doing-so not only requires the instructions to *call* that function, but also the function itself, which may be *numerous* instructions...
---------
OTOH, again... Say you're using a TinyAVR, where a MULT instruction isn't already part of the architecture's instruction-set. If you're using other libraries which use the mult8() function, (e.g. by using a*b), mult8() *will* be included, regardless of whether you figure out other means using e.g. << throughout your own code.
There comes a point where even using << may result in *more* instructions than a call to the mult8() function which has already been included by other libraries.
(e.g. <<7 might be seven instructions, but if the mult8() function has already been included, then you only need to load two registers, and jump/call, which is only something like 3 instructions...)
There are lots of caveats, here... It will definitely take *longer* to execute mult8(), but it will take *fewer* (additional) instructions, in the program-memory to call it. Again, that is, assuming mult8() is compiled into your project, via another call from elsewhere.
-----------------------------------------------------------------------------------------------------------------------
TODO: This needs revision. Thank you @Radomir Dopieralski, for bringing it to my attention, in the comments below! As he pointed-out, the level of "micro-optimization" explained in this document can actually bite you in the butt if you're not careful. Optimizers generally know the most-efficient way to handle these sorts of things for the specific architecture, and often find ways that are way more efficient than we might think.
E.G. as explained earlier, (x*64) can be rewritten as (x<<6).
If your microcontroller has a MULT instruction, (x*64) may, in fact, require the fewest number of instructions.
If your microcontroller *doesn't* have MULT, then the optimizer (or you) might choose to replace it with (x<<6), which might result in six left-shift instructions. (or possibly a loop with one left-shift and a counter).
But there are many other cases us mere-mortals may never think of. E.G. some microcontrollers have a "nibble-swap" instruction, where, the high nibble and low-nibble quite literally swap places. So, the optimizer *might* see (x<<6) and instead replace it with, essentially, (nibbleSwap(x & 0x0f) << 2). That's four instructions, rather than six.
And then, as described earlier, there's the case where _mult8() is already in your code, and the optimizer (-Os for *size* not speed) might recognize that it only takes three instructions to call _mult8().
TODO: The point, which I completely forgot in writing this late "last night", wasn't to encourage you to replace your multiplications (e.g. x*64) with shift-operations (x<<6), but to be aware that code *can* be hand-tuned/optimized, when considered *carefully* (this takes a lot of experimentation, too!) and the results may not be ideal for all platforms/architectures or even for all devices in the same architecture! And, further, doing-so *may* bite you in the butt if done from the start... (e.g. you design around *not* using _mult8(), but then later down the road realize you *have to* for something else, now your code-size increases dramatically *and* your "micro-optimizations" are slightly less efficient than merely calling _mult8())
-------
E.G. consider (x*65)...Do you *need* that much precision? If not, you might be able to get away with thinking about how your architecture will handle the operation... If your architecture has a MULT instruction, then you probably don't need to worry about it, but if it *doesn't* x*65 may very well result in *quite a few operations* that you don't need... If x*64 is close-enough, then using that *might* be *significantly* smaller in code-size and execution-time.
Note that this is a bit *in-depth* in that if somewhere else in your code (or libraries you've used) a similar operation is performed, then your compiled code will have a function like _mult8(a,b) linked-in... Calling that may only result in 3 additional instructions ( load registers a and b, call _mult8() ) whereas, again, remember that (1<<6) might result in *six* instructions. BUT: If you *know* that _mult8() is *not* used anywhere else, and you *know* that you don't absolutely need it, then you'll save *dozens* of instructions by making sure it's *never* used (and therefore not linked-in).
Think of this like the floating-point libraries... If you use floating-point, your code-size will likely grow by SEVERAL KB. If you throw usage of things like sin() or whatnot, that'll add significantly more. But if you *don't* use them, then they won't be added to your code-size. (This is similar to what happens with using global-variables which are initialized vs. those which aren't, described in a previous log). These aren't functions that *you've* written, they're essentially libraries that are automatically added whenever deemed-necessary.
Oy this is way too in-depth.
And, really, it requires quite a bit of experimentation.
TODO: A note on optimizers... -Os will most-likely consider other options such as the nibble-swap example given earlier, but some other optimization-levels will take your code word-for-word. Think you can outsmart it? :)
-------------------
Realistically, these techniques may only be useful if you've got complete control over all your code, and they're *considered* along-the-way, but only implemented *at the end* to squeeze out a few extra bytes...
-
AVR project doing nada = 58Bytes, and some experiments/results.
11/26/2016 at 04:46 • 0 commentsFirst, note: I'm using avr-gcc, directly, rather than going through e.g. WinAVR, or Arduino...
And be sure to check that previous log! I am *not* using stdio, as that's *huge*, but it takes some effort to make certain it's not included.
--------------
Here I've created an AVR project with nothing but the following, code-wise...
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> int main(void) { while(1) {} }
This project compiles with the following specs, output by 'avr-size'
_BUILD/minStartingPoint.elf : section size addr .text 58 0 .data 0 8388704 .stab 1200 0 .stabstr 2993 0 .comment 17 0 Total 4268
As I understand the contest's requirements, this qualifies as 58 Bytes toward our 1kB limit.
-------
Now, what happens when we add a global-variable?
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar; // = 0; int main(void) { while(1) {} }
Now we get:
_BUILD/minStartingPoint.elf : section size addr .text 74 0 .data 0 8388704 .bss 1 8388704 .stab 1212 0 .stabstr 3010 0 .comment 17 0 Total 4314
Toward the contest-requirements, I believe this qualifies as 74 Bytes toward our 1kB limit.Note that I did not initialize the global variable... If I'd've initialized it to 0, we'd have *exactly* the same results. (Uninitialized global/static variables are always initialized to 0, per the C standard. THIS DIFFERS from *non-global*/*non-static* local-variables, which are *not* presumed to be 0 by default.)
-----------
But what happens when we initialize it to some other value?
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; int main(void) { while(1) {} }
_BUILD/minStartingPoint.elf : section size addr .text 80 0 .data 2 8388704 .stab 1212 0 .stabstr 3010 0 .comment 17 0 Total 4321
NOW, note... our ".data" section has increased from 0 to 2. (and our .bss section has dropped from 1 to 0).As I understand, the ".data" section counts toward both your RAM and ROM/Flash usage.
Why both? Because the global-variable is *initialized* to the value 0x5a. The variable itself sits in RAM, but flash-memory is necessary to store the initial-value so it can be written to the RAM at boot.
As I understand, there's a bit of code hidden from us that essentially iterates through a lookup-table writing these initial-values to sequential RAM locations, which will then become your memory-locations for your global/static variables.
Note, again, this didn't happen when the global-variable was uninitialized (or initialized to 0) because there's no need for a lookup-table to store a bunch of "0"s, sequentially. Instead, there's a separate piece of hidden-from-us code that loads '0' to each sequential RAM location used by "uninitialized" globals/statics.
SO...
As I understand, per the contest-requirements, the above example counts as 80+2 = 82 Bytes toward the 1kB limit.
-------
I'm just guessing, here, but I imagine it went to *2* rather than *1* because they indicate the end of the initialization/"lookup-table" with a "null"-character = 0... So, most-likely, if you add a second initialized global-variable the .data section will be 3 Bytes.
Let's Check:
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; int main(void) { while(1) {} }
Well, color-me-stupid...section size addr .text 80 0 .data 2 8388704 .stab 1224 0 .stabstr 3028 0 .comment 17 0 Total 4351
.... and three?#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; uint8_t globalVar3 = 0xef; int main(void) { while(1) {} }
section size addr .text 80 0 .data 4 8388704 .stab 1236 0 .stabstr 3046 0 .comment 17 0 Total 4383
Uh Huh...!So, maybe the init-routine handles 16-bit words at a time... might make sense, since 'int' is 16-bits.
Anyways, that's probably irrelevent.
But, do note that the ".text" section hasn't grown at all.
--------
So, again, this last example would most-likely count toward 84 Bytes of the 1kB limit.
......
The key, here, is that when you "Flash" your chip, it will flash ".text" + ".data" bytes to the flash/program memory...
So, regardless of this contest, the end-result is that even if your .text section is less than your flash-memory space (say 8190 bytes), your project still might not "fit" in your flash-memory.
I remember this being *quite confusing* when I first ran into it... so maybe this'll help save you some trouble.
.......
As far as the other sections... The ones listed here, from avr-size, don't really count, they contain stuff like debugging information that's stored in your compiled binary-file, but not written to flash.
Oh, and if I wasn't clear, ".bss," it seems, tells you the amount of RAM used by *uninitialized* global/static variables... which doesn't count toward the amount of program-memory used.
.....
Side-Note: When working with projects with limited memory, it's probably wise to *not* use many large local variables... E.G. say you've a string...
void printHello(void) { char string[80] = "Hello"; char *charPtr = &(string[0]); while(*charPtr != '\0') { uart_putChar(*charPtr); charPtr++; } }
If you have several such functions, it might make more sense to have *one* *global* string array which can be (cautiously) reused between these functions... Otherwise, your stack can fill up quite-quickly, and stack-overflows are *really* confusing when they occur.Similarly, wise not to use Malloc, etc.
And another plus-side of doing-so is that it shows up in your ".bss" section, so you have an idea of how much memory you're using, and how much stack is available.
------
Here's another interesting aside... I just noticed when rereading this:
Did you notice that the ".text" section increased by only 6 bytes when we changed our uninitialized global variable to an initialized one? Seems fishy... I highly doubt they can fit looping through a lookup-table in only six bytes' worth of instructions...
I wonder if they only include each of the two different initialization-routines *when needed*...
#include <avr/io.h> #include <stdint.h> #include <inttypes.h> uint8_t globalVar = 0x5a; uint8_t globalVar2 = 0xa5; uint8_t globalVar3 = 0xef; uint8_t uninitializedGlobalVar; int main(void) { while(1) {} }
section size addr .text 96 0 .data 4 8388704 .bss 1 8388708 .stab 1248 0 .stabstr 3076 0 .comment 17 0 Total 4442
Ah hah! The only change was adding of another "uninitialized" == (initialized-to-zero), global-variable, and now the ".text" size has jumped from 80 Bytes to 96 Bytes.So it would seem that the "zeroing" routine for "uninitialized" global/static variables occupies 16 bytes, and the "lookup-table"-initialization routine occupies 22 Bytes.
Hey, wanna save a few bytes? Can you get away with converting all your global/static variables to either initialized or "uninitialized"? Might be something in there...
----------
Note: The above tests were performed on an ATmega8515...
The last-experiment shown was since run on an ATtiny861...
If anyone wonders about the differences in functionality of different systems, take a look here. Here's the result from the above test on a different processor of the same architecture:
section size addr .text 100 0 .data 4 8388704 .bss 1 8388708 .stab 1248 0 .stabstr 3093 0 .comment 17 0 Total 4463
Check that... The Tiny861 requires 4 more bytes of code-space to do the exact same thing.Is there a lesson, here? Nah... just, remember that the instruction-set may have something to do with code-space-usage. (And, maybe, if you've designed something on a Tiny AVR that *just* exceeds 1024 Bytes, then you might be able to recompile it for a Mega AVR and save a few bytes...)
-
AVRs, stdio, printf, etc...
11/25/2016 at 11:56 • 0 commentsLook here... There's no way you're going to fit printf, or any of its incarnations, in 1kB...
Whether you knew that already, or not, the problem is, the default options may include it, anyhow... Even if you don't explicitly "#include <stdio.h>" anywhere in your code.
So, you've gotta get rid of it!
Unfortunately, this ain't for the faint-of-heart...
----------
First things, first... You have to Make Certain you're not using stdio.h *anywhere* in your code...
This may not be easy to do... it's entirely plausible some of the libraries you use *might* make use of it... Or, it's equally possible that by #including something *else* that stdio.h will be inherently-included.
I mean, basically what it boils down to is this: if you use formatted-text, anywhere, then most-likely you're using printf, or one of its variants (sprintf, etc), plausibly indirectly...
Can't help yah, much, here... but here's what I do:
Try adding this to your makefile:
CFLAGS += -E -dM
(put it at the top, unless "CFLAGS =" is somewhere below... which'd overwrite it...)
Then, when you compile (e.g. run 'make'), you won't get an executable, but instead you'll get the Preprocessor's output. This'll be in your *.o files...
Most Likely the "build" will "fail"... because, it's trying to create an executable from .o files which don't contain executable code. Just ignore that, this is exactly what we want to happen.
Now, search for all the *.o files, and look for _STDIO_H_...
Here's one way to do that... In the project-folder execute the following (in BASH):
find ./ -name \*.o -exec grep -H STDIO {} \;
That'll search for all the files ending in .o, and for each one found, search within those files for "STDIO"Here's a result indicating that my main.o file somehow makes use of stdio...
$ find ./ -name \*.o -exec grep -H STDIO {} \; ./_BUILD/main.o:#define _STDIO_H_ 1
But, note, this does NOT indicate that main.c necessarily contains "#include <stdio.h>" Just that one of the files included within main.c (or one of the files included by one of those included files) contains "#include <stdio.h>". So you might have to dig through all those #includes... In my case, "#include <stdio.h>" is in main.h, and main.c contains '#include "main.h"'.
(I think there's an easier way to do this, I vaguely recall a preprocessor-option (like the '-E -dM' added earlier) that shows a hierarchy of #includes, but I can't recall it right now).
----------So... I lucked out, I don't actually make use of the functions provided by stdio.h (e.g. printf), AND the only location of "#include <stdio.h>" is within my own code which compiles into main.o, I just need to locate it (again, turns out it's in main.h), and remove that line...
But, again, this may occur in various other libraries, etc... And those libraries might require it... and therein I can't help you.
------
But here's the kicker:
Even though removing that line from main.h results in the above search resulting in *zero* matches... stdio *may very well* still be linked into your project, using up a LOT of program-space for something that's never even used.
This inclusion of the stdio-library occurs, most-likely, somewhere in your makefiles...
Here's a snippet I wrote up in my own custom makefiles... This is long-since forgotten-knowledge, so I hope it's pretty explanatory, or at least gives some idea of where to look for more info.
This is from my file avrCommon.mk:
# From the avr-libc user-manual: # (http://www.nongnu.org/avr-libc/user-manual/group__avr__stdio.html) # " Since the full implementation of all the [features of vfprintf] becomes # " fairly large, three different flavours of vfprintf() can be selected # " using linker options. The default vfprintf() implements all the # " mentioned functionality except floating point conversions. A minimized # " version of vfprintf() is available that only implements the very basic # " integer and string conversion facilities, but only the '#' additional # " option can be specified using conversion flags (these flags are parsed # " correctly from the format specification, but then simply ignored). # " This version can be requested using the following compiler options: # " -Wl,-u,vfprintf -lprintf_min # " If the full functionality including the floating point conversions is # " required, the following options should be used: # " -Wl,-u,vfprintf -lprintf_flt -lm # So, if NONE of the following are defined: # AVR_MIN_PRINTF, AVR_FLOAT_PRINTF, nor AVR_NO_STDIO # Then we will get the default version of stdio/vfprintf, etc. # The Default is slightly larger than printf_min, # but also slightly smaller than printf_flt # # But, this can be problematic if you're not using *any* references to # functionality provided by stdio # # If Anywhere within your source-code "#include <stdio.h>" is mentioned, # EVEN IF YOU DON'T USE ITS FUNCTIONS # the default (and somewhat large) stdio-library WILL BE LINKED-IN. # # So, you might think choosing printf_min is the right way to go... # but, even then, if stdio's functionality is *unused* # you'll be linking-in a bunch of (albiet smaller) *unused* functions # So far, these options are only to be set in the project's makefile # And setting AVR_NO_STDIO is risky if stdio.h is actually included # Since it compiles the default version, which is larger than min... # Which also doesn't have floating-point support. ifneq ($(AVR_NO_STDIO), TRUE) ifeq ($(AVR_MIN_PRINTF), TRUE) LDFLAGS += -Wl,-u,vfprintf -lprintf_min else ifeq ($(AVR_FLOAT_PRINTF), TRUE) # Floating point printf version (requires -lm below) LDFLAGS += -Wl,-u,vfprintf -lprintf_flt endif endif endif # So put this in your makefile (assuming it references this one): # and uncomment one... #Only one is paid-attention-to, in decreasing priority... #AVR_NO_STDIO = TRUE #AVR_MIN_PRINTF = TRUE #AVR_FLOAT_PRINTF = TRUE # NOTE that if AVR_NO_STDIO is true AND you make no reference to stdio.h # anywhere in your code, or its dependencies... # Then code-size should be smaller than choosing MIN_PRINTF, because # vfprintf will not be linked in at all # HOWEVER: If AVR_NO_STDIO is TRUE AND stdio.h is referenced (maybe by # mistake?) then code-size will be *larger* than having chosen MIN_PRINTF # Because the Default vfprintf will be used. # It's confusing and hokey. # Example a/o LCDrevisited2012-27: # stdio.h is not referenced anywhere # with AVR_MIN_PRINTF=TRUE, codesize is 8020 Bytes # with AVR_NO_STDIO=TRUE, codesize is 7048 Bytes # That's nearly 1/8th of the codeSpace, or 1KB, taken up by functions # that're never used! (e.g. vfprintf was linking-in, but not used) # (Previously, LDFLAGS was set as in AVR_MIN_PRINTF as a default, in # this file, avrCommon.mk) # with AVR_FLOAT_PRINTF=TRUE, codesize overflows by 1664 Bytes. # so that's... 9856 Bytes
(Note that AVR_NO_STDIO, AVR_MIN_PRINTF, and AVR_FLOAT_PRINTF are my own options, and MOST LIKELY WILL NOT exist in your makefile)
So, if I'm parsing that long-forgotten-knowledge correctly, the key is to FIRST: Make CERTAIN you're not using stdio.h AT ALL (even in dependencies), as described, earlier... Then Make CERTAIN you *don't* have reference in your makefiles to anything like:
LDFLAGS += -Wl,-u,vfprintf -lprintf_min
NORLDFLAGS += -Wl,-u,vfprintf -lprintf_flt
AGAIN, it's a bit counter-intuitive, because if you're *NOT* using stdio, then the "printf_min" option will *increase* your code-size.
--------
And, of course, after you've done all this, make sure to remove (or comment-out) that line added to your makefile, earlier:
CFLAGS += -E -dM
-
Exact Delays
11/25/2016 at 07:33 • 0 commentsOne way to create an exact delay is to use NOPs...
Say you're working with a processor/microcontroller which has an external memory bus used for both its RAM and its ROM... capable of addressing 64KB of external memory (16 address bits). E.G., maybe, the 8051.
Now, say your external RAM/ROM chips fit within only the first 32KB of address-space...
Now... say you've numerous cases where you need an *exact* delay of anywhere from 8 to 32768 instruction-cycles... (including intermediates, like 9, 35, or 32754).
----------Now, let's say your processor's NOP is 0xff... And its "return" (from a call/jump) is 0x81...
So...
Tie some pull-up resistors to the data-bus... (creating a NOP whenever an address doesn't select the RAM/ROM).
Use some glue-logic to detect when address 0xffff is selected (a 16-input NAND gate?)
Tie that NAND gate's output to a 74244/245's /OE input (Output-Enable, active-low), and tie the '244's inputs to 0x81 (return).
-----
A tiny bit of precalculation would tell you *which* address to jump to, outside the RAM/ROM (in "pull-resistor-space"), to get your exact-delay...
Don't forget to keep in mind the number of instruction-cycles to calculate that jump-point and execute the jump (which is why I arbitrarily threw in '8' as a minimum) and the number of instruction-cycles it takes to return.
(inspired by: https://hackaday.io/contest/18215-the-1kb-challenge/discussion-70338)