-
Postmortem
02/09/2017 at 22:58 • 0 commentsI've been meaning to write an update here since the contest winners were announced.
Before I do, congratulations to the winners! Some truly great entries were submitted, and from my earlier peeks at what my competitors were working on, I have to say that I feel the winners are representative of the projects I thought were the most worthy of taking the prize.
In retrospect, I saw this challenge as an opportunity to take code I was already working with, and make something simplified with general utility for others. In this spirit, general utility does not imply a 1KB limit on the firmware size, and I'll be returning some of the functionality to the state it was in before trying to reduce the code to qualify for the challenge.
While there's still work to do in achieving general utility, these are the features that can be expected in these future releases:
- Baud rate support down to 78 baud
- Baud rate dividers which produce bitrates as close as possible to the requested baud rate
- Serial transmit that is non-blocking (one of the compromises to favor code size was to send bytes inside the USB transfer interrupt handler, and this byte transmission would block the handler until all bytes were sent)
- Parity support for transmit and receive
- Buffer overrun detection and reporting
- BREAK generation/detection and reporting
Admittedly, as this work does not have a hard deadline, it will be done in my leisure time where not spent on other projects. I encourage anyone interested in these features to post a comment as motivation, but bear in mind that this is a hobby project and I will not be able to commit to deadlines.
Thanks for reading!
-
Honorable Mention
01/05/2017 at 03:06 • 0 commentsI'd like to talk a little about some of the changes that cut this project down to almost 1/3rd of it's original size. In earlier logs I already mentioned the USB Device Firmware Upgrade (DFU) support, extra USB descriptors, accurate baud-rate calculation, and bootloader/application separation as features that got the axe.
Here follows the remaining bits on the editing room floor.
As the bootloader was meant to handle USB core functionality while not interfering with user firmware, any registers modified by the bootloader had to be backed up prior and restored after use. This included WREG, BSR, STATUS, FSR1, FSR2, and TBLPTR. Some of which are two or even three bytes, and TBLPTR also needed to have bootloader specific values cached in ram for descriptors larger than 1 packet.
Every backup and restore operation required 4 bytes, and removing this constraint reduced the flash size by 104 bytes of code.
Originally, USB descriptors were referenced by a series of tables to simplify the application code required in user application firmware. This table setup supported multiple configurations, strings and any other type of descriptor supported by the USB specification. In the end, this table was replaced with code to specify the descriptor length, and address in flash of the remaining device and configuration descriptors. The tables themselves would have occupied at least 28 bytes of flash, and an additional 84 bytes for the code to scan the tables when a descriptor is requested.
Instead, a SWITCH/CASE block reduced the total footprint of descriptor lookups by 86 bytes.
Initializing the PIC for USB and even general operation typically involves loading default values into a number of registers that control operation of the device. After reviewing the datasheet to determine the reset and power on states of those registers, I found several either didn't need to be specified in code, or could be initialized by a single bit operation. In 16-bit PICs, loading an immediate value to a register is a 2 instruction operation, while single bit flips can be done in a single instruction. By re-arranging code to take advantage of common immediate values, changing immediate constants to single bit-flips, and removing initialization of registers that are already in the correct state at power on/reset I managed to save an additional 50 bytes.
USB Stall and USB Error interrupts had empty handlers assigned that did nothing more than clear the interrupt flag and exit the interrupt handler. During development of the bootloader, they also served to send a notification on the serial port which I'll go into a little more detail shortly. However, in this project they served no purpose, and both these interrupts could be disabled.
Removing the code for these interrupts freed up an additional 16 bytes.
Handing over buffers of filled data to the USB interface engine on the PIC is something done repeatedly by this firmware. This was a very modest saving, but in each case 3 instructions could be replaced with 1 call instruction, and two separate functions were added to handle these cases. Two were required, as these instructions are responsible for setting the number of bytes to be transmitted and setting the USB data toggle bit while handing over ownership of the buffer. The USB data toggle bit tracks if an even or odd packet is being sent for a specific request. This is independent somewhat to the ping/pong even or odd buffering which also has to be managed.
While scheming for ways to reduce the code size I had the idea, which I never made good on, to analyze the binary looking for common sequences of words in the compiled flash. The idea being, that if enough common sequences were found that could be condensed into function calls, then identifying many of these cases could be automated. While I never did get around to doing this, this specific change is exactly the kind of case I expect this analysis to identify. It would likely obfuscate the source code, but this kind of automated analysis could prove useful in the future.
In the end, this minor change saved another 8 bytes.
Next, this minor bit of tuning also helped slightly. RAM accesses in 16-bit PICs are either done through indirect addressing, using the global 'access' page, or using banked ram accesses. Banked accesses are used in a few places, and must be used when building for the extended instruction mode on the PIC18F series. As it turns out opcodes that specify the lower portion of access ram, are instead treated as offsets from FSR2 in extended mode. This firmware actually takes advantage of this when parsing USB requests received from the host, as individual bytes in the received message can be retrieved in a single opcode instead of 3 or more.
Anyway, back to banked register access. By doing bank switching earlier in the USB transfer interrupt handler, I was able to remove 4 other cases. This saved a paltry 6 bytes total.
And finally, while it saved no code space what-so-ever, I'd like to mention my serial debug routines. I removed them primarily to simplify the code in the final submission. Removing this code saved no space, because, similar to how asserts are performed in C, these debug routines are macros which can be toggled to generate no code when not producing debug output. Specific points in code (see the stall and error interrupts above for example) would print a symbol to the serial interface, and optionally dump a series of register values.
By having an ifdef statement with null macros for a non-debug build, and serial print statements for debug builds, I could easily litter the code with printed markers and dumped registers which were helpful in early development. It also had the added benefit of allowing me to determine if the delays imposed by printing to the serial port were the cause of bugs. **spoiler** It was never the delays of pushing bytes through the serial interface. Turns out, USB is somewhat forgiving as long as it sees a response in less than 50ms, which at the baud rates I was using would allow for almost 500 characters sent per USB transaction.
This feature was largely unused by the time I started on the 1KB project, as serial communication was a core part of the end goal. Trying to get meaningful debug prints, while also sending/receiving data on the same serial interface simply wasn't worth the effort by that point. Instead, most of my debugging was done by using the usbmon Linux module, and Wireshark for tracing USB communication.
While not relevant to this project, eventually the goal is to offer a similar debug printout feature using a USB-CDC interface. This should offer a development experience similar to doing Arduino development.
This, combined with the baud rate divider changes and other alterations mentioned in days past, made up enough space to fit the core functionality into 1KB while also including RTS/CTS/DTR/DSR signals.
This post wraps up the software development on this project to this point.
Thanks for reading, and stay tuned for more info on setting up the hardware!
-
Submission
01/03/2017 at 03:15 • 0 commentsAfter spending a few hours reviewing the source code and commenting it so that it will hopefully make some sense, I'm happy to submit the finished source code.
Included in the submitted tgz is a Makefile, and all the source code for assembly. The assembler used is gpasm, which is available in GPUTILS. It also has a line for programming my MCU via the PICKIT2 programmer, using the pk2cmd tool.
To build, the ubiquitous "make all" should work as long as GPUTILS are installed.
Making this code work on any USB capable PIC18F chip should be a simple matter of modifying the Makefile, and adding the necessary block to usb-devices.inc. If you need any help with this, or if you've added a chip and would like to share your changes, please post your comments here on hackaday.io
Stay tuned, I'll be sharing some of the work involved in getting to this point in the days ahead.
-
All the Stops
01/01/2017 at 01:21 • 0 commentsWhile reviewing the code changes to write about all the work required for this project, I found myself fighting to include RTS/CTS support in the final submission. It's been challenging, but as of about five minutes ago I've succeeded in including this feature into the final 1KB hex file.
Accomplishing this took just about every trick I had up my sleeve, but it's done. The only features I would say are missing at this point are serial break generation/detection, parity bit support and framing error detection. While all are "nice to have" features, at this point I'm confident there isn't the remaining space available.
The current code is 7 instructions short of 1KB. Added to this are the configuration register values which take up 14 bytes. It's not clear to me presently if these 14 bytes will be counted in the judges' tally, though I suppose it's better safe than sorry.
If I do find out that the config registers are not counted by the judges, that opens up 7 instructions worth of available space. Probably not enough to add the remaining features, but maybe worth exploring should that situation arise.
With that, I'm off to have a celebratory beer and get ready to ring in the new year. Cheers all!
-
Case Optimization
12/31/2016 at 20:28 • 0 commentsWhile reviewing the new code for identifying which descriptor is being requested by the USB host, I noticed that instead of using the Switch/Case macros I had instead manually implemented the same mechanism of using XOR and conditional branches. "Why not be consistent?" I thought to myself, and set about converting it into the Switch/Case pattern.
While using the macros cleans up the code appearance, and gets rid of some labels... it also increases the code size by a few bytes. Reviewing the reason why, I found that the 'default' case at the end of the block imposes 2 extra branch instructions. And in this specific scenario, the code inside the default block in this example is a branch to the setup_error code. This means the default block is literally 3 branches, when ideally it could be 1.
I looked at the other switch blocks in the code, and found 2 other cases of a default block that contain a single branch. Crafting a specialized case macro that includes this branch, instead of requiring a separate default block, will shave 4 bytes off of each case. It's a little ham-fisted for general use, but for the purposes of the 1KB challenge, seems like a win.
So, instead of this:
CASE SOME_VALUE DO_SOME_STUFF DEFAULT bra somewhere_else ENDSW
I'll wind up with this:
CASEELSE SOME_VALUE, somewhere_else DO_SOME_STUFF ENDSW
And in the process, clean up the descriptor lookup code as well as save 4 bytes per altered statement.
-
Switch, Case and Hardware Call Stacks
12/31/2016 at 03:13 • 0 commentsSo... a lot of code is going to be coming up in the posts ahead and it'll probably easier to digest if I explain a couple facets of what the heck it is I'm doing.
First off, I'd like to dive into the DFU bootloader which has given rise to this project and hopefully many other projects to come. The basic idea, is to take what makes Arduino so great and do it better. That is, take a general purpose microcontroller and make it easy to prototype and deploy. While I started with Microchip's PIC before I'd ever heard the word Arduino, I'm among be the first to complain about their two tiered compiler offering. I chose to write my code in assembler, because it was much easier to write code that would work with both Microchip's and open source assemblers. I also had a fascination with USB, which offers to make devices that can really integrate with our existing computer systems, rather than COM ports requiring additional software.
Ultimately, the goal was a firmware platform that could handle virtually any USB standard, support open specifications for firmware updates, and stay the hell out of the programmer's way.
One of the quirks of PIC's 8-bit architecture is a hardware call stack. This means the devices can handle a depth of N function calls (in this particular case, 32) before the stack overruns and execution effectively halts. Earlier prototypes for the USB logic in the bootloader involved a few nested levels of function calls. This eats into the 32 call budget, and for the bootloader to stay out of the way a better means of segregating the logic was desirable.
The cleanest way to structure the code was to replicate the switch/case model often used in higher languages, and replicating it in assembler wasn't too much trouble. Some creative liberties are taken (break statements are implied) and implementation is tuned to the architecture.
The value being tested if first loaded into the working register, or WREG as it's called on PICS. The case statements flip bits in the working register using a bit-wise exclusive-or (XOR), testing for a zero. A conditional branch skips to the next case statement if the result doesn't match the current case statement, and the process repeats. However, using XOR makes it very intuitive. In a naive implementation, any changes made to WREG need to be undone before checking the next case statement. This would mean XORing the value we just tried to check, and then XORing the value we want to check for a match to. However, these two steps can be combined and these macros simplify the production of this mask by XORing two consecutive case values together. This produces the exact bit flips required to convert the previous case into the current case.
Little moment to blather here... I love XOR. I think mathematically it is one of the neatest transformations around. AND and OR are both very useful and have specific jobs to do, but XOR is very versatile. Taking a value and XORing it with a max value, will perform the bit-wise NOT. Any XOR operation is reversible, by repeating the same operation. And, as a consequence of this last fact, XORing doesn't destroy any data. Let me share a snippet of PIC assembly that sums up the utility of this operation. Lets say, we have a value in the working register, and we want to swap this value with a value in some memory location called regX.
xorwf regX, F ; WREG XOR regX => regX xorwf regX, W ; WREG XOR regX => WREG xorwf regX, F ; WREG XOR regX => regX
Maybe I'm going a little overboard, but this is poetry. Like moving two decks of cards through each other, having them come apart in the same order they went in. To describe the code above step by step, the first instruction takes regX and turns it into a map of the differences between regX and the working register. It then uses this map to transform the value in the working register into the value that was in regX. It then uses this copy of what was in regX (now in the working register, stay with me) and uses this value to transform the map in regX into the value that was originally in the working register. No data is lost in any of these operations, and this is as efficient in terms of instructions executed as using a temporary ram location.
So, every case statement is exactly one transformation instruction, followed by a conditional branch. Admittedly, it's not a jump table like the case statements of C, but 3-4 cycles per mismatched case is plenty fast for most purposes.
Using the swtich/case pattern in the bootloader allows calls into user code from the interrupt handler, with a worst case of 3 levels of stack occupied. Two of those levels are used by the interrupt handlers, and the remaining one for either a bootloader call or a call into user code. The rest of the stack is available to the developer of the supplemental firmware.
I'm uploading the macros for case/select to this project. They're not commented in any meaningful sense, and the notable complexity comes from supporting nested case statements. But the core concept being automated by the scripts is the application of XOR to input values being tested, and skipping non-matching blocks of logic.
-
Stress Testing
12/30/2016 at 07:58 • 0 commentsTook the time today to hook up two chips running the firmware and do some bandwidth testing. I have to say, I'm pretty impressed.
Testing consisted of sending a 3MB file of random data through 2 chips, and using WinMerge to compare the files. I've been testing on my machine using RealTerm to send and capture the data, but noticed an issue where even at higher bitrates the throughput wasn't increasing. Google to the rescue.
Turns out, the version of RealTerm I was using had a known issue where transfers would max out near 15KB/s, and all I had to do was download an updated version.
Transfers are very reliable up to 2 megabits per second with both chips on different USB endpoints. At 3 megabits however, I start to see what appears to be byte mangling. As the errors often appear to be bit errors localized to a few single bytes, this seems to indicate that the loss either due to noise on the serial line (about 12cm of breadboarding wire) or else a manifestation of the clocks of the two PICs being slightly out. I might be able to run both off of a single clock to test that theory, as well as pairing the serial connection with a twisted ground to try cut out noise.
The clocks are suspect, as both chips are using PLLs to kick up the frequency to what's required for USB full speed operation. One chip has a 24MHz oscillator, and the other has a 20MHz crystal resonator. The PLL in particular might be making the chips more sensitive to clock problems, but running both chips of the single 24MHz oscillator should be simple enough as long as the wires aren't too long. Call me lazy, but I don't think I'll relocate one of the PICs to shorten the clock traces to test a theory.
These speed tests have only been half duplex as well. At higher speeds, RealTerm seems to jam up on one side. That is, it seems like it's RealTerm jamming up based on the two instances I run becoming non-responsive after initiating the test. (It's not my code! I swear!) I might have to try sending between two PCs, as it seems to be related to running two instances of the program on the same PC.
Also: while it puts the firmware over 1KB, I'm actually testing with RTS/CTS currently. With the deadline looming, I've yet to decide if I'll be dropping that feature or not. Maybe I can squeeze another 50 bytes out. I think before I open that can of worms, I'll write a couple more logs going over what I've removed to make the firmware fit. The source is littered with code commented out, and once I pay tribute to the fallen opcodes, I can proceed to remove them and publish the code.
So, yeah! 2 megabits per second sustained transfer speed! not too shabby!
-
Flow Control
12/29/2016 at 01:25 • 0 commentsNext is exploring the possibility of adding RTS/CTS signalling to the demo.
Doing the outbound RTS/DTR signals were pretty easy, and only add an additional 22 bytes to the code. (22 bytes, I just happened to have made available) Just set up an IO port for the signals (3 instructions, 6 bytes) and add an additional case to the class request handler (8 instructions, 16 bytes). However, putting out these signals is only somewhat useful if we can't detect these signals from a remote device using the CTS&DSR inputs.
CASE CDC_SETCONTROLSTA read2w wValueL xorwf LATB, W, ACCESS andlw 0x03 xorwf LATB, F, ACCESS ;bit 0=DTR, 1=RTS bra setdone_xmitzero
Adding support for CTS/DSR is where it gets ugly. I've considered using interrupts for this, however, noisy inputs could easily generate unwanted state updates to the host. The next best solution is to check if the input pins for CTS/DSR change when handling the USB Start Of Frame (SOF) interrupt. This works, and guarantees we see at most 1 update per millisecond, however sending this SERIAL_STATUS message to the USB host requires not only this change detection, but also sending a 10-byte message to the host.
lfsr FSR1, 0x418 rcall fsr2_to_fsr1 movlw 0xA1 movwf POSTINC2, ACCESS ;bmRequestType movlw 0x20 movwf POSTINC2, ACCESS ;bRequest clrf POSTINC2, ACCESS ;wValue clrf POSTINC2, ACCESS movlw 0x01 movwf POSTINC2, ACCESS ;wIndex clrf POSTINC2, ACCESS movlw 0x02 movwf POSTINC2, ACCESS ;wLength clrf POSTINC2, ACCESS rrncf ctsdtr, W, BANKED andlw 0x03 movwf POSTINC2, ACCESS ;UART state clrf POSTINC2, ACCESS movlw 0x0A movwf PREINC1, ACCESS ; respond with 10 bytes decf FSR1L, F, ACCESS movlw 0x80 | USBPARITY movwf INDF1, ACCESS
My naive attempts to do this add at most 37 instructions, totaling 74 bytes! Even under ideal circumstances, like more USB ram for data buffering will still require at least 62 additional bytes of code.
Lets consider for a moment, instead of sending the SERIAL_STATE header inline as in the snippet above, what if I load the 8 byte-header into flash and send it from there? The 8 byte header is currently 12 instructions, that take 24 bytes of code space. Instead, it could be 8 bytes for the header itself, and the existing send_descriptor function used to send at a cost of 4 instructions for 8 bytes to setup and make the function call. That could reduce the size to a possible 54 additional bytes.
At this point, I doubt that kind of memory is hiding in the code base to be scavenged. Though, perhaps it would be possible if I scrap the baud-rate divider, and hard-code support for 2 or 3 baudrates.
Sorry flow control, looks like you won't make the cut.
-
Innocence lost
12/26/2016 at 08:01 • 0 commentsThis is the story of how I hacked my beautiful baud rate divider down to a measly 74 bytes. It wasn't without compromise, but as with any good task of engineering, it's all about trade-offs. So lets begin.
Start: 144 bytes
To start off, the PIC18F USART supports different divider ranges selected using the BRGH bit in the TXSTA register. Having BRGH clear has almost no redeeming value, except for supporting baud rates below 183 when running at 48MHz. Setting BRGH however, increases the resolution of the divider by exactly 4. What this implies, is that any baud rate producible with BRGH clear, can also be done with BRGH set with the exception of rates below 183. So, removing support for low baudrates has no other drawback. This shaved off 42 bytes.
Remaining: 102 bytes
Next up, after the division is completed, the PIC specifies the divider to be one less than the result of the division. But, of course, the division doesn't always result in a nice round number, so the code I had to refactor down would use the remainder to determine whether the answer would round up or down, and include the implied subtraction in that rounding operation. It was tight. It felt like an elegant solution. Alas, fudging the numerator, and doing a subtraction still results in the best divider for all of the standard rates above 300 baud which I've checked. I bumped up the numerator from 12,000,000 to 12,001,024, and subtract 1 from the result. Net savings for this compromise came in at another 28 bytes.
Final size: 74 bytes
On to the next one...
-
Baud rates & Ping-pong
12/26/2016 at 06:08 • 0 commentsAfter much pruning, it looks like the baud rate division calculation can be saved. I did remove support for very low baud rates (under 180 baud) which shaved off a few instructions. But I'll admit I'm rather proud of that feature, and am eager to include it in the final submission.
I also looked into removing ping-pong buffering which will reduce the code overhead slightly, however, supporting data from Serial to USB Host will require more code without ping-pong unless I allow for some occasional byte loss when the USB buffer is busy sending to the host. However, the MCU has a mode where all endpoints have ping-pong buffering enabled EXCEPT endpoint 0. This should reduce the code overhead for the more common control-transfer case, but still streamline serial data transfer. I'll explore this in the days to come, a quick look suggests that ping pong on EP0 is responsible for 38 bytes.
Also, the division function for baud rates is not original art. It started from a copy of a 24-bit division routine found online by Tony Nixon. In a later post, I'll also document the process I took in refactoring it.