-
APP CPU Startup weiredness
10/13/2020 at 21:52 • 0 commentsAPP CPU startup corrupts the heap
I have been doing some more testing with the unhacked SDK approach and detected a heap corruption issue. At first I though I did some mistake when initializing the stack (which I did as well...), but that was not what was causing the problems.
My second suspect was the CPU caching. I didn't turn the cache on, but maybe it's on by default. But no... cache wasn't the issue either.
I then tried to simplify the problem as much as possible and used a trivial infinite loop as startup address. This doesn't even access the registers. But still, as soon as the APP CPU is started, there is damage on the heap.
It seems, some initialization of the APP CPU writes to that memory causing the corruption. It almost seems there is some code running before the APP CPU jumps the the entry vector specified in APPCPU_CTRL_D_REG. Or its an initialization sequence by the hardware. Doesn't really matter in the end, since I don't have any influence over it anyway.
I found a comment in the SDK code indicating, there is in fact some kind of ROM code running before the CPU starts.
/* Initialize heap allocator. WARNING: This *needs* to happen *after* the app cpu has booted. If the heap allocator is initialized first, it will put free memory linked list items into memory also used by the ROM. Starting the app cpu will let its ROM initialize that memory, corrupting those linked lists. Initializing the allocator *after* the app cpu has booted works around this problem. ...
I didn't expect that, since writing the startup address into an register seems to be as low-level as it gets, but apparently, there is code that is executed before that address is loaded. There is always something new you learn...
There is no way around SDK hacking
Since we now know, we have to initialize the APP CPU before the heap is initialized and thus before any user code is run, we have to modify the SDK in order to get the APP CPU running without causing damage to the heap.
I decided to add a single function call in cpu_start() immediately before heap_caps_init() is called. That happens in cpu_start.c.
I also put the code on GitHub: https://github.com/Winkelkatze/ESP32-Bare-Metal-AppCPU
-
Second thoughts about the SDK hacking approach
10/13/2020 at 11:06 • 0 commentsWhy didn't it work?
I've been wondering why my SDK hacking attempts did work so poorly. But, after thinking a bit about it, it should have been obvious from the start:
Synchronization functions heavily depend on the scheduler!
I wanted my bare-metal main to behave like an ordinary FreeRTOS task, yet a lot of things an ordinary task is expected to do (like waiting) is a feature from the scheduler. But, we explicitly don't want the scheduler to be running, so we can't expect these things to work. From the FreeRTOS point of view, it's a bit like an interrupt wants to wait for a mutex (which is forbidden as far as i know).
Possible solutions
Sadly, there isn't much that can be done about that without digging deep into FreeRTOS and either teaching the synchronization functions to busy-wait while running on the second core or using a modified scheduler for the second core that only handles synchronization stuff but leaves the task alone otherwise.
Thou shall not cache!
I thought the SDK hacking approach is at least good for having the cache working and thus being able to execute code from flash. Turns out, it's not that simple. The cache loading functions also rely on the synchronization of FreeRTOS to prevent issues from both cores trying to access the SPI flash at the same time. It also prevents the other core from accessing the cache while it's being updated.
Since as stated above the synchronization is not working, also cache loads are not possible for now.
-
Hacking the SDK
10/10/2020 at 16:03 • 0 commentsIt's finally hacking time ;)
We try to patch the SDK, so we get control over the APP core, while still letting the SDK handle the initialization. Ideally, our main will behave similar to a FreeRTOS task while running bare metal.
Approach
First, we need to figure out, what the
CONFIG_FREERTOS_UNICORE
definition does. After all, we want to replicate that behavior to some extent. So we search through the SDK code and look for code that is dependent on this definition. We can broadly sort the parts into the following categories:
- CPU starting / init code
- Memory regions
- Interrupt allocation / Synchronization / Crosscore
- FreeRTOS
Since we want to keep things simple, we try to not touch the FreeRTOS part and focus on the other three.
Now, we want everything else to think things are running in singlecore mode to avoid modules trying to create tasks on the APP core and to avoid unnecessary memory allocation by FreeRTOS. So we add our new define, which will 'counteract' the UNICORE define in certain places. I named this:
CONFIG_FREERTOS_BAREMETAL_APP_CPU
Therefore, our sdkconfig.h contains both
#define CONFIG_FREERTOS_UNICORE 1 #define CONFIG_FREERTOS_BAREMETAL_APP_CPU 1
Now, in order for our define to do something, we add some code to selected files, which will undefine UNICORE if our define is set.
#ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU #undef CONFIG_FREERTOS_UNICORE #endif
What to change?
As I wrote earlier, we apply the patch to any file related to starting the CPU / memory map / interrupt allocation / synchronization code. The definition is only used in a handful of places, so we can easily check every file and see if we assume that part is relevant for us.
Additionally, we need to modify the start code in cpu_start.c to start our main instead of the scheduler.
Doesn't work!
When I tried as described above, the system immediately crashed during the dport access init while starting the second CPU. That was when I realized, a lot of the 'basic' SDK code like interrupt allocation actually depends on FreeRTOS functions (like mutexes) and the portNUM_PROCESSORS definition instead of the UNICORE define. And sadly, portNUM_PROCESSORS is referenced quite often. So instead of continuing that road, I decided to reduce the amount of initialization, so we won't have that problem.
Minimal Hack
Since I determined that getting the Interrupt allocation / Synchronization / Crosscore stuff working is a lot of work, I decided to ignore it for now.
Without that, there are only the following files where I changed something:
- cpu_start.c
CPU starting, initialization - panic.c
Fatal error message / Core dump (not required, but nice to have) - spiram.c
Cache flushing (only for external RAM) - soc_memory_layout.c / .h
Memory map / region definitions
The most important change to out manual approach (the one without SDK hacking) is the change in the memory map definition. This way we get cache for our CPU and can execute from external flash. In cpu_start.c we add our code to undefine the UNICORE definition and modify the start_cpu1_default() function. app_cpu_bare_metal_main() is our own main function that is defined in the user code.
#if !CONFIG_FREERTOS_UNICORE void start_cpu1_default(void) { #ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU esp_cache_err_int_init(); ESP_EARLY_LOGI(TAG, "Starting bare metal main on APP CPU."); app_cpu_bare_metal_main(); #else // Wait for FreeRTOS initialization to finish on PRO CPU while (port_xSchedulerRunning[0] == 0) { ; } ...
Limitations
With this approach, we can execute from flash, so we are allowed to call SDK functions. However, we still need to be careful that the functions are not using mutexes / interrupts. DPort access is also limited, since we bypass the mutual access mitigation of the SDK. So whatever function we call, we have to check first, if this does anything 'forbidden'. Also, 'printf' doesn't work, so we have to fall back on the much more basic 'ets_printf' for debug output.
Conclusion
Sadly, it doesn't work as well as initially hoped for. Due to complexity, I didn't get the Interrupt / Mutex stuff to work. However, the other fundamental things work. We have cache for the CPU, so it can execute from external flash and not IRAM only as with the full manual approach. Also, the house-keeping interrupts works, so we detect exceptions and the call stack is no longer limited by the CPU register window.
SDK hacking is certainly less elegant than the fully manual approach, but to have the cache and basic interrupts working is a huge benefit.
EDIT: Don't enable the cache without synchronization working!
While fiddling around with this a bit more, I figured out, that enabling the cache on the second CPU without working synchronization is a very bad idea! Then cache loads are not protected and and even if it worked once, two CPUs attempting to access the SPI flash at the same time WILL cause problems sooner or later. This issue may e fixable by modifying cache_utils.c, but I have not tested this.
-
Running a task at max priority
09/08/2020 at 11:34 • 0 commentsMethod
The idea is simple, just build the SDK in normal Dual Core configuration, but pin all tasks to the first core. We then can run a task on the second core which should (in theory) run uninterrupted by everything.
But: We can't be sure nothing is running on the second core!
The ESP32 ecosystem and firmware is complex. There are so many libraries, so we can't be sure that there is really nothing else running on that core.
Testing
Easiest way to test this would be to toggle an IO in an infinite loop and then look at the output with a scope or a logic analyzer. If it is in fact running uninterrupted, the signal should be a clean square wave. If something is interrupting it, it should get 'stuck' from time to time. Since I have none of both at hand, I decided to measure the jitter in the execution time of a simple wait loop. With this method, I can't be sure it runs without interruptions, but I can at least detect some interruptions.
void app_task(void *param) { (void)param; uint64_t min = UINT64_MAX; uint64_t max = 0; uint64_t sum = 0; uint64_t last = 0; #define NUM_ROUNDS 1000 uint64_t times[NUM_ROUNDS]; printf("app_task started on core %i\n", xPortGetCoreID()); last = esp_timer_get_time(); for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++) { for(volatile uint32_t delay = 0; delay < 1000; delay++); uint64_t now = esp_timer_get_time(); uint64_t diff = now - last; last = now; if (diff < min) { min = diff; } if (diff > max) { max = diff; } sum += diff; times[i] = diff; } printf("app_task finished\nmin=%llu\nmax=%llu\navg=%llu\n", min, max, sum / 1000); for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++) { printf("%i=%llu\n", i, times[i]); } }
Results
If the code would run (more or less) uninterrupted, the times values should be almost the same (+/- 1). Except the first may be a bit off, due to the beginning of the loop. But looking at the output we get something like:
87=120 88=120 89=120 90=121 91=120 92=130 93=120 94=120 95=120 96=120 97=121
So it seems like it takes ~120ns to execute one run. However, the 92th run took 130ns. So unless the esp_timer_get_time() itself takes longer from time to time.,we got an interruption here! Sadly I have no way to verify this. But, since just a few runs take longer (always about 10ns) I guess its some interrupt handling.
Conclusion
FAIL
The simplest, straight-forward approach with just running a task at max prio does not seem to work. I don't know what causes the problems, but it seems something (maybe interrupt handling) is ruining my timing here. Also, we can't be sure the core is actually free, so I can't recommend this method.
-
Build the SDK in single core mode and launch the second core by hand
08/28/2020 at 15:53 • 0 commentsMethod
At first, we try to not hack the SDK and to avoid any side effects due to the SDK or any library trying to launch a task on the second core. So we want the running system to be completely unaware of the second core running.
So, to start with, we build everything with
CONFIG_FREERTOS_UNICORE 1
Since I'm using micropyton as a platform, I need to make some further changes to assign everything to the PRO core. If everything is working, you should get a boot message telling you the ESP is running in single core mode.
I (534) cpu_start: Pro cpu up. I (534) cpu_start: Application information: I (534) cpu_start: Compile time: Aug 25 2020 15:32:36 I (538) cpu_start: ELF file SHA256: 0000000000000000... I (543) cpu_start: ESP-IDF: v4.0.1 I (548) cpu_start: Single core mode I (553) heap_init: Initializing. RAM available for dynamic allocation: I (560) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM I (566) heap_init: At 3FFB6388 len 00001C78 (7 KiB): DRAM I (572) heap_init: At 3FFB9A20 len 00004108 (16 KiB): DRAM I (578) heap_init: At 3FFBDB5C len 00000004 (0 KiB): DRAM I (584) heap_init: At 3FFCA270 len 00015D90 (87 KiB): DRAM I (590) heap_init: At 3FFE0440 len 0001FBC0 (126 KiB): D/IRAM I (597) heap_init: At 40078000 len 00008000 (32 KiB): IRAM I (603) heap_init: At 4009DFE4 len 0000201C (8 KiB): IRAM I (609) cpu_start: Pro cpu start user code
Looking into the datasheet, we see, the config for the second core is straight-forward through the DPORT_APPCPU_CTRL_* registers.
To launch the CPU, we first ensure, it's not already running by checking the CLKGATE register. If it is disabled, we reset the CPU, load the entry vector and start the CPU by enabling the CLKGATE. We also have to allocate stack for the second core.
if (DPORT_REG_GET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN)) { printf("APP CPU is already running!\n"); return; } if (!app_cpu_stack_ptr) { app_cpu_stack_ptr = heap_caps_malloc(1024, MALLOC_CAP_DMA); } DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING); DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING); printf("Start APP CPU at %08X\n", (uint32_t)&app_cpu_init); ets_set_appcpu_boot_addr((uint32_t)&app_cpu_init); DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN);
According to the Tensilica spec, the first thing to do after start is to reset the 'Window' registers. This is a special feature of this CPUs, the general purpose registers are banked. The banks can be switched with the 'Window' registers. We also need to initialize the stack pointer, which is in the register A1 by convention. We then call our main for the APP CPU. After the main finishes, we let the second core turn off its own clock to halt it.
static void IRAM_ATTR app_cpu_init() { // Reset the reg window. This will shift the A* registers around, // so we must do this in a separate ASM block. // Otherwise the addresses for the stack pointer and main function will be invalid. asm volatile ( \ "movi a0, 0\n" \ "wsr a0, WindowStart\n" \ "movi a0, 0\n" \ "wsr a0, WindowBase\n" \ ); // init the stack pointer and jump to main function asm volatile ( \ "l32i a1, %0, 0\n" \ "callx4 %1\n" \ ::"r"(&app_cpu_stack_ptr),"r"(app_cpu_main)); DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN); }
And that it! We can now have a app_cpu_main() function that runs completely independent of the rest of the system. I verified this works by incrementing a counter in the main function. Every time I start the APP core, I can verify if this counter has been incremented.
Limitations
No cache
The APP CPU cache for external flash access is at the fixed at address 0x40078000, which is part of the allocateable memory as you can see in the boot log. So we must not enable the CPUs cache. Therefore any code running on that core must be run from the IRAM. That's not very nice, since you have to be very careful when calling any functions. If you only plan to run something rather primitive on that core, it shouldn't be a problem. After all, disabling that core freed 32K IRAM in the first place.
No exception handler
That's not a real limitation, since I simply didn't bother to set / write an exception handler. But be aware, that due to the special architecture of the CPUs, the maximum call stack depth is limited by the register window, if you don't have an exception handler. But with the IRAM limitation, a large program on that core is not a good idea anyway.
Conclusion
Yes, it works! But this way is only useful for not-too-complex tasks. Good thing is, no other part of the system knows the core is running so you can be certain this does not have any undesired side-effects.