Evaluate ways to run code on one of the ESP32s cores without the scheduler interfering
To make the experience fit your profile, pick a username and tell us what interests you.
We found and based on your interests.
I have been doing some more testing with the unhacked SDK approach and detected a heap corruption issue. At first I though I did some mistake when initializing the stack (which I did as well...), but that was not what was causing the problems.
My second suspect was the CPU caching. I didn't turn the cache on, but maybe it's on by default. But no... cache wasn't the issue either.
I then tried to simplify the problem as much as possible and used a trivial infinite loop as startup address. This doesn't even access the registers. But still, as soon as the APP CPU is started, there is damage on the heap.
It seems, some initialization of the APP CPU writes to that memory causing the corruption. It almost seems there is some code running before the APP CPU jumps the the entry vector specified in APPCPU_CTRL_D_REG. Or its an initialization sequence by the hardware. Doesn't really matter in the end, since I don't have any influence over it anyway.
I found a comment in the SDK code indicating, there is in fact some kind of ROM code running before the CPU starts.
/* Initialize heap allocator. WARNING: This *needs* to happen *after* the app cpu has booted.
If the heap allocator is initialized first, it will put free memory linked list items into
memory also used by the ROM. Starting the app cpu will let its ROM initialize that memory,
corrupting those linked lists. Initializing the allocator *after* the app cpu has booted
works around this problem.
...
I didn't expect that, since writing the startup address into an register seems to be as low-level as it gets, but apparently, there is code that is executed before that address is loaded. There is always something new you learn...
Since we now know, we have to initialize the APP CPU before the heap is initialized and thus before any user code is run, we have to modify the SDK in order to get the APP CPU running without causing damage to the heap.
I decided to add a single function call in cpu_start() immediately before heap_caps_init() is called. That happens in cpu_start.c.
I also put the code on GitHub: https://github.com/Winkelkatze/ESP32-Bare-Metal-AppCPU
I've been wondering why my SDK hacking attempts did work so poorly. But, after thinking a bit about it, it should have been obvious from the start:
I wanted my bare-metal main to behave like an ordinary FreeRTOS task, yet a lot of things an ordinary task is expected to do (like waiting) is a feature from the scheduler. But, we explicitly don't want the scheduler to be running, so we can't expect these things to work. From the FreeRTOS point of view, it's a bit like an interrupt wants to wait for a mutex (which is forbidden as far as i know).
Sadly, there isn't much that can be done about that without digging deep into FreeRTOS and either teaching the synchronization functions to busy-wait while running on the second core or using a modified scheduler for the second core that only handles synchronization stuff but leaves the task alone otherwise.
I thought the SDK hacking approach is at least good for having the cache working and thus being able to execute code from flash. Turns out, it's not that simple. The cache loading functions also rely on the synchronization of FreeRTOS to prevent issues from both cores trying to access the SPI flash at the same time. It also prevents the other core from accessing the cache while it's being updated.
Since as stated above the synchronization is not working, also cache loads are not possible for now.
It's finally hacking time ;)
We try to patch the SDK, so we get control over the APP core, while still letting the SDK handle the initialization. Ideally, our main will behave similar to a FreeRTOS task while running bare metal.
First, we need to figure out, what the
CONFIG_FREERTOS_UNICORE
definition does. After all, we want to replicate that behavior to some extent. So we search through the SDK code and look for code that is dependent on this definition. We can broadly sort the parts into the following categories:
Since we want to keep things simple, we try to not touch the FreeRTOS part and focus on the other three.
Now, we want everything else to think things are running in singlecore mode to avoid modules trying to create tasks on the APP core and to avoid unnecessary memory allocation by FreeRTOS. So we add our new define, which will 'counteract' the UNICORE define in certain places. I named this:
CONFIG_FREERTOS_BAREMETAL_APP_CPU
Therefore, our sdkconfig.h contains both
#define CONFIG_FREERTOS_UNICORE 1
#define CONFIG_FREERTOS_BAREMETAL_APP_CPU 1
Now, in order for our define to do something, we add some code to selected files, which will undefine UNICORE if our define is set.
#ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU #undef CONFIG_FREERTOS_UNICORE #endif
As I wrote earlier, we apply the patch to any file related to starting the CPU / memory map / interrupt allocation / synchronization code. The definition is only used in a handful of places, so we can easily check every file and see if we assume that part is relevant for us.
Additionally, we need to modify the start code in cpu_start.c to start our main instead of the scheduler.
When I tried as described above, the system immediately crashed during the dport access init while starting the second CPU. That was when I realized, a lot of the 'basic' SDK code like interrupt allocation actually depends on FreeRTOS functions (like mutexes) and the portNUM_PROCESSORS definition instead of the UNICORE define. And sadly, portNUM_PROCESSORS is referenced quite often. So instead of continuing that road, I decided to reduce the amount of initialization, so we won't have that problem.
Since I determined that getting the Interrupt allocation / Synchronization / Crosscore stuff working is a lot of work, I decided to ignore it for now.
Without that, there are only the following files where I changed something:
The most important change to out manual approach (the one without SDK hacking) is the change in the memory map definition. This way we get cache for our CPU and can execute from external flash. In cpu_start.c we add our code to undefine the UNICORE definition and modify the start_cpu1_default() function. app_cpu_bare_metal_main() is our own main function that is defined in the user code.
#if !CONFIG_FREERTOS_UNICORE
void start_cpu1_default(void)
{
#ifdef CONFIG_FREERTOS_BAREMETAL_APP_CPU
esp_cache_err_int_init();
ESP_EARLY_LOGI(TAG, "Starting bare metal main on APP CPU.");
app_cpu_bare_metal_main();
#else
// Wait for FreeRTOS initialization to finish on PRO CPU
while (port_xSchedulerRunning[0] == 0) {
;
}
...
With this approach, we can execute from flash, so we are allowed to call SDK functions. However, we still need to be careful that the functions are not using mutexes / interrupts. DPort access is also limited, since we bypass the mutual access mitigation of the SDK. So whatever function we call, we have to check first, if this does anything 'forbidden'. Also, 'printf' doesn't work, so we have to fall back on the much more basic 'ets_printf'...
Read more »The idea is simple, just build the SDK in normal Dual Core configuration, but pin all tasks to the first core. We then can run a task on the second core which should (in theory) run uninterrupted by everything.
The ESP32 ecosystem and firmware is complex. There are so many libraries, so we can't be sure that there is really nothing else running on that core.
Easiest way to test this would be to toggle an IO in an infinite loop and then look at the output with a scope or a logic analyzer. If it is in fact running uninterrupted, the signal should be a clean square wave. If something is interrupting it, it should get 'stuck' from time to time. Since I have none of both at hand, I decided to measure the jitter in the execution time of a simple wait loop. With this method, I can't be sure it runs without interruptions, but I can at least detect some interruptions.
void app_task(void *param)
{
(void)param;
uint64_t min = UINT64_MAX;
uint64_t max = 0;
uint64_t sum = 0;
uint64_t last = 0;
#define NUM_ROUNDS 1000
uint64_t times[NUM_ROUNDS];
printf("app_task started on core %i\n", xPortGetCoreID());
last = esp_timer_get_time();
for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++)
{
for(volatile uint32_t delay = 0; delay < 1000; delay++);
uint64_t now = esp_timer_get_time();
uint64_t diff = now - last;
last = now;
if (diff < min)
{
min = diff;
}
if (diff > max)
{
max = diff;
}
sum += diff;
times[i] = diff;
}
printf("app_task finished\nmin=%llu\nmax=%llu\navg=%llu\n", min, max, sum / 1000);
for(volatile uint32_t i = 0; i < NUM_ROUNDS; i++)
{
printf("%i=%llu\n", i, times[i]);
}
}
If the code would run (more or less) uninterrupted, the times values should be almost the same (+/- 1). Except the first may be a bit off, due to the beginning of the loop. But looking at the output we get something like:
87=120
88=120
89=120
90=121
91=120
92=130
93=120
94=120
95=120
96=120
97=121
So it seems like it takes ~120ns to execute one run. However, the 92th run took 130ns. So unless the esp_timer_get_time() itself takes longer from time to time.,we got an interruption here! Sadly I have no way to verify this. But, since just a few runs take longer (always about 10ns) I guess its some interrupt handling.
FAIL
The simplest, straight-forward approach with just running a task at max prio does not seem to work. I don't know what causes the problems, but it seems something (maybe interrupt handling) is ruining my timing here. Also, we can't be sure the core is actually free, so I can't recommend this method.
At first, we try to not hack the SDK and to avoid any side effects due to the SDK or any library trying to launch a task on the second core. So we want the running system to be completely unaware of the second core running.
So, to start with, we build everything with
CONFIG_FREERTOS_UNICORE 1
Since I'm using micropyton as a platform, I need to make some further changes to assign everything to the PRO core. If everything is working, you should get a boot message telling you the ESP is running in single core mode.
I (534) cpu_start: Pro cpu up.
I (534) cpu_start: Application information:
I (534) cpu_start: Compile time: Aug 25 2020 15:32:36
I (538) cpu_start: ELF file SHA256: 0000000000000000...
I (543) cpu_start: ESP-IDF: v4.0.1
I (548) cpu_start: Single core mode
I (553) heap_init: Initializing. RAM available for dynamic allocation:
I (560) heap_init: At 3FFAFF10 len 000000F0 (0 KiB): DRAM
I (566) heap_init: At 3FFB6388 len 00001C78 (7 KiB): DRAM
I (572) heap_init: At 3FFB9A20 len 00004108 (16 KiB): DRAM
I (578) heap_init: At 3FFBDB5C len 00000004 (0 KiB): DRAM
I (584) heap_init: At 3FFCA270 len 00015D90 (87 KiB): DRAM
I (590) heap_init: At 3FFE0440 len 0001FBC0 (126 KiB): D/IRAM
I (597) heap_init: At 40078000 len 00008000 (32 KiB): IRAM
I (603) heap_init: At 4009DFE4 len 0000201C (8 KiB): IRAM
I (609) cpu_start: Pro cpu start user code
Looking into the datasheet, we see, the config for the second core is straight-forward through the DPORT_APPCPU_CTRL_* registers.
To launch the CPU, we first ensure, it's not already running by checking the CLKGATE register. If it is disabled, we reset the CPU, load the entry vector and start the CPU by enabling the CLKGATE. We also have to allocate stack for the second core.
if (DPORT_REG_GET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN))
{
printf("APP CPU is already running!\n");
return;
}
if (!app_cpu_stack_ptr)
{
app_cpu_stack_ptr = heap_caps_malloc(1024, MALLOC_CAP_DMA);
}
DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING);
DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_A_REG, DPORT_APPCPU_RESETTING);
printf("Start APP CPU at %08X\n", (uint32_t)&app_cpu_init);
ets_set_appcpu_boot_addr((uint32_t)&app_cpu_init);
DPORT_REG_SET_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN);
According to the Tensilica spec, the first thing to do after start is to reset the 'Window' registers. This is a special feature of this CPUs, the general purpose registers are banked. The banks can be switched with the 'Window' registers. We also need to initialize the stack pointer, which is in the register A1 by convention. We then call our main for the APP CPU. After the main finishes, we let the second core turn off its own clock to halt it.
static void IRAM_ATTR app_cpu_init()
{
// Reset the reg window. This will shift the A* registers around,
// so we must do this in a separate ASM block.
// Otherwise the addresses for the stack pointer and main function will be invalid.
asm volatile ( \
"movi a0, 0\n" \
"wsr a0, WindowStart\n" \
"movi a0, 0\n" \
"wsr a0, WindowBase\n" \
);
// init the stack pointer and jump to main function
asm volatile ( \
"l32i a1, %0, 0\n" \
"callx4 %1\n" \
::"r"(&app_cpu_stack_ptr),"r"(app_cpu_main));
DPORT_REG_CLR_BIT(DPORT_APPCPU_CTRL_B_REG, DPORT_APPCPU_CLKGATE_EN);
}
And that it! We can now have a app_cpu_main() function that runs completely independent of the rest of the system. I verified this works by incrementing a counter in the main function. Every time I start the APP core, I can verify if this counter has been incremented.
The APP CPU cache for external flash access is at the fixed at address 0x40078000, which is part of the allocateable memory as you can see in the boot log. So we must not enable the CPUs cache. Therefore any code running on that core must be run from the IRAM. That's not very nice, since you have to be very careful when calling any functions. If you only plan to...
Read more »
Create an account to leave a comment. Already have an account? Log In.
Turn out it was simply latest cpu_configure_region_protection stalling core 1 some how. Reverting to old one did the trick!
Update my repo with example running on ESP-IDF v4.2!
Very cool project Daniel. :)
I'm working on using the ESP32 for high speed FOC of a BLDC motor and CAN bus communication. The bare-metal approach is just what we need for high speed ISR's with low latency. I'm also working on a compact fast cooperative kernel executing hierarchical state-machines directly from a table, to handle more complex hard real-time tasks.
Freeing up the App core for the hard real-time stuff, while maintaining the Pro core for the slower and more complex FreeRTOS stuff, will give us the best of both worlds in one cheap module. :)
Become a member to follow this project and never miss any updates
I replicated your work and added an example of using an interrupt on CORE1:
https://github.com/darthcloud/esp32_baremetal_core1_bitbang_test
This is all base on ESP-IDF v4.0.2 .
ESP32 init code changed a bit to accommodate the new chips since that version and with your current bare_metal_app_cpu.c the CORE1 do not start. So their more to do than moving the hook at the new heap_caps_init location.
I guess something that used to be init for CORE1 regardless of CONFIG_FREERTOS_UNICORE is not done anymore. Investigating how to get it working on v4.2 ...