• Optimizing OTA Updates: From 228s to 26s

    Princean hour ago 0 comments

    Our Hoopi Pedal uses a Daisy Seed (STM32H7) for audio DSP and an ESP32 for WiFi connectivity.

    We have the ability to update the audio effects by changing the Daisy's firmware, using over-the-air (OTA) updates.

    Over-the-air updates work by:

    1. Hoopi App checks cloud for updates and downloads firmware
    2. Hoopi App sends firmware to ESP32 via HTTP
    3. ESP32 sends firmware to Daisy via UART
    4. Daisy writes to QSPI flash staging area
    5. Daisy copies from staging to active area and reboots

    The initial implementation took 228 seconds for a 294KB firmware update. Users were waiting nearly 4 minutes, and the "critical window" (where power loss could soft-brick the device) was over 60 seconds.

    Note: you might be wondering why we are doing all this inside the application and not in the Daisy's bootloader, which would eliminate the critical section by using staging and active partitions that are swapped using a flag. The Daisy's bootloader is not opensource. We tried implementing our own bootloader but ran into issues with timer/HAL/system initialization that would lock up the app.

    // hoopi.cpp - Test mode for QSPI benchmarking
    #define DEBUG_PRINT 1
    #define TEST_QSPI_SPEED 1  // Enable timing tests
    
    #if TEST_QSPI_SPEED
        // Wait for USB serial connection
        hw.seed.StartLog(true);  // true = blocking wait for connection
    
        hw.PrintLine("=== QSPI Speed Test ===");
    
        uint32_t start, elapsed;
        constexpr uint32_t TEST_SIZE = 64 * 1024;  // 64KB test
    
        // Test 1: Erase timing
        hw.PrintLine("Erasing 64KB at staging area...");
        start = System::GetNow();
        hw.seed.qspi.Erase(OTA_QSPI_STAGING_ADDR, OTA_QSPI_STAGING_ADDR + TEST_SIZE);
        elapsed = System::GetNow() - start;
        hw.PrintLine("Erase 64KB: %lums", elapsed);
    
        // Test 2: Write timing with different block sizes
        uint8_t* test_buf = new uint8_t[32768];
        memset(test_buf, 0xAA, 32768);
    
        // 256-byte writes
        start = System::GetNow();
        for (uint32_t i = 0; i < TEST_SIZE; i += 256) {
            hw.seed.qspi.Write(OTA_QSPI_STAGING_ADDR + i, 256, test_buf);
        }
        elapsed = System::GetNow() - start;
        hw.PrintLine("Write 64KB (256B blocks): %lums", elapsed);
    
        // 32KB writes
        hw.seed.qspi.Erase(OTA_QSPI_STAGING_ADDR, OTA_QSPI_STAGING_ADDR + TEST_SIZE);
        start = System::GetNow();
        for (uint32_t i = 0; i < TEST_SIZE; i += 32768) {
            hw.seed.qspi.Write(OTA_QSPI_STAGING_ADDR + i, 32768, test_buf);
        }
        elapsed = System::GetNow() - start;
        hw.PrintLine("Write 64KB (32KB blocks): %lums", elapsed);
    
        delete[] test_buf;
        hw.PrintLine("=== Test Complete ===");
    #endif

    This gave us immediate feedback:

    === QSPI Speed Test ===
    Erase 64KB: 1408ms
    Write 64KB (256B blocks): 11023ms
    Write 64KB (32KB blocks): 137ms
    === Test Complete ===
    

    The 256B vs 32KB write difference (80x!) immediately showed us where to focus.

    Optimization 1: 32KB Write Chunks

    Problem: The original code wrote firmware in 256-byte pages during the final copy.

    // BEFORE: 256-byte page writes (SLOW!)
    for (uint32_t i = 0; i < ota_expected_size; i += 256) {
        hw.seed.qspi.Write(OTA_QSPI_ACTIVE_ADDR + i, 256,
                           (uint8_t*)(OTA_QSPI_STAGING_ADDR + i));
    }

    Solution: Write in 32KB chunks instead. QSPI flash can handle larger writes efficiently.

    // AFTER: 32KB chunk writes (5.5x faster!)
    constexpr uint32_t CHUNK_SIZE = 32 * 1024;
    uint8_t* sram_buf = new uint8_t[CHUNK_SIZE];
    
    while (bytes_copied < ota_expected_size) {
        uint32_t chunk_size = std::min(CHUNK_SIZE, ota_expected_size - bytes_copied);
    
        // Copy to SRAM buffer first (can't write directly from QSPI)
        memcpy(sram_buf, (uint8_t*)(OTA_QSPI_STAGING_ADDR + bytes_copied), chunk_size);
    
        // Write full 32KB chunk at once
        hw.seed.qspi.Write(OTA_QSPI_ACTIVE_ADDR + bytes_copied, chunk_size, sram_buf);
        bytes_copied += chunk_size;
    }

    Result: Critical window reduced from ~62s to ~11s.

    Optimization 2: 64KB Block Erase

    Problem: libDaisy's QSPI erase used 4KB sector erase commands (0xD7), requiring 256 erase operations for 1MB.

    Discovery: We benchmarked the erase operations:

    Erase 64KB (16x 4KB sectors): 80,000ms  // Calling EraseSector 16 times
    Erase...
    Read more »