Fixing the robots' marathon performance

It was always important for me to test if the robot can walk indefinitely - or at least for hours on end - in the simulation, so after I got the robot walking for a few seconds, I kept trying longer runs to see what failed and how it failed. At some point, I got it to walk for about an hour, but then it would suddenly fail in very weird ways:

What's more, it was failing at random points in time, sometimes after a few minutes, sometimes after over an hour... That lead me to believe it was either multithreading related (deadlocks etc.) or something at the OS level. After reverting to a number of commits and experimenting with wrapper functions to make everything fully threadsafe (which was long overdue anyway), I was quite certain threads were not the problem. So I started adding high_resolution_clocks to each code block on every thread and logged all that into the CSV files:

Maybe you already noticed: There's a clear spike of "previous_logging_time", meaning the write operation must have taken over 250ms! This made a lot of sense, and even though I haven't tracked down why that operation randomly took so much longer, I suspected it was just something on the hardware cache or OS level and moved on to the proper solution, which was, of course, asynchronous IO. The implementation is not very clean because I wasn't perfectly sure that was the actual issue but I used this in the end:

// From https://stackoverflow.com/questions/21126950/asynchronously-writing-to-a-file-in-c-unix#21127776
#ifndef ASYNC_LOGGER_H
#define ASYNC_LOGGER_H

#include <condition_variable>
#include <fstream>
#include <mutex>
#include <queue>
#include <streambuf>
#include <string>
#include <thread>
#include <vector>

struct async_buf : std::streambuf
{
    std::ofstream                 out;
    std::mutex                    mutex;
    std::condition_variable       condition;
    std::queue<std::vector<char>> queue;
    std::vector<char>             buffer;
    bool                          done;
    std::thread                   thread;

    void worker() {
        bool local_done(false);
        std::vector<char> buf;
        while (!local_done) {
            {
                std::unique_lock<std::mutex> guard(this->mutex);
                this->condition.wait(guard,
                                     [this](){ return !this->queue.empty()
                                                   || this->done; });
                if (!this->queue.empty()) {
                    buf.swap(queue.front());
                    queue.pop();
                }
                local_done = this->queue.empty() && this->done;
            }
            if (!buf.empty()) {
                out.write(buf.data(), std::streamsize(buf.size()));
                buf.clear();
            }
        }
        out.flush();
    }

public:
    async_buf(std::string const& name)
        : out(name)
        , buffer(128)
        , done(false)
        , thread(&async_buf::worker, this) {
        this->setp(this->buffer.data(),
                   this->buffer.data() + this->buffer.size() - 1);
    }
    ~async_buf() {
        this->sync();
        std::cout << "Async logger destructor ran" << std::endl;
        std::unique_lock<std::mutex>(this->mutex), (this->done = true);
        this->condition.notify_one();
        this->thread.join();
    }
    int overflow(int c) {
        if (c != std::char_traits<char>::eof()) {
            *this->pptr() = std::char_traits<char>::to_char_type(c);
            this->pbump(1);
        }
        return this->sync() != -1
            ? std::char_traits<char>::not_eof(c): std::char_traits<char>::eof();
    }
    int sync() {
        if (this->pbase() != this->pptr()) {
            this->buffer.resize(std::size_t(this->pptr() - this->pbase()));
            {
                std::unique_lock<std::mutex> guard(this->mutex);
                this->queue.push(std::move(this->buffer));
            }
            this->condition.notify_one();
            this->buffer = std::vector<char>(128);
            this->setp(this->buffer.data(),
                       this->buffer.data() + this->buffer.size() - 1);
        }
        return 0;
    }
};

#endif

The file: https://github.com/LouKordos/walking_controller/blob/develop/src/include/async_logger.hpp

Many thanks to the Stackoverflow fellow :)

You can also read about this issue on my Github repo where you will see I also checked if the solver time was spiking instead of the loop around it, maybe because the cartesian position values became increasingly large, but that was not the case:

After fixing this issue with async IO, the robot was able to walk for over 4 hours, after which the limits of the simulation started to show... More on that in another log :)

The first major issue: Delay Compensation

Controlling the robot in real-time via a Web UI

Discussions

Become a Hackaday.io Member