Higher accuracy speech recognition in ROS with Vosk

Following the previous log entry Speech recognition in ROS with PocketSphinx the recognition of speech was okay (~90% of words correctly) but not good. Also PocketSphinx is a little dated and its developers are now working on Vosk instead, which itself uses Kaldi.

With the Vosk server there is an easy to use Websocket API. I use it with the language model vosk-model-small-en-us-0.15 which is optimized for embedded systems:

python3 ./asr_server.py /opt/vosk-model-small-en-us

As with PocketSphinx accuracy can be improved by using a fixed set of words to recognize, so I will whitelist words like "forward, backward, stop, left, right, ...". The API does not allow to set a complete grammar, so sentences like "left right" can theoretically be detected, but in contrast to PocketSphinx the accuracy of detected words with Vosk is good enough so a grammar is not really needed.

Since the Vosk Server is using websockets I skip GStreamer and pipe the 16 bit audio with a sample rate of 16000 Hz from the ReSpeaker Mic Array microphone with SoX directly in my ROS publisher script:

rec -q -t alsa -c 1 -b 16 -r 16000 -t wav - silence -l 1 0.1 0.3% -1 2.0 0.3% | ./asr_vosk.rb -

With the silence command SoX is told to filter periods of silences longer then 2 seconds, so the Vosk server does not have to process these.

The ruby script asr_vosk.rb is based on the PocketSphinx one in the previous log, but the GStreamer/PocketSphinx parts are replaced with websockets and Vosk. I use websocket-eventmachine-client library for websocket handling in ruby.

#!/usr/bin/ruby

require 'logger'
require 'websocket-eventmachine-client'
require 'json'
require 'ros'
require 'std_msgs/String'

KEYWORDS = ["wild thumper"]
CONFIG = {
    "config": {
        "phrase_list": ["angle", "backward", "by", "centimeter",
                        "compass", "current", "decrease", "default",
                        "degree", "down", "eight", "eighteen", "eighty",
                        "eleven", "fifteen", "fifty", "five", "forty",
                        "forward", "four", "fourteen", "get", "go",
                        "hundred", "increase", "left", "light", "lights",
                        "meter", "mic", "minus", "motion", "mute",
                        "nine", "nineteen", "ninety", "off", "on", "one",
                        "position", "pressure", "right", "secure", "set",
                        "seven", "seventeen", "seventy", "silence",
                        "six", "sixteen", "sixty", "speed", "stop",
                        "temp", "temperature", "ten", "thirteen",
                        "thirty", "three", "to", "turn", "twelve",
                        "twenty", "two", "up", "velocity", "voltage",
                        "volume", "wild thumper", "zero"],
        "sample_rate": 16000.0
    }
}

class Speak
    def initialize(node)
        @logger = Logger.new(STDOUT)
        @commands_enabled = false
        @publisher = node.advertise('asr_result', Std_msgs::String)

        # Websocket handling
        EM.run do
            Signal.trap("INT") { send_eof }
            @ws = WebSocket::EventMachine::Client.connect(:uri => 'ws://192.168.36.4:2700')

            def send_eof
                @ws.send '{"eof" : 1}'
            end

            # Loop over all input data
            def run
                while true do
                    data = ARGF.read(16000)
                    if data
                        @ws.send data, :type => :binary
                    else
                        send_eof
                        break
                    end
                end
            end

            @ws.onopen do
                @logger.info "Running.."
                @ws.send CONFIG.to_json

                Thread.new {
                    run
                }
            end

            @ws.onmessage do |msg, type|
                d = JSON.parse(msg)
                handle_result(d)
            end

            @ws.onclose do |code, reason|
                puts "Disconnected with status code: #{code}"
                exit
            end
        end
    end

    def handle_result(msg)
        if msg.has_key? "result"
            msg["result"].each do |result|
                @logger.debug "word=" + result["word"]
            end

            # check for keywords first
            text = msg["text"]
            @logger.debug "text=" + msg["text"]
            KEYWORDS.each do |keyword|
                if text.include? keyword
                    keyword_detected(keyword)
                    text = text.gsub(keyword, "").strip
                end
            end

            # not a keyword, handle command if enabled
            if @commands_enabled and text.length > 0
                final_result(text)
            end
        end
    end

    # Enables/Disables the speech command
    def enable_commands(bEnable)
        @commands_enabled = bEnable
    end

    # Resulting speech command
    def final_result(hyp)
        @logger.info "final: " + hyp
        enable_commands(false)

        # Publish vosk result as ros message
        msg = Std_msgs::String.new
        msg.data = hyp
        @publisher.publish(msg)
    end

    def keyword_detected(hyp)
        @logger.debug "Got keyword: " + hyp
        enable_commands(true)
    end
end

if __FILE__ == $0
    node = ROS::Node.new('asr_vosk')
    app = Speak.new(node)
    begin
        node.spin
    rescue Interrupt
    ensure
        node.shutdown
    end
end

It basically works like the GStreamer/PocketSphinx script. The command recognition needs to be enabled with the "wild thumper" keyword (enable_commands()), so only "wild thumper stop" is accepted, not just "stop". This has the advantage that the robot does not react when e.g. a movie is running in the background where someone just says "stop".

The result of the speech recognition is published to the ROS topic "asr_result" as message type string.

Adding a Manipulator

Get the robot to bring me beer

Discussions

Become a Hackaday.io Member