1. Background

The BW21-CBV-Kit launched by Ai-Thinker is an open-source hardware development platform for intelligent sensing applications. Its core features include:

Supports on-device AI image recognition algorithms, enabling scenarios such as face recognition, gesture detection, and object identification, with HD image capture via a 2MP camera.

Equipped with dual-band Wi-Fi (2.4GHz + 5GHz) and a high-performance RTL8735B processor (500MHz), meeting real-time image processing and wireless transmission requirements.

Built on the Arduino development environment, providing rich interfaces such as PWM control and sensor drivers, enabling users to develop custom network camera systems.

Compared with traditional closed-source smart cameras, the BW21-CBV-Kit provides unique advantages:

Developers can bypass vendor constraints to create local storage or private cloud solutions, avoiding cloud data leakage risks.

Supports peripherals such as DHT temperature/humidity sensors and ultrasonic ranging modules, allowing smart home security, environmental monitoring, and other multi-functional applications.

Provides HomeAssistant integration and pre-built AI model calls, significantly lowering the development barrier for home-grade smart cameras.

This device has strong potential applications in smart homes, industrial visual inspection, and new retail behavior analysis. Its open-source nature is particularly suitable for maker communities and SMEs seeking customized visual solutions.

Many manufacturers currently offer smart cameras, but they are closed-source and tied to specific software and cloud services. The XiaoAnPai camera can be used to build a network camera with a custom software system for home use.

2. Objectives

The web interface should be as user-friendly as possible, with the ability to select video files by date.

The core goal of this project is to develop an intelligent home security monitoring system based on the Xiao An Pai hardware platform, focusing on three key functional modules:

1. 24/7 audio and video acquisition system

2. Tiered storage architecture

3. Cloud-based interactive system

Develops a video retrieval module for daily surveillance video retrieval.

The overall system design adheres to a collaborative "end-edge-cloud" architecture, ultimately creating a complete home security solution while ensuring data security.

3. Design Approach

3.1 Hardware Design

Development Board: BW21-CBV-Kit connected to a GC2053 camera.

Camera Enclosure: DIY casing made using the packaging box and manual assembly.

3.2 Software Design

3.2.1 Microcontroller

On the microcontroller side, RTSP video streams are output while marking faces in the video.

#include "WiFi.h"
#include "StreamIO.h"
#include "VideoStream.h"
#include "RTSP.h"
#include "NNFaceDetection.h"
#include "VideoStreamOverlay.h"
#define CHANNEL   0
#define CHANNELNN 3
// Lower resolution for NN processing
#define NNWIDTH  576
#define NNHEIGHT 320
VideoSetting config(VIDEO_FHD, 30, VIDEO_H264, 0);
VideoSetting configNN(NNWIDTH, NNHEIGHT, 10, VIDEO_RGB, 0);
NNFaceDetection facedet;
RTSP rtsp;
StreamIO videoStreamer(1, 1);
StreamIO videoStreamerNN(1, 1);
char ssid[] = "Network_SSID";    // your network SSID (name)
char pass[] = "Password";        // your network password
int status = WL_IDLE_STATUS;
IPAddress ip;
int rtsp_portnum;
void setup()
{
    Serial.begin(115200);
    // attempt to connect to Wifi network:
    while (status != WL_CONNECTED) {
        Serial.print("Attempting to connect to WPA SSID: ");
        Serial.println(ssid);
        status = WiFi.begin(ssid, pass);
        // wait 2 seconds for connection:
        delay(2000);
    }
    ip = WiFi.localIP();
    // Configure camera video channels with video format information
    // Adjust the bitrate based on your WiFi network quality
    config.setBitrate(2 * 1024 * 1024);    // Recommend to use 2Mbps for RTSP streaming to prevent network congestion
    Camera.configVideoChannel(CHANNEL, config);
    Camera.configVideoChannel(CHANNELNN, configNN);
    Camera.videoInit();
    // Configure RTSP with corresponding video format information
    rtsp.configVideo(config);
    rtsp.begin();
    rtsp_portnum = rtsp.getPort();
    // Configure face detection with corresponding video format information
    // Select Neural Network(NN) task and models
    facedet.configVideo(configNN);
    facedet.setResultCallback(FDPostProcess);
    facedet.modelSelect(FACE_DETECTION, NA_MODEL, DEFAULT_SCRFD, NA_MODEL);
    facedet.begin();
    // Configure StreamIO object to stream data from video channel to RTSP
    videoStreamer.registerInput(Camera.getStream(CHANNEL));
    videoStreamer.registerOutput(rtsp);
    if (videoStreamer.begin() != 0) {
        Serial.println("StreamIO link start failed");
    }
    // Start data stream from video channel
    Camera.channelBegin(CHANNEL);
    // Configure StreamIO object to stream data from RGB video channel to face detection
    videoStreamerNN.registerInput(Camera.getStream(CHANNELNN));
    videoStreamerNN.setStackSize();
    videoStreamerNN.setTaskPriority();
    videoStreamerNN.registerOutput(facedet);
    if (videoStreamerNN.begin() != 0) {
        Serial.println("StreamIO link start failed");
    }
    // Start video channel for NN
    Camera.channelBegin(CHANNELNN);
    // Start OSD drawing on RTSP video channel
    OSD.configVideo(CHANNEL, config);
    OSD.begin();
}
void loop()
{
    // Do nothing
}
// User callback function for post processing of face detection results
void FDPostProcess(std::vector<FaceDetectionResult> results)
{
    int count = 0;
    uint16_t im_h = config.height();
    uint16_t im_w = config.width();
    Serial.print("Network URL for RTSP Streaming: ");
    Serial.print("rtsp://");
    Serial.print(ip);
    Serial.print(":");
    Serial.println(rtsp_portnum);
    Serial.println(" ");
    printf("Total number of faces detected = %d\r\n", facedet.getResultCount());
    OSD.createBitmap(CHANNEL);
    if (facedet.getResultCount() > 0) {
        for (int i = 0; i < facedet.getResultCount(); i++) {
            FaceDetectionResult item = results[i];
            // Result coordinates are floats ranging from 0.00 to 1.00
            // Multiply with RTSP resolution to get coordinates in pixels
            int xmin = (int)(item.xMin() * im_w);
            int xmax = (int)(item.xMax() * im_w);
            int ymin = (int)(item.yMin() * im_h);
            int ymax = (int)(item.yMax() * im_h);
            // Draw boundary box
            printf("Face %ld confidence %d:\t%d %d %d %d\n\r", i, item.score(), xmin, xmax, ymin, ymax);
            OSD.drawRect(CHANNEL, xmin, ymin, xmax, ymax, 3, OSD_COLOR_WHITE);
            // Print identification text above boundary box
            char text_str[40];
            snprintf(text_str, sizeof(text_str), "%s %d", item.name(), item.score());
            OSD.drawText(CHANNEL, xmin, ymin - OSD.getTextHeight(CHANNEL), text_str, OSD_COLOR_CYAN);
            // Draw facial feature points
            for (int j = 0; j < 5; j++) {
                int x = (int)(item.xFeature(j) * im_w);
                int y = (int)(item.yFeature(j) * im_h);
                OSD.drawPoint(CHANNEL, x, y, 8, OSD_COLOR_RED);
                count++;
                if (count == MAX_FACE_DET) {
                    goto OSDUpdate;
                }
            }
        }
    }
OSDUpdate:
    OSD.update(CHANNEL);
}

3.2.2 Centralized Storage

Use ffmpeg to subscribe to RTSP streams from the XiaoAnPai board and save them locally in segments:

ffmpeg -rtsp_transport tcp -i rtsp://192.168.123.6:554/mystream \
-c copy \
-f hls \
-strftime 1 \
-hls_time 60 \
-hls_list_size 0 \
-hls_flags delete_segments+append_list \
-hls_segment_filename "./data/%Y%m%d/%Y%m%d_%H%M%S.ts" \
"./data/20250323/playlist.m3u8" \
-protocol_whitelist file,rtsp,tcp \
-rw_timeout 5000000

./data/20250318/20250323_120000.ts

./data/20250318/20250323_120003.ts

3.2.3 Video Playback

Currently, we've only completed video recording and storage. We still need a corresponding front-end page to display the video, and data transmission requires some back-end software support.

3.2.3.1 Backend

The video streaming service backend, built on the Flask framework, provides structured HLS streaming services. Through automated directory scanning and dynamic routing, it manages and distributes video resources categorized by date, primarily serving scenarios requiring time-based retrieval of video content.

 

1. Data Discovery Layer

 

2. Interface Service Layer 

from flask import Flask, render_template, jsonify, send_from_directory
import os
from datetime import datetime
app = Flask(__name__)
DATA_DIR = '../ffmpeg/data'
def get_available_dates():
    dates = []
    for dirname in os.listdir(DATA_DIR):
        dir_path = os.path.join(DATA_DIR, dirname)
        m3u8_path = os.path.join(dir_path, 'playlist.m3u8')
        if os.path.isdir(dir_path) and os.path.exists(m3u8_path):
            try:
                datetime.strptime(dirname, '%Y%m%d')
                dates.append(dirname)
            except ValueError:
                continue
    dates.sort(key=lambda x: datetime.strptime(x, '%Y%m%d'))
    return dates
@app.route('/')
def index():
    return render_template('index.html')
@app.route('/api/dates')
def api_dates():
    dates = get_available_dates()
    formatted_dates = [f"{d[:4]}-{d[4:6]}-{d[6:8]}" for d in dates]
    return jsonify({'dates': formatted_dates})
@app.route('/video/<filename>')
def serve_m3u8(filename):
    if ".ts" in filename:
        return serve_ts_files(filename)
    else:
        dir_path = os.path.join(DATA_DIR, filename)
        return send_from_directory(dir_path, 'playlist.m3u8')
def serve_ts_files(filename):
    # Extract the date portion (first 8 digits) from a file name
    if len(filename) < 8:
        return "Invalid filename", 400
   
    date_str = filename[:8]
    try:
        # Validate date format
        datetime.strptime(date_str, '%Y%m%d')
    except ValueError:
        return "Invalid date format in filename", 400
    # Constructing file paths
    dir_path = os.path.join(DATA_DIR, date_str)
    print("dir_path is ", dir_path)
    if not os.path.isdir(dir_path):
        return "Date directory not found", 404
    file_path = os.path.join(dir_path, filename)
    print("file_path is ", file_path)
    if not os.path.exists(file_path):
        return "File not found", 404
    return send_from_directory(dir_path, filename)
if __name__ == '__main__':
    app.run(debug=True)


3.2.3.2 Front-end

1. The front-end primarily handles human-computer interaction, using the HLS streaming protocol to enable cross-platform video playback. A visual user interface is built for date selection, playback control, and status feedback, forming a complete streaming solution in conjunction with the back-end services.

Presentation Layer Design

Desktop adaptation achieved through container maximum width constraints and automatic margins

The control bar uses Flex layout to maintain element spacing, and the video player features a full-width black background to enhance visual focus

Dynamically displays playback progress and total duration for enhanced feedback

 

2. Streaming Adaptation Layer

The Hls.js library is preferred for advanced feature support

A native HTML5 player is used as an alternative to ensure usability in environments like Safari

Dynamically destroys and recreates HLS objects to avoid multi-stream memory leaks

Loading order is ensured through the MEDIA_ATTACHED and MANIFEST_PARSED event chains

 

3. Control Logic Layer

Select a date → Generate a standardized request → Initialize the playback engine → Automatically play

User actions (play/pause) are synchronized with the video element status in real time

Automatically converts display format dates to storage format parameters

 

<!-- templates/index.html -->
<!DOCTYPE html>
<html>
<head>
    <title>Video surveillance system</title>
   
    <style>
        .container {
            max-width: 800px;
            margin: 20px auto;
            padding: 20px;
        }
        .controls {
            margin-bottom: 20px;
            display: flex;
            gap: 10px;
            align-items: center;
        }
        #videoPlayer {
            width: 100%;
            background: #000;
        }
        button {
            padding: 5px 15px;
            cursor: pointer;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="controls">
            <select id="dateSelect">
 <option value="">Select Date</option>
 <button id="playBtn">Play</button>
 <button id="pauseBtn">Pause</button>
 <button id="pauseBtn">Pause</button>
 <span id="status">Ready</span>
        </div>
        <video id="videoPlayer" controls></video>
    </div>
    <script>
        const video = document.getElementById('videoPlayer');
        let hls = null;
        let currentDate = '';
        // Initialize HLS support detection
        if (Hls.isSupported()) {
            hls = new Hls();
            hls.attachMedia(video);
        }
        // Loading available dates
        fetch('/api/dates')
            .then(res => res.json())
            .then(data => {
                const select = document.getElementById('dateSelect');
                data.dates.forEach(date => {
                    const option = document.createElement('option');
                    option.value = date;
                    option.textContent = date;
                    select.appendChild(option);
                });
            });
        // Date Selection Event
        document.getElementById('dateSelect').addEventListener('change', function() {
            const selectedDate = this.value;
            if (!selectedDate) return;
            currentDate = selectedDate.replace(/-/g, '');
            const m3u8Url = `/video/${currentDate}`;
            
            if (Hls.isSupported()) {
                if (hls) {
                    hls.destroy();
                }
                hls = new Hls();
                hls.attachMedia(video);
                hls.on(Hls.Events.MEDIA_ATTACHED, () => {
                    hls.loadSource(m3u8Url);
                    hls.on(Hls.Events.MANIFEST_PARSED, () => {
                        video.play();
                    });
                });
            } else if (video.canPlayType('application/vnd.apple.mpegurl')) {
                video.src = m3u8Url;
                video.addEventListener('loadedmetadata', () => {
                    video.play();
                });
            }
        });
        // Playback Controls
        document.getElementById('playBtn').addEventListener('click', () => {
            video.play();
        });
        document.getElementById('pauseBtn').addEventListener('click', () => {
            video.pause();
        });
        // Update status display
        video.addEventListener('timeupdate', () => {
            const status = document.getElementById('status');
            status.textContent = `Playing - ${formatTime(video.currentTime)}/${formatTime(video.duration)}`;
        });
        function formatTime(seconds) {
            const date = new Date(0);
            date.setSeconds(seconds);
            return date.toISOString().substr(11, 8);
        }
    </script>
</body>
</html>


Demo video: https://www.bilibili.com/video/BV1zSXYYAEhJ/ 4. Future Prospects

4.1 Event Detection and Quick Navigation

Currently, the frontend can only manually jump through video.

Future improvements: Use face recognition to identify family members and detect strangers, adding event tracks to allow fast navigation in the playback timeline.

 

4.2 Integration with HomeAssistant 

Current implementation is independent with custom frontend/backend. 

Embedding the system into HomeAssistant (HASS) would broaden its application and accessibility.