
A open-source voice AI assistant

An AI voice assistant that provides continuous audio recording and allows you to integrate any model you want.

Public Chat
Similar projects worth following
A hardware device capable of 24-hour continuous recording, transmitting data to a smartphone via BLE.
A companion app that receives data and uploads it to any large model interface, and accepts responses from the server.
A companion server program that performs speech recognition and summarization tasks.
All content is open-source.

Recently, more and more AI voice assistants have entered the public eye, with everyone praising the benefits AI can bring. However, this has led people to overlook the fact that the essence of these hardware products is just a recording device that can upload data in real-time!

The privacy of personal recordings is paramount, and it's hard not to be concerned about where the data goes. Are private conversations really safe? How do they store the original recordings? How do they analyze and use such private data?

Moreover, most of these products require manual triggering to start recording. If you tend to forget important information, you might also forget to trigger the recording. What I want is an assistant that can automatically record all information without any extra operations.

Therefore, I started this project with the hope of creating a device with the following features:
- A hardware device capable of 24-hour continuous recording, with data transmission to a smartphone via BLE.
- Possible features include:
  - 24-hour continuous recording capability
  - The ability to locally extract useful recording content (90% of the time we are not in conversation)
  - Uploading useful data to a smartphone via BLE
  - Firmware upgrade via BLE
  - As small and lightweight as possible to attach to clothing

We also need to develop a mobile app to receive data from the device and allow us to send the recording data to any custom model interface we choose. Of course, it will include common public large model interfaces, but the key is that we decide where everything goes!
- The app may include local speech-to-text and voiceprint recognition capabilities to reduce data volume (lower priority).

Additionally, we need a server program, including a local model to process data, running on our own server. (I think summarizing doesn't require a very powerful large model, though I could be wrong)
- Possible features include:
  - Converting speech to text
  - Voiceprint recognition
  - Summarizing text content with a local model
  - Deployment via Docker on most servers

This is my vision for the device. I will implement these features step by step. The current plan includes:
- Validating the functionality using ESP32 (though its power consumption is high)
- Developing the accompanying app to forward recording data to the model and receive the model's feedback
- Developing the server software to support simple speech-to-text and text summarization
- Replacing the ESP32 with a lower-power BLE MCU
- Designing the final PCB and manufacturing the enclosure

This is the current plan. The project will be fully open-source, from hardware to software. I hope that through this project, we can provide everyone with a convenient and secure personal voice assistant, ensuring our privacy.

  • Open-source repository:

    铲屎将军11/10/2024 at 02:51 0 comments



    The hardware cost is approximately $6 (excluding battery and processing fees). Key components include an ESP-32 ($2.5) and an INMP1441 ($1.5).

  • Hardware verification for the proposed solution has been completed

    铲屎将军11/09/2024 at 04:03 0 comments

    Hardware verification for the proposed solution has been completed. The ESP32-based hardware used for verification has been validated. This verification board can record audio upon button press and stop recording upon release. The recorded data is then sent to a TCP server.

    This verification board can now serve as a basic trigger-based voice assistant, although it is still far from our ultimate goal. Attached are photos of the verified hardware solution. The open-source link for the entire project design will be updated shortly, including hardware design, preliminary verification code, and a simple test server code (which merely packages the received audio data into a WAV file).

  • SMT is completed

    铲屎将军08/26/2024 at 03:42 0 comments

    SMT is completed, and next we will test the function of the recording part.

  • The customized PCB has arrived and will be sent to the factory for welding

    铲屎将军08/15/2024 at 04:15 0 comments

  • Preliminarily complete the hardware design of the ESP32 solution

    铲屎将军08/10/2024 at 09:32 0 comments

    The design uses the ESP32-WROVER primarily because of its large memory capacity. 

    The MIC is an INMP441, and data is directly obtained via I2S. 

    There is a button for user operation and an RGB LED to indicate status. 

    There is a TF card slot, which is not part of the original functionality because the TF card consumes a lot of power; it is included only for debugging audio data. 

    The battery uses a PH2.0 connector to facilitate the use of different solutions. 

    The PCB has a diameter of 4 cm and includes a slot for the ESP32 antenna.

    The current hardware cost of the solution is approximately $6 (excluding the battery).

    Since the PCB has not been verified yet, a public link for the schematic diagram generation has not been created. After the verification is completed, the schematic diagram and PCB address will be filled in the project.

View all 5 project logs

Enjoy this project?



katyyyyyydk wrote 10/23/2024 at 16:16 point


  Are you sure? yes | no

420pootang69 wrote 08/10/2024 at 10:25 point


I run all my home automation stuff in house *except* for a few Google homes scattered about the place.
It's annoying, as nothing connects to "the cloud" except for the Google assistants. The reason being is that I like having voice control. The commands are as simple as "turn on the kitchen lights", which then comes back into node red and gets processed into an MQTT request and sent out.
I'd love to be able to replace these with something processing voice on my local server, with the ability to figure out what the commands are saying and do the correct action.
I know you're more focused at LLM stuff at the moment, which is understandable, but have you given any thought to simpler actions?

Although saying that, teaching a local LLM to process the commands correctly might be the easiest option...

  Are you sure? yes | no

铲屎将军 wrote 08/11/2024 at 02:37 point

In my understanding, the key is to standardize various commands and then call different interfaces. Before the advent of large models, we could only enumerate all possible voice commands and then match them. Now, we can have the model organize the input text into a standard format before calling the interfaces. This can be achieved without fine-tuning the model, just by using appropriate prompts. Moreover, this functionality can be accomplished with a very small model.

You can build your own server to interface with the control interfaces of the devices, and then submit the voice received by the hardware to your server. This is exactly why I want this project to be able to interface with any system!

  Are you sure? yes | no

Does this project spark your interest?

Become a member to follow this project and never miss any updates