add some code

2025-09-05 13:25:11 +08:00
parent 9ff0a99e7a
commit 3cf1229a85
8911 changed files with 2535396 additions and 0 deletions
--- a/main/audio/README.md
+++ b/main/audio/README.md
@@ -0,0 +1,88 @@
+# Audio Service Architecture
+
+The audio service is a core component responsible for managing all audio-related functionalities, including capturing audio from the microphone, processing it, encoding/decoding, and playing back audio through the speaker. It is designed to be modular and efficient, running its main operations in dedicated FreeRTOS tasks to ensure real-time performance.
+
+## Key Components
+
+-   **`AudioService`**: The central orchestrator. It initializes and manages all other audio components, tasks, and data queues.
+-   **`AudioCodec`**: A hardware abstraction layer (HAL) for the physical audio codec chip. It handles the raw I2S communication for audio input and output.
+-   **`AudioProcessor`**: Performs real-time audio processing on the microphone input stream. This typically includes Acoustic Echo Cancellation (AEC), noise suppression, and Voice Activity Detection (VAD). `AfeAudioProcessor` is the default implementation, utilizing the ESP-ADF Audio Front-End.
+-   **`WakeWord`**: Detects keywords (e.g., "你好，小智", "Hi, ESP") from the audio stream. It runs independently from the main audio processor until a wake word is detected.
+-   **`OpusEncoderWrapper` / `OpusDecoderWrapper`**: Manages the encoding of PCM audio to the Opus format and decoding Opus packets back to PCM. Opus is used for its high compression and low latency, making it ideal for voice streaming.
+-   **`OpusResampler`**: A utility to convert audio streams between different sample rates (e.g., resampling from the codec's native sample rate to the required 16kHz for processing).
+
+## Threading Model
+
+The service operates on three primary tasks to handle the different stages of the audio pipeline concurrently:
+
+1.  **`AudioInputTask`**: Solely responsible for reading raw PCM data from the `AudioCodec`. It then feeds this data to either the `WakeWord` engine or the `AudioProcessor` based on the current state.
+2.  **`AudioOutputTask`**: Responsible for playing audio. It retrieves decoded PCM data from the `audio_playback_queue_` and sends it to the `AudioCodec` to be played on the speaker.
+3.  **`OpusCodecTask`**: A worker task that handles both encoding and decoding. It fetches raw audio from `audio_encode_queue_`, encodes it into Opus packets, and places them in the `audio_send_queue_`. Concurrently, it fetches Opus packets from `audio_decode_queue_`, decodes them into PCM, and places the result in the `audio_playback_queue_`.
+
+## Data Flow
+
+There are two primary data flows: audio input (uplink) and audio output (downlink).
+
+### 1. Audio Input (Uplink) Flow
+
+This flow captures audio from the microphone, processes it, encodes it, and prepares it for sending to a server.
+
+```mermaid
+graph TD
+    subgraph Device
+        Mic[("Microphone")] -->|I2S| Codec(AudioCodec)
+        
+        subgraph AudioInputTask
+            Codec -->|Raw PCM| Read(ReadAudioData)
+            Read -->|16kHz PCM| Processor(AudioProcessor)
+        end
+
+        subgraph OpusCodecTask
+            Processor -->|Clean PCM| EncodeQueue(audio_encode_queue_)
+            EncodeQueue --> Encoder(OpusEncoder)
+            Encoder -->|Opus Packet| SendQueue(audio_send_queue_)
+        end
+
+        SendQueue --> |"PopPacketFromSendQueue()"| App(Application Layer)
+    end
+    
+    App -->|Network| Server((Cloud Server))
+```
+
+-   The `AudioInputTask` continuously reads raw PCM data from the `AudioCodec`.
+-   This data is fed into an `AudioProcessor` for cleaning (AEC, VAD).
+-   The processed PCM data is pushed into the `audio_encode_queue_`.
+-   The `OpusCodecTask` picks up the PCM data, encodes it into Opus format, and pushes the resulting packet to the `audio_send_queue_`.
+-   The application can then retrieve these Opus packets and send them over the network.
+
+### 2. Audio Output (Downlink) Flow
+
+This flow receives encoded audio data, decodes it, and plays it on the speaker.
+
+```mermaid
+graph TD
+    Server((Cloud Server)) -->|Network| App(Application Layer)
+
+    subgraph Device
+        App -->|"PushPacketToDecodeQueue()"| DecodeQueue(audio_decode_queue_)
+
+        subgraph OpusCodecTask
+            DecodeQueue -->|Opus Packet| Decoder(OpusDecoder)
+            Decoder -->|PCM| PlaybackQueue(audio_playback_queue_)
+        end
+
+        subgraph AudioOutputTask
+            PlaybackQueue -->|PCM| Codec(AudioCodec)
+        end
+
+        Codec -->|I2S| Speaker[("Speaker")]
+    end
+```
+
+-   The application receives Opus packets from the network and pushes them into the `audio_decode_queue_`.
+-   The `OpusCodecTask` retrieves these packets, decodes them back into PCM data, and pushes the data to the `audio_playback_queue_`.
+-   The `AudioOutputTask` takes the PCM data from the queue and sends it to the `AudioCodec` for playback.
+
+## Power Management
+
+To conserve energy, the audio codec's input (ADC) and output (DAC) channels are automatically disabled after a period of inactivity (`AUDIO_POWER_TIMEOUT_MS`). A timer (`audio_power_timer_`) periodically checks for activity and manages the power state. The channels are automatically re-enabled when new audio needs to be captured or played. 
--- a/main/audio/audio_codec.cc
+++ b/main/audio/audio_codec.cc
@@ -0,0 +1,152 @@
+#include "audio_codec.h"
+#include "board.h"
+#include "settings.h"
+
+#include <esp_log.h>
+#include <cstring>
+#include <driver/i2s_common.h>
+
+#define TAG "AudioCodec"
+
+AudioCodec::AudioCodec(){
+}
+
+AudioCodec::~AudioCodec(){
+}
+
+void AudioCodec::OutputData(std::vector<int16_t> &data){
+    Write(data.data(), data.size());
+}
+
+bool AudioCodec::InputData(std::vector<int16_t> &data){
+    int samples = Read(data.data(), data.size());
+    if (samples > 0){
+        return true;
+    }
+    return false;
+}
+
+void AudioCodec::Start(){
+    Settings settings("audio", false);
+    output_volume_ = settings.GetInt("output_volume", output_volume_);
+    if (output_volume_ <= 0){
+        ESP_LOGW(TAG, "Output volume value (%d) is too small, setting to default (10)", output_volume_);
+        output_volume_ = 10;
+    }
+
+    // 保存原始输出采样率
+    if (original_output_sample_rate_ == 0){
+        original_output_sample_rate_ = output_sample_rate_;
+        ESP_LOGI(TAG, "Saved original output sample rate: %d Hz", original_output_sample_rate_);
+    }
+
+    if (tx_handle_ != nullptr){
+        ESP_ERROR_CHECK(i2s_channel_enable(tx_handle_));
+    }
+
+    if (rx_handle_ != nullptr){
+        ESP_ERROR_CHECK(i2s_channel_enable(rx_handle_));
+    }
+
+    EnableInput(true);
+    EnableOutput(true);
+    ESP_LOGI(TAG, "Audio codec started");
+}
+
+void AudioCodec::SetOutputVolume(int volume){
+    output_volume_ = volume;
+    ESP_LOGI(TAG, "Set output volume to %d", output_volume_);
+
+    Settings settings("audio", true);
+    settings.SetInt("output_volume", output_volume_);
+}
+
+void AudioCodec::EnableInput(bool enable){
+    if (enable == input_enabled_){
+        return;
+    }
+    input_enabled_ = enable;
+    ESP_LOGI(TAG, "Set input enable to %s", enable ? "true" : "false");
+}
+
+void AudioCodec::EnableOutput(bool enable){
+    if (enable == output_enabled_){
+        return;
+    }
+    output_enabled_ = enable;
+    ESP_LOGI(TAG, "Set output enable to %s", enable ? "true" : "false");
+}
+
+bool AudioCodec::SetOutputSampleRate(int sample_rate){
+    // 特殊处理：如果传入 -1，表示重置到原始采样率
+    if (sample_rate == -1){
+        if (original_output_sample_rate_ > 0){
+            sample_rate = original_output_sample_rate_;
+            ESP_LOGI(TAG, "Resetting to original output sample rate: %d Hz", sample_rate);
+        }else{
+            ESP_LOGW(TAG, "Original sample rate not available, cannot reset");
+            return false;
+        }
+    }
+
+    if (sample_rate <= 0 || sample_rate > 192000){
+        ESP_LOGE(TAG, "Invalid sample rate: %d", sample_rate);
+        return false;
+    }
+
+    if (output_sample_rate_ == sample_rate){
+        ESP_LOGI(TAG, "Sample rate already set to %d Hz", sample_rate);
+        return true;
+    }
+
+    if (tx_handle_ == nullptr){
+        ESP_LOGW(TAG, "TX handle is null, only updating sample rate variable");
+        output_sample_rate_ = sample_rate;
+        return true;
+    }
+
+    ESP_LOGI(TAG, "Changing output sample rate from %d to %d Hz", output_sample_rate_, sample_rate);
+
+    // 先尝试禁用 I2S 通道（如果已启用的话）
+    bool was_enabled = false;
+    esp_err_t disable_ret = i2s_channel_disable(tx_handle_);
+    if (disable_ret == ESP_OK){
+        was_enabled = true;
+        ESP_LOGI(TAG, "Disabled I2S TX channel for reconfiguration");
+    }
+    else if (disable_ret == ESP_ERR_INVALID_STATE){
+        // 通道可能已经是禁用状态，这是正常的
+        ESP_LOGI(TAG, "I2S TX channel was already disabled");
+    }else{
+        ESP_LOGW(TAG, "Failed to disable I2S TX channel: %s", esp_err_to_name(disable_ret));
+    }
+
+    // 重新配置 I2S 时钟
+    i2s_std_clk_config_t clk_cfg = {
+        .sample_rate_hz = (uint32_t)sample_rate,
+        .clk_src = I2S_CLK_SRC_DEFAULT,
+        .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+#ifdef I2S_HW_VERSION_2
+        .ext_clk_freq_hz = 0,
+#endif
+    };
+
+    esp_err_t ret = i2s_channel_reconfig_std_clock(tx_handle_, &clk_cfg);
+
+    // 重新启用通道（无论之前是什么状态，现在都需要启用以便播放音频）
+    esp_err_t enable_ret = i2s_channel_enable(tx_handle_);
+    if (enable_ret != ESP_OK){
+        ESP_LOGE(TAG, "Failed to enable I2S TX channel: %s", esp_err_to_name(enable_ret));
+    }else{
+        ESP_LOGI(TAG, "Enabled I2S TX channel");
+    }
+
+    if (ret == ESP_OK){
+        output_sample_rate_ = sample_rate;
+        ESP_LOGI(TAG, "Successfully changed output sample rate to %d Hz", sample_rate);
+        return true;
+    }else{
+        ESP_LOGE(TAG, "Failed to change sample rate to %d Hz: %s", sample_rate, esp_err_to_name(ret));
+        return false;
+    }
+}
--- a/main/audio/audio_codec.h
+++ b/main/audio/audio_codec.h
@@ -0,0 +1,62 @@
+#ifndef _AUDIO_CODEC_H
+#define _AUDIO_CODEC_H
+
+#include <freertos/FreeRTOS.h>
+#include <freertos/event_groups.h>
+#include <driver/i2s_std.h>
+
+#include <vector>
+#include <string>
+#include <functional>
+
+#include "board.h"
+
+#define AUDIO_CODEC_DMA_DESC_NUM 6
+#define AUDIO_CODEC_DMA_FRAME_NUM 240
+#define AUDIO_CODEC_DEFAULT_MIC_GAIN 30.0
+
+class AudioCodec {
+public:
+    AudioCodec();
+    virtual ~AudioCodec();
+    
+    virtual void SetOutputVolume(int volume);
+    virtual void EnableInput(bool enable);
+    virtual void EnableOutput(bool enable);
+    virtual bool SetOutputSampleRate(int sample_rate);
+
+    virtual void OutputData(std::vector<int16_t>& data);
+    virtual bool InputData(std::vector<int16_t>& data);
+    virtual void Start();
+
+    inline bool duplex() const { return duplex_; }
+    inline bool input_reference() const { return input_reference_; }
+    inline int input_sample_rate() const { return input_sample_rate_; }
+    inline int output_sample_rate() const { return output_sample_rate_; }
+    inline int original_output_sample_rate() const { return original_output_sample_rate_; }
+    inline int input_channels() const { return input_channels_; }
+    inline int output_channels() const { return output_channels_; }
+    inline int output_volume() const { return output_volume_; }
+    inline bool input_enabled() const { return input_enabled_; }
+    inline bool output_enabled() const { return output_enabled_; }
+
+protected:
+    i2s_chan_handle_t tx_handle_ = nullptr;
+    i2s_chan_handle_t rx_handle_ = nullptr;
+
+    bool duplex_ = false;
+    bool input_reference_ = false;
+    bool input_enabled_ = false;
+    bool output_enabled_ = false;
+    int input_sample_rate_ = 0;
+    int output_sample_rate_ = 0;
+    int original_output_sample_rate_ = 0;
+    int input_channels_ = 1;
+    int output_channels_ = 1;
+    int output_volume_ = 70;
+
+    virtual int Read(int16_t* dest, int samples) = 0;
+    virtual int Write(const int16_t* data, int samples) = 0;
+};
+
+#endif // _AUDIO_CODEC_H
--- a/main/audio/audio_processor.h
+++ b/main/audio/audio_processor.h
@@ -0,0 +1,25 @@
+#ifndef AUDIO_PROCESSOR_H
+#define AUDIO_PROCESSOR_H
+
+#include <string>
+#include <vector>
+#include <functional>
+
+#include "audio_codec.h"
+
+class AudioProcessor {
+public:
+    virtual ~AudioProcessor() = default;
+    
+    virtual void Initialize(AudioCodec* codec, int frame_duration_ms) = 0;
+    virtual void Feed(std::vector<int16_t>&& data) = 0;
+    virtual void Start() = 0;
+    virtual void Stop() = 0;
+    virtual bool IsRunning() = 0;
+    virtual void OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) = 0;
+    virtual void OnVadStateChange(std::function<void(bool speaking)> callback) = 0;
+    virtual size_t GetFeedSize() = 0;
+    virtual void EnableDeviceAec(bool enable) = 0;
+};
+
+#endif
--- a/main/audio/audio_service.cc
+++ b/main/audio/audio_service.cc
@@ -0,0 +1,673 @@
+#include "audio_service.h"
+#include <esp_log.h>
+#include <cstring>
+
+#if CONFIG_USE_AUDIO_PROCESSOR
+#include "processors/afe_audio_processor.h"
+#else
+#include "processors/no_audio_processor.h"
+#endif
+
+#if CONFIG_USE_AFE_WAKE_WORD
+#include "wake_words/afe_wake_word.h"
+#elif CONFIG_USE_ESP_WAKE_WORD
+#include "wake_words/esp_wake_word.h"
+#elif CONFIG_USE_CUSTOM_WAKE_WORD
+#include "wake_words/custom_wake_word.h"
+#endif
+
+#define TAG "AudioService"
+
+
+AudioService::AudioService() {
+    event_group_ = xEventGroupCreate();
+}
+
+AudioService::~AudioService() {
+    if (event_group_ != nullptr) {
+        vEventGroupDelete(event_group_);
+    }
+}
+
+
+void AudioService::Initialize(AudioCodec* codec) {
+    codec_ = codec;
+    codec_->Start();
+
+    /* Setup the audio codec */
+    opus_decoder_ = std::make_unique<OpusDecoderWrapper>(codec->output_sample_rate(), 1, OPUS_FRAME_DURATION_MS);
+    opus_encoder_ = std::make_unique<OpusEncoderWrapper>(16000, 1, OPUS_FRAME_DURATION_MS);
+    opus_encoder_->SetComplexity(0);
+
+    if (codec->input_sample_rate() != 16000) {
+        input_resampler_.Configure(codec->input_sample_rate(), 16000);
+        reference_resampler_.Configure(codec->input_sample_rate(), 16000);
+    }
+
+#if CONFIG_USE_AUDIO_PROCESSOR
+    audio_processor_ = std::make_unique<AfeAudioProcessor>();
+#else
+    audio_processor_ = std::make_unique<NoAudioProcessor>();
+#endif
+
+#if CONFIG_USE_AFE_WAKE_WORD
+    wake_word_ = std::make_unique<AfeWakeWord>();
+#elif CONFIG_USE_ESP_WAKE_WORD
+    wake_word_ = std::make_unique<EspWakeWord>();
+#elif CONFIG_USE_CUSTOM_WAKE_WORD
+    wake_word_ = std::make_unique<CustomWakeWord>();
+#else
+    wake_word_ = nullptr;
+#endif
+
+    audio_processor_->OnOutput([this](std::vector<int16_t>&& data) {
+        PushTaskToEncodeQueue(kAudioTaskTypeEncodeToSendQueue, std::move(data));
+    });
+
+    audio_processor_->OnVadStateChange([this](bool speaking) {
+        voice_detected_ = speaking;
+        if (callbacks_.on_vad_change) {
+            callbacks_.on_vad_change(speaking);
+        }
+    });
+
+    if (wake_word_) {
+        wake_word_->OnWakeWordDetected([this](const std::string& wake_word) {
+            if (callbacks_.on_wake_word_detected) {
+                callbacks_.on_wake_word_detected(wake_word);
+            }
+        });
+    }
+
+    esp_timer_create_args_t audio_power_timer_args = {
+        .callback = [](void* arg) {
+            AudioService* audio_service = (AudioService*)arg;
+            audio_service->CheckAndUpdateAudioPowerState();
+        },
+        .arg = this,
+        .dispatch_method = ESP_TIMER_TASK,
+        .name = "audio_power_timer",
+        .skip_unhandled_events = true,
+    };
+    esp_timer_create(&audio_power_timer_args, &audio_power_timer_);
+}
+
+void AudioService::Start() {
+    service_stopped_ = false;
+    xEventGroupClearBits(event_group_, AS_EVENT_AUDIO_TESTING_RUNNING | AS_EVENT_WAKE_WORD_RUNNING | AS_EVENT_AUDIO_PROCESSOR_RUNNING);
+
+    esp_timer_start_periodic(audio_power_timer_, 1000000);
+
+#if CONFIG_USE_AUDIO_PROCESSOR
+    /* Start the audio input task */
+    xTaskCreatePinnedToCore([](void* arg) {
+        AudioService* audio_service = (AudioService*)arg;
+        audio_service->AudioInputTask();
+        vTaskDelete(NULL);
+    }, "audio_input", 2048 * 3, this, 8, &audio_input_task_handle_, 1);
+
+    /* Start the audio output task */
+    xTaskCreate([](void* arg) {
+        AudioService* audio_service = (AudioService*)arg;
+        audio_service->AudioOutputTask();
+        vTaskDelete(NULL);
+    }, "audio_output", 2048 * 2, this, 3, &audio_output_task_handle_);
+#else
+    /* Start the audio input task */
+    xTaskCreate([](void* arg) {
+        AudioService* audio_service = (AudioService*)arg;
+        audio_service->AudioInputTask();
+        vTaskDelete(NULL);
+    }, "audio_input", 2048 * 2, this, 8, &audio_input_task_handle_);
+
+    /* Start the audio output task */
+    xTaskCreate([](void* arg) {
+        AudioService* audio_service = (AudioService*)arg;
+        audio_service->AudioOutputTask();
+        vTaskDelete(NULL);
+    }, "audio_output", 2048, this, 3, &audio_output_task_handle_);
+#endif
+
+    /* Start the opus codec task */
+    xTaskCreate([](void* arg) {
+        AudioService* audio_service = (AudioService*)arg;
+        audio_service->OpusCodecTask();
+        vTaskDelete(NULL);
+    }, "opus_codec", 2048 * 13, this, 2, &opus_codec_task_handle_);
+}
+
+void AudioService::Stop() {
+    esp_timer_stop(audio_power_timer_);
+    service_stopped_ = true;
+    xEventGroupSetBits(event_group_, AS_EVENT_AUDIO_TESTING_RUNNING |
+        AS_EVENT_WAKE_WORD_RUNNING |
+        AS_EVENT_AUDIO_PROCESSOR_RUNNING);
+
+    std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+    audio_encode_queue_.clear();
+    audio_decode_queue_.clear();
+    audio_playback_queue_.clear();
+    audio_testing_queue_.clear();
+    audio_queue_cv_.notify_all();
+}
+
+bool AudioService::ReadAudioData(std::vector<int16_t>& data, int sample_rate, int samples) {
+    if (!codec_->input_enabled()) {
+        esp_timer_stop(audio_power_timer_);
+        esp_timer_start_periodic(audio_power_timer_, AUDIO_POWER_CHECK_INTERVAL_MS * 1000);
+        codec_->EnableInput(true);
+    }
+
+    if (codec_->input_sample_rate() != sample_rate) {
+        data.resize(samples * codec_->input_sample_rate() / sample_rate * codec_->input_channels());
+        if (!codec_->InputData(data)) {
+            return false;
+        }
+        if (codec_->input_channels() == 2) {
+            auto mic_channel = std::vector<int16_t>(data.size() / 2);
+            auto reference_channel = std::vector<int16_t>(data.size() / 2);
+            for (size_t i = 0, j = 0; i < mic_channel.size(); ++i, j += 2) {
+                mic_channel[i] = data[j];
+                reference_channel[i] = data[j + 1];
+            }
+            auto resampled_mic = std::vector<int16_t>(input_resampler_.GetOutputSamples(mic_channel.size()));
+            auto resampled_reference = std::vector<int16_t>(reference_resampler_.GetOutputSamples(reference_channel.size()));
+            input_resampler_.Process(mic_channel.data(), mic_channel.size(), resampled_mic.data());
+            reference_resampler_.Process(reference_channel.data(), reference_channel.size(), resampled_reference.data());
+            data.resize(resampled_mic.size() + resampled_reference.size());
+            for (size_t i = 0, j = 0; i < resampled_mic.size(); ++i, j += 2) {
+                data[j] = resampled_mic[i];
+                data[j + 1] = resampled_reference[i];
+            }
+        } else {
+            auto resampled = std::vector<int16_t>(input_resampler_.GetOutputSamples(data.size()));
+            input_resampler_.Process(data.data(), data.size(), resampled.data());
+            data = std::move(resampled);
+        }
+    } else {
+        data.resize(samples * codec_->input_channels());
+        if (!codec_->InputData(data)) {
+            return false;
+        }
+    }
+
+    /* Update the last input time */
+    last_input_time_ = std::chrono::steady_clock::now();
+    debug_statistics_.input_count++;
+
+#if CONFIG_USE_AUDIO_DEBUGGER
+    // 音频调试：发送原始音频数据
+    if (audio_debugger_ == nullptr) {
+        audio_debugger_ = std::make_unique<AudioDebugger>();
+    }
+    audio_debugger_->Feed(data);
+#endif
+
+    return true;
+}
+
+void AudioService::AudioInputTask() {
+    while (true) {
+        EventBits_t bits = xEventGroupWaitBits(event_group_, AS_EVENT_AUDIO_TESTING_RUNNING |
+            AS_EVENT_WAKE_WORD_RUNNING | AS_EVENT_AUDIO_PROCESSOR_RUNNING,
+            pdFALSE, pdFALSE, portMAX_DELAY);
+
+        if (service_stopped_) {
+            break;
+        }
+        if (audio_input_need_warmup_) {
+            audio_input_need_warmup_ = false;
+            vTaskDelay(pdMS_TO_TICKS(120));
+            continue;
+        }
+
+        /* Used for audio testing in NetworkConfiguring mode by clicking the BOOT button */
+        if (bits & AS_EVENT_AUDIO_TESTING_RUNNING) {
+            if (audio_testing_queue_.size() >= AUDIO_TESTING_MAX_DURATION_MS / OPUS_FRAME_DURATION_MS) {
+                ESP_LOGW(TAG, "Audio testing queue is full, stopping audio testing");
+                EnableAudioTesting(false);
+                continue;
+            }
+            std::vector<int16_t> data;
+            int samples = OPUS_FRAME_DURATION_MS * 16000 / 1000;
+            if (ReadAudioData(data, 16000, samples)) {
+                // If input channels is 2, we need to fetch the left channel data
+                if (codec_->input_channels() == 2) {
+                    auto mono_data = std::vector<int16_t>(data.size() / 2);
+                    for (size_t i = 0, j = 0; i < mono_data.size(); ++i, j += 2) {
+                        mono_data[i] = data[j];
+                    }
+                    data = std::move(mono_data);
+                }
+                PushTaskToEncodeQueue(kAudioTaskTypeEncodeToTestingQueue, std::move(data));
+                continue;
+            }
+        }
+
+        /* Feed the wake word */
+        if (bits & AS_EVENT_WAKE_WORD_RUNNING) {
+            std::vector<int16_t> data;
+            int samples = wake_word_->GetFeedSize();
+            if (samples > 0) {
+                if (ReadAudioData(data, 16000, samples)) {
+                    wake_word_->Feed(data);
+                    continue;
+                }
+            }
+        }
+
+        /* Feed the audio processor */
+        if (bits & AS_EVENT_AUDIO_PROCESSOR_RUNNING) {
+            std::vector<int16_t> data;
+            int samples = audio_processor_->GetFeedSize();
+            if (samples > 0) {
+                if (ReadAudioData(data, 16000, samples)) {
+                    audio_processor_->Feed(std::move(data));
+                    continue;
+                }
+            }
+        }
+
+        ESP_LOGE(TAG, "Should not be here, bits: %lx", bits);
+        break;
+    }
+
+    ESP_LOGW(TAG, "Audio input task stopped");
+}
+
+void AudioService::AudioOutputTask() {
+    while (true) {
+        std::unique_lock<std::mutex> lock(audio_queue_mutex_);
+        audio_queue_cv_.wait(lock, [this]() { return !audio_playback_queue_.empty() || service_stopped_; });
+        if (service_stopped_) {
+            break;
+        }
+
+        auto task = std::move(audio_playback_queue_.front());
+        audio_playback_queue_.pop_front();
+        audio_queue_cv_.notify_all();
+        lock.unlock();
+
+        if (!codec_->output_enabled()) {
+            esp_timer_stop(audio_power_timer_);
+            esp_timer_start_periodic(audio_power_timer_, AUDIO_POWER_CHECK_INTERVAL_MS * 1000);
+            codec_->EnableOutput(true);
+        }
+        codec_->OutputData(task->pcm);
+
+        /* Update the last output time */
+        last_output_time_ = std::chrono::steady_clock::now();
+        debug_statistics_.playback_count++;
+
+#if CONFIG_USE_SERVER_AEC
+        /* Record the timestamp for server AEC */
+        if (task->timestamp > 0) {
+            lock.lock();
+            timestamp_queue_.push_back(task->timestamp);
+        }
+#endif
+    }
+
+    ESP_LOGW(TAG, "Audio output task stopped");
+}
+
+void AudioService::OpusCodecTask() {
+    while (true) {
+        std::unique_lock<std::mutex> lock(audio_queue_mutex_);
+        audio_queue_cv_.wait(lock, [this]() {
+            return service_stopped_ ||
+                (!audio_encode_queue_.empty() && audio_send_queue_.size() < MAX_SEND_PACKETS_IN_QUEUE) ||
+                (!audio_decode_queue_.empty() && audio_playback_queue_.size() < MAX_PLAYBACK_TASKS_IN_QUEUE);
+        });
+        if (service_stopped_) {
+            break;
+        }
+
+        /* Decode the audio from decode queue */
+        if (!audio_decode_queue_.empty() && audio_playback_queue_.size() < MAX_PLAYBACK_TASKS_IN_QUEUE) {
+            auto packet = std::move(audio_decode_queue_.front());
+            audio_decode_queue_.pop_front();
+            audio_queue_cv_.notify_all();
+            lock.unlock();
+
+            auto task = std::make_unique<AudioTask>();
+            task->type = kAudioTaskTypeDecodeToPlaybackQueue;
+            task->timestamp = packet->timestamp;
+
+            SetDecodeSampleRate(packet->sample_rate, packet->frame_duration);
+            if (opus_decoder_->Decode(std::move(packet->payload), task->pcm)) {
+                // Resample if the sample rate is different
+                if (opus_decoder_->sample_rate() != codec_->output_sample_rate()) {
+                    int target_size = output_resampler_.GetOutputSamples(task->pcm.size());
+                    std::vector<int16_t> resampled(target_size);
+                    output_resampler_.Process(task->pcm.data(), task->pcm.size(), resampled.data());
+                    task->pcm = std::move(resampled);
+                }
+
+                lock.lock();
+                audio_playback_queue_.push_back(std::move(task));
+                audio_queue_cv_.notify_all();
+            } else {
+                ESP_LOGE(TAG, "Failed to decode audio");
+                lock.lock();
+            }
+            debug_statistics_.decode_count++;
+        }
+        
+        /* Encode the audio to send queue */
+        if (!audio_encode_queue_.empty() && audio_send_queue_.size() < MAX_SEND_PACKETS_IN_QUEUE) {
+            auto task = std::move(audio_encode_queue_.front());
+            audio_encode_queue_.pop_front();
+            audio_queue_cv_.notify_all();
+            lock.unlock();
+
+            auto packet = std::make_unique<AudioStreamPacket>();
+            packet->frame_duration = OPUS_FRAME_DURATION_MS;
+            packet->sample_rate = 16000;
+            packet->timestamp = task->timestamp;
+            if (!opus_encoder_->Encode(std::move(task->pcm), packet->payload)) {
+                ESP_LOGE(TAG, "Failed to encode audio");
+                continue;
+            }
+
+            if (task->type == kAudioTaskTypeEncodeToSendQueue) {
+                {
+                    std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+                    audio_send_queue_.push_back(std::move(packet));
+                }
+                if (callbacks_.on_send_queue_available) {
+                    callbacks_.on_send_queue_available();
+                }
+            } else if (task->type == kAudioTaskTypeEncodeToTestingQueue) {
+                std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+                audio_testing_queue_.push_back(std::move(packet));
+            }
+            debug_statistics_.encode_count++;
+            lock.lock();
+        }
+    }
+
+    ESP_LOGW(TAG, "Opus codec task stopped");
+}
+
+void AudioService::SetDecodeSampleRate(int sample_rate, int frame_duration) {
+    if (opus_decoder_->sample_rate() == sample_rate && opus_decoder_->duration_ms() == frame_duration) {
+        return;
+    }
+
+    opus_decoder_.reset();
+    opus_decoder_ = std::make_unique<OpusDecoderWrapper>(sample_rate, 1, frame_duration);
+
+    auto codec = Board::GetInstance().GetAudioCodec();
+    if (opus_decoder_->sample_rate() != codec->output_sample_rate()) {
+        ESP_LOGI(TAG, "Resampling audio from %d to %d", opus_decoder_->sample_rate(), codec->output_sample_rate());
+        output_resampler_.Configure(opus_decoder_->sample_rate(), codec->output_sample_rate());
+    }
+}
+
+void AudioService::PushTaskToEncodeQueue(AudioTaskType type, std::vector<int16_t>&& pcm) {
+    auto task = std::make_unique<AudioTask>();
+    task->type = type;
+    task->pcm = std::move(pcm);
+    
+    /* Push the task to the encode queue */
+    std::unique_lock<std::mutex> lock(audio_queue_mutex_);
+
+    /* If the task is to send queue, we need to set the timestamp */
+    if (type == kAudioTaskTypeEncodeToSendQueue && !timestamp_queue_.empty()) {
+        if (timestamp_queue_.size() <= MAX_TIMESTAMPS_IN_QUEUE) {
+            task->timestamp = timestamp_queue_.front();
+        } else {
+            ESP_LOGW(TAG, "Timestamp queue (%u) is full, dropping timestamp", timestamp_queue_.size());
+        }
+        timestamp_queue_.pop_front();
+    }
+
+    audio_queue_cv_.wait(lock, [this]() { return audio_encode_queue_.size() < MAX_ENCODE_TASKS_IN_QUEUE; });
+    audio_encode_queue_.push_back(std::move(task));
+    audio_queue_cv_.notify_all();
+}
+
+bool AudioService::PushPacketToDecodeQueue(std::unique_ptr<AudioStreamPacket> packet, bool wait) {
+    std::unique_lock<std::mutex> lock(audio_queue_mutex_);
+    if (audio_decode_queue_.size() >= MAX_DECODE_PACKETS_IN_QUEUE) {
+        if (wait) {
+            audio_queue_cv_.wait(lock, [this]() { return audio_decode_queue_.size() < MAX_DECODE_PACKETS_IN_QUEUE; });
+        } else {
+            return false;
+        }
+    }
+    audio_decode_queue_.push_back(std::move(packet));
+    audio_queue_cv_.notify_all();
+    return true;
+}
+
+std::unique_ptr<AudioStreamPacket> AudioService::PopPacketFromSendQueue() {
+    std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+    if (audio_send_queue_.empty()) {
+        return nullptr;
+    }
+    auto packet = std::move(audio_send_queue_.front());
+    audio_send_queue_.pop_front();
+    audio_queue_cv_.notify_all();
+    return packet;
+}
+
+void AudioService::EncodeWakeWord() {
+    if (wake_word_) {
+        wake_word_->EncodeWakeWordData();
+    }
+}
+
+const std::string& AudioService::GetLastWakeWord() const {
+    return wake_word_->GetLastDetectedWakeWord();
+}
+
+std::unique_ptr<AudioStreamPacket> AudioService::PopWakeWordPacket() {
+    auto packet = std::make_unique<AudioStreamPacket>();
+    if (wake_word_->GetWakeWordOpus(packet->payload)) {
+        return packet;
+    }
+    return nullptr;
+}
+
+void AudioService::EnableWakeWordDetection(bool enable) {
+    if (!wake_word_) {
+        return;
+    }
+
+    ESP_LOGD(TAG, "%s wake word detection", enable ? "Enabling" : "Disabling");
+    if (enable) {
+        if (!wake_word_initialized_) {
+            if (!wake_word_->Initialize(codec_)) {
+                ESP_LOGE(TAG, "Failed to initialize wake word");
+                return;
+            }
+            wake_word_initialized_ = true;
+        }
+        wake_word_->Start();
+        xEventGroupSetBits(event_group_, AS_EVENT_WAKE_WORD_RUNNING);
+    } else {
+        wake_word_->Stop();
+        xEventGroupClearBits(event_group_, AS_EVENT_WAKE_WORD_RUNNING);
+    }
+}
+
+void AudioService::EnableVoiceProcessing(bool enable) {
+    ESP_LOGD(TAG, "%s voice processing", enable ? "Enabling" : "Disabling");
+    if (enable) {
+        if (!audio_processor_initialized_) {
+            audio_processor_->Initialize(codec_, OPUS_FRAME_DURATION_MS);
+            audio_processor_initialized_ = true;
+        }
+
+        /* We should make sure no audio is playing */
+        ResetDecoder();
+        audio_input_need_warmup_ = true;
+        audio_processor_->Start();
+        xEventGroupSetBits(event_group_, AS_EVENT_AUDIO_PROCESSOR_RUNNING);
+    } else {
+        audio_processor_->Stop();
+        xEventGroupClearBits(event_group_, AS_EVENT_AUDIO_PROCESSOR_RUNNING);
+    }
+}
+
+void AudioService::EnableAudioTesting(bool enable) {
+    ESP_LOGI(TAG, "%s audio testing", enable ? "Enabling" : "Disabling");
+    if (enable) {
+        xEventGroupSetBits(event_group_, AS_EVENT_AUDIO_TESTING_RUNNING);
+    } else {
+        xEventGroupClearBits(event_group_, AS_EVENT_AUDIO_TESTING_RUNNING);
+        /* Copy audio_testing_queue_ to audio_decode_queue_ */
+        std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+        audio_decode_queue_ = std::move(audio_testing_queue_);
+        audio_queue_cv_.notify_all();
+    }
+}
+
+void AudioService::EnableDeviceAec(bool enable) {
+    ESP_LOGI(TAG, "%s device AEC", enable ? "Enabling" : "Disabling");
+    if (!audio_processor_initialized_) {
+        audio_processor_->Initialize(codec_, OPUS_FRAME_DURATION_MS);
+        audio_processor_initialized_ = true;
+    }
+
+    audio_processor_->EnableDeviceAec(enable);
+}
+
+void AudioService::SetCallbacks(AudioServiceCallbacks& callbacks) {
+    callbacks_ = callbacks;
+}
+
+void AudioService::PlaySound(const std::string_view& ogg) {
+    if (!codec_->output_enabled()) {
+        esp_timer_stop(audio_power_timer_);
+        esp_timer_start_periodic(audio_power_timer_, AUDIO_POWER_CHECK_INTERVAL_MS * 1000);
+        codec_->EnableOutput(true);
+    }
+
+    const uint8_t* buf = reinterpret_cast<const uint8_t*>(ogg.data());
+    size_t size = ogg.size();
+    size_t offset = 0;
+
+    auto find_page = [&](size_t start)->size_t {
+        for (size_t i = start; i + 4 <= size; ++i) {
+            if (buf[i] == 'O' && buf[i+1] == 'g' && buf[i+2] == 'g' && buf[i+3] == 'S') return i;
+        }
+        return static_cast<size_t>(-1);
+    };
+
+    bool seen_head = false;
+    bool seen_tags = false;
+    int sample_rate = 16000; // 默认值
+
+    while (true) {
+        size_t pos = find_page(offset);
+        if (pos == static_cast<size_t>(-1)) break;
+        offset = pos;
+        if (offset + 27 > size) break;
+
+        const uint8_t* page = buf + offset;
+        uint8_t page_segments = page[26];
+        size_t seg_table_off = offset + 27;
+        if (seg_table_off + page_segments > size) break;
+
+        size_t body_size = 0;
+        for (size_t i = 0; i < page_segments; ++i) body_size += page[27 + i];
+
+        size_t body_off = seg_table_off + page_segments;
+        if (body_off + body_size > size) break;
+
+        // Parse packets using lacing
+        size_t cur = body_off;
+        size_t seg_idx = 0;
+        while (seg_idx < page_segments) {
+            size_t pkt_len = 0;
+            size_t pkt_start = cur;
+            bool continued = false;
+            do {
+                uint8_t l = page[27 + seg_idx++];
+                pkt_len += l;
+                cur += l;
+                continued = (l == 255);
+            } while (continued && seg_idx < page_segments);
+
+            if (pkt_len == 0) continue;
+            const uint8_t* pkt_ptr = buf + pkt_start;
+
+            if (!seen_head) {
+                // 解析OpusHead包
+                if (pkt_len >= 19 && std::memcmp(pkt_ptr, "OpusHead", 8) == 0) {
+                    seen_head = true;
+                    
+                    // OpusHead结构：[0-7] "OpusHead", [8] version, [9] channel_count, [10-11] pre_skip
+                    // [12-15] input_sample_rate, [16-17] output_gain, [18] mapping_family
+                    if (pkt_len >= 12) {
+                        uint8_t version = pkt_ptr[8];
+                        uint8_t channel_count = pkt_ptr[9];
+                        
+                        if (pkt_len >= 16) {
+                            // 读取输入采样率 (little-endian)
+                            sample_rate = pkt_ptr[12] | (pkt_ptr[13] << 8) | 
+                                        (pkt_ptr[14] << 16) | (pkt_ptr[15] << 24);
+                            ESP_LOGI(TAG, "OpusHead: version=%d, channels=%d, sample_rate=%d", 
+                                   version, channel_count, sample_rate);
+                        }
+                    }
+                }
+                continue;
+            }
+            if (!seen_tags) {
+                // Expect OpusTags in second packet
+                if (pkt_len >= 8 && std::memcmp(pkt_ptr, "OpusTags", 8) == 0) {
+                    seen_tags = true;
+                }
+                continue;
+            }
+
+            // Audio packet (Opus)
+            auto packet = std::make_unique<AudioStreamPacket>();
+            packet->sample_rate = sample_rate;
+            packet->frame_duration = 60;
+            packet->payload.resize(pkt_len);
+            std::memcpy(packet->payload.data(), pkt_ptr, pkt_len);
+            PushPacketToDecodeQueue(std::move(packet), true);
+        }
+
+        offset = body_off + body_size;
+    }
+}
+
+bool AudioService::IsIdle() {
+    std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+    return audio_encode_queue_.empty() && audio_decode_queue_.empty() && audio_playback_queue_.empty() && audio_testing_queue_.empty();
+}
+
+void AudioService::ResetDecoder() {
+    std::lock_guard<std::mutex> lock(audio_queue_mutex_);
+    opus_decoder_->ResetState();
+    timestamp_queue_.clear();
+    audio_decode_queue_.clear();
+    audio_playback_queue_.clear();
+    audio_testing_queue_.clear();
+    audio_queue_cv_.notify_all();
+}
+
+void AudioService::CheckAndUpdateAudioPowerState() {
+    auto now = std::chrono::steady_clock::now();
+    auto input_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(now - last_input_time_).count();
+    auto output_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(now - last_output_time_).count();
+    if (input_elapsed > AUDIO_POWER_TIMEOUT_MS && codec_->input_enabled()) {
+        codec_->EnableInput(false);
+    }
+    if (output_elapsed > AUDIO_POWER_TIMEOUT_MS && codec_->output_enabled()) {
+        codec_->EnableOutput(false);
+    }
+    if (!codec_->input_enabled() && !codec_->output_enabled()) {
+        esp_timer_stop(audio_power_timer_);
+    }
+}
+
+void AudioService::UpdateOutputTimestamp() {
+    last_output_time_ = std::chrono::steady_clock::now();
+}
--- a/main/audio/audio_service.h
+++ b/main/audio/audio_service.h
@@ -0,0 +1,159 @@
+#ifndef AUDIO_SERVICE_H
+#define AUDIO_SERVICE_H
+
+#include <memory>
+#include <deque>
+#include <condition_variable>
+#include <chrono>
+#include <mutex>
+
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#include <freertos/event_groups.h>
+#include <esp_timer.h>
+
+#include <opus_encoder.h>
+#include <opus_decoder.h>
+#include <opus_resampler.h>
+
+#include "audio_codec.h"
+#include "audio_processor.h"
+#include "processors/audio_debugger.h"
+#include "wake_word.h"
+#include "protocol.h"
+
+
+/*
+ * There are two types of audio data flow:
+ * 1. (MIC) -> [Processors] -> {Encode Queue} -> [Opus Encoder] -> {Send Queue} -> (Server)
+ * 2. (Server) -> {Decode Queue} -> [Opus Decoder] -> {Playback Queue} -> (Speaker)
+ *
+ * We use one task for MIC / Speaker / Processors, and one task for Opus Encoder / Opus Decoder.
+ * 
+ * Decode Queue and Send Queue are the main queues, because Opus packets are quite smaller than PCM packets.
+ * 
+ */
+
+#define OPUS_FRAME_DURATION_MS 60
+#define MAX_ENCODE_TASKS_IN_QUEUE 2
+#define MAX_PLAYBACK_TASKS_IN_QUEUE 2
+#define MAX_DECODE_PACKETS_IN_QUEUE (2400 / OPUS_FRAME_DURATION_MS)
+#define MAX_SEND_PACKETS_IN_QUEUE (2400 / OPUS_FRAME_DURATION_MS)
+#define AUDIO_TESTING_MAX_DURATION_MS 10000
+#define MAX_TIMESTAMPS_IN_QUEUE 3
+
+#define AUDIO_POWER_TIMEOUT_MS 15000
+#define AUDIO_POWER_CHECK_INTERVAL_MS 1000
+
+
+#define AS_EVENT_AUDIO_TESTING_RUNNING      (1 << 0)
+#define AS_EVENT_WAKE_WORD_RUNNING          (1 << 1)
+#define AS_EVENT_AUDIO_PROCESSOR_RUNNING    (1 << 2)
+#define AS_EVENT_PLAYBACK_NOT_EMPTY         (1 << 3)
+
+struct AudioServiceCallbacks {
+    std::function<void(void)> on_send_queue_available;
+    std::function<void(const std::string&)> on_wake_word_detected;
+    std::function<void(bool)> on_vad_change;
+    std::function<void(void)> on_audio_testing_queue_full;
+};
+
+
+enum AudioTaskType {
+    kAudioTaskTypeEncodeToSendQueue,
+    kAudioTaskTypeEncodeToTestingQueue,
+    kAudioTaskTypeDecodeToPlaybackQueue,
+};
+
+struct AudioTask {
+    AudioTaskType type;
+    std::vector<int16_t> pcm;
+    uint32_t timestamp;
+};
+
+struct DebugStatistics {
+    uint32_t input_count = 0;
+    uint32_t decode_count = 0;
+    uint32_t encode_count = 0;
+    uint32_t playback_count = 0;
+};
+
+class AudioService {
+public:
+    AudioService();
+    ~AudioService();
+
+    void Initialize(AudioCodec* codec);
+    void Start();
+    void Stop();
+    void EncodeWakeWord();
+    std::unique_ptr<AudioStreamPacket> PopWakeWordPacket();
+    const std::string& GetLastWakeWord() const;
+    bool IsVoiceDetected() const { return voice_detected_; }
+    bool IsIdle();
+    bool IsWakeWordRunning() const { return xEventGroupGetBits(event_group_) & AS_EVENT_WAKE_WORD_RUNNING; }
+    bool IsAudioProcessorRunning() const { return xEventGroupGetBits(event_group_) & AS_EVENT_AUDIO_PROCESSOR_RUNNING; }
+
+    void EnableWakeWordDetection(bool enable);
+    void EnableVoiceProcessing(bool enable);
+    void EnableAudioTesting(bool enable);
+    void EnableDeviceAec(bool enable);
+
+    void SetCallbacks(AudioServiceCallbacks& callbacks);
+
+    bool PushPacketToDecodeQueue(std::unique_ptr<AudioStreamPacket> packet, bool wait = false);
+    std::unique_ptr<AudioStreamPacket> PopPacketFromSendQueue();
+    void PlaySound(const std::string_view& sound);
+    bool ReadAudioData(std::vector<int16_t>& data, int sample_rate, int samples);
+    void ResetDecoder();
+
+    void UpdateOutputTimestamp();
+
+private:
+    AudioCodec* codec_ = nullptr;
+    AudioServiceCallbacks callbacks_;
+    std::unique_ptr<AudioProcessor> audio_processor_;
+    std::unique_ptr<WakeWord> wake_word_;
+    std::unique_ptr<AudioDebugger> audio_debugger_;
+    std::unique_ptr<OpusEncoderWrapper> opus_encoder_;
+    std::unique_ptr<OpusDecoderWrapper> opus_decoder_;
+    OpusResampler input_resampler_;
+    OpusResampler reference_resampler_;
+    OpusResampler output_resampler_;
+    DebugStatistics debug_statistics_;
+
+    EventGroupHandle_t event_group_;
+
+    // Audio encode / decode
+    TaskHandle_t audio_input_task_handle_ = nullptr;
+    TaskHandle_t audio_output_task_handle_ = nullptr;
+    TaskHandle_t opus_codec_task_handle_ = nullptr;
+    std::mutex audio_queue_mutex_;
+    std::condition_variable audio_queue_cv_;
+    std::deque<std::unique_ptr<AudioStreamPacket>> audio_decode_queue_;
+    std::deque<std::unique_ptr<AudioStreamPacket>> audio_send_queue_;
+    std::deque<std::unique_ptr<AudioStreamPacket>> audio_testing_queue_;
+    std::deque<std::unique_ptr<AudioTask>> audio_encode_queue_;
+    std::deque<std::unique_ptr<AudioTask>> audio_playback_queue_;
+    // For server AEC
+    std::deque<uint32_t> timestamp_queue_;
+
+    bool wake_word_initialized_ = false;
+    bool audio_processor_initialized_ = false;
+    bool voice_detected_ = false;
+    bool service_stopped_ = true;
+    bool audio_input_need_warmup_ = false;
+
+    esp_timer_handle_t audio_power_timer_ = nullptr;
+    std::chrono::steady_clock::time_point last_input_time_;
+    std::chrono::steady_clock::time_point last_output_time_;
+
+    void AudioInputTask();
+    void AudioOutputTask();
+    void OpusCodecTask();
+    void PushTaskToEncodeQueue(AudioTaskType type, std::vector<int16_t>&& pcm);
+    void SetDecodeSampleRate(int sample_rate, int frame_duration);
+    void CheckAndUpdateAudioPowerState();
+};
+
+#endif
--- a/main/audio/codecs/box_audio_codec.cc
+++ b/main/audio/codecs/box_audio_codec.cc
@@ -0,0 +1,244 @@
+#include "box_audio_codec.h"
+
+#include <esp_log.h>
+#include <driver/i2c_master.h>
+#include <driver/i2s_tdm.h>
+
+#define TAG "BoxAudioCodec"
+
+BoxAudioCodec::BoxAudioCodec(void* i2c_master_handle, int input_sample_rate, int output_sample_rate,
+    gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+    gpio_num_t pa_pin, uint8_t es8311_addr, uint8_t es7210_addr, bool input_reference) {
+    duplex_ = true; // 是否双工
+    input_reference_ = input_reference; // 是否使用参考输入，实现回声消除
+    input_channels_ = input_reference_ ? 2 : 1; // 输入通道数
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+
+    CreateDuplexChannels(mclk, bclk, ws, dout, din);
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = (i2c_port_t)1,
+        .addr = es8311_addr,
+        .bus_handle = i2c_master_handle,
+    };
+    out_ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(out_ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8311_codec_cfg_t es8311_cfg = {};
+    es8311_cfg.ctrl_if = out_ctrl_if_;
+    es8311_cfg.gpio_if = gpio_if_;
+    es8311_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_DAC;
+    es8311_cfg.pa_pin = pa_pin;
+    es8311_cfg.use_mclk = true;
+    es8311_cfg.hw_gain.pa_voltage = 5.0;
+    es8311_cfg.hw_gain.codec_dac_voltage = 3.3;
+    out_codec_if_ = es8311_codec_new(&es8311_cfg);
+    assert(out_codec_if_ != NULL);
+
+    esp_codec_dev_cfg_t dev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_OUT,
+        .codec_if = out_codec_if_,
+        .data_if = data_if_,
+    };
+    output_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(output_dev_ != NULL);
+
+    // Input
+    i2c_cfg.addr = es7210_addr;
+    in_ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(in_ctrl_if_ != NULL);
+
+    es7210_codec_cfg_t es7210_cfg = {};
+    es7210_cfg.ctrl_if = in_ctrl_if_;
+    es7210_cfg.mic_selected = ES7210_SEL_MIC1 | ES7210_SEL_MIC2 | ES7210_SEL_MIC3 | ES7210_SEL_MIC4;
+    in_codec_if_ = es7210_codec_new(&es7210_cfg);
+    assert(in_codec_if_ != NULL);
+
+    dev_cfg.dev_type = ESP_CODEC_DEV_TYPE_IN;
+    dev_cfg.codec_if = in_codec_if_;
+    input_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(input_dev_ != NULL);
+
+    ESP_LOGI(TAG, "BoxAudioDevice initialized");
+}
+
+BoxAudioCodec::~BoxAudioCodec() {
+    ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    esp_codec_dev_delete(output_dev_);
+    ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    esp_codec_dev_delete(input_dev_);
+
+    audio_codec_delete_codec_if(in_codec_if_);
+    audio_codec_delete_ctrl_if(in_ctrl_if_);
+    audio_codec_delete_codec_if(out_codec_if_);
+    audio_codec_delete_ctrl_if(out_ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+}
+
+void BoxAudioCodec::CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din) {
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .ext_clk_freq_hz = 0,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = true,
+            .big_endian = false,
+            .bit_order_lsb = false
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    i2s_tdm_config_t tdm_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)input_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .ext_clk_freq_hz = 0,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+            .bclk_div = 8,
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = i2s_tdm_slot_mask_t(I2S_TDM_SLOT0 | I2S_TDM_SLOT1 | I2S_TDM_SLOT2 | I2S_TDM_SLOT3),
+            .ws_width = I2S_TDM_AUTO_WS_WIDTH,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = false,
+            .big_endian = false,
+            .bit_order_lsb = false,
+            .skip_mask = false,
+            .total_slot = I2S_TDM_AUTO_SLOT_NUM
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = I2S_GPIO_UNUSED,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_tdm_mode(rx_handle_, &tdm_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+void BoxAudioCodec::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, volume));
+    AudioCodec::SetOutputVolume(volume);
+}
+
+void BoxAudioCodec::EnableInput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == input_enabled_) {
+        return;
+    }
+    if (enable) {
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 4,
+            .channel_mask = ESP_CODEC_DEV_MAKE_CHANNEL_MASK(0),
+            .sample_rate = (uint32_t)output_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        if (input_reference_) {
+            fs.channel_mask |= ESP_CODEC_DEV_MAKE_CHANNEL_MASK(1);
+        }
+        ESP_ERROR_CHECK(esp_codec_dev_open(input_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_in_channel_gain(input_dev_, ESP_CODEC_DEV_MAKE_CHANNEL_MASK(0), AUDIO_CODEC_DEFAULT_MIC_GAIN));
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    }
+    AudioCodec::EnableInput(enable);
+}
+
+void BoxAudioCodec::EnableOutput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == output_enabled_) {
+        return;
+    }
+    if (enable) {
+        // Play 16bit 1 channel
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)output_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(output_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, output_volume_));
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    }
+    AudioCodec::EnableOutput(enable);
+}
+
+int BoxAudioCodec::Read(int16_t* dest, int samples) {
+    if (input_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_read(input_dev_, (void*)dest, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
+
+int BoxAudioCodec::Write(const int16_t* data, int samples) {
+    if (output_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_write(output_dev_, (void*)data, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
--- a/main/audio/codecs/box_audio_codec.h
+++ b/main/audio/codecs/box_audio_codec.h
@@ -0,0 +1,40 @@
+#ifndef _BOX_AUDIO_CODEC_H
+#define _BOX_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+#include <mutex>
+
+
+class BoxAudioCodec : public AudioCodec {
+private:
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* out_ctrl_if_ = nullptr;
+    const audio_codec_if_t* out_codec_if_ = nullptr;
+    const audio_codec_ctrl_if_t* in_ctrl_if_ = nullptr;
+    const audio_codec_if_t* in_codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t output_dev_ = nullptr;
+    esp_codec_dev_handle_t input_dev_ = nullptr;
+    std::mutex data_if_mutex_;
+
+    void CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    BoxAudioCodec(void* i2c_master_handle, int input_sample_rate, int output_sample_rate,
+        gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+        gpio_num_t pa_pin, uint8_t es8311_addr, uint8_t es7210_addr, bool input_reference);
+    virtual ~BoxAudioCodec();
+
+    virtual void SetOutputVolume(int volume) override;
+    virtual void EnableInput(bool enable) override;
+    virtual void EnableOutput(bool enable) override;
+};
+
+#endif // _BOX_AUDIO_CODEC_H
--- a/main/audio/codecs/dummy_audio_codec.cc
+++ b/main/audio/codecs/dummy_audio_codec.cc
@@ -0,0 +1,20 @@
+#include "dummy_audio_codec.h"
+
+DummyAudioCodec::DummyAudioCodec(int input_sample_rate, int output_sample_rate) {
+    duplex_ = true;
+    input_reference_ = false;
+    input_channels_ = 1;
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+}
+
+DummyAudioCodec::~DummyAudioCodec() {
+}
+
+int DummyAudioCodec::Read(int16_t* dest, int samples) {
+    return 0;
+}
+
+int DummyAudioCodec::Write(const int16_t* data, int samples) {
+    return 0;
+}
--- a/main/audio/codecs/dummy_audio_codec.h
+++ b/main/audio/codecs/dummy_audio_codec.h
@@ -0,0 +1,16 @@
+#ifndef _DUMMY_AUDIO_CODEC_H
+#define _DUMMY_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+class DummyAudioCodec : public AudioCodec {
+private:
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    DummyAudioCodec(int input_sample_rate, int output_sample_rate);
+    virtual ~DummyAudioCodec();
+};
+
+#endif // _DUMMY_AUDIO_CODEC_H
--- a/main/audio/codecs/es8311_audio_codec.cc
+++ b/main/audio/codecs/es8311_audio_codec.cc
@@ -0,0 +1,187 @@
+#include "es8311_audio_codec.h"
+
+#include <esp_log.h>
+
+#define TAG "Es8311AudioCodec"
+
+Es8311AudioCodec::Es8311AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+    gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+    gpio_num_t pa_pin, uint8_t es8311_addr, bool use_mclk, bool pa_inverted) {
+    duplex_ = true; // 是否双工
+    input_reference_ = false; // 是否使用参考输入，实现回声消除
+    input_channels_ = 1; // 输入通道数
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+    pa_pin_ = pa_pin;
+    pa_inverted_ = pa_inverted;
+
+    assert(input_sample_rate_ == output_sample_rate_);
+    CreateDuplexChannels(mclk, bclk, ws, dout, din);
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = i2c_port,
+        .addr = es8311_addr,
+        .bus_handle = i2c_master_handle,
+    };
+    ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8311_codec_cfg_t es8311_cfg = {};
+    es8311_cfg.ctrl_if = ctrl_if_;
+    es8311_cfg.gpio_if = gpio_if_;
+    es8311_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_BOTH;
+    es8311_cfg.pa_pin = pa_pin;
+    es8311_cfg.use_mclk = use_mclk;
+    es8311_cfg.hw_gain.pa_voltage = 5.0;
+    es8311_cfg.hw_gain.codec_dac_voltage = 3.3;
+    es8311_cfg.pa_reverted = pa_inverted_;
+    codec_if_ = es8311_codec_new(&es8311_cfg);
+    assert(codec_if_ != NULL);
+
+    ESP_LOGI(TAG, "Es8311AudioCodec initialized");
+}
+
+Es8311AudioCodec::~Es8311AudioCodec() {
+    esp_codec_dev_delete(dev_);
+
+    audio_codec_delete_codec_if(codec_if_);
+    audio_codec_delete_ctrl_if(ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+}
+
+void Es8311AudioCodec::UpdateDeviceState() {
+    if ((input_enabled_ || output_enabled_) && dev_ == nullptr) {
+        esp_codec_dev_cfg_t dev_cfg = {
+            .dev_type = ESP_CODEC_DEV_TYPE_IN_OUT,
+            .codec_if = codec_if_,
+            .data_if = data_if_,
+        };
+        dev_ = esp_codec_dev_new(&dev_cfg);
+        assert(dev_ != NULL);
+
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)input_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_in_gain(dev_, AUDIO_CODEC_DEFAULT_MIC_GAIN));
+        ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(dev_, output_volume_));
+    } else if (!input_enabled_ && !output_enabled_ && dev_ != nullptr) {
+        esp_codec_dev_close(dev_);
+        dev_ = nullptr;
+    }
+    if (pa_pin_ != GPIO_NUM_NC) {
+        int level = output_enabled_ ? 1 : 0;
+        gpio_set_level(pa_pin_, pa_inverted_ ? !level : level);
+    }
+}
+
+void Es8311AudioCodec::CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din) {
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2    
+				.ext_clk_freq_hz = 0,
+			#endif
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            #ifdef   I2S_HW_VERSION_2   
+                .left_align = true,
+                .big_endian = false,
+                .bit_order_lsb = false
+            #endif
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+void Es8311AudioCodec::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(dev_, volume));
+    AudioCodec::SetOutputVolume(volume);
+}
+
+void Es8311AudioCodec::EnableInput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == input_enabled_) {
+        return;
+    }
+    AudioCodec::EnableInput(enable);
+    UpdateDeviceState();
+}
+
+void Es8311AudioCodec::EnableOutput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == output_enabled_) {
+        return;
+    }
+    AudioCodec::EnableOutput(enable);
+    UpdateDeviceState();
+}
+
+int Es8311AudioCodec::Read(int16_t* dest, int samples) {
+    if (input_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_read(dev_, (void*)dest, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
+
+int Es8311AudioCodec::Write(const int16_t* data, int samples) {
+    if (output_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_write(dev_, (void*)data, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
--- a/main/audio/codecs/es8311_audio_codec.h
+++ b/main/audio/codecs/es8311_audio_codec.h
@@ -0,0 +1,42 @@
+#ifndef _ES8311_AUDIO_CODEC_H
+#define _ES8311_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <driver/i2c_master.h>
+#include <driver/gpio.h>
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+#include <mutex>
+
+
+class Es8311AudioCodec : public AudioCodec {
+private:
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* ctrl_if_ = nullptr;
+    const audio_codec_if_t* codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t dev_ = nullptr;
+    gpio_num_t pa_pin_ = GPIO_NUM_NC;
+    bool pa_inverted_ = false;
+    std::mutex data_if_mutex_;
+
+    void CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+    void UpdateDeviceState();
+
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    Es8311AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+        gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+        gpio_num_t pa_pin, uint8_t es8311_addr, bool use_mclk = true, bool pa_inverted = false);
+    virtual ~Es8311AudioCodec();
+
+    virtual void SetOutputVolume(int volume) override;
+    virtual void EnableInput(bool enable) override;
+    virtual void EnableOutput(bool enable) override;
+};
+
+#endif // _ES8311_AUDIO_CODEC_H
--- a/main/audio/codecs/es8374_audio_codec.cc
+++ b/main/audio/codecs/es8374_audio_codec.cc
@@ -0,0 +1,196 @@
+#include "es8374_audio_codec.h"
+
+#include <esp_log.h>
+
+#define TAG "Es8374AudioCodec"
+
+Es8374AudioCodec::Es8374AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+    gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+    gpio_num_t pa_pin, uint8_t es8374_addr, bool use_mclk) {
+    duplex_ = true; // 是否双工
+    input_reference_ = false; // 是否使用参考输入，实现回声消除
+    input_channels_ = 1; // 输入通道数
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+    pa_pin_ = pa_pin;
+    CreateDuplexChannels(mclk, bclk, ws, dout, din);
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = i2c_port,
+        .addr = es8374_addr,
+        .bus_handle = i2c_master_handle,
+    };
+    ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8374_codec_cfg_t es8374_cfg = {};
+    es8374_cfg.ctrl_if = ctrl_if_;
+    es8374_cfg.gpio_if = gpio_if_;
+    es8374_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_BOTH;
+    es8374_cfg.pa_pin = pa_pin;
+    codec_if_ = es8374_codec_new(&es8374_cfg);
+    assert(codec_if_ != NULL);
+
+    esp_codec_dev_cfg_t dev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_OUT,
+        .codec_if = codec_if_,
+        .data_if = data_if_,
+    };
+    output_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(output_dev_ != NULL);
+    dev_cfg.dev_type = ESP_CODEC_DEV_TYPE_IN;
+    input_dev_ = esp_codec_dev_new(&dev_cfg);
+    assert(input_dev_ != NULL);
+    esp_codec_set_disable_when_closed(output_dev_, false);
+    esp_codec_set_disable_when_closed(input_dev_, false);
+    ESP_LOGI(TAG, "Es8374AudioCodec initialized");
+}
+
+Es8374AudioCodec::~Es8374AudioCodec() {
+    ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    esp_codec_dev_delete(output_dev_);
+    ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    esp_codec_dev_delete(input_dev_);
+
+    audio_codec_delete_codec_if(codec_if_);
+    audio_codec_delete_ctrl_if(ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+}
+
+void Es8374AudioCodec::CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din) {
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = 6,
+        .dma_frame_num = 240,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2    
+				.ext_clk_freq_hz = 0,
+			#endif
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            #ifdef   I2S_HW_VERSION_2   
+                .left_align = true,
+                .big_endian = false,
+                .bit_order_lsb = false
+            #endif
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+void Es8374AudioCodec::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, volume));
+    AudioCodec::SetOutputVolume(volume);
+}
+
+void Es8374AudioCodec::EnableInput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == input_enabled_) {
+        return;
+    }
+    if (enable) {
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)input_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(input_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_in_gain(input_dev_, AUDIO_CODEC_DEFAULT_MIC_GAIN));
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    }
+    AudioCodec::EnableInput(enable);
+}
+
+void Es8374AudioCodec::EnableOutput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == output_enabled_) {
+        return;
+    }
+    if (enable) {
+        // Play 16bit 1 channel
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)output_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(output_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, output_volume_));
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 1);
+        }
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 0);
+        }
+    }
+    AudioCodec::EnableOutput(enable);
+}
+
+int Es8374AudioCodec::Read(int16_t* dest, int samples) {
+    if (input_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_read(input_dev_, (void*)dest, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
+
+int Es8374AudioCodec::Write(const int16_t* data, int samples) {
+    if (output_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_write(output_dev_, (void*)data, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
--- a/main/audio/codecs/es8374_audio_codec.h
+++ b/main/audio/codecs/es8374_audio_codec.h
@@ -0,0 +1,41 @@
+#ifndef _ES8374_AUDIO_CODEC_H
+#define _ES8374_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <driver/i2c.h>
+#include <driver/gpio.h>
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+#include <mutex>
+
+
+class Es8374AudioCodec : public AudioCodec {
+private:
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* ctrl_if_ = nullptr;
+    const audio_codec_if_t* codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t output_dev_ = nullptr;
+    esp_codec_dev_handle_t input_dev_ = nullptr;
+    gpio_num_t pa_pin_ = GPIO_NUM_NC;
+    std::mutex data_if_mutex_;
+
+    void CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    Es8374AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+        gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+        gpio_num_t pa_pin, uint8_t es8374_addr, bool use_mclk = true);
+    virtual ~Es8374AudioCodec();
+
+    virtual void SetOutputVolume(int volume) override;
+    virtual void EnableInput(bool enable) override;
+    virtual void EnableOutput(bool enable) override;
+};
+
+#endif // _ES8374_AUDIO_CODEC_H
--- a/main/audio/codecs/es8388_audio_codec.cc
+++ b/main/audio/codecs/es8388_audio_codec.cc
@@ -0,0 +1,207 @@
+#include "es8388_audio_codec.h"
+
+#include <esp_log.h>
+
+#define TAG "Es8388AudioCodec"
+
+Es8388AudioCodec::Es8388AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+    gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+    gpio_num_t pa_pin, uint8_t es8388_addr) {
+    duplex_ = true; // 是否双工
+    input_reference_ = false; // 是否使用参考输入，实现回声消除
+    input_channels_ = 1; // 输入通道数
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+    pa_pin_ = pa_pin;                                                                                                                                                                                     CreateDuplexChannels(mclk, bclk, ws, dout, din);
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = i2c_port,
+        .addr = es8388_addr,
+        .bus_handle = i2c_master_handle,
+    };
+    ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8388_codec_cfg_t es8388_cfg = {};
+    es8388_cfg.ctrl_if = ctrl_if_;
+    es8388_cfg.gpio_if = gpio_if_;
+    es8388_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_BOTH;
+    es8388_cfg.master_mode = true;
+    es8388_cfg.pa_pin = pa_pin;
+    es8388_cfg.pa_reverted = false;
+    es8388_cfg.hw_gain.pa_voltage = 5.0;
+    es8388_cfg.hw_gain.codec_dac_voltage = 3.3;
+    codec_if_ = es8388_codec_new(&es8388_cfg);
+    assert(codec_if_ != NULL);
+
+    esp_codec_dev_cfg_t outdev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_OUT,
+        .codec_if = codec_if_,
+        .data_if = data_if_,
+    };
+    output_dev_ = esp_codec_dev_new(&outdev_cfg);
+    assert(output_dev_ != NULL);
+
+    esp_codec_dev_cfg_t indev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_IN,
+        .codec_if = codec_if_,
+        .data_if = data_if_,
+    };
+    input_dev_ = esp_codec_dev_new(&indev_cfg);
+    assert(input_dev_ != NULL);
+    esp_codec_set_disable_when_closed(output_dev_, false);
+    esp_codec_set_disable_when_closed(input_dev_, false);
+    ESP_LOGI(TAG, "Es8388AudioCodec initialized");
+}
+
+Es8388AudioCodec::~Es8388AudioCodec() {
+    ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    esp_codec_dev_delete(output_dev_);
+    ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    esp_codec_dev_delete(input_dev_);
+
+    audio_codec_delete_codec_if(codec_if_);
+    audio_codec_delete_ctrl_if(ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+}
+
+void Es8388AudioCodec::CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din){
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .ext_clk_freq_hz = 0,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = true,
+            .big_endian = false,
+            .bit_order_lsb = false
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+void Es8388AudioCodec::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, volume));
+    AudioCodec::SetOutputVolume(volume);
+}
+
+void Es8388AudioCodec::EnableInput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == input_enabled_) {
+        return;
+    }
+    if (enable) {
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)input_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(input_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_in_gain(input_dev_, 24.0));
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    }
+    AudioCodec::EnableInput(enable);
+}
+
+void Es8388AudioCodec::EnableOutput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == output_enabled_) {
+        return;
+    }
+    if (enable) {
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)output_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(output_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, output_volume_));
+
+        // Set analog output volume to 0dB, default is -45dB
+        uint8_t reg_val = 30; // 0dB
+        uint8_t regs[] = { 46, 47, 48, 49 }; // HP_LVOL, HP_RVOL, SPK_LVOL, SPK_RVOL
+        for (uint8_t reg : regs) {
+            ctrl_if_->write_reg(ctrl_if_, reg, 1, &reg_val, 1);
+        }
+
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 1);
+        }
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 0);
+        }
+    }
+    AudioCodec::EnableOutput(enable);
+}
+
+int Es8388AudioCodec::Read(int16_t* dest, int samples) {
+    if (input_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_read(input_dev_, (void*)dest, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
+
+int Es8388AudioCodec::Write(const int16_t* data, int samples) {
+    if (output_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_write(output_dev_, (void*)data, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
--- a/main/audio/codecs/es8388_audio_codec.h
+++ b/main/audio/codecs/es8388_audio_codec.h
@@ -0,0 +1,40 @@
+#ifndef _ES8388_AUDIO_CODEC_H
+#define _ES8388_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <driver/i2c_master.h>
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+#include <mutex>
+
+
+class Es8388AudioCodec : public AudioCodec {
+private:
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* ctrl_if_ = nullptr;
+    const audio_codec_if_t* codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t output_dev_ = nullptr;
+    esp_codec_dev_handle_t input_dev_ = nullptr;
+    gpio_num_t pa_pin_ = GPIO_NUM_NC;
+    std::mutex data_if_mutex_;
+
+    void CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    Es8388AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+        gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+        gpio_num_t pa_pin, uint8_t es8388_addr);
+    virtual ~Es8388AudioCodec();
+
+    virtual void SetOutputVolume(int volume) override;
+    virtual void EnableInput(bool enable) override;
+    virtual void EnableOutput(bool enable) override;
+};
+
+#endif // _ES8388_AUDIO_CODEC_H
--- a/main/audio/codecs/es8389_audio_codec.cc
+++ b/main/audio/codecs/es8389_audio_codec.cc
@@ -0,0 +1,203 @@
+#include "es8389_audio_codec.h"
+
+#include <esp_log.h>
+
+static const char TAG[] = "Es8389AudioCodec";
+
+Es8389AudioCodec::Es8389AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+    gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+    gpio_num_t pa_pin, uint8_t es8389_addr, bool use_mclk) {
+    duplex_ = true; // 是否双工
+    input_reference_ = false; // 是否使用参考输入，实现回声消除
+    input_channels_ = 1; // 输入通道数
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+    pa_pin_ = pa_pin;
+    CreateDuplexChannels(mclk, bclk, ws, dout, din);
+
+    // Do initialize of related interface: data_if, ctrl_if and gpio_if
+    audio_codec_i2s_cfg_t i2s_cfg = {
+        .port = I2S_NUM_0,
+        .rx_handle = rx_handle_,
+        .tx_handle = tx_handle_,
+    };
+    data_if_ = audio_codec_new_i2s_data(&i2s_cfg);
+    assert(data_if_ != NULL);
+
+    // Output
+    audio_codec_i2c_cfg_t i2c_cfg = {
+        .port = i2c_port,
+        .addr = es8389_addr,
+        .bus_handle = i2c_master_handle,
+    };
+    ctrl_if_ = audio_codec_new_i2c_ctrl(&i2c_cfg);
+    assert(ctrl_if_ != NULL);
+
+    gpio_if_ = audio_codec_new_gpio();
+    assert(gpio_if_ != NULL);
+
+    es8389_codec_cfg_t es8389_cfg = {};
+    es8389_cfg.ctrl_if = ctrl_if_;
+    es8389_cfg.gpio_if = gpio_if_;
+    es8389_cfg.codec_mode = ESP_CODEC_DEV_WORK_MODE_BOTH;
+    es8389_cfg.pa_pin = pa_pin;
+    es8389_cfg.use_mclk = use_mclk;
+    es8389_cfg.hw_gain.pa_voltage = 5.0;
+    es8389_cfg.hw_gain.codec_dac_voltage = 3.3;
+    codec_if_ = es8389_codec_new(&es8389_cfg);
+
+    assert(codec_if_ != NULL);
+
+    esp_codec_dev_cfg_t outdev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_OUT,
+        .codec_if = codec_if_,
+        .data_if = data_if_,
+    };
+    output_dev_ = esp_codec_dev_new(&outdev_cfg);
+    assert(output_dev_ != NULL);
+
+    esp_codec_dev_cfg_t indev_cfg = {
+        .dev_type = ESP_CODEC_DEV_TYPE_IN,
+        .codec_if = codec_if_,
+        .data_if = data_if_,
+    };
+    input_dev_ = esp_codec_dev_new(&indev_cfg);
+    assert(input_dev_ != NULL);
+    esp_codec_set_disable_when_closed(output_dev_, false);
+    esp_codec_set_disable_when_closed(input_dev_, false);
+    ESP_LOGI(TAG, "Es8389AudioCodec initialized");
+}
+
+Es8389AudioCodec::~Es8389AudioCodec() {
+    ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+    esp_codec_dev_delete(output_dev_);
+    ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    esp_codec_dev_delete(input_dev_);
+
+    audio_codec_delete_codec_if(codec_if_);
+    audio_codec_delete_ctrl_if(ctrl_if_);
+    audio_codec_delete_gpio_if(gpio_if_);
+    audio_codec_delete_data_if(data_if_);
+}
+
+void Es8389AudioCodec::CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din) {
+    assert(input_sample_rate_ == output_sample_rate_);
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = 6,
+        .dma_frame_num = 240,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+#ifdef   I2S_HW_VERSION_2    
+                .ext_clk_freq_hz = 0,
+#endif
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_STEREO,
+            .slot_mask = I2S_STD_SLOT_BOTH,
+            .ws_width = I2S_DATA_BIT_WIDTH_16BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            .left_align = true,
+            .big_endian = false,
+            .bit_order_lsb = false
+        },
+        .gpio_cfg = {
+            .mclk = mclk,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+void Es8389AudioCodec::SetOutputVolume(int volume) {
+    ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, volume));
+    AudioCodec::SetOutputVolume(volume);
+}
+
+void Es8389AudioCodec::EnableInput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == input_enabled_) {
+        return;
+    }
+    if (enable) {
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)input_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(input_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_in_gain(input_dev_, 40.0));
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(input_dev_));
+    }
+    AudioCodec::EnableInput(enable);
+}
+
+void Es8389AudioCodec::EnableOutput(bool enable) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    if (enable == output_enabled_) {
+        return;
+    }
+    if (enable) {
+        // Play 16bit 1 channel
+        esp_codec_dev_sample_info_t fs = {
+            .bits_per_sample = 16,
+            .channel = 1,
+            .channel_mask = 0,
+            .sample_rate = (uint32_t)output_sample_rate_,
+            .mclk_multiple = 0,
+        };
+        ESP_ERROR_CHECK(esp_codec_dev_open(output_dev_, &fs));
+        ESP_ERROR_CHECK(esp_codec_dev_set_out_vol(output_dev_, output_volume_));
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 1);
+        }
+    } else {
+        ESP_ERROR_CHECK(esp_codec_dev_close(output_dev_));
+        if (pa_pin_ != GPIO_NUM_NC) {
+            gpio_set_level(pa_pin_, 0);
+        }
+    }
+    AudioCodec::EnableOutput(enable);
+}
+
+int Es8389AudioCodec::Read(int16_t* dest, int samples) {
+    if (input_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_read(input_dev_, (void*)dest, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
+
+int Es8389AudioCodec::Write(const int16_t* data, int samples) {
+    if (output_enabled_) {
+        ESP_ERROR_CHECK_WITHOUT_ABORT(esp_codec_dev_write(output_dev_, (void*)data, samples * sizeof(int16_t)));
+    }
+    return samples;
+}
--- a/main/audio/codecs/es8389_audio_codec.h
+++ b/main/audio/codecs/es8389_audio_codec.h
@@ -0,0 +1,40 @@
+#ifndef _ES8389_AUDIO_CODEC_H
+#define _ES8389_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <driver/i2c.h>
+#include <driver/gpio.h>
+#include <esp_codec_dev.h>
+#include <esp_codec_dev_defaults.h>
+#include <mutex>
+
+class Es8389AudioCodec : public AudioCodec {
+private:
+    const audio_codec_data_if_t* data_if_ = nullptr;
+    const audio_codec_ctrl_if_t* ctrl_if_ = nullptr;
+    const audio_codec_if_t* codec_if_ = nullptr;
+    const audio_codec_gpio_if_t* gpio_if_ = nullptr;
+
+    esp_codec_dev_handle_t output_dev_ = nullptr;
+    esp_codec_dev_handle_t input_dev_ = nullptr;
+    gpio_num_t pa_pin_ = GPIO_NUM_NC;
+    std::mutex data_if_mutex_;
+
+    void CreateDuplexChannels(gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+
+    virtual int Read(int16_t* dest, int samples) override;
+    virtual int Write(const int16_t* data, int samples) override;
+
+public:
+    Es8389AudioCodec(void* i2c_master_handle, i2c_port_t i2c_port, int input_sample_rate, int output_sample_rate,
+        gpio_num_t mclk, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din,
+        gpio_num_t pa_pin, uint8_t es8389_addr, bool use_mclk = true);
+    virtual ~Es8389AudioCodec();
+
+    virtual void SetOutputVolume(int volume) override;
+    virtual void EnableInput(bool enable) override;
+    virtual void EnableOutput(bool enable) override;
+};
+
+#endif // _ES8389_AUDIO_CODEC_H
--- a/main/audio/codecs/no_audio_codec.cc
+++ b/main/audio/codecs/no_audio_codec.cc
@@ -0,0 +1,332 @@
+#include "no_audio_codec.h"
+
+#include <esp_log.h>
+#include <cmath>
+#include <cstring>
+
+#define TAG "NoAudioCodec"
+
+NoAudioCodec::~NoAudioCodec() {
+    if (rx_handle_ != nullptr) {
+        ESP_ERROR_CHECK(i2s_channel_disable(rx_handle_));
+    }
+    if (tx_handle_ != nullptr) {
+        ESP_ERROR_CHECK(i2s_channel_disable(tx_handle_));
+    }
+}
+
+NoAudioCodecDuplex::NoAudioCodecDuplex(int input_sample_rate, int output_sample_rate, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din) {
+    duplex_ = true;
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+
+    i2s_chan_config_t chan_cfg = {
+        .id = I2S_NUM_0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, &rx_handle_));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2
+				.ext_clk_freq_hz = 0,
+			#endif
+
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_MONO,
+            .slot_mask = I2S_STD_SLOT_LEFT,
+            .ws_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            #ifdef   I2S_HW_VERSION_2
+                .left_align = true,
+                .big_endian = false,
+                .bit_order_lsb = false
+            #endif
+
+        },
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = bclk,
+            .ws = ws,
+            .dout = dout,
+            .din = din,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Duplex channels created");
+}
+
+
+NoAudioCodecSimplex::NoAudioCodecSimplex(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, gpio_num_t mic_sck, gpio_num_t mic_ws, gpio_num_t mic_din) {
+    duplex_ = false;
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+
+    // Create a new channel for speaker
+    i2s_chan_config_t chan_cfg = {
+        .id = (i2s_port_t)0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, nullptr));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2
+				.ext_clk_freq_hz = 0,
+			#endif
+
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_MONO,
+            .slot_mask = I2S_STD_SLOT_LEFT,
+            .ws_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            #ifdef   I2S_HW_VERSION_2
+                .left_align = true,
+                .big_endian = false,
+                .bit_order_lsb = false
+            #endif
+
+        },
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = spk_bclk,
+            .ws = spk_ws,
+            .dout = spk_dout,
+            .din = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+
+    // Create a new channel for MIC
+    chan_cfg.id = (i2s_port_t)1;
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, nullptr, &rx_handle_));
+    std_cfg.clk_cfg.sample_rate_hz = (uint32_t)input_sample_rate_;
+    std_cfg.gpio_cfg.bclk = mic_sck;
+    std_cfg.gpio_cfg.ws = mic_ws;
+    std_cfg.gpio_cfg.dout = I2S_GPIO_UNUSED;
+    std_cfg.gpio_cfg.din = mic_din;
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Simplex channels created");
+}
+
+NoAudioCodecSimplex::NoAudioCodecSimplex(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, i2s_std_slot_mask_t spk_slot_mask, gpio_num_t mic_sck, gpio_num_t mic_ws, gpio_num_t mic_din, i2s_std_slot_mask_t mic_slot_mask){
+    duplex_ = false;
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+
+    // Create a new channel for speaker
+    i2s_chan_config_t chan_cfg = {
+        .id = (i2s_port_t)0,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM,
+        .dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM,
+        .auto_clear_after_cb = true,
+        .auto_clear_before_cb = false,
+        .intr_priority = 0,
+    };
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, &tx_handle_, nullptr));
+
+    i2s_std_config_t std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2
+				.ext_clk_freq_hz = 0,
+			#endif
+
+        },
+        .slot_cfg = {
+            .data_bit_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .slot_bit_width = I2S_SLOT_BIT_WIDTH_AUTO,
+            .slot_mode = I2S_SLOT_MODE_MONO,
+            .slot_mask = spk_slot_mask,
+            .ws_width = I2S_DATA_BIT_WIDTH_32BIT,
+            .ws_pol = false,
+            .bit_shift = true,
+            #ifdef   I2S_HW_VERSION_2
+                .left_align = true,
+                .big_endian = false,
+                .bit_order_lsb = false
+            #endif
+
+        },
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = spk_bclk,
+            .ws = spk_ws,
+            .dout = spk_dout,
+            .din = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv = false
+            }
+        }
+    };
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &std_cfg));
+
+    // Create a new channel for MIC
+    chan_cfg.id = (i2s_port_t)1;
+    ESP_ERROR_CHECK(i2s_new_channel(&chan_cfg, nullptr, &rx_handle_));
+    std_cfg.clk_cfg.sample_rate_hz = (uint32_t)input_sample_rate_;
+    std_cfg.slot_cfg.slot_mask = mic_slot_mask;
+    std_cfg.gpio_cfg.bclk = mic_sck;
+    std_cfg.gpio_cfg.ws = mic_ws;
+    std_cfg.gpio_cfg.dout = I2S_GPIO_UNUSED;
+    std_cfg.gpio_cfg.din = mic_din;
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
+    ESP_LOGI(TAG, "Simplex channels created");
+}
+
+NoAudioCodecSimplexPdm::NoAudioCodecSimplexPdm(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, gpio_num_t mic_sck, gpio_num_t mic_din) {
+    duplex_ = false;
+    input_sample_rate_ = input_sample_rate;
+    output_sample_rate_ = output_sample_rate;
+
+    // Create a new channel for speaker
+    i2s_chan_config_t tx_chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG((i2s_port_t)1, I2S_ROLE_MASTER);
+    tx_chan_cfg.dma_desc_num = AUDIO_CODEC_DMA_DESC_NUM;
+    tx_chan_cfg.dma_frame_num = AUDIO_CODEC_DMA_FRAME_NUM;
+    tx_chan_cfg.auto_clear_after_cb = true;
+    tx_chan_cfg.auto_clear_before_cb = false;
+    tx_chan_cfg.intr_priority = 0;
+    ESP_ERROR_CHECK(i2s_new_channel(&tx_chan_cfg, &tx_handle_, NULL));
+
+
+    i2s_std_config_t tx_std_cfg = {
+        .clk_cfg = {
+            .sample_rate_hz = (uint32_t)output_sample_rate_,
+            .clk_src = I2S_CLK_SRC_DEFAULT,
+            .mclk_multiple = I2S_MCLK_MULTIPLE_256,
+			#ifdef   I2S_HW_VERSION_2
+				.ext_clk_freq_hz = 0,
+			#endif
+
+        },
+        .slot_cfg = I2S_STD_MSB_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_MONO),
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = spk_bclk,
+            .ws = spk_ws,
+            .dout = spk_dout,
+            .din = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv   = false,
+            },
+        },
+    };
+    ESP_ERROR_CHECK(i2s_channel_init_std_mode(tx_handle_, &tx_std_cfg));
+#if SOC_I2S_SUPPORTS_PDM_RX
+    // Create a new channel for MIC in PDM mode
+    i2s_chan_config_t rx_chan_cfg = I2S_CHANNEL_DEFAULT_CONFIG((i2s_port_t)0, I2S_ROLE_MASTER);
+    ESP_ERROR_CHECK(i2s_new_channel(&rx_chan_cfg, NULL, &rx_handle_));
+    i2s_pdm_rx_config_t pdm_rx_cfg = {
+        .clk_cfg = I2S_PDM_RX_CLK_DEFAULT_CONFIG((uint32_t)input_sample_rate_),
+        /* The data bit-width of PDM mode is fixed to 16 */
+        .slot_cfg = I2S_PDM_RX_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_16BIT, I2S_SLOT_MODE_MONO),
+        .gpio_cfg = {
+            .clk = mic_sck,
+            .din = mic_din,
+
+            .invert_flags = {
+                .clk_inv = false,
+            },
+        },
+    };
+    ESP_ERROR_CHECK(i2s_channel_init_pdm_rx_mode(rx_handle_, &pdm_rx_cfg));
+#else
+    ESP_LOGE(TAG, "PDM is not supported");
+#endif
+    ESP_LOGI(TAG, "Simplex channels created");
+}
+
+int NoAudioCodec::Write(const int16_t* data, int samples) {
+    std::lock_guard<std::mutex> lock(data_if_mutex_);
+    std::vector<int32_t> buffer(samples);
+
+    // output_volume_: 0-100
+    // volume_factor_: 0-65536
+    int32_t volume_factor = pow(double(output_volume_) / 100.0, 2) * 65536;
+    for (int i = 0; i < samples; i++) {
+        int64_t temp = int64_t(data[i]) * volume_factor; // 使用 int64_t 进行乘法运算
+        if (temp > INT32_MAX) {
+            buffer[i] = INT32_MAX;
+        } else if (temp < INT32_MIN) {
+            buffer[i] = INT32_MIN;
+        } else {
+            buffer[i] = static_cast<int32_t>(temp);
+        }
+    }
+
+    size_t bytes_written;
+    ESP_ERROR_CHECK(i2s_channel_write(tx_handle_, buffer.data(), samples * sizeof(int32_t), &bytes_written, portMAX_DELAY));
+    return bytes_written / sizeof(int32_t);
+}
+
+int NoAudioCodec::Read(int16_t* dest, int samples) {
+    size_t bytes_read;
+
+    std::vector<int32_t> bit32_buffer(samples);
+    if (i2s_channel_read(rx_handle_, bit32_buffer.data(), samples * sizeof(int32_t), &bytes_read, portMAX_DELAY) != ESP_OK) {
+        ESP_LOGE(TAG, "Read Failed!");
+        return 0;
+    }
+
+    samples = bytes_read / sizeof(int32_t);
+    for (int i = 0; i < samples; i++) {
+        int32_t value = bit32_buffer[i] >> 12;
+        dest[i] = (value > INT16_MAX) ? INT16_MAX : (value < -INT16_MAX) ? -INT16_MAX : (int16_t)value;
+    }
+    return samples;
+}
+
+int NoAudioCodecSimplexPdm::Read(int16_t* dest, int samples) {
+    size_t bytes_read;
+
+    // PDM 解调后的数据位宽为 16 位，直接读取到目标缓冲区
+    if (i2s_channel_read(rx_handle_, dest, samples * sizeof(int16_t), &bytes_read, portMAX_DELAY) != ESP_OK) {
+        ESP_LOGE(TAG, "Read Failed!");
+        return 0;
+    }
+
+    // 计算实际读取的样本数
+    return bytes_read / sizeof(int16_t);
+}
--- a/main/audio/codecs/no_audio_codec.h
+++ b/main/audio/codecs/no_audio_codec.h
@@ -0,0 +1,38 @@
+#ifndef _NO_AUDIO_CODEC_H
+#define _NO_AUDIO_CODEC_H
+
+#include "audio_codec.h"
+
+#include <driver/gpio.h>
+#include <driver/i2s_pdm.h>
+#include <mutex>
+
+class NoAudioCodec : public AudioCodec {
+protected:
+    std::mutex data_if_mutex_;
+
+    virtual int Write(const int16_t* data, int samples) override;
+    virtual int Read(int16_t* dest, int samples) override;
+
+public:
+    virtual ~NoAudioCodec();
+};
+
+class NoAudioCodecDuplex : public NoAudioCodec {
+public:
+    NoAudioCodecDuplex(int input_sample_rate, int output_sample_rate, gpio_num_t bclk, gpio_num_t ws, gpio_num_t dout, gpio_num_t din);
+};
+
+class NoAudioCodecSimplex : public NoAudioCodec {
+public:
+    NoAudioCodecSimplex(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, gpio_num_t mic_sck, gpio_num_t mic_ws, gpio_num_t mic_din);
+    NoAudioCodecSimplex(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, i2s_std_slot_mask_t spk_slot_mask, gpio_num_t mic_sck, gpio_num_t mic_ws, gpio_num_t mic_din, i2s_std_slot_mask_t mic_slot_mask);
+};
+
+class NoAudioCodecSimplexPdm : public NoAudioCodec {
+public:
+    NoAudioCodecSimplexPdm(int input_sample_rate, int output_sample_rate, gpio_num_t spk_bclk, gpio_num_t spk_ws, gpio_num_t spk_dout, gpio_num_t mic_sck,  gpio_num_t mic_din);
+    int Read(int16_t* dest, int samples);
+};
+
+#endif // _NO_AUDIO_CODEC_H
--- a/main/audio/processors/afe_audio_processor.cc
+++ b/main/audio/processors/afe_audio_processor.cc
@@ -0,0 +1,183 @@
+#include "afe_audio_processor.h"
+#include <esp_log.h>
+
+#define PROCESSOR_RUNNING 0x01
+
+#define TAG "AfeAudioProcessor"
+
+AfeAudioProcessor::AfeAudioProcessor()
+    : afe_data_(nullptr) {
+    event_group_ = xEventGroupCreate();
+}
+
+void AfeAudioProcessor::Initialize(AudioCodec* codec, int frame_duration_ms) {
+    codec_ = codec;
+    frame_samples_ = frame_duration_ms * 16000 / 1000;
+
+    // Pre-allocate output buffer capacity
+    output_buffer_.reserve(frame_samples_);
+
+    int ref_num = codec_->input_reference() ? 1 : 0;
+
+    std::string input_format;
+    for (int i = 0; i < codec_->input_channels() - ref_num; i++) {
+        input_format.push_back('M');
+    }
+    for (int i = 0; i < ref_num; i++) {
+        input_format.push_back('R');
+    }
+
+    srmodel_list_t *models = esp_srmodel_init("model");
+    char* ns_model_name = esp_srmodel_filter(models, ESP_NSNET_PREFIX, NULL);
+    char* vad_model_name = esp_srmodel_filter(models, ESP_VADN_PREFIX, NULL);
+    
+    afe_config_t* afe_config = afe_config_init(input_format.c_str(), NULL, AFE_TYPE_VC, AFE_MODE_HIGH_PERF);
+    afe_config->aec_mode = AEC_MODE_VOIP_HIGH_PERF;
+    afe_config->vad_mode = VAD_MODE_0;
+    afe_config->vad_min_noise_ms = 100;
+    if (vad_model_name != nullptr) {
+        afe_config->vad_model_name = vad_model_name;
+    }
+
+    if (ns_model_name != nullptr) {
+        afe_config->ns_init = true;
+        afe_config->ns_model_name = ns_model_name;
+        afe_config->afe_ns_mode = AFE_NS_MODE_NET;
+    } else {
+        afe_config->ns_init = false;
+    }
+
+    afe_config->afe_perferred_core = 1;
+    afe_config->afe_perferred_priority = 1;
+    afe_config->agc_init = false;
+    afe_config->memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM;
+
+#ifdef CONFIG_USE_DEVICE_AEC
+    afe_config->aec_init = true;
+    afe_config->vad_init = false;
+#else
+    afe_config->aec_init = false;
+    afe_config->vad_init = true;
+#endif
+
+    afe_iface_ = esp_afe_handle_from_config(afe_config);
+    afe_data_ = afe_iface_->create_from_config(afe_config);
+    
+    xTaskCreate([](void* arg) {
+        auto this_ = (AfeAudioProcessor*)arg;
+        this_->AudioProcessorTask();
+        vTaskDelete(NULL);
+    }, "audio_communication", 4096, this, 3, NULL);
+}
+
+AfeAudioProcessor::~AfeAudioProcessor() {
+    if (afe_data_ != nullptr) {
+        afe_iface_->destroy(afe_data_);
+    }
+    vEventGroupDelete(event_group_);
+}
+
+size_t AfeAudioProcessor::GetFeedSize() {
+    if (afe_data_ == nullptr) {
+        return 0;
+    }
+    return afe_iface_->get_feed_chunksize(afe_data_);
+}
+
+void AfeAudioProcessor::Feed(std::vector<int16_t>&& data) {
+    if (afe_data_ == nullptr) {
+        return;
+    }
+    afe_iface_->feed(afe_data_, data.data());
+}
+
+void AfeAudioProcessor::Start() {
+    xEventGroupSetBits(event_group_, PROCESSOR_RUNNING);
+}
+
+void AfeAudioProcessor::Stop() {
+    xEventGroupClearBits(event_group_, PROCESSOR_RUNNING);
+    if (afe_data_ != nullptr) {
+        afe_iface_->reset_buffer(afe_data_);
+    }
+}
+
+bool AfeAudioProcessor::IsRunning() {
+    return xEventGroupGetBits(event_group_) & PROCESSOR_RUNNING;
+}
+
+void AfeAudioProcessor::OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) {
+    output_callback_ = callback;
+}
+
+void AfeAudioProcessor::OnVadStateChange(std::function<void(bool speaking)> callback) {
+    vad_state_change_callback_ = callback;
+}
+
+void AfeAudioProcessor::AudioProcessorTask() {
+    auto fetch_size = afe_iface_->get_fetch_chunksize(afe_data_);
+    auto feed_size = afe_iface_->get_feed_chunksize(afe_data_);
+    ESP_LOGI(TAG, "Audio communication task started, feed size: %d fetch size: %d",
+        feed_size, fetch_size);
+
+    while (true) {
+        xEventGroupWaitBits(event_group_, PROCESSOR_RUNNING, pdFALSE, pdTRUE, portMAX_DELAY);
+
+        auto res = afe_iface_->fetch_with_delay(afe_data_, portMAX_DELAY);
+        if ((xEventGroupGetBits(event_group_) & PROCESSOR_RUNNING) == 0) {
+            continue;
+        }
+        if (res == nullptr || res->ret_value == ESP_FAIL) {
+            if (res != nullptr) {
+                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
+            }
+            continue;
+        }
+
+        // VAD state change
+        if (vad_state_change_callback_) {
+            if (res->vad_state == VAD_SPEECH && !is_speaking_) {
+                is_speaking_ = true;
+                vad_state_change_callback_(true);
+            } else if (res->vad_state == VAD_SILENCE && is_speaking_) {
+                is_speaking_ = false;
+                vad_state_change_callback_(false);
+            }
+        }
+
+        if (output_callback_) {
+            size_t samples = res->data_size / sizeof(int16_t);
+            
+            // Add data to buffer
+            output_buffer_.insert(output_buffer_.end(), res->data, res->data + samples);
+            
+            // Output complete frames when buffer has enough data
+            while (output_buffer_.size() >= frame_samples_) {
+                if (output_buffer_.size() == frame_samples_) {
+                    // If buffer size equals frame size, move the entire buffer
+                    output_callback_(std::move(output_buffer_));
+                    output_buffer_.clear();
+                    output_buffer_.reserve(frame_samples_);
+                } else {
+                    // If buffer size exceeds frame size, copy one frame and remove it
+                    output_callback_(std::vector<int16_t>(output_buffer_.begin(), output_buffer_.begin() + frame_samples_));
+                    output_buffer_.erase(output_buffer_.begin(), output_buffer_.begin() + frame_samples_);
+                }
+            }
+        }
+    }
+}
+
+void AfeAudioProcessor::EnableDeviceAec(bool enable) {
+    if (enable) {
+#if CONFIG_USE_DEVICE_AEC
+        afe_iface_->disable_vad(afe_data_);
+        afe_iface_->enable_aec(afe_data_);
+#else
+        ESP_LOGE(TAG, "Device AEC is not supported");
+#endif
+    } else {
+        afe_iface_->disable_aec(afe_data_);
+        afe_iface_->enable_vad(afe_data_);
+    }
+}
--- a/main/audio/processors/afe_audio_processor.h
+++ b/main/audio/processors/afe_audio_processor.h
@@ -0,0 +1,45 @@
+#ifndef AFE_AUDIO_PROCESSOR_H
+#define AFE_AUDIO_PROCESSOR_H
+
+#include <esp_afe_sr_models.h>
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#include <freertos/event_groups.h>
+
+#include <string>
+#include <vector>
+#include <functional>
+
+#include "audio_processor.h"
+#include "audio_codec.h"
+
+class AfeAudioProcessor : public AudioProcessor {
+public:
+    AfeAudioProcessor();
+    ~AfeAudioProcessor();
+
+    void Initialize(AudioCodec* codec, int frame_duration_ms) override;
+    void Feed(std::vector<int16_t>&& data) override;
+    void Start() override;
+    void Stop() override;
+    bool IsRunning() override;
+    void OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) override;
+    void OnVadStateChange(std::function<void(bool speaking)> callback) override;
+    size_t GetFeedSize() override;
+    void EnableDeviceAec(bool enable) override;
+
+private:
+    EventGroupHandle_t event_group_ = nullptr;
+    esp_afe_sr_iface_t* afe_iface_ = nullptr;
+    esp_afe_sr_data_t* afe_data_ = nullptr;
+    std::function<void(std::vector<int16_t>&& data)> output_callback_;
+    std::function<void(bool speaking)> vad_state_change_callback_;
+    AudioCodec* codec_ = nullptr;
+    int frame_samples_ = 0;
+    bool is_speaking_ = false;
+    std::vector<int16_t> output_buffer_;
+
+    void AudioProcessorTask();
+};
+
+#endif 
--- a/main/audio/processors/audio_debugger.cc
+++ b/main/audio/processors/audio_debugger.cc
@@ -0,0 +1,68 @@
+#include "audio_debugger.h"
+#include "sdkconfig.h"
+
+#if CONFIG_USE_AUDIO_DEBUGGER
+#include <esp_log.h>
+#include <arpa/inet.h>
+#include <unistd.h>
+#include <errno.h>
+#include <cstring>
+#include <string>
+#endif
+
+#define TAG "AudioDebugger"
+
+
+AudioDebugger::AudioDebugger() {
+#if CONFIG_USE_AUDIO_DEBUGGER
+    udp_sockfd_ = socket(AF_INET, SOCK_DGRAM, 0);
+    if (udp_sockfd_ >= 0) {
+        // 解析配置的服务器地址 "IP:PORT"
+        std::string server_addr = CONFIG_AUDIO_DEBUG_UDP_SERVER;
+        size_t colon_pos = server_addr.find(':');
+        
+        if (colon_pos != std::string::npos) {
+            std::string ip = server_addr.substr(0, colon_pos);
+            int port = std::stoi(server_addr.substr(colon_pos + 1));
+            
+            memset(&udp_server_addr_, 0, sizeof(udp_server_addr_));
+            udp_server_addr_.sin_family = AF_INET;
+            udp_server_addr_.sin_port = htons(port);
+            inet_pton(AF_INET, ip.c_str(), &udp_server_addr_.sin_addr);
+            
+            ESP_LOGI(TAG, "Initialized server address: %s", CONFIG_AUDIO_DEBUG_UDP_SERVER);
+        } else {
+            ESP_LOGW(TAG, "Invalid server address: %s, should be IP:PORT", CONFIG_AUDIO_DEBUG_UDP_SERVER);
+            close(udp_sockfd_);
+            udp_sockfd_ = -1;
+        }
+    } else {
+        ESP_LOGW(TAG, "Failed to create UDP socket: %d", errno);
+    }
+#endif
+}
+
+AudioDebugger::~AudioDebugger() {
+#if CONFIG_USE_AUDIO_DEBUGGER
+    if (udp_sockfd_ >= 0) {
+        close(udp_sockfd_);
+        ESP_LOGI(TAG, "Closed UDP socket");
+    }
+#endif
+}
+
+void AudioDebugger::Feed(const std::vector<int16_t>& data) {
+#if CONFIG_USE_AUDIO_DEBUGGER
+    if (udp_sockfd_ >= 0) {
+        ssize_t sent = sendto(udp_sockfd_, data.data(), data.size() * sizeof(int16_t), 0,
+                             (struct sockaddr*)&udp_server_addr_, sizeof(udp_server_addr_));
+        if (sent < 0) {
+            ESP_LOGW(TAG, "Failed to send audio data to %s: %d", CONFIG_AUDIO_DEBUG_UDP_SERVER, errno);
+        } else {
+            ESP_LOGD(TAG, "Sent %d bytes audio data to %s", sent, CONFIG_AUDIO_DEBUG_UDP_SERVER);
+        }
+    }
+#endif
+}
+
+ 
--- a/main/audio/processors/audio_debugger.h
+++ b/main/audio/processors/audio_debugger.h
@@ -0,0 +1,22 @@
+#ifndef AUDIO_DEBUGGER_H
+#define AUDIO_DEBUGGER_H
+
+#include <vector>
+#include <cstdint>
+
+#include <sys/socket.h>
+#include <netinet/in.h>
+
+class AudioDebugger {
+public:
+    AudioDebugger();
+    ~AudioDebugger();
+
+    void Feed(const std::vector<int16_t>& data);
+
+private:
+    int udp_sockfd_ = -1;
+    struct sockaddr_in udp_server_addr_;
+};
+
+#endif 
--- a/main/audio/processors/no_audio_processor.cc
+++ b/main/audio/processors/no_audio_processor.cc
@@ -0,0 +1,59 @@
+#include "no_audio_processor.h"
+#include <esp_log.h>
+
+#define TAG "NoAudioProcessor"
+
+void NoAudioProcessor::Initialize(AudioCodec* codec, int frame_duration_ms) {
+    codec_ = codec;
+    frame_samples_ = frame_duration_ms * 16000 / 1000;
+}
+
+void NoAudioProcessor::Feed(std::vector<int16_t>&& data) {
+    if (!is_running_ || !output_callback_) {
+        return;
+    }
+
+    if (codec_->input_channels() == 2) {
+        // If input channels is 2, we need to fetch the left channel data
+        auto mono_data = std::vector<int16_t>(data.size() / 2);
+        for (size_t i = 0, j = 0; i < mono_data.size(); ++i, j += 2) {
+            mono_data[i] = data[j];
+        }
+        output_callback_(std::move(mono_data));
+    } else {
+        output_callback_(std::move(data));
+    }
+}
+
+void NoAudioProcessor::Start() {
+    is_running_ = true;
+}
+
+void NoAudioProcessor::Stop() {
+    is_running_ = false;
+}
+
+bool NoAudioProcessor::IsRunning() {
+    return is_running_;
+}
+
+void NoAudioProcessor::OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) {
+    output_callback_ = callback;
+}
+
+void NoAudioProcessor::OnVadStateChange(std::function<void(bool speaking)> callback) {
+    vad_state_change_callback_ = callback;
+}
+
+size_t NoAudioProcessor::GetFeedSize() {
+    if (!codec_) {
+        return 0;
+    }
+    return frame_samples_;
+}
+
+void NoAudioProcessor::EnableDeviceAec(bool enable) {
+    if (enable) {
+        ESP_LOGE(TAG, "Device AEC is not supported");
+    }
+}
--- a/main/audio/processors/no_audio_processor.h
+++ b/main/audio/processors/no_audio_processor.h
@@ -0,0 +1,33 @@
+#ifndef DUMMY_AUDIO_PROCESSOR_H
+#define DUMMY_AUDIO_PROCESSOR_H
+
+#include <vector>
+#include <functional>
+
+#include "audio_processor.h"
+#include "audio_codec.h"
+
+class NoAudioProcessor : public AudioProcessor {
+public:
+    NoAudioProcessor() = default;
+    ~NoAudioProcessor() = default;
+
+    void Initialize(AudioCodec* codec, int frame_duration_ms) override;
+    void Feed(std::vector<int16_t>&& data) override;
+    void Start() override;
+    void Stop() override;
+    bool IsRunning() override;
+    void OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) override;
+    void OnVadStateChange(std::function<void(bool speaking)> callback) override;
+    size_t GetFeedSize() override;
+    void EnableDeviceAec(bool enable) override;
+
+private:
+    AudioCodec* codec_ = nullptr;
+    int frame_samples_ = 0;
+    std::function<void(std::vector<int16_t>&& data)> output_callback_;
+    std::function<void(bool speaking)> vad_state_change_callback_;
+    bool is_running_ = false;
+};
+
+#endif 
--- a/main/audio/wake_word.h
+++ b/main/audio/wake_word.h
@@ -0,0 +1,25 @@
+#ifndef WAKE_WORD_H
+#define WAKE_WORD_H
+
+#include <string>
+#include <vector>
+#include <functional>
+
+#include "audio_codec.h"
+
+class WakeWord {
+public:
+    virtual ~WakeWord() = default;
+    
+    virtual bool Initialize(AudioCodec* codec) = 0;
+    virtual void Feed(const std::vector<int16_t>& data) = 0;
+    virtual void OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback) = 0;
+    virtual void Start() = 0;
+    virtual void Stop() = 0;
+    virtual size_t GetFeedSize() = 0;
+    virtual void EncodeWakeWordData() = 0;
+    virtual bool GetWakeWordOpus(std::vector<uint8_t>& opus) = 0;
+    virtual const std::string& GetLastDetectedWakeWord() const = 0;
+};
+
+#endif
--- a/main/audio/wake_words/afe_wake_word.cc
+++ b/main/audio/wake_words/afe_wake_word.cc
@@ -0,0 +1,203 @@
+#include "afe_wake_word.h"
+#include "audio_service.h"
+
+#include <esp_log.h>
+#include <sstream>
+
+#define DETECTION_RUNNING_EVENT 1
+
+#define TAG "AfeWakeWord"
+
+AfeWakeWord::AfeWakeWord()
+    : afe_data_(nullptr),
+      wake_word_pcm_(),
+      wake_word_opus_() {
+
+    event_group_ = xEventGroupCreate();
+}
+
+AfeWakeWord::~AfeWakeWord() {
+    if (afe_data_ != nullptr) {
+        afe_iface_->destroy(afe_data_);
+    }
+
+    if (wake_word_encode_task_stack_ != nullptr) {
+        heap_caps_free(wake_word_encode_task_stack_);
+    }
+
+    if (wake_word_encode_task_buffer_ != nullptr) {
+        heap_caps_free(wake_word_encode_task_buffer_);
+    }
+
+    if (models_ != nullptr) {
+        esp_srmodel_deinit(models_);
+    }
+
+    vEventGroupDelete(event_group_);
+}
+
+bool AfeWakeWord::Initialize(AudioCodec* codec) {
+    codec_ = codec;
+    int ref_num = codec_->input_reference() ? 1 : 0;
+
+    models_ = esp_srmodel_init("model");
+    if (models_ == nullptr || models_->num == -1) {
+        ESP_LOGE(TAG, "Failed to initialize wakenet model");
+        return false;
+    }
+    for (int i = 0; i < models_->num; i++) {
+        ESP_LOGI(TAG, "Model %d: %s", i, models_->model_name[i]);
+        if (strstr(models_->model_name[i], ESP_WN_PREFIX) != NULL) {
+            wakenet_model_ = models_->model_name[i];
+            auto words = esp_srmodel_get_wake_words(models_, wakenet_model_);
+            // split by ";" to get all wake words
+            std::stringstream ss(words);
+            std::string word;
+            while (std::getline(ss, word, ';')) {
+                wake_words_.push_back(word);
+            }
+        }
+    }
+
+    std::string input_format;
+    for (int i = 0; i < codec_->input_channels() - ref_num; i++) {
+        input_format.push_back('M');
+    }
+    for (int i = 0; i < ref_num; i++) {
+        input_format.push_back('R');
+    }
+    afe_config_t* afe_config = afe_config_init(input_format.c_str(), models_, AFE_TYPE_SR, AFE_MODE_HIGH_PERF);
+    afe_config->aec_init = codec_->input_reference();
+    afe_config->aec_mode = AEC_MODE_SR_HIGH_PERF;
+    afe_config->afe_perferred_core = 1;
+    afe_config->afe_perferred_priority = 1;
+    afe_config->memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM;
+    
+    afe_iface_ = esp_afe_handle_from_config(afe_config);
+    afe_data_ = afe_iface_->create_from_config(afe_config);
+
+    xTaskCreate([](void* arg) {
+        auto this_ = (AfeWakeWord*)arg;
+        this_->AudioDetectionTask();
+        vTaskDelete(NULL);
+    }, "audio_detection", 4096, this, 3, nullptr);
+
+    return true;
+}
+
+void AfeWakeWord::OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback) {
+    wake_word_detected_callback_ = callback;
+}
+
+void AfeWakeWord::Start() {
+    xEventGroupSetBits(event_group_, DETECTION_RUNNING_EVENT);
+}
+
+void AfeWakeWord::Stop() {
+    xEventGroupClearBits(event_group_, DETECTION_RUNNING_EVENT);
+    if (afe_data_ != nullptr) {
+        afe_iface_->reset_buffer(afe_data_);
+    }
+}
+
+void AfeWakeWord::Feed(const std::vector<int16_t>& data) {
+    if (afe_data_ == nullptr) {
+        return;
+    }
+    afe_iface_->feed(afe_data_, data.data());
+}
+
+size_t AfeWakeWord::GetFeedSize() {
+    if (afe_data_ == nullptr) {
+        return 0;
+    }
+    return afe_iface_->get_feed_chunksize(afe_data_);
+}
+
+void AfeWakeWord::AudioDetectionTask() {
+    auto fetch_size = afe_iface_->get_fetch_chunksize(afe_data_);
+    auto feed_size = afe_iface_->get_feed_chunksize(afe_data_);
+    ESP_LOGI(TAG, "Audio detection task started, feed size: %d fetch size: %d",
+        feed_size, fetch_size);
+
+    while (true) {
+        xEventGroupWaitBits(event_group_, DETECTION_RUNNING_EVENT, pdFALSE, pdTRUE, portMAX_DELAY);
+
+        auto res = afe_iface_->fetch_with_delay(afe_data_, portMAX_DELAY);
+        if (res == nullptr || res->ret_value == ESP_FAIL) {
+            continue;;
+        }
+
+        // Store the wake word data for voice recognition, like who is speaking
+        StoreWakeWordData(res->data, res->data_size / sizeof(int16_t));
+
+        if (res->wakeup_state == WAKENET_DETECTED) {
+            Stop();
+            last_detected_wake_word_ = wake_words_[res->wakenet_model_index - 1];
+
+            if (wake_word_detected_callback_) {
+                wake_word_detected_callback_(last_detected_wake_word_);
+            }
+        }
+    }
+}
+
+void AfeWakeWord::StoreWakeWordData(const int16_t* data, size_t samples) {
+    // store audio data to wake_word_pcm_
+    wake_word_pcm_.emplace_back(std::vector<int16_t>(data, data + samples));
+    // keep about 2 seconds of data, detect duration is 30ms (sample_rate == 16000, chunksize == 512)
+    while (wake_word_pcm_.size() > 2000 / 30) {
+        wake_word_pcm_.pop_front();
+    }
+}
+
+void AfeWakeWord::EncodeWakeWordData() {
+    const size_t stack_size = 4096 * 7;
+    wake_word_opus_.clear();
+    if (wake_word_encode_task_stack_ == nullptr) {
+        wake_word_encode_task_stack_ = (StackType_t*)heap_caps_malloc(stack_size, MALLOC_CAP_SPIRAM);
+        assert(wake_word_encode_task_stack_ != nullptr);
+    }
+    if (wake_word_encode_task_buffer_ == nullptr) {
+        wake_word_encode_task_buffer_ = (StaticTask_t*)heap_caps_malloc(sizeof(StaticTask_t), MALLOC_CAP_INTERNAL);
+        assert(wake_word_encode_task_buffer_ != nullptr);
+    }
+
+    wake_word_encode_task_ = xTaskCreateStatic([](void* arg) {
+        auto this_ = (AfeWakeWord*)arg;
+        {
+            auto start_time = esp_timer_get_time();
+            auto encoder = std::make_unique<OpusEncoderWrapper>(16000, 1, OPUS_FRAME_DURATION_MS);
+            encoder->SetComplexity(0); // 0 is the fastest
+
+            int packets = 0;
+            for (auto& pcm: this_->wake_word_pcm_) {
+                encoder->Encode(std::move(pcm), [this_](std::vector<uint8_t>&& opus) {
+                    std::lock_guard<std::mutex> lock(this_->wake_word_mutex_);
+                    this_->wake_word_opus_.emplace_back(std::move(opus));
+                    this_->wake_word_cv_.notify_all();
+                });
+                packets++;
+            }
+            this_->wake_word_pcm_.clear();
+
+            auto end_time = esp_timer_get_time();
+            ESP_LOGI(TAG, "Encode wake word opus %d packets in %ld ms", packets, (long)((end_time - start_time) / 1000));
+
+            std::lock_guard<std::mutex> lock(this_->wake_word_mutex_);
+            this_->wake_word_opus_.push_back(std::vector<uint8_t>());
+            this_->wake_word_cv_.notify_all();
+        }
+        vTaskDelete(NULL);
+    }, "encode_wake_word", stack_size, this, 2, wake_word_encode_task_stack_, wake_word_encode_task_buffer_);
+}
+
+bool AfeWakeWord::GetWakeWordOpus(std::vector<uint8_t>& opus) {
+    std::unique_lock<std::mutex> lock(wake_word_mutex_);
+    wake_word_cv_.wait(lock, [this]() {
+        return !wake_word_opus_.empty();
+    });
+    opus.swap(wake_word_opus_.front());
+    wake_word_opus_.pop_front();
+    return !opus.empty();
+}
--- a/main/audio/wake_words/afe_wake_word.h
+++ b/main/audio/wake_words/afe_wake_word.h
@@ -0,0 +1,60 @@
+#ifndef AFE_WAKE_WORD_H
+#define AFE_WAKE_WORD_H
+
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#include <freertos/event_groups.h>
+
+#include <esp_afe_sr_models.h>
+#include <esp_nsn_models.h>
+#include <model_path.h>
+
+#include <deque>
+#include <string>
+#include <vector>
+#include <functional>
+#include <mutex>
+#include <condition_variable>
+
+#include "audio_codec.h"
+#include "wake_word.h"
+
+class AfeWakeWord : public WakeWord {
+public:
+    AfeWakeWord();
+    ~AfeWakeWord();
+
+    bool Initialize(AudioCodec* codec);
+    void Feed(const std::vector<int16_t>& data);
+    void OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback);
+    void Start();
+    void Stop();
+    size_t GetFeedSize();
+    void EncodeWakeWordData();
+    bool GetWakeWordOpus(std::vector<uint8_t>& opus);
+    const std::string& GetLastDetectedWakeWord() const { return last_detected_wake_word_; }
+
+private:
+    srmodel_list_t *models_ = nullptr;
+    esp_afe_sr_iface_t* afe_iface_ = nullptr;
+    esp_afe_sr_data_t* afe_data_ = nullptr;
+    char* wakenet_model_ = NULL;
+    std::vector<std::string> wake_words_;
+    EventGroupHandle_t event_group_;
+    std::function<void(const std::string& wake_word)> wake_word_detected_callback_;
+    AudioCodec* codec_ = nullptr;
+    std::string last_detected_wake_word_;
+
+    TaskHandle_t wake_word_encode_task_ = nullptr;
+    StaticTask_t* wake_word_encode_task_buffer_ = nullptr;
+    StackType_t* wake_word_encode_task_stack_ = nullptr;
+    std::deque<std::vector<int16_t>> wake_word_pcm_;
+    std::deque<std::vector<uint8_t>> wake_word_opus_;
+    std::mutex wake_word_mutex_;
+    std::condition_variable wake_word_cv_;
+
+    void StoreWakeWordData(const int16_t* data, size_t size);
+    void AudioDetectionTask();
+};
+
+#endif
--- a/main/audio/wake_words/custom_wake_word.cc
+++ b/main/audio/wake_words/custom_wake_word.cc
@@ -0,0 +1,185 @@
+#include "custom_wake_word.h"
+#include "audio_service.h"
+#include "system_info.h"
+
+#include <esp_log.h>
+#include "esp_mn_iface.h"
+#include "esp_mn_models.h"
+#include "esp_mn_speech_commands.h"
+
+
+#define TAG "CustomWakeWord"
+
+
+CustomWakeWord::CustomWakeWord()
+    : wake_word_pcm_(), wake_word_opus_() {
+}
+
+CustomWakeWord::~CustomWakeWord() {
+    if (multinet_model_data_ != nullptr && multinet_ != nullptr) {
+        multinet_->destroy(multinet_model_data_);
+        multinet_model_data_ = nullptr;
+    }
+
+    if (wake_word_encode_task_stack_ != nullptr) {
+        heap_caps_free(wake_word_encode_task_stack_);
+    }
+
+    if (wake_word_encode_task_buffer_ != nullptr) {
+        heap_caps_free(wake_word_encode_task_buffer_);
+    }
+
+    if (models_ != nullptr) {
+        esp_srmodel_deinit(models_);
+    }
+}
+
+bool CustomWakeWord::Initialize(AudioCodec* codec) {
+    codec_ = codec;
+
+    models_ = esp_srmodel_init("model");
+    if (models_ == nullptr || models_->num == -1) {
+        ESP_LOGE(TAG, "Failed to initialize wakenet model");
+        return false;
+    }
+
+    // 初始化 multinet (命令词识别)
+    mn_name_ = esp_srmodel_filter(models_, ESP_MN_PREFIX, ESP_MN_CHINESE);
+    if (mn_name_ == nullptr) {
+        ESP_LOGE(TAG, "Failed to initialize multinet, mn_name is nullptr");
+        ESP_LOGI(TAG, "Please refer to https://pcn7cs20v8cr.feishu.cn/wiki/CpQjwQsCJiQSWSkYEvrcxcbVnwh to add custom wake word");
+        return false;
+    }
+
+    ESP_LOGI(TAG, "multinet: %s", mn_name_);
+    multinet_ = esp_mn_handle_from_name(mn_name_);
+    multinet_model_data_ = multinet_->create(mn_name_, 3000);  // 3 秒超时
+    multinet_->set_det_threshold(multinet_model_data_, CONFIG_CUSTOM_WAKE_WORD_THRESHOLD / 100.0f);
+    esp_mn_commands_clear();
+    esp_mn_commands_add(1, CONFIG_CUSTOM_WAKE_WORD);
+    esp_mn_commands_update();
+    
+    multinet_->print_active_speech_commands(multinet_model_data_);
+    return true;
+}
+
+void CustomWakeWord::OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback) {
+    wake_word_detected_callback_ = callback;
+}
+
+void CustomWakeWord::Start() {
+    running_ = true;
+}
+
+void CustomWakeWord::Stop() {
+    running_ = false;
+}
+
+void CustomWakeWord::Feed(const std::vector<int16_t>& data) {
+    if (multinet_model_data_ == nullptr || !running_) {
+        return;
+    }
+
+    esp_mn_state_t mn_state;
+    // If input channels is 2, we need to fetch the left channel data
+    if (codec_->input_channels() == 2) {
+        auto mono_data = std::vector<int16_t>(data.size() / 2);
+        for (size_t i = 0, j = 0; i < mono_data.size(); ++i, j += 2) {
+            mono_data[i] = data[j];
+        }
+
+        StoreWakeWordData(mono_data);
+        mn_state = multinet_->detect(multinet_model_data_, const_cast<int16_t*>(mono_data.data()));
+    } else {
+        StoreWakeWordData(data);
+        mn_state = multinet_->detect(multinet_model_data_, const_cast<int16_t*>(data.data()));
+    }
+    
+    if (mn_state == ESP_MN_STATE_DETECTING) {
+        return;
+    } else if (mn_state == ESP_MN_STATE_DETECTED) {
+        esp_mn_results_t *mn_result = multinet_->get_results(multinet_model_data_);
+        ESP_LOGI(TAG, "Custom wake word detected: command_id=%d, string=%s, prob=%f", 
+                mn_result->command_id[0], mn_result->string, mn_result->prob[0]);
+        
+        if (mn_result->command_id[0] == 1) {
+            last_detected_wake_word_ = CONFIG_CUSTOM_WAKE_WORD_DISPLAY;
+        }
+        running_ = false;
+        
+        if (wake_word_detected_callback_) {
+            wake_word_detected_callback_(last_detected_wake_word_);
+        }
+        multinet_->clean(multinet_model_data_);
+    } else if (mn_state == ESP_MN_STATE_TIMEOUT) {
+        ESP_LOGD(TAG, "Command word detection timeout, cleaning state");
+        multinet_->clean(multinet_model_data_);
+    }
+}
+
+size_t CustomWakeWord::GetFeedSize() {
+    if (multinet_model_data_ == nullptr) {
+        return 0;
+    }
+    return multinet_->get_samp_chunksize(multinet_model_data_);
+}
+
+void CustomWakeWord::StoreWakeWordData(const std::vector<int16_t>& data) {
+    // store audio data to wake_word_pcm_
+    wake_word_pcm_.push_back(data);
+    // keep about 2 seconds of data, detect duration is 30ms (sample_rate == 16000, chunksize == 512)
+    while (wake_word_pcm_.size() > 2000 / 30) {
+        wake_word_pcm_.pop_front();
+    }
+}
+
+void CustomWakeWord::EncodeWakeWordData() {
+    const size_t stack_size = 4096 * 7;
+    wake_word_opus_.clear();
+    if (wake_word_encode_task_stack_ == nullptr) {
+        wake_word_encode_task_stack_ = (StackType_t*)heap_caps_malloc(stack_size, MALLOC_CAP_SPIRAM);
+        assert(wake_word_encode_task_stack_ != nullptr);
+    }
+    if (wake_word_encode_task_buffer_ == nullptr) {
+        wake_word_encode_task_buffer_ = (StaticTask_t*)heap_caps_malloc(sizeof(StaticTask_t), MALLOC_CAP_INTERNAL);
+        assert(wake_word_encode_task_buffer_ != nullptr);
+    }
+
+    wake_word_encode_task_ = xTaskCreateStatic([](void* arg) {
+        auto this_ = (CustomWakeWord*)arg;
+        {
+            auto start_time = esp_timer_get_time();
+            auto encoder = std::make_unique<OpusEncoderWrapper>(16000, 1, OPUS_FRAME_DURATION_MS);
+            encoder->SetComplexity(0); // 0 is the fastest
+
+            int packets = 0;
+            for (auto& pcm: this_->wake_word_pcm_) {
+                encoder->Encode(std::move(pcm), [this_](std::vector<uint8_t>&& opus) {
+                    std::lock_guard<std::mutex> lock(this_->wake_word_mutex_);
+                    this_->wake_word_opus_.emplace_back(std::move(opus));
+                    this_->wake_word_cv_.notify_all();
+                });
+                packets++;
+            }
+            this_->wake_word_pcm_.clear();
+
+            auto end_time = esp_timer_get_time();
+            ESP_LOGI(TAG, "Encode wake word opus %d packets in %ld ms", packets, (long)((end_time - start_time) / 1000));
+
+            std::lock_guard<std::mutex> lock(this_->wake_word_mutex_);
+            this_->wake_word_opus_.push_back(std::vector<uint8_t>());
+            this_->wake_word_cv_.notify_all();
+        }
+        vTaskDelete(NULL);
+    }, "encode_wake_word", stack_size, this, 2, wake_word_encode_task_stack_, wake_word_encode_task_buffer_);
+}
+
+bool CustomWakeWord::GetWakeWordOpus(std::vector<uint8_t>& opus) {
+    std::unique_lock<std::mutex> lock(wake_word_mutex_);
+    wake_word_cv_.wait(lock, [this]() {
+        return !wake_word_opus_.empty();
+    });
+    opus.swap(wake_word_opus_.front());
+    wake_word_opus_.pop_front();
+    return !opus.empty();
+}
--- a/main/audio/wake_words/custom_wake_word.h
+++ b/main/audio/wake_words/custom_wake_word.h
@@ -0,0 +1,58 @@
+#ifndef CUSTOM_WAKE_WORD_H
+#define CUSTOM_WAKE_WORD_H
+
+#include <esp_attr.h>
+#include <esp_mn_iface.h>
+#include <esp_mn_models.h>
+#include <model_path.h>
+
+#include <deque>
+#include <string>
+#include <vector>
+#include <functional>
+#include <mutex>
+#include <condition_variable>
+#include <atomic>
+
+#include "audio_codec.h"
+#include "wake_word.h"
+
+class CustomWakeWord : public WakeWord {
+public:
+    CustomWakeWord();
+    ~CustomWakeWord();
+
+    bool Initialize(AudioCodec* codec);
+    void Feed(const std::vector<int16_t>& data);
+    void OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback);
+    void Start();
+    void Stop();
+    size_t GetFeedSize();
+    void EncodeWakeWordData();
+    bool GetWakeWordOpus(std::vector<uint8_t>& opus);
+    const std::string& GetLastDetectedWakeWord() const { return last_detected_wake_word_; }
+
+private:
+    // multinet 相关成员变量
+    esp_mn_iface_t* multinet_ = nullptr;
+    model_iface_data_t* multinet_model_data_ = nullptr;
+    srmodel_list_t *models_ = nullptr;
+    char* mn_name_ = nullptr;
+ 
+    std::function<void(const std::string& wake_word)> wake_word_detected_callback_;
+    AudioCodec* codec_ = nullptr;
+    std::string last_detected_wake_word_;
+    std::atomic<bool> running_ = false;
+
+    TaskHandle_t wake_word_encode_task_ = nullptr;
+    StaticTask_t* wake_word_encode_task_buffer_ = nullptr;
+    StackType_t* wake_word_encode_task_stack_ = nullptr;
+    std::deque<std::vector<int16_t>> wake_word_pcm_;
+    std::deque<std::vector<uint8_t>> wake_word_opus_;
+    std::mutex wake_word_mutex_;
+    std::condition_variable wake_word_cv_;
+
+    void StoreWakeWordData(const std::vector<int16_t>& data);
+};
+
+#endif
--- a/main/audio/wake_words/esp_wake_word.cc
+++ b/main/audio/wake_words/esp_wake_word.cc
@@ -0,0 +1,82 @@
+#include "esp_wake_word.h"
+#include <esp_log.h>
+
+
+#define TAG "EspWakeWord"
+
+EspWakeWord::EspWakeWord() {
+}
+
+EspWakeWord::~EspWakeWord() {
+    if (wakenet_data_ != nullptr) {
+        wakenet_iface_->destroy(wakenet_data_);
+        esp_srmodel_deinit(wakenet_model_);
+    }
+}
+
+bool EspWakeWord::Initialize(AudioCodec* codec) {
+    codec_ = codec;
+
+    wakenet_model_ = esp_srmodel_init("model");
+    if (wakenet_model_ == nullptr || wakenet_model_->num == -1) {
+        ESP_LOGE(TAG, "Failed to initialize wakenet model");
+        return false;
+    }
+    if(wakenet_model_->num > 1) {
+        ESP_LOGW(TAG, "More than one model found, using the first one");
+    } else if (wakenet_model_->num == 0) {
+        ESP_LOGE(TAG, "No model found");
+        return false;
+    }
+    char *model_name = wakenet_model_->model_name[0];
+    wakenet_iface_ = (esp_wn_iface_t*)esp_wn_handle_from_name(model_name);
+    wakenet_data_ = wakenet_iface_->create(model_name, DET_MODE_95);
+
+    int frequency = wakenet_iface_->get_samp_rate(wakenet_data_);
+    int audio_chunksize = wakenet_iface_->get_samp_chunksize(wakenet_data_);
+    ESP_LOGI(TAG, "Wake word(%s),freq: %d, chunksize: %d", model_name, frequency, audio_chunksize);
+
+    return true;
+}
+
+void EspWakeWord::OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback) {
+    wake_word_detected_callback_ = callback;
+}
+
+void EspWakeWord::Start() {
+    running_ = true;
+}
+
+void EspWakeWord::Stop() {
+    running_ = false;
+}
+
+void EspWakeWord::Feed(const std::vector<int16_t>& data) {
+    if (wakenet_data_ == nullptr || !running_) {
+        return;
+    }
+
+    int res = wakenet_iface_->detect(wakenet_data_, (int16_t *)data.data());
+    if (res > 0) {
+        last_detected_wake_word_ = wakenet_iface_->get_word_name(wakenet_data_, res);
+        running_ = false;
+
+        if (wake_word_detected_callback_) {
+            wake_word_detected_callback_(last_detected_wake_word_);
+        }
+    }
+}
+
+size_t EspWakeWord::GetFeedSize() {
+    if (wakenet_data_ == nullptr) {
+        return 0;
+    }
+    return wakenet_iface_->get_samp_chunksize(wakenet_data_);
+}
+
+void EspWakeWord::EncodeWakeWordData() {
+}
+
+bool EspWakeWord::GetWakeWordOpus(std::vector<uint8_t>& opus) {
+    return false;
+}
--- a/main/audio/wake_words/esp_wake_word.h
+++ b/main/audio/wake_words/esp_wake_word.h
@@ -0,0 +1,42 @@
+#ifndef ESP_WAKE_WORD_H
+#define ESP_WAKE_WORD_H
+
+#include <esp_wn_iface.h>
+#include <esp_wn_models.h>
+#include <model_path.h>
+
+#include <string>
+#include <vector>
+#include <functional>
+#include <atomic>
+
+#include "audio_codec.h"
+#include "wake_word.h"
+
+class EspWakeWord : public WakeWord {
+public:
+    EspWakeWord();
+    ~EspWakeWord();
+
+    bool Initialize(AudioCodec* codec);
+    void Feed(const std::vector<int16_t>& data);
+    void OnWakeWordDetected(std::function<void(const std::string& wake_word)> callback);
+    void Start();
+    void Stop();
+    size_t GetFeedSize();
+    void EncodeWakeWordData();
+    bool GetWakeWordOpus(std::vector<uint8_t>& opus);
+    const std::string& GetLastDetectedWakeWord() const { return last_detected_wake_word_; }
+
+private:
+    esp_wn_iface_t *wakenet_iface_ = nullptr;
+    model_iface_data_t *wakenet_data_ = nullptr;
+    srmodel_list_t *wakenet_model_ = nullptr;
+    AudioCodec* codec_ = nullptr;
+    std::atomic<bool> running_ = false;
+
+    std::function<void(const std::string& wake_word)> wake_word_detected_callback_;
+    std::string last_detected_wake_word_;
+};
+
+#endif