Chapter 3 比較語音關鍵字偵測的 performance 差異

上一章節裡面, 介紹了語音關鍵字偵測, 來偵測 right, left, forward, stop 這 4 個字. 可是有時候, 偵測的結果不盡滿意, 後來找到另一個硬體 Seeed Studio 的 Wio Terminal, 它的語音偵測恰好也是用 Arduino IDE 跟 Nano 33 BLE Sense 同一套應用軟體, 激起把這 2 套硬體的語音偵測結果來做個比較的想法.

開發板 CPU CPU core CPU 速度 ROM RAM
Arduino Nano Nordic nRF52840 ARM Cortex-M4 64MHz 1MB 256KB
Wio Terminal Microchip ATSAMD51P19 ARM Cortex-M4F* 120MHz 4MB 192KB

Remark * M4F 含 FPU (浮點運算單元)

既然是比較硬體的 performance, 就盡量用同一套應用程式, 所以, 就決定用 Wio Terminal 的方式, 完全透過 Edge Impulse 的作法, Wio terminal 有個 colab 的 ipynb 程式, 會將資料上傳到 Edge Impulse, 透過 Edge Impulse 訓練後, 結果是產生直接給 Arduino IDE 操作的 .zip 檔案.

Arduino Nano 33 BLE Sense 跟 Wio Terminal 都是用同一個 .zip 檔在 Arduino IDE 上執行. 唯一的差別在於 Wio Terminal 的麥克風擷取程式需要修改, 我用的是 Wio Terminal 官方提供的版本, 相對於 Arduino Nano 33 BLE Sense 用的是 Edge Impulse 提供的版本, 看起來 Wio Terminal 除了修改麥克風擷取程式之外, 主程式也做了部份修改, 造成彼此在辨識結果也有差別.

為了避免每次發音不同造成的結果差異, 同時將 Arduino Nano 33 BLE Sense 跟 Wio Terminal 同時接到不同筆電 (PC 跟 Mac) 來做測試, 也交換筆電來看不同組合下是否有不同結果. 各以 3 次的測試結果避免單次誤差.

組合 開發板組合 Right Left Forward Stop
1 Nano + Mac 0.93/0.83/0.87 0.86/0.77/0.97 0.81/0.91/0.93 0.99/0.99/0.99
1 Wio + PC 0.68/0.76/0.60 0.97/0.81/0.59 0.88/0.92/0.91 0.82/0.92/0.83
2 Nano + PC 0.03/0.50/0.37 0.82/0.86/0.74 0.91/0.91/0.71 0.44/0.99/0.66
2 Wio + Mac 0.16/0.58/0.63 0.77/0.66/0.79 0.96/0.89/0.89 0.97/0.97/0.77

Right 單字的辨識率似乎對 Nano 跟 Wio 都有難度, 但是, Nano 還是有可能辨識率達 93%.

雖然 Wio Terminal 辨識的頻率比較高, 列出其中各一段相同時間內的辨識結果, Nano 在這段時間內辨識了 9 次, 而 Wio Terminal 辨識了 24 次, 但無法得出 Wio Terminal 辨識率比較高的結論. 所以, 模型決定了辨識的結果, CPU 速度能增加辨識的頻率, 對 1 秒內的語音辨識, Arduino Nano 33 BLE Sense 的 64MHz Arm M4 CPU 已經足夠。

3.0.1 Arduino Nano 33 BLE Sense 辨識結果 - 3

15:53:28.881 -> Predictions (DSP: 90 ms., Classification: 7 ms., Anomaly: 0 ms.): 
15:53:28.881 ->     _noise: 0.00000
15:53:28.881 ->     _unknown: 0.07422
15:53:28.881 ->     forward: 0.00391
15:53:28.881 ->     left: 0.05078
15:53:28.881 ->     right: 0.87109
15:53:28.881 ->     stop: 0.00000
15:53:29.893 -> Predictions (DSP: 90 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:29.893 ->     _noise: 0.88281
15:53:29.893 ->     _unknown: 0.01562
15:53:29.893 ->     forward: 0.00000
15:53:29.893 ->     left: 0.00781
15:53:29.893 ->     right: 0.01562
15:53:29.893 ->     stop: 0.07422
15:53:30.890 -> Predictions (DSP: 90 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:30.890 ->     _noise: 0.71094
15:53:30.890 ->     _unknown: 0.06250
15:53:30.890 ->     forward: 0.00391
15:53:30.890 ->     left: 0.05078
15:53:30.890 ->     right: 0.04297
15:53:30.890 ->     stop: 0.12891
15:53:31.872 -> Predictions (DSP: 91 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:31.872 ->     _noise: 0.00000
15:53:31.872 ->     _unknown: 0.01172
15:53:31.872 ->     forward: 0.00000
15:53:31.872 ->     left: 0.97656
15:53:31.872 ->     right: 0.01172
15:53:31.872 ->     stop: 0.00000
15:53:32.880 -> Predictions (DSP: 91 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:32.880 ->     _noise: 0.78906
15:53:32.880 ->     _unknown: 0.06641
15:53:32.880 ->     forward: 0.01172
15:53:32.880 ->     left: 0.02734
15:53:32.880 ->     right: 0.00781
15:53:32.880 ->     stop: 0.09766
15:53:33.893 -> Predictions (DSP: 90 ms., Classification: 7 ms., Anomaly: 0 ms.): 
15:53:33.893 ->     _noise: 0.00000
15:53:33.893 ->     _unknown: 0.02734
15:53:33.893 ->     forward: 0.93359
15:53:33.893 ->     left: 0.01172
15:53:33.893 ->     right: 0.02734
15:53:33.893 ->     stop: 0.00000
15:53:34.866 -> Predictions (DSP: 90 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:34.866 ->     _noise: 0.11328
15:53:34.866 ->     _unknown: 0.29297
15:53:34.866 ->     forward: 0.00391
15:53:34.866 ->     left: 0.51172
15:53:34.866 ->     right: 0.01562
15:53:34.866 ->     stop: 0.06250
15:53:35.878 -> Predictions (DSP: 91 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:35.878 ->     _noise: 0.88672
15:53:35.878 ->     _unknown: 0.03125
15:53:35.878 ->     forward: 0.00000
15:53:35.878 ->     left: 0.02344
15:53:35.878 ->     right: 0.00781
15:53:35.878 ->     stop: 0.05078
15:53:36.887 -> Predictions (DSP: 91 ms., Classification: 6 ms., Anomaly: 0 ms.): 
15:53:36.887 ->     _noise: 0.00000
15:53:36.887 ->     _unknown: 0.00391
15:53:36.887 ->     forward: 0.00000
15:53:36.887 ->     left: 0.00000
15:53:36.887 ->     right: 0.00000
15:53:36.887 ->     stop: 0.99609

3.0.2 Wio Terminal 辨識結果 - 3

15:53:28.722 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:28.722 ->     _noise: 0.00000
15:53:28.722 ->     _unknown: 0.19531
15:53:28.722 ->     forward: 0.00781
15:53:28.722 ->     left: 0.60156
15:53:28.722 ->     right: 0.16016
15:53:28.722 ->     stop: 0.03516
15:53:29.117 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:29.117 ->     _noise: 0.00000
15:53:29.117 ->     _unknown: 0.41016
15:53:29.117 ->     forward: 0.00000
15:53:29.117 ->     left: 0.33984
15:53:29.117 ->     right: 0.19141
15:53:29.117 ->     stop: 0.06250
15:53:29.515 -> Predictions (DSP: 196 ms, NN: 16 ms)
15:53:29.515 ->     _noise: 0.69141
15:53:29.515 ->     _unknown: 0.05078
15:53:29.515 ->     forward: 0.00391
15:53:29.515 ->     left: 0.02734
15:53:29.515 ->     right: 0.00781
15:53:29.515 ->     stop: 0.22266
15:53:29.911 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:29.911 ->     _noise: 0.98828
15:53:29.911 ->     _unknown: 0.00000
15:53:29.911 ->     forward: 0.00000
15:53:29.911 ->     left: 0.00000
15:53:29.911 ->     right: 0.00000
15:53:29.911 ->     stop: 0.00781
15:53:30.198 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:30.198 ->     _noise: 0.87109
15:53:30.198 ->     _unknown: 0.05078
15:53:30.198 ->     forward: 0.01172
15:53:30.198 ->     left: 0.00781
15:53:30.198 ->     right: 0.01172
15:53:30.198 ->     stop: 0.04297
15:53:30.485 -> Predictions (DSP: 196 ms, NN: 16 ms)
15:53:30.485 ->     _noise: 0.88672
15:53:30.485 ->     _unknown: 0.04297
15:53:30.485 ->     forward: 0.01562
15:53:30.485 ->     left: 0.02344
15:53:30.485 ->     right: 0.01172
15:53:30.485 ->     stop: 0.01953
15:53:30.917 -> Predictions (DSP: 207 ms, NN: 16 ms)
15:53:30.917 ->     _noise: 0.85547
15:53:30.917 ->     _unknown: 0.05078
15:53:30.917 ->     forward: 0.00391
15:53:30.917 ->     left: 0.01562
15:53:30.917 ->     right: 0.00000
15:53:30.917 ->     stop: 0.07422
15:53:31.209 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:31.209 ->     _noise: 0.00000
15:53:31.209 ->     _unknown: 0.37500
15:53:31.209 ->     forward: 0.04688
15:53:31.209 ->     left: 0.37500
15:53:31.209 ->     right: 0.05859
15:53:31.209 ->     stop: 0.14453
15:53:31.603 -> Predictions (DSP: 196 ms, NN: 16 ms)
15:53:31.603 ->     _noise: 0.00000
15:53:31.603 ->     _unknown: 0.23047
15:53:31.603 ->     forward: 0.00391
15:53:31.603 ->     left: 0.59375
15:53:31.603 ->     right: 0.04297
15:53:31.603 ->     stop: 0.13281
15:53:32.107 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:32.107 ->     _noise: 0.00000
15:53:32.107 ->     _unknown: 0.64453
15:53:32.107 ->     forward: 0.00000
15:53:32.107 ->     left: 0.06641
15:53:32.107 ->     right: 0.03906
15:53:32.107 ->     stop: 0.25000
15:53:32.395 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:32.395 ->     _noise: 0.74219
15:53:32.395 ->     _unknown: 0.00781
15:53:32.395 ->     forward: 0.00000
15:53:32.395 ->     left: 0.00391
15:53:32.395 ->     right: 0.00391
15:53:32.395 ->     stop: 0.23828
15:53:32.683 -> Predictions (DSP: 196 ms, NN: 16 ms)
15:53:32.683 ->     _noise: 0.97656
15:53:32.683 ->     _unknown: 0.00781
15:53:32.683 ->     forward: 0.00000
15:53:32.683 ->     left: 0.00781
15:53:32.683 ->     right: 0.00391
15:53:32.683 ->     stop: 0.00781
15:53:33.114 -> Predictions (DSP: 207 ms, NN: 16 ms)
15:53:33.114 ->     _noise: 0.98828
15:53:33.114 ->     _unknown: 0.00391
15:53:33.114 ->     forward: 0.00000
15:53:33.114 ->     left: 0.00391
15:53:33.114 ->     right: 0.00391
15:53:33.114 ->     stop: 0.00391
15:53:33.401 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:33.401 ->     _noise: 0.00781
15:53:33.401 ->     _unknown: 0.18750
15:53:33.401 ->     forward: 0.39844
15:53:33.401 ->     left: 0.18750
15:53:33.401 ->     right: 0.05859
15:53:33.401 ->     stop: 0.15625
15:53:33.835 -> Predictions (DSP: 196 ms, NN: 16 ms)
15:53:33.835 ->     _noise: 0.00000
15:53:33.835 ->     _unknown: 0.05469
15:53:33.835 ->     forward: 0.91797
15:53:33.835 ->     left: 0.01172
15:53:33.835 ->     right: 0.00781
15:53:33.835 ->     stop: 0.01172
15:53:34.194 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:34.194 ->     _noise: 0.00000
15:53:34.194 ->     _unknown: 0.14844
15:53:34.194 ->     forward: 0.80859
15:53:34.194 ->     left: 0.00781
15:53:34.194 ->     right: 0.00781
15:53:34.194 ->     stop: 0.03125
15:53:34.487 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:34.487 ->     _noise: 0.00000
15:53:34.487 ->     _unknown: 0.16406
15:53:34.487 ->     forward: 0.75000
15:53:34.487 ->     left: 0.00781
15:53:34.487 ->     right: 0.06250
15:53:34.487 ->     stop: 0.00781
15:53:34.884 -> Predictions (DSP: 195 ms, NN: 15 ms)
15:53:34.884 ->     _noise: 0.93750
15:53:34.884 ->     _unknown: 0.02734
15:53:34.884 ->     forward: 0.00391
15:53:34.884 ->     left: 0.01953
15:53:34.884 ->     right: 0.00391
15:53:34.884 ->     stop: 0.01172
15:53:35.316 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:35.316 ->     _noise: 0.86328
15:53:35.316 ->     _unknown: 0.05078
15:53:35.316 ->     forward: 0.01172
15:53:35.316 ->     left: 0.01172
15:53:35.316 ->     right: 0.00781
15:53:35.316 ->     stop: 0.05078
15:53:35.606 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:35.606 ->     _noise: 0.73828
15:53:35.606 ->     _unknown: 0.06250
15:53:35.606 ->     forward: 0.04297
15:53:35.606 ->     left: 0.03125
15:53:35.606 ->     right: 0.01562
15:53:35.606 ->     stop: 0.11328
15:53:35.894 -> Predictions (DSP: 195 ms, NN: 15 ms)
15:53:35.894 ->     _noise: 0.94141
15:53:35.894 ->     _unknown: 0.01172
15:53:35.894 ->     forward: 0.00000
15:53:35.894 ->     left: 0.01953
15:53:35.894 ->     right: 0.00781
15:53:35.894 ->     stop: 0.02344
15:53:36.290 -> Predictions (DSP: 207 ms, NN: 15 ms)
15:53:36.290 ->     _noise: 0.00000
15:53:36.290 ->     _unknown: 0.17188
15:53:36.290 ->     forward: 0.00391
15:53:36.290 ->     left: 0.04688
15:53:36.290 ->     right: 0.00391
15:53:36.290 ->     stop: 0.77344
15:53:36.722 -> Predictions (DSP: 206 ms, NN: 15 ms)
15:53:36.722 ->     _noise: 0.00000
15:53:36.722 ->     _unknown: 0.05078
15:53:36.722 ->     forward: 0.00000
15:53:36.722 ->     left: 0.10547
15:53:36.722 ->     right: 0.01172
15:53:36.722 ->     stop: 0.83594
15:53:36.974 -> Predictions (DSP: 195 ms, NN: 15 ms)
15:53:36.974 ->     _noise: 0.00000
15:53:36.974 ->     _unknown: 0.39844
15:53:36.974 ->     forward: 0.00000
15:53:36.974 ->     left: 0.05078
15:53:36.974 ->     right: 0.07422
15:53:36.974 ->     stop: 0.48047