Final Report - HackMD

# Final Report ## Tutorial on how to build a handheld Voice-User-Interface for real-time speech-controlled ML applications *Jan von Loeper (jv222fq)* # Project Overview This project is a first exploration towards building a faceless (screenless) smart phone, which relies on speech as an input. My aim is to explore this further after this course and assess the boundaries of Voice-User-Interfaces (VUIs). In this first trial I will explore the basic components, that are needed to transfer voice input from a microphone via a wireless connection to an online API that processes speech input. ### Time Estimate It is hard to estimate the time of this project. That depends on how many dependencies are already installed and which type of microphone is chosen. In an ideal case this project should take less than 8 hours. # Objective ### Ambition & Motivation The ambition is not to find the best solution for every problem or optimise every part with custom code. Instead, this is a start, where I will seek one way how to make everything work. My intent is to learn from this and create a foundation that I can build on and iterate later on. I expect to find out which components integrate well and which don’t. ### Approach I approach this project as a sensor application. In 2020, Joshua Villamater has built a Voice control lamp ([https://hackmd.io/@joshua01/SyeghoVJP](https://hackmd.io/@joshua01/SyeghoVJP)), which he approached as an actuating application. He relied on an Android Smart Phone and the Google speech assistant. He put emphasis on the expression and control of the LED lights. By approaching this as a sensor application, I can break out of the form factor of existing devices which offer speech input, such as smartphones, tablets and laptops. On the other side, I rely on ready-made APIs and applications, such as AssemblyAI and Open AI. ### Vision for a Product & Purpose If I would develop this as a product, the form factor could become smaller than a smartphone, approximate the Apple Remote Gen 2 (2009). Instead of looking at VR or AR applications displaying on a smartphone screen, the world of IoT devices becomes the users’ screen or canvas. Imagine you visited the Louvre and there was Mona Lisa on a digital screen and you could tell your screenless smartphone: “Can you redraw her in a dress by the Fabricant from the 2022 collection?” and with the help of GPT and Dall-E it would change the artwork and redraw it according to your real-time speech input from your personal screenless smartphone. It would be a phone, not primarily meant to communicate with people, but with networked computational devices with AI capabilities, which respond to speech input or act as conversational agents. ### Expected Learnings and Insights I expect to gain insights on: (1) How to sense, transmit and feed voices as audio data all the way from the user’s spoken word to an API which can process speech as an input for ML applications. (2) How to program and initiate the transmission of audio data. (3) How to establish connectivity from my self-assembled device to an API on the internet. (4) How small such a device might become in comparison to a 2022 smartphone. # Material ### Choice of Programming Language In this project I have chosen to work with MicroPython. Java Script and Toit might be alternatives worth exploring. I see the advantage with Python in the variety of Machine Learning applications, but MicroPython is slower and less effective than Toit for embedded applications. JS is particularly interesting for responsive web design (incl. libraries, such as react), but also for ML applications, such as Tensorflow. I have tried JS in combination with Arduino microcontrollers before (with a Node.js backend), and found JS more suitable for applications with a web UI, than the C-based Processing. Processing I consider less attractive, since it is more constrained in terms of ML and responsive design libraries / APIs and C-based (low-level). I consider Python and Javascript powerful when it comes to ML APIs and web applications. C and Toit are better to make efficient use of embedded electronics. Toit has been supporting I2s from the start ([https://libs.toit.io/i2s/library-summary](https://libs.toit.io/i2s/library-summary)), whereas MicroPython had neglected this addition for a long time and still makes it hard to make use of microphones. ### BOM for Development These materials were mostly chosen for development and partly for further use after this course. The pycom development board comes with a variety of connectivity options (WiFi, Bluetooth, LoRa, SigFox, LTE CAT M1 / NB1), which I can explore further after this course. It supports MicroPython, which I am using. Since I want to challenge the concept of a smartphone, I chose a board with many options to connect to things, incl. the possibility to use a SIM-card. The expansion board is not strictly needed, but it is convenient for connecting the microcontroller for the first time via USB to the PC (in my case Mac) and the IDE. The microphone has automatic gain control. I bought this in the hope that I do not have to take care of amplifier gain or adjustment of levels, depending on how loud the voice is and how far away from the microphone. - microcontroller: pycom fipy (used as development board) [https://pycom.io/product/fipy/](https://pycom.io/product/fipy/) 59,40€ - expansion board: pycom pysense (the regular expansion board was out of stock) [https://pycom.io/product/pysense-2-0-x/](https://pycom.io/product/pysense-2-0-x/) 29,65€ - micro USB cable: connect the expansion board to the Mac (I carried over the cable from an existing Arduino kit – since there are multiple kinds of Micro USB connectors, I had to pay close attention that I have the right form factor) - battery: 3.7 V 2000 mAh (used for development to approximately match the capacity of a smartphone – iPhone 13 mini: 2406 mAh) 18.08€ [https://www.electrokit.com/produkt/batteri-lipo-3-7v-2000mah-2/](https://www.electrokit.com/produkt/batteri-lipo-3-7v-2000mah-2/) - battery charger: Micro USB LIPO charger 7,15€ [https://www.electrokit.com/produkt/microlipo-usb-laddare/](https://www.electrokit.com/produkt/microlipo-usb-laddare/) - microphone: Electret Microphone Amplifier with Auto Gain Control 20-20KHz. According to the spec sheet, the Mic Amplifier with automated gain control is suitable for applications, such as “personal digital assistant”, “Two-Way Communicators”, “high-quality portable recorders” and “IP phones/telephone conferencing”. [https://www.adafruit.com/product/1713](https://www.adafruit.com/product/1713) 15,28€ (on Amazon.nl) - jump wires: connect the microphone to the microcontroller or expansion board The BOM for Development adds up to ca. 130€. For the next iteration and further development toward a product, it could be reduced drastically. As a size reference for the vision, I added two remote controls. ![](https://i.imgur.com/UaipZNs.jpg) # Platform Before we dive into the setup, let’s get an overview over the platform and how all the components in the project are meant to connect and work together: - Electret mic: sense sound from air *–jump wire–* - FiPy: receives sensor data, send audio signal as speech stream *–wifi–* - Assembly AI: receive audio signal, transcribe the speech stream to text stream, send text stream *–API–* - Open AI: receive text stream, process it, respond with generated image *–API–* - Streamlit: display the response from Open AI on the web GUI # Computer setup ### Install and test the microcontroller During the installation you will use the IDE VS Code with the extension PyMakr ([https://github.com/pycom/pymakr-vsc/](https://github.com/pycom/pymakr-vsc/)) and Node.js. Pymakr is a plugin to compile a ESP32 board using MicroPython. After the installation is completed, you will be able to run code on the microcontroller through the micro USB cable or flash it and disconnect the microcontroller from the Mac. 1. Choose an IDE: I have chosen VS Code, because it offers PyMakr as an extension and it provides me with useful prompts for commands. Additionally, we will use X Code and the Terminal. Some packages will be installed with Homebrew: [https://brew.sh/](https://brew.sh/) More details on the combination of VS Code and PyMakr can be found here: [https://lemariva.com/blog/rss/micropython-visual-studio-code-as-ide](https://lemariva.com/blog/rss/micropython-visual-studio-code-as-ide) 2. Install Node.js: PyMakr requires Node.js. I already had it installed for a previous project where I controlled an Arduino with JS. Node.js takes up a lot of space on my disc. 3. Install the PyMakr Extension in VS Code: (1) Open VS Code, (2) Press *Shift Cmd X* to open the Extensions menu, (3) search for ‘PyMakr’ (4) install (5) in case you have the ‘Python’ extension installed: disable it while using the pycom with PyMakr (Python might interfere, when you run MicroPython code). 4. Connect the microcontroller to the Mac (following this guide: [https://docs.pycom.io/gettingstarted/](https://docs.pycom.io/gettingstarted/)): (1) Plug the fipy into the expansion board (2) connect the expansion board via Micro USB to the Mac (3) Open the IDE, open the PyMakr extension, click the bolt icon to connect the board and look at the REPL. If it displays `>>>`, the board is ready for step 5. It might be necessary to update the pycom firmware ([https://docs.pycom.io/advance/cli/](https://docs.pycom.io/advance/cli/)). For me that was not the case. 5. Do a trial (again following this guide, Step 3: [https://docs.pycom.io/gettingstarted/](https://docs.pycom.io/gettingstarted/)): The `>>>` means you can start typing commands! Type the following commands in the REPL terminal: **import** **pycom**>>> pycom.heartbeat(False) >>> pycom.rgbled(0x330033)` Press the button ‘Run’. This will turn the RGB LED on your device purple! Notice that the REPL does not give any feedback. Only when you make a mistake or ask it to return something will it give you a response. 6. Instead of running the code from the Mac, it can be uploaded to the development board. This way the expansion board can get disconnected from the micro USB cable and Mac. (1) Repeat step 5, but upload the code onto the microcontroller. (2) Plug the battery’s JST-PH connector into the expansion board. (3) Detach the micro USB cable from the expansion board. (4) Now it runs the code from its own memory, independent from the IDE and Mac. You can push the board’s reset button, to start it over. ![](https://i.imgur.com/e1jImhy.jpg) ### Install and test the frontend and backend For the backend we will use Assembly AI’s ([https://www.assemblyai.com/](https://www.assemblyai.com/)) speech to text transcription and Open AI’s ([https://openai.com/](https://openai.com/)) pre-trained GPT-3 language model to respond to speech input. After the completion of the following steps you will be able to use your own web app and interact with a Graphical User Interface in real-time ([https://streamlit.io/](https://streamlit.io/)) where you can interact with the smart assistant using your voice. Streamlit integrates into a database called Snowflake ([https://docs.streamlit.io/knowledge-base/tutorials/databases/snowflake](https://docs.streamlit.io/knowledge-base/tutorials/databases/snowflake)), which I am not using here. I use this guide as an orientation: [https://towardsdatascience.com/using-ai-to-make-my-own-smart-assistant-app-5ad015449447](https://towardsdatascience.com/using-ai-to-make-my-own-smart-assistant-app-5ad015449447) ![](https://i.imgur.com/O10hJJP.png) ⚠️ Dependencies depend on one another. As a foundation for Mac users, X-Code should be updated, the latest version of Python and PaMakr should be installed and on top of that libraries such as Pyaudio need to be installed with the correct requirements, such as brew install portaudio --HEAD. Now we will go through this in detail. 1. In this part we will install our web GUI Streamlit and its prerequisites (following this guide: [https://docs.streamlit.io/library/get-started/installation#prerequisites](https://docs.streamlit.io/library/get-started/installation#prerequisites)). 2. Now we will install Assembly AI and its prerequisites for real time speech recognition (paid feature) with python (following this guide: [https://www.assemblyai.com/blog/real-time-speech-recognition-with-python/?_ga=2.27418620.175709387.1656505417-785546218.1656505417&_gac=1.214414821.1656531213.Cj0KCQjw8O-VBhCpARIsACMvVLPF2cuAopXGae8CxkAawwC8Mf76cr550ziksCNr0dQpvfWWGO1QhT0aAqe0EALw_wcB](https://www.assemblyai.com/blog/real-time-speech-recognition-with-python/?_ga=2.27418620.175709387.1656505417-785546218.1656505417&_gac=1.214414821.1656531213.Cj0KCQjw8O-VBhCpARIsACMvVLPF2cuAopXGae8CxkAawwC8Mf76cr550ziksCNr0dQpvfWWGO1QhT0aAqe0EALw_wcB)). To transcribe the audio in real-time we’ll be using the python websockets library to connect to AssemblyAI’s streaming websocket endpoint. Websocket connections allow to send and receive requests from the API server bi-directionally, unlike a HTTPS connection. This way you can transcribe the data live (documentation: [https://docs.assemblyai.com/walkthroughs#realtime-streaming-transcription](https://docs.assemblyai.com/walkthroughs#realtime-streaming-transcription)). # Presenting the data 1. Now we will try out Open AI’s Dall-E Backend and run it on a separate server. This allows us to generate artificial images based on text input. We will use a less resource-demanding version, called Mini. Here are the instructions: [https://github.com/saharmor/dalle-playground](https://github.com/saharmor/dalle-playground) ![](https://i.imgur.com/5KKAO7c.jpg) 2. Next, we want to generate images with our voice. With AssemblyAI, Open AI, and our web GUI Streamlit working independently, we need to connect them to one another. We will be able to input voice through our Mac’s built-in mic, transcribe the speech into a string of text with AssemblyAI, interpret it and output an image with Open AI’s DallE and run everything on Streamlit’s web GUI. This tutorial demonstrates how it can be done: [https://www.youtube.com/watch?v=fRa2rmDvOCY](https://www.youtube.com/watch?v=fRa2rmDvOCY) ![](https://i.imgur.com/sSQMAm8.png) ![](https://i.imgur.com/4093v6N.png) # Transmitting the data / connectivity ### How often is the data sent? Data needs to be sent continuously, in order to transcribe speech in real-time. The higher the rate of digital sampling, the higher the audio quality. ### How often is the data saved in the database? There is no need for a separate database. Each push of the button “Say something” triggers a new request to listen, transcribe and interpret the audio signal. The request ends with the output of the transcribed voice sample into the text field and the display of generated images. Our frontend Streamlit integrates into a database called Snowflake. We do not use it here, but that could be an option if a database is needed. ### Which transport protocols were used? For real-time audio and text transmission, we use WebSocket. There is a good explanation here: [https://pyshine.com/Socket-Programming-send-receive-live-audio/](https://pyshine.com/Socket-Programming-send-receive-live-audio/) # Putting everything together ### Choice of microphone In this part I will give you some insights about the issues I faced when researching how to connect a microphone to the Pycom FiPy. When it comes to components for microcontrollers, there are three types of microphones. In this part I will give some insights on different microphones, which can be used with an ESP32. However most of them were mainly used with Arduino in C, not MicroPython: - analog (electret): This is an analog mic [https://www.adafruit.com/product/1713](https://www.adafruit.com/product/1713) - I2S (by Philips Semiconductors): Inter-IC sound bus is a serial link especially for digital audio. This 3-line serial bus consists of serial data (SD) word select (WS), and continuous serial clock (SCK). The device generating SCK and WS is the master. Either the transmitter, the receiver or controller can be the master. This is one exemplar ([https://www.amazon.de/TECNOIOT-Omnidirectional-Microphone-Interface-Precision/dp/B07YXF6ZV2](https://www.amazon.de/TECNOIOT-Omnidirectional-Microphone-Interface-Precision/dp/B07YXF6ZV2)), which was also recommended in this comparison: [https://www.atomic14.com/2020/09/12/esp32-audio-input.html](https://www.atomic14.com/2020/09/12/esp32-audio-input.html). - Digital Mic – PDM (Pulse Density Modulation): PDM microphones sample a signal as a stream of single bits. This digital PDM signal is output from the microphone as a 1-bit data word, where the density of ones and zeros in the data represents the amplitude of the audio signal. This is an example: [https://www.adafruit.com/product/3492](https://www.adafruit.com/product/3492) This board was tested and not recommended by Mike Teachman ([https://github.com/miketeachman/micropython-esp32-i2s-examples](https://github.com/miketeachman/micropython-esp32-i2s-examples)) As stated in the BOM, I decided for an analogue microphone, because it seemed suitable for my application, where I want to cover a high frequency (20-20KHz) and make use of automatic gain control. A high frequency means better sound quality. The mic consists of a capacitor with electric film, which will react to moving air (=sound), bringing it closer to the resistor plate and hence increasing the value of the capacitor. The transistor responds by varying the current between its strain and source pins, which we can detect with the microcontroller (see Electret Mic Teardown here: [https://www.youtube.com/watch?v=P7YQRhHPasM](https://www.youtube.com/watch?v=P7YQRhHPasM)). The chip with the auto gain avoids audio peaks and enhances quiet sounds. Overall the amp will balance out levels. ### Boundaries of MicroPython in relation to microphone application With MicroPython and the Pycom firmware, there are many unsolved issues and explaining my findings would exceed this format. MicroPython, Pycom firmware and continuous high-quality Audio recording do not go together well, particularly for a beginner. Instead an ESP32 flashed with Arduino firmware, programmed in C is more suitable for audio applications. It would be possible to use the analog MX9814 or the digital ICS 43434 ([https://invensense.tdk.com/products/ics-43434/](https://invensense.tdk.com/products/ics-43434/) [https://www.tindie.com/products/onehorse/ics43434-i2s-digital-microphone/](https://www.tindie.com/products/onehorse/ics43434-i2s-digital-microphone/)). ### Switch from MicroPython to Arduino Facing these issues, I decided to abandon MicroPython for the firmware and programming of the microcontroller. However Python is still relevant for the web-based ML applications. ### Flashing the FiPy with Arduino Now we will flash the FiPy with Arduino. If you want to restore the MicroPython on your board, just follow the PyCom firmware upgrade guide. (1) Install Arduino IDE [https://www.arduino.cc/en/software](https://www.arduino.cc/en/software) (2) Enter bootloader mode [https://community.hiveeyes.org/t/i2s-mikrophon-mit-fipy-und-arduino-ide-zum-laufen-bringen/3452](https://community.hiveeyes.org/t/i2s-mikrophon-mit-fipy-und-arduino-ide-zum-laufen-bringen/3452) (2) Flash Arduino following this guide: [https://medium.com/@pauljoegeorge/setup-arduino-ide-to-flash-a-project-to-esp32-34db014a7e65](https://medium.com/@pauljoegeorge/setup-arduino-ide-to-flash-a-project-to-esp32-34db014a7e65) and this [https://forum.pycom.io/topic/3134/using-pycom-boards-with-arduino-ide](https://forum.pycom.io/topic/3134/using-pycom-boards-with-arduino-ide) (3) For wireless control of the board via wifi or BLE you can use [https://blynk.io/](https://blynk.io/) ### Connect to wifi This guide shows how it is done: [https://www.youtube.com/watch?v=klIBePOzXpo](https://www.youtube.com/watch?v=klIBePOzXpo) [https://techtutorialsx.com/2017/04/24/esp32-connecting-to-a-wifi-network/](https://techtutorialsx.com/2017/04/24/esp32-connecting-to-a-wifi-network/) ### Set up the ESP32 for capturing audio data The ESP32 offers built-in Analog to Digital (ADC) converters and I2S, which are difficult to use with MicroPython. For the analog microphone (MAX9814) you can follow this guide to make use of internal Digital to Analog Conversion (DAC) with Direct Memory Access (DMA) and I2S (Inter-IC Sound): [https://github.com/maspetsberger/esp32-i2s-mems](https://github.com/maspetsberger/esp32-i2s-mems) [https://github.com/atomic14/esp32_audio](https://github.com/atomic14/esp32_audio) [https://www.youtube.com/watch?time_continue=70&v=pPh3_ciEmzs&feature=emb_logo](https://www.youtube.com/watch?time_continue=70&v=pPh3_ciEmzs&feature=emb_logo) ### Intended Configuration with analog microphone (max9814) ![](https://i.imgur.com/5qStkXa.jpg) ### Alternative Configuration with digital microphone (ICS43434 or INMP441) and breakout board as I2S setup (not tried, for future experiments) ![](https://i.imgur.com/npDywCg.jpg) ### Connect the FiPy to wifi (with Python) [https://docs.pycom.io/tutorials/networks/wlan/](https://docs.pycom.io/tutorials/networks/wlan/) # The code ### Loop_sampling This project shows how to use the Arduino `analogRead` function and the Espressif `adc1_get_raw` function. It also demonstrates how to get a calibrated value back from the ADC to give you the actual voltage at the input. ```arduino #include <Arduino.h> #include <WiFi.h> #include "esp_adc_cal.h" // calibration values for the adc #define DEFAULT_VREF 1100 esp_adc_cal_characteristics_t *adc_chars; void setup() { Serial.begin(115200); Serial.println("Started up"); //Range 0-4096 adc1_config_width(ADC_WIDTH_BIT_12); // full voltage range adc1_config_channel_atten(ADC1_CHANNEL_7, ADC_ATTEN_DB_11); // check to see what calibration is available if (esp_adc_cal_check_efuse(ESP_ADC_CAL_VAL_EFUSE_VREF) == ESP_OK) { Serial.println("Using voltage ref stored in eFuse"); } if (esp_adc_cal_check_efuse(ESP_ADC_CAL_VAL_EFUSE_TP) == ESP_OK) { Serial.println("Using two point values from eFuse"); } if (esp_adc_cal_check_efuse(ESP_ADC_CAL_VAL_DEFAULT_VREF) == ESP_OK) { Serial.println("Using default VREF"); } //Characterize ADC adc_chars = (esp_adc_cal_characteristics_t *)calloc(1, sizeof(esp_adc_cal_characteristics_t)); esp_adc_cal_characterize(ADC_UNIT_1, ADC_ATTEN_DB_11, ADC_WIDTH_BIT_12, DEFAULT_VREF, adc_chars); } void loop() { // for a more accurate reading you could read multiple samples here // read a sample from the adc using GPIO35 int sample = adc1_get_raw(ADC1_CHANNEL_7); // get the calibrated value int milliVolts = esp_adc_cal_raw_to_voltage(sample, adc_chars); Serial.printf("Sample=%d, mV=%d\n", sample, milliVolts); delay(500); } ``` [https://github.com/atomic14/esp32_audio/blob/master/loop_sampling/src/main.cpp](https://github.com/atomic14/esp32_audio/blob/master/loop_sampling/src/main.cpp) With the given complexity I faced, I have not done any changes to this. # Finalising the design It turned out much more complex than expected to connect a microphone to the FiPy. Therefore I can not present a full working prototype, yet. The next step would be to bring it all together. I believe I have discovered many of the challenges, which working with audio input brings. When I set out, I expected to gain insights on: (1) How to sense, transmit and feed voices as audio data all the way from the user’s spoken word to an API which can process speech as an input for ML applications: *I have managed to prototype and explore the frontend (Streamlit as web GUI) and the backend (AssemblyAI and Open AI).* (2) How to program and initiate the transmission of audio data: *I have found an approach to make the hardware work, but not had the time yet to realise it.* (3) How to establish connectivity from my self-assembled device to an API on the internet: *I have found all components and identified wifi and websockets as key components for connectivity and protocol.* (4) How small such a device might become in comparison to a 2022 smartphone: *The size reference of the Apple remote Gen 2 seems to be suitable, if one can live with a small battery, as the coin type.* To conclude: This project has been incomplete in terms of making everything work, but gave me many insights on things to consider when working with microcontrollers and audio input. ![](https://i.imgur.com/8BTcTXS.jpg)