We value your privacy. We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies. Read our Privacy Policy for more information.

Engineering

Smart robotics: integrating computer vision and speech

Tuesday, June 11, 2024

Stefanos Peros

Software engineer

Incorporating AI services into robotics, such as computer vision and speech, is steadily transforming how robots perceive and interact with the world, and how we interact with them. For instance, in healthcare, robots can now visually detect changes in patients' conditions and verbally report these observations to medical staff, improving response times and patient care.

On a smaller scale, we have recently embarked on an exciting venture to develop Budd-E (Figure 1), a robot controlled remotely and powered by AI, which we invite you to see in action here. In this blog post, we'll explore how we leveraged AI services to enhance Budd-E with the skills to describe its visual observations and convert user text inputs into spoken language.

Figure 1: Meet Budd-E, our AI service robot.

Overview

At its heart, Budd-E is built around a Raspberry Pi 3B+, a small, affordable computer that connects with an array of hardware components, including four wheels, two servo motors, a buzzer, a camera, ultrasonic sensors, a speaker, and LEDs. This project was kickstarted using a smart car kit, which served as the initial building block from which we further developed Budd-E's distinctive features.

The system's software is split between a client, which can run on any mobile device or computer, and a TCP server hosted on the Pi, both developed in Python. The client features a straightforward interface that allows users to control the robot manually via a set of buttons, resembling the controller of a traditional remote-controlled car. Internally, pressing each button triggers a specific encoded command that is transmitted to the TCP server on the Pi for execution.

Describe a scene

Figure 2: Upon user request, the client instructs Budd-E to convert what it sees to audio and text.

Budd-E is capable of describing what it sees through its camera, shown by the flows in Figure 2. When the user requests this through the UI, the client application instructs the robot to take a picture and stream it back.

Next, the client prompts GPT-4 Vision to convert the picture (.jpg) to a corresponding text that describes its contents. More concretely, GPT-4 Vision integrates capabilities from vision and language models, as it is trained on both text and images, to convert pixels into embeddings which enables the model to understand and generate content that reflects the context and relationships between visual and textual information.

The client also sends a request to Amazon Polly, an AWS text-to-speech AI service to convert this text to an audio file (.mp3). Specifically, Amazon Polly uses advanced deep learning technologies to synthesize speech that sounds like a human voice. It processes the input text, analyzes the phonetics and syntax, and then applies text normalization to convert written numbers or abbreviations into their spoken equivalents. Finally, Polly generates the audio output by converting the processed text into sound, producing a natural-sounding voice that can read the text aloud. The client plays back the description through the speakers while showing the text in the UI.

Text input to spoken language

Figure 3: Upon user request, the client instructs Budd-E to convert input text to audio and play it back.

Budd-E is also capable of converting any human text into spoken language (Figure 3), which is particularly useful to communicate remotely with people near the robot. For this we also leveraged Amazon Polly, but this time the audio needs to be played by Budd-E and not the client’s speakers. As such, we extended Budd-E to support a new command. Upon receiving this command, which also contains the input text to be converted, Budd-E sends a request to Amazon Polly to convert it to an audio (.mp3) file, which it stores locally and plays back through its USB speaker.

Closing statement

In summary, the fusion of AI technology with robotics, marks an intriguing development in both fields. Budd-E utilizes AI services to interpret visual scenes and convert written text into spoken words, illustrating how robots can interact with their surroundings and communicate in a human-like manner. The complexity required to significantly enhance the capabilities of this robot is strikingly low, thanks to the integration with AI and cloud services. Keep an eye on our journey, as we continue to explore and integrate new AI services to further expand Budd-E's abilities and open up even more possibilities.