How we used OpenAI technology to build a real-time translation service

Jess O'Dwyer, general manager for Europe at Pocketalk, on how the hit app was developed
Jess O'Dwyer
AI graphic

Let’s talk about language translation. If you’re reading this then you can understand English, but what if you couldn’t understand English? What if English was an additional language for the audience you’re talking to? You’ll probably close this tab.

Now put that into the context of your business. What if a talent pool of expertise doesn't speak the same language as you? What if your product or service could meet the needs of a country that doesn't speak the language that you speak?

It means you eliminate a perfectly skilled and very able workforce and you could lose out on potential market growth. That doesn't make any business sense, does it?

For the purpose of this article we’re going to focus on the UK. A country that is striving to meet the needs of a growing and increasingly diverse population. Here, language translation is a catalyst for productivity and inclusion across business sectors.

From the boardroom to the corridors of healthcare services and classrooms in schools, the impact of language translation is reverberating across private and public sectors, as we all endeavour to foster a more inclusive and interconnected UK. 

Looking ahead, the power of language translation in various sectors will become even greater than it is today, particularly in the context of workforce productivity, diversity and inclusion, healthcare, education, and childcare. And it’s this growing trend that led us to create Pocketalk for Business, language translation designed specifically for enterprises, businesses and organisations.

Rather than fear technology advancement and the growth of AI, we embraced it and integrated AI powered language translation for virtual and in person meetings that offers accurate, fast, seamless and secure translated conversations for any scenario. 

The result is SENTIO, a GDPR compliant, browser-based, live translation service that enables users to engage in conversations in real-time, in their own language during meetings or through any virtual meeting platform, including Zoom, Google Meet and Microsoft Teams. 

With the global workplace expanding, so is the need for real-time translation across conversations to deliver smooth business communications. What was important for us here was accessibility, speed and accuracy. Language translation needs to be available from anywhere on any device through an app and web browser. It also needs to be reliable.  

So how did we use AI to build a real-time, accurate translation service? Through hard work, precision and a lot of dedication. Oh, and plenty of GPUs - Graphic Processing Units.

A GPU is a computer chip that renders graphics and images by performing rapid mathematical calculations. GPUs have recently optimised for AI deep learning which has increased demand at the same time the GPU chip manufacturing supply chain became damaged. Despite the manufacturing challenges and increased demand of AI, we secured enough GPUs, based on our user growth estimation, and tuned our engine server and AI model to make it as fast as possible. 

We integrated multiple artificial intelligence into our Voice Translation Pipeline. Whisper, a machine learning model for speech recognition, was one of them.  We needed multiple to take care of any mis-recognition, known as ‘hallucination’ - it’s like having a filter to clean up the data.

Next up we wanted to ensure we could deliver a seamless and natural interpretation experience, so we optimised real-time speech recognition to achieve millisecond response times. By default, the Whisper technology manages audio segments in 30 seconds which means it buffers 30 seconds of audio and transcribes it to text. This doesn't fit the concept of real time translation service though. Thus, we needed to modify this to be able to digest audio(voice) in short segments. 

The challenge here is that short segmented audio doesn't offer enough context of the conversation, which causes it to be less accurate for speech recognition. We fine tuned the process to keep the better accuracy as possible as we can. This involved quantising Whisper and other voice translation platforms and incorporating diverse research findings into the speech recognition processing to guarantee precision not only in recognition but also in the translation process.

Of course, it takes time to develop this type of software. We repeatedly fine-tuned the system to improve the real-time performance of speech recognition results and to respond to results in milliseconds. Various research results were incorporated into the output of the speech recognition processing results to ensure the highest accuracy not only in the speech recognition results, but also in the translation process for simultaneous interpretation. Numerous innovations in acoustic signal processing were incorporated to suppress auditory hallucinations. 

The user experience setup operation was also subject to repeated trial and error testing. It has been quite the project but one that has the capability to fuel inclusivity and diversity in business.

Written by
Jess O'Dwyer
Written by
April 9, 2024