A US-based company specializing in conference services contacted AltexSoft to optimize transcribing the recordings of online and offline meetings. Our team built a web app with automated speech recognition (ASR) that streams the audio of the event, supports multiple speakers, and converts their words into text in real time.
The ASR-based system we created allows conference participants to tailor their speeches on the fly and get the written record of the event once it’s over, with no need for manual transcribers. All in all, the software addresses the following business challenges:
Before anything else, we made a thorough analysis of available ASR tools to be integrated into the web app. On checking all significant players against basic criteria, we decided on six tools. Our team created a sandbox to run the instruments on several browsers and machines with different configurations. We focused on parameters essential for our client — such as the accuracy of transcriptions, the number of sessions supported simultaneously, the speed of response, and more. After evaluating tools across multiple dimensions, AltexSoft chose the most relevant option.
Since AI tools dynamically evolve, chances are that another ASR provider will soon be able to deliver better services. So, it’s critical to have a technical possibility for a fast switch between third-party APIs. We embedded the needed flexibility into the app architecture by creating a separate API integration layer with adapters. As a result, we can relatively quickly swap vendors with minimal to no impact on the core business logic.
The ASR API integration is based on the WebSockets protocol, which enables two-way real-time communication between a web server and a browser. Yet, the technology is sensitive to network disruptions, which may happen for many reasons and at any stage of data transfer. We implemented buffering so that a local device used by a speaker could accumulate chunks of audio data when the Internet drops out. This allows for resuming the online audio transcription once the connection is restored.
One of the technical challenges was achieving clock unification across all users during a live session. We set a server-side timer as a single source of truth and made local clocks to match the server’s computer. Subsequently, all participants, no matter their location, see the same time, up to milliseconds, on their screens.
Our team created the app’s interface with simplicity in mind. The transcribing process is automated, with human intervention boiled down to two main operations — start and stop the program. It takes conference organizers and participants a short while to begin using the software in real settings.
Safety and compliance with data protection regulations are our client’s top priorities. To mitigate possible risks, we deployed the system in the Amazon virtual private cloud (VPC) — a logically isolated environment that allows for strict access control and management. Security measures involve creating a unique 16-character ID and passcode for each conference session, a strong authentication process, and Internet traffic encryption. Only a limited number of people with a proper permission level can access and manually download conference data.
The project is in progress, with 2 backend developers, a frontend developer, a project manager, a QA engineer, and a UI/UX designer involved.
The tech stack included AWS, DynamoDB, Node.js, Vue.js, and WebSockets.