The core of all AI is mathematics and to make the math work, we need to convert all of life’s signals into measurable quantities. One of the ways we do that for videos is to extract audio using our video parsing engine, and then run it through a speech-to-text module.
Since we’ve specialized in parsing videos that have a human voice, this is an important and integral ingredient. Speech is loaded with information that needs to be understood.
Converting speech into text is the first step of the journey under the hood. Text, by itself, isn’t a mathematical representation but becomes the starting point for the next step in our engine wherein we convert text into quantities that our algorithms can manipulate. This is one of the most significant research areas for us and we’ve accomplished some very remarkable milestones in what our engine can now.
One of the well-known challenges in speech-to-text is that machine-generated text is still not 100% accurate. After testing almost every major speech-to-text engine (proprietary and open) we’ve concluded that the best engines are at about 85% accuracy.
The good news for our clients is that we built Parmonic with this tolerance baked in. When we were starting we knew that there’s no perfect speech-to-text engine so our models need to be tolerant.
If you’ve ever used a voice assistant (Alexa, Siri, OK Google, Cortana or others), you’ve likely experienced the gaps in speech-to-text because those engines also use the same concept to understand what you said. Interestingly, it’s not always about accents – there are many other factors that affect whether a speech-to-text engine can convert speech into text at 100% reliability. It could be background noise, ambient noise, speaker position, or even weather conditions.
Understanding speech allows our AI engine to understand some aspects of the video. There’s a lot more that happens under the hood but we’ll discuss that later.