The Future of AI is Sparse
According to the World Health Organization (WHO), hearing loss affects an estimated 466 million people globally. However, only 17% of those afflicted adopt hearing aid treatment. Imagine if only one person out of every six people needing corrective lenses had them. Given the statistics, it is no surprise that WHO estimates untreated hearing loss costs a staggering $750B/year in lost productivity and downstream medical care. Social stigma, high cost, and poor performance in social settings are 3 major factors affecting hearing aid adoption rates. Limited adoption rates further limit the economies of scale that ultimately make treatment accessible to all.
Emerging AI processing technology developed by Femtosense will bring hearing aids to the level of performance, compact form factor, and ultimately, cost required for mass adoption of hearing aids. Femtosense’s efficient Sparse Processing Unit and proprietary speech algorithms deliver 100x more power efficient and 10x more memory efficient AI processing than competitor solutions, making Femtosense the sensible choice for next generation hearing aids.
The Silent Majority
One of the major issues with treating hearing loss is the stigma associated with wearing hearing aids; potential patients do not want to admit they have trouble hearing and they do not want to appear as if they are aging. In-the-canal (ITC) and completely-in-canal (CIC) hearing aids help mitigate this issue by offering a discrete, earbud-like, or near invisible form factor compared to the behind the ear (BTE) models most people are familiar with. However, ITC and CIC hearing aids’ small form factor can come at the cost of reduced amplification or sound quality, reduced battery life, and reduced feature sets when compared to larger BTE models, making them suitable for mild or moderate hearing loss. Attaining high performance, a small form factor, and a rich feature set is an elusive goal for manufacturers.
The high cost of premium hearing aids is another significant barrier to access. It is especially concerning for underserved populations with mild to moderate hearing loss that may decide they do not “need” hearing aids because the benefits do not offset the costs. For these reasons, it is quite common for those with mild/moderate hearing loss to settle for cheaper personal sound amplification devices (PSAPs) or forgo hearing aids altogether. PSAPs generally do not come with the feature sets, optimizations, and overall quality that premium hearing aids do, making them attractive for those with mild/moderate hearing loss and a tight budget. Maintaining strong performance while keeping costs down is another difficult task for manufacturers to tackle.
Setting cost and stigma aside, those with treated and untreated hearing loss alike most often report speech in noise as the number one issue. Studies have shown that as the brain ages, one’s ability to isolate signals of interest in noise decreases. Furthermore, it’s estimated that at least 15-20% of people with normal hearing have speech-in-noise difficulty, as do many people with neurocognitive disorders, advanced age, and traumatic brain injury. These people may benefit from hearing assistive technology, even if the signal gain is set very low. Unfortunately, poor speech-in-noise performance is still the number one issue reported by hearing aid users. Being able to understand one’s friends and family in any setting is key to enjoying social situations and staying active outside of the home. If today’s premium digital hearing aids reduce noise with time-honored classical signal processing methods, why aren’t consumers satisfied with the experience?
The Status Quo
Today’s digital hearing aids mitigate the speech-in-noise problem by running classical noise filtering and speech enhancement algorithms, isolating and enhancing frequency bands correlated with human speech. An extension of this technique involves having the user manually choose between a few different presets that correspond to environmental noise classes like windy outdoors, restaurant, home, etc. These presets alter the gain and filter shapes for frequencies found in the types of noises they wish to filter out. Some hearing aids detect your environment automatically based on the frequencies picked up by the microphones and some require pairing to a smartphone app with GPS to track your location and “remember” which settings you prefer in certain places. In both cases, presets are switched for the user by the hearing aid or phone to make the user experience simpler.
Another approach to attacking the speech-in-noise problem is to employ narrow beamforming for on-board microphones, focusing the pick-up on the area directly in front of the listener, which in many situations, is where the speaker of interest is. This can work well in controlled 1-on-1 situations, but often fail to provide adequate performance when there are multiple speakers or when speakers are not located directly in front of the user.
While classical and beamforming techniques have attempted to mitigate the speech-in-noise problem, many hearing aid users report that their hearing aids do not filter enough noise, degrade speech naturalness, or don’t perform well with varying and non-stationary noises and speakers. Thus, Femtosense has decided to take a novel, cutting edge approach to canceling background noise.
Deep learning algorithms are fundamentally different than classical techniques in that the goal is to train an algorithm to learn what is speech and what is not, rather than rely on heuristics or hand-tuned filters that try to characterize speech frequencies. Because they are trained on a colossal amount and variety of noises and environments, deep learning approaches for speech enhancement can significantly outperform existing classical techniques for speech intelligibility in challenging places like restaurants, airports, and busy municipal areas. Deep learning algorithms are also much better at reducing the intensity of transient noises like glass breaking, silverware on plates, and loud noises that are often reported as uncomfortable by hearing aid wearers when amplified.
Deep learning techniques begin with a large dataset composed of samples of clean speech and the same samples with a wide variety of noise added. The filtering technique works by taking a digital audio signal and performing an STFT (Short Time Fourier Transform) to transform it from a “sound intensity vs. time” representation to an “intensity by frequency” representation (also known as a spectrogram). The scale of the spectrogram is then transformed to match that of human hearing (also known as a log-mel spectrogram). The neural network predicts a fine-grained, time-varying filter for each audio frame, representing the noise in the spectrogram. The filter is then multiplied by the noisy input spectrogram and the result is a new spectrogram that is an estimate of the original clean speech sample. The spectrogram is then inverted back to sound intensity space, and the resulting audio is played through the speaker of the hearing aid.
During the training phase, the neural network algorithm compares the ground truth clean speech with the speech that has been cleaned by the network and adjusts the weights of the neural network iteratively to minimize the “difference” between the two. What that “difference” is can be defined by metrics like signal to noise ratio (SNR), and other speech intelligibility/quality measures like PESQ, STOI, and MOS. This process happens not on the hearing aid itself but on on large GPUs specialized for neural network training. Once the training process is complete, the network can then be deployed to another processor to continuously remove noise frame by frame for a given audio stream.
Research and Reality
While these techniques have achieved good performance in experiments and research settings, there are several reasons why they haven’t been deployed effectively in hearing aids. Firstly, the computation required to run deep learning algorithms is quite intense, requiring many dense matrix multiplies and adds. Only recently have processors like CPUs and embedded GPUs advanced to the point of being able to run these algorithms in a reasonable time frame. Assuaging the computation issue often entails reducing neural network size and thus number of calculations per frame of audio. Naively shrinking deep learning algorithms by removing random connections and decreasing the number and size of layers usually results in significant performance drops like diminished noise-removal ability or degraded speech naturalness.
Reducing the precision of neural network weights and activations from 32 bit floats to lower bit width integer numbers also decreases neural network size and computational burden. Much like the situation described above, naively converting weights and activations of the trained model results in performance drops. However quantization error can be simulated throughout training process and calibrated to relevant data, making the model robust to the reduction in precision. This is called Quantization Aware Training (QAT).
Compression techniques like sparsity (pruning connections in the neural network) show even greater promise in delivering on the complexity reduction challenge when introduced during the training process. Starting with a much larger, more powerful deep learning model and removing irrelevant connections gradually throughout the training process consistently yields better performance than a densely-trained model of the same resulting size (this is true for many deep learning problems, not just speech enhancement). This is called sparsity-aware-training (SAT).
Deciding how to introduce sparsity or how to structure it is an active area of research, though progress has been slow. This is due in part to the fact that legacy hardware platforms like CPUs and GPUs see little benefit from sparsity because the architecture and instruction sets were not designed for running sparse workloads. If hardware platforms reap few of the theoretical benefits of sparse algorithms, there is little incentive to research it further. Thus, there is a burning need to innovate speech enhancement algorithms, compression techniques, and hardware to reach an acceptable level and balance of noise removal, speech quality, and efficiency.
The SWAP Balancing Act
Tradeoffs in the computational power problem boil down to a series of constraints, often referred to as “SWAP-C” constraints (Size, Weight, Area, Power, and Cost). The area available for processors inside hearing aids is very limited, especially for CIC and ITC form factors. Processing power is often directly related to the size of the processor (assuming the same silicon process node) as more transistors or cores means more processing capability. Power is also related to chip size; more processing usually requires more power. Increasing the performance of neural networks typically requires increasing their size (assuming no breakthroughs in architecture) and thus required memory for model storage, ultimately affecting chip size and cost. Cost is often directly related to size, power, memory available, and processing capability.
Perhaps the most important constraint to add to the mix for speech enhancement is latency. The time it takes to process frames of audio data using a neural network is related to all of these factors mentioned above. For the user, there is great difficulty associated with speaking while hearing your own voice when there is too much delay. Moreover, there are hard constraints on total end to end audio path latency that affect sound quality, resulting in “tin-like” audio or robotic speech when delay is too high. The limit on total end to end latency accepted by many hearing aid engineers is 10 ms, though lower is often reported. Naively, one may think running chips at a higher clock rate solves this problem by increasing throughput, but this comes at the cost of power. Running algorithms for many hours per day on a hearing aid coin cell or rechargeable battery severely limits the processors and neural networks that can be deployed and how fast they can be run, not to mention the amount of heat dissipated by the processors in and around your ear.
So we need a processor and neural network small enough to fit inside of a hearing aid, powerful enough to run speech enhancement under 10 ms per frame, in a power budget that can run all day on a hearing aid or rechargeable battery, that won’t overheat, and won’t completely destroy product margins. Achieve all of this, and, most importantly, perform well enough to significantly improve speech intelligibility for hearing aid users around the world.
Hear Here for Femtosense!
Femtosense has taken on this daunting, elusive problem to enable speech enhancement for hearing aids with a fully integrated solution. The hardware-algorithm problem is solved simultaneously by our Femtosense Sparse Processing Unit and proprietary speech enhancement network.
Our AI co-processor hardware comes equipped with a highly tuned, bespoke speech enhancement neural network leveraging high levels of sparsity to reduce neural network size and power by 10x and 100x respectively compared to today’s microprocessors and state of the art networks. It achieves best-in-class performance in speech enhancement, outperforming much larger networks in SNR improvement and objective intelligibility measures like PESQ and STOI.
The SPU runs our proprietary algorithm within the strict latency constraints required for hearing aids. At 8ms latency it consumes close to 1mW, allowing it to run continuously for hours at a time on a hearing aid disposable or rechargeable battery.
At 3mm2, the SPU fits in both behind-the-ear and in-canal form factors and is a highly efficient use of silicon real-estate compared to solutions that do not support sparsity.
If you have your own speech enhancement model you would like to run on the SPU, we have a suite of PyTorch and Tensorflow-based optimization tools that make sparsity and quantization aware easy to implement. After training and optimization, you can simulate the power, latency, and performance of your algorithm running on the SPU with our hardware simulation tools, all without leaving Python. This makes it easy for algorithm developers to know quickly whether their neural networks are going to run within spec on the hardware, making iteration a pain-free and rapid process. Deploying your algorithm to the SPU can be done in a few lines of code and we are happy to help you along the way.
Femtosense offers the only fully integrated speech enhancement solution suitable for premium digital hearing aids. The embedded SPU is a cost-effective, high performing, fully integrated solution to the ubiquitous speech-in-noise problem affecting millions of hearing-impaired people. By integrating a co-processor or integrated IP block into hearing aid SoCs, hearing aid manufacturers can be first to market with this feature for their users. Femtosense is excited and ready to partner with hearing aid manufacturers to usher in the future of digital hearing aids and significantly improve the lives of people around the globe.
Want to hear the solution live in action? Want to read more about our proprietary algorithm? Want to get hardware specifications, order an eval kit, or reserve test chips? Reach out to us at firstname.lastname@example.org to schedule an in-person demo, request documentation, or chat about other applications like keyword detection, sound and scene ID, neural beamforming, or biosignal classification.
The future of AI is sparse.