It’s May, and the spring semester is officially over. In true college fashion, it didn’t go out easily. My artificial intelligence course ended on a heavy note, another take-home exam that left everything to the imagination with a simple prompt: design a novel neural network with an architecture that minimizes CO2 emissions. We had 36 hours to turn in our solution, and I used almost every minute of that aside from sleeping and eating.

The core idea for my solution came to me pretty quickly thanks to a workshop I attended last year. I had come up with the idea of improving on the user friendliness of hearing aids by focusing on the fact that many seniors struggle with using ever-changing technology. I proposed a hearing aid that would utilize machine learning both on-device and on a paired smartphone to enhance audio quality based on a user’s audiogram and stated preferences, and to make intelligent adjustments in real-time, minimizing the need for user interaction to perform tasks such as changing volume or filtering different types of noises.

What I didn’t have at the time was an idea of how to architect such improvements, but this exam spurred me into action and I developed the following concept. In designing the solution, I leaned on my hobbyist understanding of audio processing to optimize latency and power efficiency.

Like other projects I worked on in this class, this probably has plenty of room for improvement. I also don’t know what grade I received for it (our professor didn’t give anyone their final exam grade), but I did manage to get an A+ for the class so it seems that the professor liked what I designed.


My novel neural network is an energy-efficient model for use in low-power medical devices such as hearing aids and wireless earbuds with microprocessors that provide hearing assistance features (such as Apple’s AirPods). It is designed to work as part of an adaptive system which distinguishes and enhances relevant sounds, with an option to prioritize human speech using my neural network. It uses a combination of techniques focused on reducing CO2 emissions through efficient training and consuming very little power in inference.

System Design

Energy efficiency is baked into the design of not just the model itself, but input and output processing as well. Additionally, a hearing profile based on a user’s audiogram is expected to be applied to the model, which will assist in optimizing the model for further efficiency.

Input Preprocessing

Power efficiency begins with the input, before the model is activated. Given that the typical peak of the range of speech is 8kHz, applying the Nyquist-Shannon sampling theorem we find that audio processing of speech requires a sampling rate of 16kHz for quality audio. However, hearing loss tends to occur in a much smaller range, which varies from one person to the next.

Preprocessor Architecture

Thus, the first step in the preprocessing pipeline is to apply one or more band-pass filters that reduce audio fed to the model to only those frequencies which a particular user has difficulty with based on their hearing profile. The frequencies that fall outside the band-pass filters can be enhanced in some other way so as not to lose those sounds, but that is outside the scope of the design of this model. By limiting processing to only the frequencies that the user struggles with, the model requires much fewer computations and thus consumes much less battery power.

Next, during windowing a short sample of audio is sliced into chunks called frames, and for processing speech those frames will typically be about 20-40 milliseconds long. They are usually captured in overlapping intervals due to the continuity of speech. Each frame will serve as a single input for the neural network after it is transformed.

Windowing is followed by an awareness function that determines whether the model should be activated for any given frame. There are various methods to achieve this, but regardless of the method chosen, the goal in this context is to reduce unnecessary activation for minimal energy usage.

If a given frame passes the awareness function, it will then be transformed into a set of numbers called Mel-Frequency Cepstral Coefficients (MFCCs), which are a representation of the spectral fingerprint of the sound in the frame. MFCCs are combined with the number of frames (representing time) in a 2D tensor that the model accepts for inference.

Note that framing an audio sample implies that the model will use a stream of frames to infer whether someone is speaking based on how sounds change over time. The inputs for the model will be small groups of frames, and the output will be a prediction of whether there is speech present in the sample.

Model Structure

Model Architecture

The input layer for the model is simply structural, serving as an entry point for the preprocessor to interface with the model. It accepts the 2D tensor and passes it on to the next layer.

Next, the temporal feature extraction group contains one or more layers that apply depthwise separable convolutions (DSC) to the input frames, providing the model with dynamic pruning potential. DSC provides early reduction in the number of parameters by combining features from different channels to abstract fewer, more complex features for a reduction in computations required later in the model. It also enables early detection of useful features, which can help the model avoid computation of frames that are not likely to contain speech.

DSC can significantly reduce computations, in many cases 90+%, with relatively little loss in accuracy.

This is followed by the temporal context layer which contains a small unidirectional gate recurrent unit (GRU). The GRU helps the model to distinguish between random noise and consistent patterns that may be speech. This layer is computationally lightweight, so it runs efficiently on very low-power processors. In this model, 32 units are selected for the hidden state of the GRU to provide a balance between efficiency and complex pattern recognition.

Next, the attention layer helps the model to ignore non-vocal sounds by applying importance scores to each frame. By summing the scores into a context vector, it also reduces the energy requirements of the model for computations later in the network. The attention layer provides a significant increase to the model’s ability to accurately recognize vocal sounds especially in noisy scenarios.

Finally, the output group is composed of a dense layer and a sigmoid activation. The dense layer takes the context vector from the attention layer and maps it to a single logit representing the presence of speech. Then the sigmoid activation converts the logit into a probability between 0 and 1.

Output Postprocessing

Postprocessor Architecture

Once the model’s analysis has completed, several processes are applied before the audio is output. The model’s prediction is compared to a threshold of 0.5; any value greater than that is classified as speech, otherwise it is classified as noise. Smart gain adjustments amplify selected frequency bands when speech is detected and attenuate those bands when noise is detected. The methods here are outside the scope of this design, but it would be based on a user’s hearing profile and the degree of hearing loss in each band.

Additionally, transitions to and from the sample are smoothed to provide a more natural hearing experience, and finally the audio sample is passed to the device output.

Model Training

Efficient Dataset Design

Efficiency in training starts with the dataset itself. Audio samples must first be labeled. This can be done manually or with automated low-cost heuristics. For this model, the labeling would simply be 1 when speech is present in a sample, and 0 when speech is not.

To improve generalization, clean training samples should be preprocessed with various band-pass filters and background noise samples at varying signal-to-noise ratios (SNRs) to create additional training samples. Once processed into MFCCs, the training data requires no further computation and can be reused as much as needed, eliminating unnecessary energy consumption.

Training Algorithm

The model itself is very small; depending on the number of layers chosen, it can be as small as 6-10k parameters. By using quantization simulation during training, the model will learn to operate under low-bit constraints, and together these two properties contribute to energy savings in both training and inference.

The training algorithm I have selected is a curriculum learning loop:

  1. Apply random feature dropout at the beginning of each epoch.
  2. Begin training with easy data, only selecting samples of clean speech and clean noise.
  3. Increase complexity as epochs advance. Introduce low SNR scenarios, such as conversations in quiet restaurants. Vary the voices and accents provided in samples, and gradually increase the noise complexity.
  4. Use a hard-negative mining pass at regular intervals, and focus subsequent training loops on samples where the model shows low confidence or makes wrong predictions. This helps the model converge with fewer epochs, reducing wasted computation on easy samples that the model has already mastered, and avoids overfitting on easy samples.

Small batch sizes can be used to reduce memory usage and spikes in computation depending on the hardware, which in turn minimizes energy consumption. Additionally, binary cross-entropy loss is used in training to push the model toward more confident binary predictions.

Evaluation

Training incorporates several techniques to improve both accuracy and energy efficiency:

  • Tracking accuracy and false positives/negatives on noisy samples to identify when the model is struggling with complex inputs.
  • Monitoring confidence threshold stability to ensure consistent decision-making, and stopping training early if confidence outputs become erratic (e.g. the model displays sudden jumps in confidence values for similar inputs), which could indicate overfitting.
  • Stopping training early if loss plateaus or accuracy peaks, to reduce unnecessary computation and energy consumption after the model has converged.

Post-Training Optimization

Efficiency gains are also found in optimizing the model after training:

  • Particularly considering low-power target devices such as hearing aids and wireless earbuds, 8-bit quantization is applied for significant computation savings during inference.
  • Fusing layers to reduce computations can be beneficial to reduce power consumption on low-power devices with digital signal processors, and in some cases may be required depending on hardware architecture.
  • Pruning low-magnitude weights, which have little impact on accuracy, can provide a massive reduction in model size and in the number of computations required on any given activation.

Tailoring to the End User

Finally, the model can be further optimized for efficiency by customizing it for an individual user. While the specifics of deployment may vary, the general idea is that an application would form a user profile based on key frequency bands identified in a provided audiogram and user-defined preferences. It would then use the profile to prune parameters within the model (trained but not yet quantized) that are related to bands for which the user does not require adjustment (this would have to be mapped during training).

The application would then quantize the pruned model before packaging it for installation on the appropriate device. During installation, customized settings would be applied to features such as band-pass filters in preprocessing and gain adjustments in post-processing.

Practical Application

Let’s look at how the model with its input and output pipelines might handle an audio sample.

Raw input: a 1-second audio sample of a person saying “Hello!”

Constraint: the user’s profile indicates moderate hearing loss in the 3.5-8kHz range

  1. Preprocessing
    1. As audio begins streaming in, a band-pass filter is applied for 3.5-8kHz.
    2. The filtered audio stream is divided into 25ms frames sampled every 10ms, resulting in a 15ms overlap. Every three frames, the preprocessor advances the frames to the next step.
    3. MFCCs are calculated for each frame, keeping the first 13 as they define the spectral “shape” of the frame.
    4. The MFCC vectors are combined into a 3x13 matrix, the 2D tensor of shape (T, F) to be used as input where T is the number of frames and F is the number of coefficients.
  2. Input: the 2D tensor is received and passed to the temporal feature extraction group.
  3. Temporal feature extraction: the DSC layers scan the sequence of frames and extract multi-resolution features from each frame and channel. They output a tensor of shape (T, C) where C is the number of output channels per frame. In this example, let’s say 16 channels are output, so the tensor is of shape (3,16).
  4. Temporal context: the unidirectional GRU layer processes the 3-frame sequence, maintaining contextual memory and identifying patterns over time. It outputs a tensor of shape (3,32).
  5. Attention: in the attention layer, each frame is given an importance score, and a context vector is generated for the entire sequence with a length of 32.
  6. Output: the dense layer maps the context vector to a single logit, and the sigmoid activation converts the logit to a probability between 0 and 1.
  7. Postprocessing
    1. The model’s output is compared to the threshold value of 0.5.
    2. Smart gain adjustments are applied based on the result of thresholding and the user’s hearing profile.
    3. Smoothing is applied to speech-positive sequences.