Skip to main content

Researching Audio using MFCCs and HuBERT

Researching Audio using MFCCs and HuBERT

MFCCs: Raw audio has too much data (16,000+ numbers/second). MFCCs compress this to 13 meaningful numbers capturing speech characteristics.

HuBERT: Learns powerful speech patterns from unlabeled audio, eliminating expensive transcription needs while achieving superior performance.

MFCCs compress raw audio into 13 numbers every ~25 milliseconds, capturing the essential "shape" of speech sounds while discarding irrelevant details like pitch and volume variations. They're hand-crafted features mimicking human hearing, perfect for traditional speech recognition and speaker identification when computational resources are limited.

HuBERT learns rich speech representations from unlabeled audio by predicting masked portions, similar to how humans learn language through exposure. It creates universal features that work across multiple tasks (recognition, emotion detection, speaker verification) and languages without needing transcribed data. It's computationally intensive but far more powerful and adaptable than MFCCs, representing modern deep learning's approach to speech understanding.

1. MFCCs (Mel-frequency Cepstral Coefficients)

What MFCCs Are

MFCCs are a way to represent audio (especially speech) as numbers that machine learning models can understand. Think of them as a "fingerprint" of sound that captures the most important characteristics while throwing away unnecessary details.

The Problem They Solve

Raw audio is just a long list of numbers representing air pressure changes over time - often 16,000+ numbers per second! This is:

  • Too much data for most ML models
  • Full of irrelevant information
  • Not organized in a way that highlights important speech features

MFCCs compress this into maybe 13 numbers per 25 milliseconds of audio - much more manageable!

How MFCCs Work (Simplified)

The process resembles how human hearing works:

  1. Divide into short chunks: Split audio into tiny segments (usually 20-40ms), short enough that the sound is relatively stable

  2. Analyze frequencies: For each chunk, determine which frequencies (pitches) are present and how strong they are

  3. Apply the Mel scale: Emphasize frequencies the way humans hear them. We're better at distinguishing between low frequencies (100Hz vs 200Hz) than high ones (10,000Hz vs 10,100Hz)

  4. Take the logarithm: This mimics how we perceive loudness - the difference between quiet and moderate is more noticeable than between loud and very loud

  5. Apply DCT (Discrete Cosine Transform): This mathematical step decorrelates the features and compresses information, keeping only the most important patterns

  6. Keep first 13 coefficients: These contain most of the relevant information about the sound

What Each MFCC Represents

  • MFCC 0: Overall energy/loudness
  • MFCC 1-2: Basic spectral shape (bright vs dull)
  • MFCC 3-13: Finer details of spectral shape, capturing things like vowel sounds and consonant characteristics

Why They're Powerful for ML

MFCCs are excellent because they:

  • Capture the "shape" of the vocal tract that produced the sound
  • Are relatively robust to volume changes
  • Ignore pitch variations (mostly) - "hello" said high or low gives similar MFCCs
  • Compact representation - from thousands of numbers to just 13
  • Work well for speech recognition, speaker identification, and music analysis

Limitations

  • Lose some information (can't reconstruct original audio from MFCCs)
  • Don't capture pitch well (bad for tonal languages or music melody)
  • Assume audio is speech-like
  • Not ideal for environmental sounds or complex music

2. HuBERT (Hidden-Unit BERT)

What HuBERT Is

HuBERT is a self-supervised speech representation model developed by Facebook/Meta in 2021. It learns to understand speech by training on massive amounts of unlabeled audio - no transcriptions needed!

The Breakthrough Idea

Traditional speech systems needed paired audio-text data (expensive to create). HuBERT instead:

  1. Learns patterns from raw audio alone
  2. Creates its own "pseudo-labels" to train on
  3. Builds rich representations useful for many downstream tasks

How HuBERT Works

The training process is clever:

  1. Initial clustering: First, group similar-sounding audio frames together using basic features (like MFCCs). Think of this as creating a rough "alphabet" of sounds

  2. Masked prediction: Like BERT for text:

    • Take audio input
    • Randomly mask (hide) parts of it
    • Train the model to predict what was masked based on surrounding context
    • But instead of predicting raw audio, predict the cluster assignments
  3. Iterative refinement:

    • Use the trained model to create better clusters
    • Retrain on these improved clusters
    • Repeat 2-3 times
    • Each iteration discovers more sophisticated patterns

The Architecture

  • Input: Raw waveform or basic features
  • CNN encoder: Converts audio to initial representations
  • Transformer layers: 12 or 24 layers that process the masked sequence
  • Prediction head: Predicts cluster assignments for masked regions

What Makes HuBERT Special

  1. No labels needed: Learns from any speech audio, even unknown languages

  2. Universal representations: The learned features work for many tasks:

    • Speech recognition
    • Speaker verification
    • Emotion recognition
    • Language identification
  3. Multilingual capability: Can handle multiple languages without being explicitly told which is which

  4. Robust features: Handles noisy audio better than traditional features

Key Innovations

  • Offline clustering: Unlike some competitors, HuBERT clusters the entire dataset at once, giving more consistent targets
  • Multiple iterations: The retraining process progressively discovers more complex patterns
  • Simplicity: Simpler than some alternatives while achieving better results

Practical Impact

HuBERT has enabled:

  • Speech recognition for low-resource languages (little transcribed data)
  • Better voice assistants that understand various accents
  • Improved performance when labeled data is scarce
  • Cross-lingual transfer (train on English, work on other languages)

Comparison to MFCCs

While MFCCs are:

  • Hand-crafted features based on human knowledge
  • Fixed extraction process
  • Limited to capturing certain acoustic properties
  • Fast and simple to compute

HuBERT features are:

  • Learned automatically from data
  • Adaptive to the specific patterns in speech
  • Capture complex, abstract relationships
  • Computationally intensive but more powerful

How They Work Together

In modern systems, you might see:

  1. MFCCs used as initial features fed into HuBERT
  2. HuBERT replacing MFCCs entirely for downstream tasks
  3. Both used in ensemble systems for robustness

Think of MFCCs as a good "first-generation" solution based on human insight about audio, while HuBERT represents the modern "deep learning" approach that discovers patterns we might never have thought to look for.

Based on MFCCs and HuBERT, here are the 3 most essential related topics you should study to build a complete understanding:

1. Spectrograms and Fourier Transform

Why This Is Essential

This is THE foundation that everything else builds on. You can't truly understand MFCCs or any audio processing without grasping how we get from sound waves to frequency information.

What to Learn

Core Concepts:

  • Fourier Transform: How we decompose a signal into its frequency components - like breaking a chord into individual notes
  • STFT (Short-Time Fourier Transform): Applying Fourier Transform to overlapping windows of audio
  • Spectrograms: The visual representation showing frequency (y-axis) vs time (x-axis) vs intensity (color)
  • Mel-Spectrograms: Spectrograms with mel-scale frequency spacing

Key Understanding Points:

  • Time-frequency tradeoff (can't have perfect resolution in both)
  • Window size effects (larger = better frequency resolution, worse time resolution)
  • Phase vs magnitude (why we often discard phase)
  • The relationship between waveforms and spectrograms

Practical Skills:

# You should be able to create and interpret:
librosa.stft()  # Basic spectrogram
librosa.feature.melspectrogram()  # Mel-scaled version
# Understand the parameters: n_fft, hop_length, n_mels

Connection to MFCCs and HuBERT

  • MFCCs are derived FROM mel-spectrograms (plus DCT)
  • HuBERT can take spectrograms as input features
  • Most audio models start with some form of time-frequency representation

2. Transformer Architecture and Attention Mechanisms

Why This Is Essential

HuBERT is built on Transformers. Without understanding this architecture, HuBERT remains a black box. This is also the foundation for almost all modern speech/audio models.

What to Learn

Core Components:

  • Self-Attention: How positions in a sequence attend to each other
  • Multi-Head Attention: Parallel attention mechanisms capturing different relationships
  • Positional Encoding: How Transformers know sequence order
  • Feed-Forward Networks: The MLP layers between attention blocks
  • Layer Normalization: Stabilizing training

Key Concepts:

  • Query, Key, Value matrices and their roles
  • Attention weights visualization and interpretation
  • Masked attention (crucial for HuBERT's training)
  • Encoder vs Decoder architectures (HuBERT uses encoder-only like BERT)

Architecture Understanding:

# Understand this flow:
Input → Embedding → Add Position → [Transformer Block]×N → Output
                                    ↓
                    [Multi-Head Attention → Add&Norm → FFN → Add&Norm]

Connection to Audio Processing

  • How audio frames become "tokens" for Transformers
  • Why masking certain frames helps learning (HuBERT's core idea)
  • How attention captures long-range dependencies in speech
  • The role of context in understanding speech

3. wav2vec 2.0 and Contrastive Learning

Why This Is Essential

wav2vec 2.0 is HuBERT's "sibling" - understanding both gives you the complete picture of self-supervised speech learning. The contrastive learning approach is fundamentally different from HuBERT's clustering approach, and comparing them deepens understanding of both.

What to Learn

Contrastive Learning Framework:

  • Positive vs Negative samples: True next frame vs distractors
  • InfoNCE Loss: The contrastive loss function
  • Quantization: Converting continuous speech to discrete units
  • Codebook learning: Building a vocabulary of speech units

wav2vec 2.0 Specifics:

  • CNN feature encoder (processes raw waveform)
  • Quantization module (creates discrete targets)
  • Transformer context network (builds representations)
  • Contrastive task: distinguish true quantized features from distractors

Key Differences from HuBERT:

  • Online vs Offline: wav2vec 2.0 quantizes on-the-fly, HuBERT pre-clusters
  • Training objective: Contrastive vs predictive
  • Discrete units: Learned during training vs pre-computed
  • Stability: HuBERT is generally more stable to train

Practical Understanding

# Conceptual pipeline comparison:
# wav2vec 2.0: Audio → CNN → Quantize → Mask → Transformer → Contrastive Loss
# HuBERT:      Audio → Cluster → Mask → Transformer → Predict Clusters

Why These Three Topics?

Building from Foundation to Frontier:

  1. Spectrograms = Understanding how audio becomes numbers
  2. Transformers = Understanding the modern architecture powering everything
  3. wav2vec 2.0 = Understanding the alternative approach to HuBERT

With these three additions to MFCCs and HuBERT, you'll have:

  • Classical approach (MFCCs)
  • Fundamental representation (Spectrograms)
  • Core architecture (Transformers)
  • Two competing self-supervised methods (HuBERT vs wav2vec 2.0)

Study Order Recommendation

  1. Start with Spectrograms/Fourier - This is prerequisite knowledge
  2. Study Transformers thoroughly - Take time here, it's reused everywhere
  3. Then tackle wav2vec 2.0 - Compare and contrast with HuBERT
  4. Finally, implement small projects using each concept

Quick Learning Resources

For Spectrograms:

  • "But what is the Fourier Transform?" by 3Blue1Brown (YouTube)
  • Librosa tutorials for hands-on practice

For Transformers:

  • "The Illustrated Transformer" by Jay Alammar
  • "Attention is All You Need" paper (after understanding basics)

For wav2vec 2.0:

  • Original Facebook AI blog post (very accessible)
  • Hugging Face's wav2vec2 documentation and tutorials

These three topics plus MFCCs and HuBERT will give you a solid foundation where you can understand most modern speech/audio papers and implement real systems!

Excellent question! Let me map out the broader landscape of audio and speech processing so you can see how MFCCs and HuBERT fit into the bigger picture.

The Big Picture: Audio/Speech Processing Hierarchy

1. Foundation: Signal Processing & Feature Extraction

Classic Features (MFCCs' relatives):

  • Spectrograms: Visual representation of frequencies over time - the "base map" of audio
  • Mel-Spectrograms: Spectrograms with mel-scale frequency bins
  • Chroma Features: Capture musical pitch classes (for music analysis)
  • Zero Crossing Rate: How often signal crosses zero (roughness indicator)
  • Spectral Centroid/Rolloff/Flux: Various measures of frequency distribution
  • PLP (Perceptual Linear Prediction): Alternative to MFCCs, models human hearing differently
  • Filter Banks: Raw mel-scale energies before the DCT step in MFCCs

Key Concepts to Understand:

  • Fourier Transform: The mathematical foundation for frequency analysis
  • Window Functions: Hamming, Hann windows for analyzing audio chunks
  • Nyquist Theorem: Why we sample at 16kHz or 44.1kHz
  • Pre-emphasis: Boosting high frequencies before processing

2. Self-Supervised Learning Family (HuBERT's relatives)

Contemporary Models:

  • wav2vec 2.0: Facebook's predecessor to HuBERT, uses contrastive learning
  • WavLM: Microsoft's model, extends HuBERT with denoising objectives
  • data2vec: Meta's unified approach for speech, vision, and text
  • W2V-BERT: Combines wav2vec 2.0 and HuBERT approaches
  • XLSR: Cross-lingual speech representations
  • UniSpeech: Unified pre-training for various speech tasks

Earlier Approaches:

  • CPC (Contrastive Predictive Coding): Pioneer in self-supervised audio
  • APC (Autoregressive Predictive Coding): Predicts future frames
  • MockingJay: BERT-style masking for spectrograms
  • VQ-VAE: Vector quantization for discrete audio representations

3. End-to-End Speech Recognition Systems

Modern Architectures:

  • Whisper: OpenAI's robust multilingual ASR system
  • Conformer: Combines CNNs and Transformers (Google)
  • Transformer Transducer: Streaming speech recognition
  • LAS (Listen, Attend, Spell): Attention-based sequence-to-sequence
  • RNN-T (RNN Transducer): For real-time streaming ASR
  • CTC (Connectionist Temporal Classification): Alignment-free training

Classic Approaches (important historically):

  • HMM-GMM: Hidden Markov Models with Gaussian Mixture Models
  • HMM-DNN: HMMs with Deep Neural Networks
  • Kaldi: Popular toolkit combining many classical techniques

4. Speech Synthesis (Opposite Direction)

Neural Vocoders:

  • WaveNet: DeepMind's breakthrough in audio generation
  • WaveGlow/WaveRNN: Faster alternatives
  • HiFi-GAN: High-fidelity audio generation
  • Parallel WaveGAN: Non-autoregressive generation

Text-to-Speech Models:

  • Tacotron 1/2: Google's TTS systems
  • FastSpeech 1/2: Non-autoregressive TTS
  • VITS: Variational Inference TTS
  • Tortoise TTS: High-quality but slow
  • Bark: Suno's model for speech with emotions/sound effects

5. Audio Understanding Beyond Speech

General Audio Analysis:

  • PANNs: Pre-trained Audio Neural Networks for sound classification
  • VGGish: Audio classification based on VGG architecture
  • YAMNet: Google's audio event detection
  • OpenL3: Look, Listen, and Learn embeddings
  • CLAP: Contrastive Language-Audio Pre-training (like CLIP for audio)

Music-Specific:

  • Music Information Retrieval (MIR): Entire field for music analysis
  • Beat tracking/Tempo estimation
  • Chord recognition
  • Music source separation: Spleeter, Demucs, Open-Unmix
  • Singing voice synthesis: VOCALOID, DiffSinger

6. Multimodal and Advanced Topics

Audio-Visual:

  • AV-HuBERT: Learning from audio and lip movements together
  • Audio-Visual Speech Recognition (AVSR)
  • Speech-driven facial animation

Voice Conversion & Cloning:

  • VQVC: Voice conversion with vector quantization
  • YourTTS: Zero-shot voice cloning
  • RVC: Retrieval-based Voice Conversion

Neural Audio Codecs:

  • SoundStream: Google's neural audio codec
  • EnCodec: Meta's neural codec used in MusicGen
  • Descript Audio Codec (DAC): High-fidelity compression

7. Key Concepts You Should Understand

Fundamental Theory:

  • Phonetics vs Phonology: Physical sounds vs sound systems
  • Formants: Resonant frequencies that define vowels
  • Prosody: Rhythm, stress, and intonation
  • Coarticulation: How adjacent sounds affect each other

Machine Learning Concepts:

  • Attention Mechanisms: Core to modern models
  • Transformer Architecture: Foundation of HuBERT and others
  • Contrastive Learning: Learning by comparing examples
  • Knowledge Distillation: Training smaller models from larger ones
  • Fine-tuning vs Feature Extraction: How to use pre-trained models

Evaluation Metrics:

  • WER (Word Error Rate): Primary ASR metric
  • PESQ/STOI: Speech quality metrics
  • MOS (Mean Opinion Score): Human evaluation
  • RTF (Real-Time Factor): Processing speed

8. Practical Applications & Current Challenges

Hot Topics:

  • Streaming/Online Processing: Real-time with minimal latency
  • Multilingual and Code-Switching: Handling multiple languages
  • Robust ASR: Noisy environments, far-field microphones
  • Efficient Models: Running on edge devices
  • Privacy-Preserving Speech Processing: Federated learning, encryption
  • Emotional/Paralinguistic: Beyond words to emotion, age, health
  • Speaker Diarization: Who spoke when
  • Speech Enhancement: Noise removal, dereverberation

Learning Path Recommendations

To not miss the big picture, I'd suggest understanding topics in this order:

  1. Basics: Spectrograms, Fourier transforms, fundamental audio properties
  2. Classical Features: MFCCs, filter banks, basic feature extraction
  3. Deep Learning Basics: CNNs, RNNs, Attention, Transformers
  4. Modern Speech Recognition: CTC, attention-based models, end-to-end systems
  5. Self-Supervised Learning: wav2vec 2.0, HuBERT, the motivation behind them
  6. Application-Specific: Choose based on interest (TTS, music, enhancement, etc.)

Key Papers/Resources to Explore

  • "Speech and Language Processing" by Jurafsky & Martin (textbook)
  • "Deep Learning" by Goodfellow, Bengio, Courville (for ML foundations)
  • Original papers: "Attention is All You Need", wav2vec 2.0, HuBERT
  • Kaldi documentation (even if using modern tools, great for concepts)
  • SpeechBrain or ESPnet tutorials (modern frameworks)

The field is moving rapidly - what's cutting-edge today might be standard tomorrow. Understanding both classical (MFCCs) and modern (HuBERT) approaches gives you the flexibility to choose the right tool for each problem!


Comments

Popular posts from this blog

Simple Linear Regression - and Related Regression Loss Functions

Today's Topics: a. Regression Algorithms  b. Outliers - Explained in Simple Terms c. Common Regression Metrics Explained d. Overfitting and Underfitting e. How are Linear and Non Linear Regression Algorithms used in Neural Networks [Future study topics] Regression Algorithms Regression algorithms are a category of machine learning methods used to predict a continuous numerical value. Linear regression is a simple, powerful, and interpretable algorithm for this type of problem. Quick Example: These are the scores of students vs. the hours they spent studying. Looking at this dataset of student scores and their corresponding study hours, can we determine what score someone might achieve after studying for a random number of hours? Example: From the graph, we can estimate that 4 hours of daily study would result in a score near 80. It is a simple example, but for more complex tasks the underlying concept will be similar. If you understand this graph, you will understand this blog. Sim...

What problems can AI Neural Networks solve

How does AI Neural Networks solve Problems? What problems can AI Neural Networks solve? Based on effectiveness and common usage, here's the ranking from best to least suitable for neural networks (Classification Problems, Regression Problems and Optimization Problems.) But first some Math, background and related topics as how the Neural Network Learn by training (Supervised Learning and Unsupervised Learning.)  Background Note - Mathematical Precision vs. Practical AI Solutions. Math can solve all these problems with very accurate results. While Math can theoretically solve classification, regression, and optimization problems with perfect accuracy, such calculations often require impractical amounts of time—hours, days, or even years for complex real-world scenarios. In practice, we rarely need absolute precision; instead, we need actionable results quickly enough to make timely decisions. Neural networks excel at this trade-off, providing "good enough" solutions in seco...

Activation Functions in Neural Networks

  A Guide to Activation Functions in Neural Networks 🧠 Question: Without activation function can a neural network with many layers be non-linear? Answer: Provided at the end of this document. Activation functions are a crucial component of neural networks. Their primary purpose is to introduce non-linearity , which allows the network to learn the complex, winding patterns found in real-world data. Without them, a neural network, no matter how deep, would just be a simple linear model. In the diagram below the f is the activation function that receives input and send output to next layers. Commonly used activation functions. 1. Sigmoid Function 2. Tanh (Hyperbolic Tangent) 3. ReLU (Rectified Linear Unit - Like an Electronic Diode) 4. Leaky ReLU & PReLU 5. ELU (Exponential Linear Unit) 6. Softmax 7. GELU, Swish, and SiLU 1. Sigmoid Function                       The classic "S-curve," Sigmoid squashes any input value t...