Intermediate Speech Representations for Low Resource Speech Models
Feature extraction and intermediate speech representation are important components for speech processing tasks. Many different approaches exist, e.g. methods such as wav2vec2, Mockingjay, TERA, autoregressive methods. Beyond taking the raw input wave, basic feature extractions (e.g. MFCC, log mel spectrogram etc.) are widely used in models. Some also create intermediate representations that are useful for self-supervised learning and to allow base models trained without labels to be fine-tuned and applied to a variety of different prediction tasks and target outputs. Understanding and explaining why these methods work well on some downstream tasks and but not others has not been well studied for different speech objectives such as phoneme recognition, speaker identification, speech recognition, language identification, spoken language understanding, speech translation, emotion recognition, voice conversion, speech synthesis etc.
This project will adapt state-of-the-art deep learning architecture to improve the existing speech representation methods. New methods emerging from the fields of computer vision (CV) and natural language processing (NLP) will be reviewed for cross-domain inspiration. Datasets will be sourced to fine-tune models with varying amounts of labelled data, This will inform the relationship between fine-tuning dataset size and the chosen representation, highlighting the potential for application to low resourced speech tasks. Better understanding of the relationship between the representations and the training data for the initial frozen model and fine-tuned models will help inform model and data choices across different classes of speech models.