The goal of this project is to allow higher level control of a sound synthesis model. We will have a corpus of sounds. At a low level each sound is represented as a time series of the sampled waveform. Each sound can be analysed to give mid-level parameters, e.g. log-mel-spectrogram representation or other time frequency description and we can also collect labels such as roughness or happiness, on a scale of 1-10 (call these “highlevel parameters”).
We have a synthesis model that is controlled by “mid-level parameters”. Currently many machine learning based synthesis techniques use log mel-spectrograms to represent audio. This is a perceptually informed timefrequency representation. In synthesis this has the challenge of reconstructing the phase. This can be done using existing signal processing methods (although Wavenet uses a ‘neural vocoder’ to convert the spectrogram into a time domain waveform). We will explore other representations of sound which may be more suitable for flexible single source sound synthesis. Our synthesis model will generate phase in a deterministic manner.
We hypothesise that we get better-sounding results on transients etc using the synthesis model than by inverting the spectrogram.
Our goal is to control the synthesis model via high level parameters. The user will directly control high-level parameters, and the UI will output mid-level parameters, and the synthesis model will then output audio.
The UI can be driven by a neural network. The input layer will be high-level parameters and the output will be mid-level parameters. Initially it will be an “instantaneous” model i.e. for each window it will take in current values of high-level parameters and output mid-level parameters for the same window. Later steps in the research could make it non-instantaneous but still causal. This could be done using a convolutional network taking in multiple recent time-steps, or a recurrent neural network e.g. LSTMs.
Variational autoencoders (VAEs) and Generative adversarial networks (GANs) and hybrids are probably the state of the art for unsupervised learning. Google’s Magenta project has used these models to create new audio synthesiser spaces. Google WaveNet is a pre-trained convolutional AE for audio. Google NSynth is a WaveNet-like model which allows user control of a control space. In NSynth the dimensions defined by interpolation between multiple real instruments. The proposed research is different because it chooses fixed high-level features as the control variables.