Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here
Add Your Heading Text Here

Aaron Dees

Student

Project Title

Neural Audio Synthesis with Meaningful Human Control for Creative Expression

Project Description

Machine learning has recently found its way into the world of signal processing and sound synthesis, and is fast becoming a powerful tool. Neural Audio Synthesis is the name given to efforts to apply Neural Networks in the field of sound synthesis. Sound synthesis typically has two approaches, physical modelling and signal modelling. Signal models are described as general, that is, one model can be applied to an array of sounds. The disadvantage of signal models is that they require many parameters to get convincing synthesis, and expert knowledge on how to tune them. This makes it difficult to control the sound in an intuitive way. Recent research has shown that neural networks can provide appropriate solutions to some of the issues of synthesis using traditional signal models.

Neural networks are often considered black boxes. They learn details of a dataset but can be difficult  to interpret. The proposed project will be an investigation into generative models for Neural Audio Synthesis, with meaningful human control for creative expression. Meaningful human control is characterised in terms of intention and perceptual effect. Creative expression regards to capacity for performative interaction. In short, we want to leverage the expressiveness of neural networks, while introducing a level of interpretability and controllability, allowing meaningful interaction.

The main parameters of musical sound are pitch, intensity and timbre. Pitch is correlated to fundamental frequency, intensity to amplitude, but timbre has no such correlation. The word timbre refers to the perceived quality of a sound – that attribute that distinguishes sounds of similar pitch and intensity, often described as a multidimensional attribute.  Dimensions identified in previous research are useful but do not define sounds, rather they describe them. It is proposed here to use a data driven approach to create a space that includes the identified perceptually important dimensions, but also has the ability to synthesise sounds from a reduced space. While the space will include intuitive dimensions the dimensionality will still be too high for uniformed interactive use. To counter this we propose using landmarks in the space, e.g., sounds for which some familiarity can be expected (e.g. natural sounds), and the ability to audition sounds in the space.  What’s more, we want the synthesis to be interactive – that is, it does not synthesise sounds of a set pitch, loudness and timbre given a point – but offers continuous temporal control over these attributes. This is a challenging problem as timbre is not independent of pitch and intensity. A starting point will be an investigation into the synthesis of perceptual spaces [1]. Additionally, investigation can be carried out into how interpretability can be extracted from models to allow more meaningful control and understanding of features.

The overall aim will be to develop methods for making Neural Audio Synthesis models more expressive and controllable. The project will leverage state of the art generative modelling techniques, e.g. autoencoders and gans, as well as interpretability techniques such as model conditioning and regularisation (among others).

[1] Esling, P., Chemla-Romeu-Santos, A., Bitton, A.: Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv:1805.08501 [cs, eess] (2018). URL http://arxiv.org/abs/1805.08501. ArXiv: 1805.08501

[2] Engel, J., Hantrakul, L.H., Gu, C., Roberts, A.: DDSP: Differentiable Digital Signal Processing (2019). URL https://openreview.net/forum?id=B1x1ma4tDr