Sheldon(Sicong) Huang finished his third year of undergrad at University of Toronto and is currently on a year of research internship at Vector Institute and Borealis AI, and after that he will return to school for his fourth year of undergraduate study. He joined Vector Institute to work with Professor Roger Grosse after his second year of undergrad and then joined Borealis AI as a research intern during his third year undergrad. His past work includes quantitative evaluation of generative models, CipherGAN (ICLR 2018) and musical style transfer (TimbreTron). He has a broad range of research interests including information theory, generative models, network compression, reinforcement learning and cognitive science. He is also the Founder and Former President of University of Toronto Machine Intelligence Student Team (UTMIST), where he currently serves as the Scientific Advisor.
TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer
In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, an audio processing pipeline which combines three powerful ideas from different domains: Constant Q Transform (CQT) spectrogram for audio representation, a variant of CycleGAN for timbre transfer and WaveNet-Synthesizer for high quality audio generation. We verified that CQT TimbreTron in principle and in practice is more suitable than its STFT counterpart, even though STFT is more commonly used for audio representation. Based on human perceptual evaluations, we confirmed that timbre was transferred recognizably while the musical content was preserved by TimbreTron.