Paper Session 8: Signal Processing II
Alexandre Francois: “Real-Time, Low-Latency, High Resolution Audio Spectral Analysis: Phase Matters”
Robert Esler: “Pd++: A C++ Library of Pure Data’s DSP Objects”
Pd++ is a real-time C++ audio synthesis library that implements Pure Data’s DSP (digital signal processing) objects as C++ classes, making it usable with object-oriented programming languages like C++, Java, or C#. The library has been designed to follow similar logic and naming conventions of Pure Data. It includes bindings for Java which allows the library to work with the Processing development environment and C# providing a native code interface to the Unity game engine. Pd++ has also been extensively tested on all major operating systems including iOS and Android, single board CPUs like the Raspberry Pi, as well as C++ based Application Programming Inter- faces (APIs) such as Unreal Engine, Wwise, JUCE and FMOD. In this article the author presents how the library works in design, practice and philosophy, its perceived workflow as a design and educational tool, as well as future developments for Pd++.
Tian Cheng, Tomoyasu Nakano and Masataka Goto: “Exploring Masked CE Losses to Enhance Word Offset Estimation in CTC-based Lyrics-to-Audio Alignment”
Lyrics-to-audio alignment is an important task for real-world applications such as karaoke systems. Despite alignment performance improved with the release of large datasets and the utility of advanced deep learning models, accurate word offset estimation remains challenging. To address this problem, we extend our previously proposed masked cross-entropy (CE) loss by proposing new masks to enforce model predictions at masked frames with frame-wise phoneme labels derived from word-level annotations. We train a Convolutional Recurrent Neural Network (CRNN) by using both the masked CE loss and the Connectionist Temporal Classification (CTC) loss. By comparing the results obtained by using different masks in the masked CE loss, we find that word offset estimation performance is improved by using masks which cover all silent frames. In addition, we find that masks on word onset frames are essential for improving word onset estimation performance. We achieve comparable word onset estimation results and provide benchmark word offset estimation results for future research.
