MM06’ Half Day Tutorial
Computer Audition: An introduction and research survey
Music and Computing in the Arts
Department of Music / CALIT2
Computer Audition could be considered as a general field of audio understanding by machine. It differs from problems of audio engineering since it focuses on aspects of audio understanding rather then processing. It also differs from related work on speech analysis since it deals with general audio signals, such as natural sounds and music. The purpose of the proposed tutorial is to introduce the audience to research problems of computer audition, specifically addressing the existing semantic gap between human and machine levels of music understanding.
The formidable question of music understanding has been addressed over the years from many aspects, some analytic or critical, others more creative. Musicians might be interested to enhance the expressive possibilities of their art by looking into content, description and structure of audio, from individual sounds to larger scale forms to organization principles in various music styles. Researchers might be interested to use music as a study case for machine intelligence or in order to develop better tools for dealing with complex temporal signals or understand better human cognition and intelligence. Recently, problems of music information retrieval have spurred new interest into problems of musical information processing, offering concrete problems and practical algorithms. Applications such as music information retrieval, machine improvisation, automatic accompaniment, score following and automatic music analysis are different examples of computer audition that combine advanced signal and semantic processing.
Questions relating to human cognitive processing of music, such as modeling of emotions, musical memory and familiarity and possibly additional factors such as aesthetic aspects and musical complexity are may be the most formidable aspects of computer audition research. Music, in its most general contemporary definition, spanning from highly stylistic concert performances, to audio art, vocalization, Foley effects or sound ecology, offers a unique glimpse into the intricate relations between auditory scenes and processes, their signal realizations and our brain.
The three hour tutorial, although too short to cover the topics in great detail, will attempt to achieve two primary and complementary goals – first, to introduce practical tools for carrying audio / music analysis research, including survey of basic languages, toolboxes and software for handling audio and midi, introducing the research methods and basic underlying algorithms, and secondly, surveying the current research and suggesting further topics in research and creative applications that might be of interest to the tutorial audience. It should be noted that the field of computer audition, unlike computer vision or speech mentioned earlier, suffers form great lack in textbooks and tutorials, possibly due to the unbalanced and highly varying nature of interests and technical backgrounds of the field practitioners. One of the educational or didactic purposes of the tutorial is to bridge this gap, assuming a generally knowledgeable audience with background in multimedia signals and systems, but not much beyond that. The advanced topics would be of interest to more specialized participants as well.
One of the unique properties of musical signals is that they offer different types of representations, from notated score, to performance actions in midi files, to audio recordings and human annotations. Accordingly, one of the central chapters in the tutorial will be devoted to techniques that match different types of representations, specifically methods for score alignment and annotation. Accordingly, the tutorial is divided into three main parts, progressively covering the different aspects of computer audition from low level signal and symbolic representations, through methods and techniques for finding correspondences between different representation schemes, and up to new research methods and algorithms for analyzing and modeling of relations between feature statistics and human cognitive responses in music audition.
In the first part of the tutorial we will present an overview of different audio and music representations, considering parametric and non-parametric sound representations (filter banks, ARMA models, sinusoidal and residual models, SDIFF -Sound Description Interchange Format), symbolic representation (MIDI and structured audio), and common features for audio classification purposes (Mel cepstrum, Audio Basis representations based on PCA and ICA, Chromagram and Beat Spectrum representations).
Second part of the tutorial will discuss methods that combine symbolic and signal representations and processing, such as problems of score alignment (matching MIDI to audio) using dynamic programming, acoustic likelihood and distance measures and problems of multi-pitch detection and transcription. New applications that use combination of signal processing with prior symbolic information will be discussed, such as source separation, score following and hybrid signal and symbolic coding.
The last part of the tutorial will discuss recent work by the instructor and colleagues on analysis of correspondence between statistical properties of audio features and cognitive human responses when listening to music, such as emotional force and familiarity. This includes investigation of signal prediction properties, generalization of signal complexity measures, such as spectral flatness to the case of feature vectors, and introduction of novel features that describe structure of audio signals in terms of recurrence patterns and clustering / segmentation properties of signal spectral similarity matrix. Applications of the methods to audio monitoring, segmentation and computer music listening will be presented.
Topics to be covered (by the hour):
What is computer audition
Survey of music perception and cognition
• Audio Representation basics
Digital audio, sampling, bitdepth
MP3 reading in Matlab
• Fourier Analysis
Filter Banks and STFT analysis
Conditions for perfect reconstruction
• Pattern Playback
Signal reconstruction from partial information
Short time amplitude modification
• Sound Description:
Sinusoidal and Sinusoidal + Noise Models
Source-Filter models, Speech example
Pitch detection and Voicing
• MIDI definitions, tools for reading and writing Midi in Matlab
• Simple synthesis methods (Unit generators, Wavetable, FM instruments)
• Distance measures for audio and Midi:
Comparison between signals
Likelihood of signal given MIDI information
Score based signal processing
• Signal Alignment Methods:
Dynamic Time Warping between audio signals
Hidden Markov Models (HMM)
• Signal Features:
Spectral Envelope and Pitch
Cepstral analysis, Auditory representations, Mel Cepstrum
• Musical Content Features:
Chromagram, Pitch Histograms, Beat Spectrum
Use of Chromagram in score alignment
• Latent Semantic Analysis
Principle Components Analysis (PCA)
Independent Components Analysis (ICA)
• Recurrence Analysis (also known as Self-Similarity):
Audio segmentation and Spectral Clustering
Audio Texture synthesis
• Anticipation and Musical Form
Spectrum and Entropy
Information rate and predictive information
Appraisal model of emotions
• Style Modeling
Dictionary based prediction
String Matching and Factor Oracle
Examples of Machine Improvisation
• Conclusions and Future research
Each part of the tutorial is planned for 50 min, with 10 min. break between the three parts.
Examples will be demonstrated using Matlab and a CDROM with programs will be handed out to tutorial participants.
Tutorial at ISMIR02: Modern Methods for Statistical Audio Signal Processing and Characterization http://music.ucsd.edu/~sdubnov/ISMIRv2b.htm
Course “Music in Matlab” http://music.ucsd.edu/~sdubnov/Mu176/
Relevant publications by the instructor
1. P. Herrera-Boyer, G. Peeters, S. Dubnov, "Automatic Classification of Musical Instrument Sounds", Journal of New Music Research 2003, Vol. 32, No. 1, pp. 3–21.
2. S. Dubnov, "Generalization of Spectral Flatness Measure for Non-Gaussian Processes", IEEE Signal Processing Letters, 11 (8), pp. 698 - 701, Aug. 2004.
3. J. Tabrikian, S. Dubnov. E. Fisher, "Generalized Likelihood Ratio Test for Voiced-Unvoiced Decision in Noisy Speech Using the Harmonic Model", IEEE Transactions on Audio, Speech and Language Processing, Volume 14, Issue 2, March 2006, Page(s): 502 – 510.
4. Ben-Shalom, S. Dubnov, “Optimal Filtering of an Instrument Sound in a Mixed Recording Using Harmonic Model and Score Alignment”, Proceedings of International Computer Music Conference, November 2004, Miami.
5. Shai Shalev- Shwartz, Shlomo Dubnov, Nir Friedman and Yoram Singer, "Robust Temporal and Spectral Modeling for Query by Melody", Proceedings of the 25rd Conference on Research and Development in Information Retrieval (SIGIR), 2002.
6. S. Dubnov, T. Apel “Audio Segmentation by Singular Value Clustering”, in Proceedings of International Computer Music Conference, November 2004, Miami.
7. S. Dubnov, S. McAdams, R. Reynolds, “Structural and Affective Aspects of Music from Statistical Audio Signal Analysis”, to appear in Journal of the American Society for Information Science and Technology, Special Issue on Style, 2006.
8. S. Dubnov, “Spectral Anticipations”, Computer Music Journal, MIT Press, Summer 2006.
1. Dannenberg, ``Listening to `Naima': An Automated Structural Analysis of Music from Recorded Audio,'' In Proceedings of the 2002 International Computer Music Conference. San Francisco: International Computer Music Association., (2002)
2. Hu, Dannenberg, and Tzanetakis. ``Polyphonic Audio Matching and Alignment for Music Retrieval,'' in 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New York: IEEE (2003), pp. 185-188.
3. Tzanetakis, George and Perry Cook "Musical Genre Classification of Audio Signals", IEEE Transactions on Speech and Audio Processing , 10(5), July 2002
4. Toiviainen, P. (2005). Visualization of tonal content with self-organizing maps and self-similarity matrices. ACM Computers in Entertainment, 3(4)
5. X. Rodet. Musical Sound Signal Analysis/Synthesis: Sinusoidal + Residual and Elementary Waveform Models. In IEEE Time-Frequency and Time-Scale Workshop 97, Coventry, Grande Bretagne, 1997
6. Casey, M. "MPEG-7 Sound Recognition", in IEEE Transaction on Circuits and Systems Video Technology, special issue on MPEG-7, IEEE, May/June 2001
Shlomo Dubnov is an associate Professor in music technology at UCSD. Prior to this he served as researcher in Institute for Research and Coordination of Acoustics and Music (IRCAM) in Paris and was a senior lecturer in department of communication systems engineering in Ben-Gurion University in Israel. He holds PhD in Computer Science from Hebrew University, Jerusalem. His work on polyspectral analysis of musical timbre and his research on machine learning of musical style are widely acknowledged by the computer music community. He served as co-PI in several projects dealing with semantic analysis of audio, such as a recent EU sponsored “Semantic HiFi” project. Currently he is co-editing a book on “The Structure of Style: algorithmic approaches to understanding manner and meaning” and working on a textbook on semantic audio processing.