MM06Õ Half Day Tutorial
Computer Audition: An introduction and research survey
Shlomo
Dubnov
Music
and Computing in the Arts
Department
of Music / CALIT2
UCSD
Email:
sdubnov@ucsd.edu
Computer
Audition could be considered as a general field of audio understanding by
machine. It differs from problems of audio engineering since it focuses on
aspects of audio understanding rather then processing. It also differs from
related work on speech analysis since it deals with general audio signals, such
as natural sounds and music. The purpose of the proposed tutorial is to
introduce the audience to research problems of computer audition, specifically
addressing the existing semantic gap between human and machine levels of music
understanding.
The
formidable question of music understanding has been addressed over the years
from many aspects, some analytic or critical, others more creative. Musicians
might be interested to enhance the expressive possibilities of their art by
looking into content, description and structure of audio, from individual
sounds to larger scale forms to organization principles in various music
styles. Researchers might be interested to use music as a study case for
machine intelligence or in order to develop better tools for dealing with
complex temporal signals or understand better human cognition and intelligence.
Recently, problems of music information retrieval have spurred new interest
into problems of musical information processing, offering concrete problems and
practical algorithms. Applications such as music information retrieval, machine
improvisation, automatic accompaniment, score following and automatic music
analysis are different examples of computer audition that combine advanced
signal and semantic processing.
Questions
relating to human cognitive processing of music, such as modeling of emotions,
musical memory and familiarity and possibly additional factors such as
aesthetic aspects and musical complexity are may be the most formidable aspects
of computer audition research. Music, in its most general contemporary
definition, spanning from highly stylistic concert performances, to audio art,
vocalization, Foley effects or sound ecology, offers a unique glimpse into the
intricate relations between auditory scenes and processes, their signal
realizations and our brain.
The
three hour tutorial, although too short to cover the topics in great detail,
will attempt to achieve two primary and complementary goals – first, to
introduce practical tools for carrying audio / music analysis research,
including survey of basic languages, toolboxes and software for handling audio
and midi, introducing the research methods and basic underlying algorithms, and
secondly, surveying the current research and suggesting further topics in
research and creative applications that might be of interest to the tutorial
audience. It should be noted that the field of computer audition, unlike
computer vision or speech mentioned earlier, suffers form great lack in
textbooks and tutorials, possibly due to the unbalanced and highly varying
nature of interests and technical backgrounds of the field practitioners. One
of the educational or didactic purposes of the tutorial is to bridge this gap,
assuming a generally knowledgeable audience with background in multimedia
signals and systems, but not much beyond that. The advanced topics would be of
interest to more specialized participants as well.
One of the unique properties of musical signals is that they offer different
types of representations, from notated score, to performance actions in midi
files, to audio recordings and human annotations. Accordingly, one of the
central chapters in the tutorial will be devoted to techniques that match
different types of representations, specifically methods for score alignment
and annotation. Accordingly, the tutorial is divided into three main parts,
progressively covering the different aspects of computer audition from low
level signal and symbolic representations, through methods and techniques for finding
correspondences between different representation schemes, and up to new
research methods and algorithms for analyzing and modeling of relations between
feature statistics and human cognitive responses in music audition.
In
the first part of the tutorial we will present an overview of different audio
and music representations, considering parametric and non-parametric sound
representations (filter banks, ARMA models, sinusoidal and residual models,
SDIFF -Sound Description Interchange Format), symbolic representation (MIDI and
structured audio), and common features for audio classification purposes (Mel
cepstrum, Audio Basis representations based on PCA and ICA, Chromagram and Beat
Spectrum representations).
Second
part of the tutorial will discuss methods that combine symbolic and signal
representations and processing, such as problems of score alignment (matching
MIDI to audio) using dynamic programming, acoustic likelihood and distance
measures and problems of multi-pitch detection and transcription. New
applications that use combination of signal processing with prior symbolic
information will be discussed, such as source separation, score following and
hybrid signal and symbolic coding.
The
last part of the tutorial will discuss recent work by the instructor and
colleagues on analysis of correspondence between statistical properties of
audio features and cognitive human responses when listening to music, such as
emotional force and familiarity. This includes investigation of signal
prediction properties, generalization of signal complexity measures, such as
spectral flatness to the case of feature vectors, and introduction of novel
features that describe structure of audio signals in terms of recurrence
patterns and clustering / segmentation properties of signal spectral similarity
matrix. Applications of the methods to audio monitoring, segmentation and
computer music listening will be presented.
Topics
to be covered (by the hour):
Hour
1: ¥
Introduction: What
is computer audition Example
problems Survey
of music perception and cognition ¥
Audio Representation basics Digital audio, sampling, bitdepth MP3 reading in Matlab Midi representation ¥
Fourier Analysis Windowing Filter Banks and STFT analysis Conditions
for perfect reconstruction ¥
Pattern Playback Signal reconstruction from partial information Short time amplitude modification ¥ Sound Description: SDIFF
Format Sinusoidal and Sinusoidal + Noise Models Source-Filter models, Speech example Pitch detection and Voicing ¥
MIDI definitions, tools for reading and writing Midi in Matlab ¥
Simple synthesis methods (Unit generators, Wavetable, FM instruments) |
Hour
2: ¥
Distance measures for audio and Midi: Comparison between signals Likelihood of signal given MIDI information Score based signal processing ¥
Signal Alignment Methods: Dynamic Time Warping between audio signals Score alignment Hidden Markov Models (HMM) ¥
Signal Features: Spectral Envelope and Pitch Cepstral analysis, Auditory representations, Mel
Cepstrum ¥
Musical Content Features: Chromagram, Pitch Histograms, Beat Spectrum Use of Chromagram in score alignment |
Hour
3: ¥
Latent Semantic Analysis Audio Basis Principle Components Analysis (PCA) Independent Components Analysis (ICA) ¥
Recurrence Analysis (also known as Self-Similarity): Audio segmentation and Spectral Clustering Audio summarization Audio Texture synthesis ¥
Anticipation and Musical Form Spectrum and Entropy Information rate and predictive information Appraisal model of emotions ¥
Style Modeling Dictionary
based prediction String
Matching and Factor Oracle Examples
of Machine Improvisation ¥
Conclusions and Future research |
Each
part of the tutorial is planned for 50 min, with 10 min. break between the
three parts.
Examples
will be demonstrated using Matlab and a CDROM with programs will be handed out
to tutorial participants.
References:
Tutorial
at ISMIR02: Modern Methods for Statistical Audio Signal Processing and
Characterization http://music.ucsd.edu/~sdubnov/ISMIRv2b.htm
Course
ÒMusic in MatlabÓ http://music.ucsd.edu/~sdubnov/Mu176/
Relevant
publications by the instructor
1.
P. Herrera-Boyer, G. Peeters, S. Dubnov, "Automatic Classification of
Musical Instrument Sounds", Journal of New Music Research 2003, Vol. 32,
No. 1, pp. 3–21.
2.
S. Dubnov, "Generalization of Spectral Flatness Measure for Non-Gaussian Processes",
IEEE Signal Processing Letters, 11 (8), pp. 698 - 701, Aug. 2004.
3.
J. Tabrikian, S. Dubnov. E. Fisher, "Generalized Likelihood Ratio Test for
Voiced-Unvoiced Decision in Noisy Speech Using the Harmonic Model", IEEE Transactions
on Audio, Speech and Language Processing, Volume 14, Issue 2, March 2006,
Page(s): 502 – 510.
4.
Ben-Shalom, S. Dubnov, ÒOptimal Filtering of an Instrument Sound in a Mixed Recording
Using Harmonic Model and Score AlignmentÓ, Proceedings of International
Computer Music Conference, November 2004, Miami.
5.
Shai Shalev- Shwartz, Shlomo Dubnov, Nir Friedman and Yoram Singer, "Robust
Temporal and Spectral Modeling for Query by Melody", Proceedings of the 25rd
Conference on Research and Development in Information Retrieval (SIGIR), 2002.
6.
S. Dubnov, T. Apel ÒAudio Segmentation by Singular Value ClusteringÓ, in Proceedings
of International Computer Music Conference, November 2004, Miami.
7.
S. Dubnov, S. McAdams, R. Reynolds, ÒStructural and Affective Aspects of Music
from Statistical Audio Signal AnalysisÓ, to appear in Journal of the American
Society for Information Science and Technology, Special Issue on Style, 2006.
8.
S. Dubnov, ÒSpectral AnticipationsÓ, Computer Music Journal, MIT Press, Summer
2006.
Other
Bibliography
1.
Dannenberg, ``Listening to `Naima': An Automated Structural Analysis of Music from
Recorded Audio,'' In Proceedings of the 2002 International Computer Music Conference.
San Francisco: International Computer Music Association., (2002)
2.
Hu, Dannenberg, and Tzanetakis. ``Polyphonic Audio Matching and Alignment for
Music Retrieval,'' in 2003 IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics, New York: IEEE (2003), pp. 185-188.
3.
Tzanetakis, George and Perry Cook "Musical Genre Classification of Audio Signals",
IEEE Transactions on Speech and Audio Processing , 10(5), July 2002
4.
Toiviainen, P. (2005). Visualization of tonal content with self-organizing maps
and self-similarity matrices. ACM Computers in Entertainment, 3(4)
5.
X. Rodet. Musical Sound Signal Analysis/Synthesis: Sinusoidal + Residual and Elementary
Waveform Models. In IEEE Time-Frequency and Time-Scale Workshop 97, Coventry,
Grande Bretagne, 1997
6.
Casey, M. "MPEG-7 Sound Recognition", in IEEE Transaction on Circuits
and Systems Video Technology, special issue on MPEG-7, IEEE, May/June 2001
Shlomo
Dubnov is an associate Professor in music technology at UCSD. Prior to this he
served as researcher in Institute for Research and Coordination of Acoustics
and Music (IRCAM) in Paris and was a senior lecturer in department of
communication systems engineering in Ben-Gurion University in Israel. He holds
PhD in Computer Science from Hebrew University, Jerusalem. His work on
polyspectral analysis of musical timbre and his research on machine learning of
musical style are widely acknowledged by the computer music community. He
served as co-PI in several projects dealing with semantic analysis of audio,
such as a recent EU sponsored ÒSemantic HiFiÓ project. Currently he is
co-editing a book on ÒThe Structure of Style: algorithmic approaches to
understanding manner and meaningÓ and working on a textbook on semantic audio
processing.