Research Projects
Tunebot 
Bryan Pardo, David Little, Hagai Livni, Rui Jiang, Anda Bereczky, Jeff Hentschel
What is Tunebot?
Tunebot is a search engine that lets you find the music you're looking for by singing a bit of it (or entering music notation). In response to your query it returns a ranked list of songs you can play. These songs are linked to www.amazon.com, where you can purchase the desired music.
How is this new?
Most commercial music search engines, such as Amazon's and Apple's iTunes, index their music by metadata, such as song title, composer name or performer name. What happens if you can't recall this information but still want to find a piece of music? If you already have the recording and just want to know its name, you can play the recording to www.shazam.com. Without the metadata or an example of the recording, none of these services can identify the music. With our search engine, you can find the music you seek, as long as you can sing a bit of it. Our search engine compares what you sing to a database of melodies and returns the melodies that best match your example. You don't need to know the lyrics. You don't have to have a copy of the recording. You just have to have a voice and a microphone.
iPhone Beta Release
We are also currently developing an iPhone version of Tunebot. If you would like to become a beta tester and have iPhone 3.0 software, you must submit a request to pardo [at] cs.northwestern.edu with the subject "Tunebot iPhone Betatester Request" and include your iPhone or iPod touch unique identifier (NOT the serial number). You can find this go to the device section in iTunes and click on "serial number" to switch it to "Identifier (UDID)". We will then send you instructions on how to install the application on your device. Act quickly as space is limited! Please note that you must have iPhone 3.0 software and an iPhone or 2nd generation iPod touch with an external microphone to test the software.
Why contribute a song?
Our search engine doesn't compare your singing to the original recording you seek. This is because computer systems are confused by comparing an unaccompanied voice to a recording with multiple concurrent voices and instruments. An unaccompanied solo voice is so different from the typical commercial recording that even a really good rendition won't seem similar to a computer. The solution is to compare unaccompanied voices to other unaccompanied voices. This is where you come in. When you add a song to the Tunebot database, you provide us with an unaccompanied solo version that is linked to the appropriate recording on www.amazon.com. Then, when someone else wants to find this song, we compare their singing to your performance. The more performances people contribute, the more songs become searchable with our system.
How does the system work?
The diagram below shows the work flow of the system. A person hums a tune (1), the search engine returns a list of songs, ranked by similarity to the hummed tune (2). The user can then choose the song that is most similar to the one hummed (3). This provides feedback to the system that pairs the desired song with the sung example. These pairings of recorded queries and the correct corresponding song are stored in a database (4). We can use those pairings for a genetic algorithm to optimize parameters for the search engine (5), improving system performance automatically.
Related Papers
D. Little, D. Raffensperger, B. Pardo. A Query by Humming System that Learns from Experience. Proceedings of the 8th International Conference on Music Information Retrieval, Vienna, Austria, September 23-27, 2007 (PDF)
D. Little, D. Raffensperger and B. Pardo. Online Training of a Music Search Engine. Northwestern University, Evanston, IL, NWU-EECS-07-03, 2007 (PDF)
Funding
This work was funded in part by National Science Foundation Grant number IIS-0812314.
Karaoke Callout 
Bryan Pardo, Jeff Hentschell, Anda Bereczky, David A. Shamma
What is it?
Karaoke Callout is a web-based karaoke game that allows users to "callout" or challenge each other to a duel of singing. A person first selects a song from a growing list of music, and then sings their own rendition of the song. The system gives them a score for their performance. The person then may choose to either callout a friend, who will then record his or her own rendition of the song, to share their rendition with the world, to try again, or to choose a new song.
How do I play?
Go here. Pick a song and press select to bring up the karaoke interface. Press record and begin singing. This version of karaoke is a cappella. When you are finished, press stop. You can then choose from the following options: review your recording, callout a friend, save your recording, and start a new song.
What songs does KaraokeCallout know?
The Karaoke database contains a wide variety of songs. If you can't find what you're looking for, contribute your own.
What's Next?
We are currently in development of the iPhone version of KaraokeCallout. Also, the callout functionality in the web-based version is in progress.
Who created KaraokeCallout?
The game is a research project in Bryan Pardo's lab in the Electrical Engineering and Computer Science Department of Northwestern University.
Project members include Bryan Pardo, Jeff Hentschel, and Anda Bereczky.
Related Papers
B. Pardo and David A. Shamma. Teaching a Music Search Engine Through Play. In Proceedings of CHI 2007 Workshop on Vocal Interaction in Assistive Technologies and Games (CHI 2007), San Jose, CA, USA, April 29 - May 3, 2007. (PDF)
D. Shamma and B. Pardo. Karaoke Callout: using social and collaborative cell phone networking for new entertainment modalities and data collection. In Proceedings of ACM Multimedia Workshop on Audio and Music Computing for Multimedia (AMCMM 2006), Santa Barbara, CA, USA, October 23-27, 2006. (PDF)
Funding
This work was funded in part by National Science Foundation Grant number IIS-0812314.
Audio Source Separation
Jinyu Han, Zafar Rafii and Bryan Pardo
What is it?
Source separation is the process of determining individual source signals, given only mixtures of the source signals. When prior analysis of the individual sound sources is not possible, the problem is considered blind source separation (BSS). In this work, we focus on the BSS problem as it relates to recordings of music.
What is it good for?
A tool that can accomplish blind separation of musical mixtures would be of use to recording engineers, composers, multimedia producers, in hearing aids and to the general public. Imagine a hearing aid that lets you choose which person in the cocktail party is amplified. A live recording of a flute recital could be remixed after recording to eliminate the sound of a ringing cell phone. Other applications abound.
Our Approaches
Recent work in blind source separation applied to anechoic mixtures of speech
allows for improved reconstruction of sources that rarely overlap in a time-frequency representation.
While the assumption that speech mixtures do not overlap significantly in time-frequency is reasonable,
music mixtures rarely meet this constraint, requiring new approaches.
Spatial Cues Re-estimation:
We introduce a method that uses spatial cues from anechoic, stereo music recordings and assumptions
regarding the structure of musical source signals to effectively separate mixtures of tonal music.
We use existing techniques to create partial source signal estimates from regions of the mixture
where source signals do not overlap significantly.
We use these partial signals within a new demixing framework, in which we estimate harmonic masks for each source,
allowing the determination of the number of active sources in important time-frequency frames of the mixture.
We then propose a method for distributing energy from time-frequency frames of the mixture to multiple source signals.
This allows dealing with mixtures that contain time-frequency frames
in which multiple harmonic sources are active without requiring knowledge of source characteristics.
Constant Q Transform:
A parallel study on blind source separation based on time-frequency masking also shows that sound separation,
especially for music, may be improved by using a log-frequency scaled spectrogram built from the Constant Q Transform (CQT).
The first results show that CQT-based DUET can separate more sources than a standard DUET and with more fidelity to the original sources.
Spatial Cues Re-estimation
As sources increasingly overlap in the time-frequency domain or the angle between sources decreases, the spatial cues used in our methods become unreliable. We also introduce a method to re-estimate the spatial cues for mixtures of harmonic sources. The newly estimated spatial cues are fed to the system to update each source estimate and the pitch estimate of each source. This iterative procedure is repeated until the difference between the current estimate of the spatial cues and the previous one is under a pre-set threshold. Results on a set of three-source mixtures of musical instruments show this approach significantly improves separation performance of two existing time-frequency masking systems.
Five systems are compared in our experiments.
They are DUET with ground truth pitch information for spatial cues refinement (IDEAL), unmodified DUET, DUET with spatial cues iterative refinement (DUET+ITER),
unmodified ASE and ASE with spatial cues iterative refinement (ASE+ITER).
Ground truth pitches are the pitches estimated from the original recordings of the isolated sources, prior to mixing.
Performance results when the angle between instruments is 20 degres are shown in the Figure above.
The median value of separation performance for each method is labeled with arrow text.
Iterative spatial cues refinement improves DUET's median performance by 5.2 dB and ASE's median performance by 2.5 dB.
Furthermore, our proposed system with iterative spatial cues estimation performs as well as the system using ground truth pitches.
The audio examples listed below are from the experiments on mixtures of horn, oboe and bass clarinet. The initial estimates are achieved using a method only based on spatial cues and the final estimates are achieved using our new re-estimation method.
Re-estimation at different mixing angles
Performance results from mixtures created using different mixing angle are shown in the figure below.
In this figure, each data point indicates an average result for 30 mixtures.
The proposed system (DUET+ITER) consistently outperformed the existing systems' performance (excluding the system using ground truth pitches)
for nearly all the mixing angles above 18 degree. The iterative spatial cues estimation improves DUET or ASE when the sources are close to each other
(In the figure, this is the case when the mixing angle is between 18 and 30 degree).
Constant Q Transform
We have developed and evaluated a CQT-based DUET algorithm on different mixtures of speech and music shown below.
While the classic DUET failed in detecting all the sources in the 2D histogram
built from the amplitude and phase differences in the classic spectrogram,
the CQT-based DUET succeeded in detecting and separating all the sources in every mixture,
and with high fidelity to the original sources.
-
Mixture of 4 female voices:

2D histograms of amplitude and phase delays of the mixture of 4 female voices.
On the left, classic DUET has detected only 3 sources,
while on the right, the CQT-based DUET has clearly detected the 4 sources.
-
estimated female voice 1
estimated female voice 2
estimated female voice 3
estimated female voice 4
-
Mixture of 4 male voices:

2D histograms of amplitude and phase delays of the mixture of 4 male voices.
On the left, classic DUET has detected only 3 sources,
while on the right, the CQT-based DUET has clearly detected the 4 sources.
-
estimated male voice 1
estimated male voice 2
estimated male voice 3
estimated male voice 4
-
Mixture of drums, vocal & piano:

2D histograms of amplitude and phase delays of the mixture of drums, vocal & piano.
On the left, classic DUET has detected 2 dubious peaks,
while on the right, the CQT-based DUET has clearly detected the 3 sources.
-
estimated drums
estimated vocal
estimated piano
-
Mixture of keyboard, guitar, bass & 2 drums:

2D histograms of amplitude and phase delays of the mixture of keyboard, guitar, bass & 2 drums.
On the left, classic DUET has detected 1 gross peak,
while on the right, the CQT-based DUET has clearly detected the 5 sources.
-
estimated keyboard
estimated guitar
estimated bass
estimated drums 1
estimated drums 2
Related Papers
J. Han and B. Pardo. Improving Separation of Harmonic Sources with Iterative Estimation of Spatial Cues.
Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2009), 2009. (PDF)
J. Woodruff and B. Pardo. Using Pitch, Amplitude Modulation and Spatial Cues for Separation of Harmonic Instruments from Stereo Music Recordings. EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 86369, 2007. (PDF)
J. Woodruff, B. Pardo and R. Dannenberg. Remixing Stereo Music with Score-informed Source Separation in Proceedings of the 7th International Conference on Music Information Retrieval, Victoria, Canada, October 8-12, 2006. (PDF)
Funding
This work was funded in part by National Science Foundation Grant number IIS-0643752.
Automatic Audio Tagging
David Little and Bryan Pardo
Goal: Labeling Sound Sources in Mixtures
Our goal is to develop a system that can identify a given target sound source when presented in mixtures of other sources. An example would be a system that learns to identify the sound of a clarinet when trained on recordings of musical groups containing clarinet.
Weak Labels Only
The region containing energy from a target source is not specifically labeled. We refer to this as weak labeling.
For instance, the sounds below, were used to help train the system to learn "clarinet.". Each audio file was created using sampled instruments provided with Apple's Garage Band version 4.1.2. The first song has a clarinet in it, the other does not. When learning clarinet, the program is presented each sound, and told only that the first sound has a clarinet, and that the second one does not.
Novel Mixtures
Furthermore, the system must be able to identify sound objects in novel mixtures. For example, if we teach the system a clarinet, presenting the sytem with clarinet and oboe mixtures as positive examples, and violin and cello mixtures as negative examples, our system should recognize that a clarinet and violin recording includes a clarinet, even though this combination of instruments has never been seen before by the system.
What's it good for?
A system capable of learning to identify sources from weakly labeled acoustic mixtures would have applications in source separation, music transcription and music/audio information retrieval ("find me songs that have a violin, find me all the explosions in this movie"). More generally a system capable of learning and identifying sounds in an environment could serve as an aid to the deaf, provide context for a digital assistant, or facilitate ecological and biodiversity monitoring.
Related Papers
D. Little and B. Pardo. Learning musical instruments from mixtures of audio with weak labels. In ISMIR, 9th International Conference on Music Information Retrieval, 2008. (PDF)
Funding
This work was funded in part by National Science Foundation Grant number IIS-0643752.
Adaptive User Interfaces
Bryan Pardo, Andy Sabin, Darren Gergle, Colby Leider, Zafar Rafii
What is it?
Many musicians think about sound in individualistic terms that may not have known mappings onto the controls of existing audio production tools. For example, a violinist may want to make the recording of her violin sound "shimmery." While she has a clear concept of what a "shimmery" sound is, she may not know how to articulate it in terms that map "shimmery" onto the available audio tools (such as reverberation and equalization).

A typical parametric equalization plug-in. Too difficult!
What is it good for?
Imagine a computational tool that works alongside the musician to quickly learn how acoustic features map onto an audio concept, and creates a simple controller to manipulate audio in terms of that concept. In the case of the violin player, the tool would learn what "shimmery" means to her, and then create a knob that would let her make a sound more or less shimmery. This allows the creator to quickly realize a concept, bypassing the bottleneck of technical knowledge. We therefore propose here a user-centered design approach to the development of audio production tools that automatically adapt to the user's work style, rather than forcing the user to adapt to the tools. The result will be new technologies that support and enhance human creativity.

The learning process: a sound is modified by a series of equalization curves or reverberation settings and the listener rates each example as to how well it exemplifies a target audio concept (e.g., "boomy").
Our Approach
In our initial work, we focused on improving on equalization and reverberation tools. We have tested algorithms to rapidly learn listener's desired equalization curve or reverberation settings. First, a sound is modified by a series of equalization or reverberation curves. After each modification, the listener indicates how well the current sound exemplifies a target sound descriptor (e.g., "warm"). After rating, a function is computed that models the user's desired equalization or reverberation curve. Listeners report that sounds generated using this function capture their intended meaning of the descriptor. Machine ratings generated by computing the similarity of a given curve to the weighting function are highly correlated to listener responses, and asymptotic performance is reached after only ~25 listener ratings.
Current work includes study of alternate controller paradigms (knobs, sliders, pressure sensors, etc), other audio tools (compression) or graphic tools and user feedback along multiple dimensions.

A personalized slider for a personal audio concept.
Related Papers
A. Sabin and B. Pardo. Rapid learning of Subjective Preference in Equalization. 125th meeting of the Audio Engineering Society, San Francisco, CA, October 2-5, 2008
Z. Rafii and B. Pardo. Learning to Control a Reverberator using Subjective Perceptual Descriptors. 10th International Society for Music Information Retrieval Conference, Kobe, Japan, October 26-30, 2009. (PDF)
Z. Rafii and B. Pardo. A Digital Reverberator controlled through Measures of the Reverberation. Northwestern University, EECS Department, Technical Report NWU-EECS-09-08, 2009. (PDF)
A. Sabin and B. Pardo. 2DEQ: An Intuitive Audio Equalizer. Proceedings of ACM Creativity and Cognition 2009, Berkeley, CA, October 27-29, 2009 (PDF)
A. Sabin and B. Pardo. A method for rapid personalization of audio equalization parameters. Proceedings of ACM Multimedia 2009, Beijing, China, October 19 - 24, 2009 (PDF)
Funding
This work was supported by NSF Grant number IIS-0757544.
Jazz Score / Performance Database 
Bryan Pardo, Josh Moshier
What if a program could tell locate a jazz musician's place on a lead sheet based on what they are playing?
Existing score following databases assume faithful performance of fully notated music. While current models may be effective for following notated "classical" music, they have little application for improvised music. Score following in jazz requires alignment of an improvised performance to a lead sheet, the basic template providing a song's melody, harmony and structural information. When playing standard jazz repertoire, it is not uncommon for jazz musicians to spontaneously alter aspects of the piece's pre-determined structure. With the best live jazz musicians, this spontaneous structural change would be seamless. A program that follows the form of a jazz performance would depend on a flexible model that parses musical content in anticipation of possible structural change.
We have created a database of MIDI recordings performed by professional Chicago jazz pianists using lead sheets.
These performers marked their performances with measure, beat and structural branch point information, encoded as MIDI data. A structural analysis of each performance has been created by a conservatory-trained professional jazz pianist. This database will be useful as training and validation data for a jazz score following program.
For more information:
http://music.cs.northwestern.edu/jazzstudy/index.html
Funding
This work is funded in part by National Science Foundation Grant number IIS-0643752.
Score Alignment
Bryan Pardo and Sara Laupp
What is it?
Score alignment involves finding the best alignment between an audio performance and the events in a machine-readable music score. We propose advancing the state-of-the-art to include partially specified scores, such as Jazz lead sheets, that require significant inference about likely surface structures (such as note sequences in a solo) from deeper structural descriptions in the score. This requires fundamentally new ways of approaching the problem.
What is it good for?
Many tasks in music analysis and production would be greatly aided by reliable alignment of a score to the acoustic recording. Editing of audio tracks often requires the location of a particular section or phrase in the music to be tweaked in some way (tuning, equalization, adding an audio effect). Score alignment would allow selection of the desired audio by selecting the measure or passage on the musical score. The audio would be searchable by melodic riff or hook. In entertainment, on-line score alignment allows for interactive automated musical accompaniment (Karaoke that follows you, instead of the other way around). For music education, a system able to align a score to an acoustic performance could indicate to a music student where the performance deviates from indicated score markings ("the score says to play softer here" or "you sped up when it said to slow down").
Our Approach
This combination of surface-level variation (block chords vs. arpeggios) and structural variation (play the interlude vs. skip the interlude) presents a problem for the traditional score-matching paradigm: the score elements will be far fewer, and perhaps significantly different, than the transcribed performance elements. Therefore, a score may not be represented as a single sequence of notes and a (nearly) one-to-one mapping between a score sequence and performance sequence is not possible, violating a basic assumption of existing score alignment techniques.
A Markov model describes a process that goes through a sequence of discrete states, such as notes or chords in a lead sheet. Markov models are generative. A generative model describes an underlying structure able to emit a sequence of observed events. A musical score may be represented as a (hidden) Markov model. The directed graph in the figure below shows a Markov model created from the notes in the written score above it.
We use Markov models, automatically generated from scores, whose emission probabilities can be used to represent chordal or melodic material. Directed edges (arrows) represent transitions. Transition probabilities are indicated by a value associated with each edge. Score following is typically done with a Markov model by first training the model on a set of performances, to tune the transition and emission probabilities. Then, when a new sequence of events (a performance transcription) is presented, the Viterbi algorithm is used to determine the most likely sequence of states in the model to generate the performance. This sequence of states is deemed to be the path through the model and thus the path through the score used to generate the model.
Related Papers
B. Pardo and W. and Birmingham. Modeling Form for On-line Following of Musical Performances in Proceedings of the Twentieth National Conference on Artificial Intelligence, Pittsburgh, Pennsylvania, July 9-13, 2005. (PDF)
B. Pardo and W. Birmingham Improved Score Following for Acoustic Performances in Proceedings of the International Computer Music Conference, Gothenburg, Sweden, September 16-20, 2002. (PDF)
Funding
This work is funded in part by National Science Foundation Grant number IIS-0643752.
Music Story
David Ayman Shamma, Bryan Pardo, Kristian Hammond
What is it?
MusicStory listens to music. Like anyone else, the lyrics it hears bring to mind images that resonate with them. The difference is that its mind is not that of a human. It is instead defined by the set of links between ideas and images that is the World Wide Web.
It listens while it searches online for images linked to the words that are being sung and the connections that exist between those images and the song. As the images are found, it presents them to the audience, creating an 'on the fly' music video, heightening, clarifying, and exposing the connections between words, ideas and images that we often do not even notice.
How does it function?
While listening to the music, MusicStory finds and presents word/image associations. Its takes the emotional experience of listening to music, amplifies it and heightens its visceral appeal by externalizing concrete and visual imagery intrinsic in the music. The retrieved images vary in their association - some semantically on point and some distant.
MusicStory uses online indexes to retrieve images which have popular relevance. This hands back pop-culture meanings of terms and images. In some cases, the image becomes a more authoritative and concrete. The balance of pop-culture and authoritative associations expands the emotional experience. The flow of imagery moves with the pace of the song: providing quick transitions through fast songs, and leisurely transitions through slower songs.
How do I get it?
MusicStory was created for personal media devices and is available for download. We are currently expanding MusicStory for use in large-scale, concert installations.
You will also need the latest free versions of Windows Media Player version 10 and Windows Media Encoder. A fast PC with Windows XP and the .NET framework 1.1 or better is also required.
Its still a work in progress, please email ayman-(at)-cs-northwestern-edu if you have any problems or questions or feature requests.
Related Papers
D. Shamma, B. Pardo and J. Woodruff. MusicStory: An Autonomous, Personalized Music Video Creator. Intelligent Music Information Systems-Tools and Methodologies, Idea Group, 2007.
D. Shamma, B. Pardo, and K. Hammond. MusicStory: a Personalized Music Video Creator in Proceedings of ACM Multimedia 2005, Singapore, November 6-11, 2005. (PDF)
Multi-Pitch Estimation
Zhiyao Duan and Bryan Pardo
What is it?
Multiple pitch estimation is to estimate at some given time the pitches and the number of pitches (polyphony) of a polyphonic audio. In Music Information Retrieval (MIR), people usually do not discriminate "pitch" and "fundamental frequency (F0)", so multi-pitch estimation is also called multiple F0 estimation.
What is it good for?
Multi-pitch estimation is of great interest to researchers working in music audio and is useful for many applications, including automatic music transcription, source separation, score following, content-based music search, etc. The task, however, remains challenging and existing methods do not match human ability in either accuracy or flexibility.
Our Approach
For Multi-pitch estimation, we propose a maximum likelihood approach, where the amplitude spectrum of a time frame is the observation and the F0s are the parameters to be estimated. When defining the likelihood model, previous methods only model spectral peaks, while we also model non-peak regions (frequencies further than a musical quarter-step from an observed peak). It is shown that the peak likelihood and the non-peak region likelihood act as a complementary pair. The former helps find F0s that have harmonics that explain peaks, while the latter helps avoid F0s that have harmonics in non-peak regions. Parameters of these models are learned from monophonic and polyphonic training data. We propose an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation. We also propose a polyphony estimation method to terminate the iterative process. Finally, we propose a post-processing method to refine F0 estimates using neighboring frames. It is shown that this refinement method eliminates many inconsistent estimation errors. Evaluations are done on ten recorded four-part J. S. Bach chorales. Results show that the proposed method shows superior F0 estimation and polyphony estimation compared to a state-of-the-art algorithm.
A sound example of our existing dataset:
composed by J. S. Bach, played by violin, clarinet, tenor saxphone and bassoon.
The figure above shows the pitch estimation result on the four part music recording above. The horizontal axis is time and vertical axis is pitches in MIDI number Colored lines are the ground truth pitches. Black dots are pitch estimates. It can be seen that most estimates are correct, while there are still some insertion, deletion and substitution errors.
The figure above shows boxplots of comparison results with Klapuri's system (A. Klapuri, "Multiple fundamental frequency estimation by summing harmonic amplitudes," in Proc. International Conference on Music Information Retrieval (ISMIR), 2006, pp. 216-221.) on 10 pieces of four-part music recordings. Each box represents 330 data points and each point corresponds to 1 second of audio. The gray boxes are Klapuri06's results and the white ones are ours. The lower and upper lines of each box show 25th and 75th percentiles of the sample. Higher values are better. The line in the middle of each box is the sample median, which is also presented as the number below each box. The lines extending above and below each box show the extent of the rest of the samples, excluding outliers. "Mul-F0" measures the overall accuracy of all pitches. "Pre-F0" measures the accuracy of the first pitch found.
Related Papers
Z. Duan, J. Han and B. Pardo. Harmonically Informed Multi-pitch Tracking. Accepted by 10th International Society for Music Information Retrieval Conference (ISMIR 2009), Kobe, Japan, October 26-30, 2009. (PDF)
Funding
This work is funded in part by National Science Foundation Grant number IIS-0643752.
|
|