Detecting duplicate content in American talk radio

Published in

Cortico

9 min readFeb 21, 2019

At Cortico, we’ve made talk radio searchable in order to help surface underrepresented voices across the country. Using talk radio, we can get a sense of what people are talking about on the local level. However, it turns out there is a lot of duplicate content in talk radio, from syndicated content (for example, an NPR segment airing across all of its member stations), to commercials. This not only clutters up our search space as well as our top terms results, but also makes it harder to hear local voices underneath nationally broadcast content. So we embarked on a journey to see how we can automatically detect duplicate audio content.

Looking for a Solution

Since we are transcribing all of our radio data, we already had some notion of duplicate detection based on text. However, transcript based duplicate detection varies with transcription accuracy, which in turn varies with background noise. Furthermore, transcription is limited only to English and sometimes has difficulty with accents. So we wanted to see if there could be a solution that did not rely on text, that could possibly complement our transcript duplicate detection.

Not only did we want to increase our duplicate detection accuracy, but we also wanted to understand the landscape of talk radio more. How much of talk radio is syndicated content? Are there clusters of radio stations that air the same content?

Audio Fingerprinting

The solution we decided to try out was audio fingerprinting. Audio fingerprinting gives us a means of identifying portions of an audio file via a hash (the “fingerprint”). This hash can then be compared to other portions of other audio files. This technique was published by Shazam, and looks something like this:

Transform audio from the time domain to the frequency domain using a fast Fourier transform
Find peaks in resultant spectrogram, indexed by time and frequency
Compute a hash function on the relative peak positions in order to get a sequence of hash values (hash(frequencies of peaks, time diff between peaks))
Identify sequence of common hashes between audio files

Luckily for us, an open source Python library for this technique exists called Dejavu.

Consider, as one does, a list of Britney Spears songs. After going through the FFT and hash calculation process, each song has a hash and the position of that hash. In the original Dejavu library, these are stored away into a MySQL database.

Now, let’s say you hear a beautiful voice singing on the radio and a wave of nostalgia hits you. You simply must know what this song is! Using an app like Shazam, your phone can fingerprint the song as it comes in (create a sequence of hashes), then compare this sequence of hashes to other hash sequences it has in its database. Using this sequence matching method, your phone can tell you what song is playing, no matter at what point it started listening to the song.

Matching streaming song to songs in a database

In the picture above, it looks like the incoming song was Britney’s hit song, Toxic! Speaking of Toxic, have you read our post on Visualizing Toxicity in Twitter Conversations? :)

With an algorithm and Python implementation in hand, we set out to apply audio fingerprinting to our specific problem. Our goal is to be able to say, “from 1:10 to 2:10 in clip A, we have the same content as from 3:25 to 4:25 in clip B”. A talk radio clip may start with original content, then play some duplicate commercials, then go back to original content. Unlike the Shazam problem, we do not want to just return the top ‘song’ that matched, but rather, all matches in a given clip. Furthermore, we needed to add logic in order to determine the time frame of each match.

Besides this segmentation problem, we also had another problem — a really big one! Here are some of our statistics:

163 radio stations in 37 states
970 million words per month
3.5 TB of audio ingested per month

This would amount to:

50,000 fingerprint files a day
~10,000 fingerprints per file

Unlike music streaming matching, we are not trying to find a match between the incoming song and every other audio clip. Instead, we are looking to compare every song to every other song. This gives us an n² problem, where n is every possible sequence in a given audio file, multiplied by 50,000! Our biggest problem quickly became how to scale this process.

Our Solution

We can split our scale problem up into two distinct parts:

Fingerprinting (generating sequences of hashes for each radio clip)
Matching

Scaling the fingerprinting process itself was relatively simple. Since we already had a bunch of machines that were transcribing the audio data coming in, we added a process that would fingerprint the file as it came in, then toss that file into S3 for further use later.

Adding a step to fingerprint an audio file to our existing ingest pipeline

Scaling the matching process was not so simple though since not every process could run in parallel — every radio clip has to be compared to every other radio clip. We quickly saw that MySQL, the default database with the Dejavu library, would not cut it, so we ventured into the realm of Spark. With Spark, we can hold all of the fingerprint values in memory across many machines rather than in a gigantic database, then do comparisons in memory. The original Dejavu code was not written in a functional way, but the duplicate detection algorithm (perhaps most similar to gene sequencing) lends itself reasonably well to map reduce functions.

For an overview of how this matching algorithm works, check out this Observable notebook which lets you play with parameters and goes through the algorithm step by step. Note that this notebook is written in JavaScript for easy visualizing and sharing, but the actual implementation was entirely in Python. Overall, the algorithm looks like:

Map → Download each fingerprint file
Reduce → Group by hash value
Map → Calculate offsets between matching hashes
Reduce → Align matching hash/offset pairs
Map → Get time range matches
Reduce → Group by the radio clip name
Map → Save results back to S3

Infrastructure

Although we thought rewriting the algorithm in a functional way would be the most difficult, it was relatively easy compared to deploying the Spark process onto our infrastructure. At Cortico, we use Kubernetes through Amazon EKS. We wanted our duplicate detection to run once a day, over the last 24 hours of data. In our ideal world, when the job starts, we would start up machines for the cluster in Amazon EC2, and when the job is finished, the machines would come down.

We achieved this using EC2 auto scaling groups attached to a Kubernetes cluster autoscaler (more info in this AWS tutorial). By configuring the EC2 auto scaling groups to use Spot instances (spare EC2 machines offered at a discount by AWS), and by having the Kubernetes cluster autoscaler tell AWS when we need these machines and when we no longer do, we can deploy our job at minimal cost. The workflow looks like:

Kubernetes cron job creates X number of Spark workers, all with resource requests

spec:
  replicas:10
  ...
  containers:
    - name: spark-worker
      ...
      resources:
         requests:
           cpu: "5"
           memory: "17Gi"

2. The Kubernetes cluster autoscaler detects that it does not have enough resources to create all of these Spark workers, so it alerts the EC2 autoscaling group

3. The EC2 autoscaling group bids for the number of spot instances it needs and adds those as new nodes to the Kubernetes cluster.

4. Once the new machines are up, the duplicate detection job is deployed and runs on these new machines

5. When the job completes, the Kubernetes cluster autoscaler detects it has unused notes, and asks the EC2 autoscaling group to relinquish them

6. Repeat, every day!

Results

Stats

Using audio fingerprinting for duplicate detection seems to complement our existing transcript-based method well. The fingerprinting method finds most of the same clips that our transcript-based method finds, as well as many others. In total, the fingerprinting method finds 74% of the duplicates flagged by either method, versus 47% flagged by the transcript-based method. Listening to the clips, we found that the fingerprinting method has nearly perfect precision, suggesting that its impact is strictly additive.

The two approaches are not apples-to-apples, however, since our transcript-based method identifies individual sentences that overlap and can identify when the same piece of text is spoken by different speakers.

A more rigorous evaluation of both methods’ precision and recall requires quite a bit of hand-labeling that we haven’t yet undertaken, but early signs are promising!

Once our first few days of audio fingerprinting data rolled in, we were excited to see what our data looked like.

Madison, Wisconsin

Because of our efforts launching the Local Voices Network in Madison, Wisconsin, we took a look at radio stations that can be heard from Madison.

Duplicate content on Madison radio stations

In this picture, blue lines represent duplicate content with any other radio station we are ingesting nationally. We can see very clearly that WORT has very little duplicate content, and reading its Wikipedia page confirms it has a high ratio of locally produced content. The picture above happens to show a Monday, but we can toggle through different days of the week to see how duplicate content varies by weekends vs. weekdays.

Play with the observable notebook here!

Radio Station Clusters

In order to visualize “clusters” of radio stations, where radio stations with high overlap content with one another are clustered together, we created a force-directed graph. Each node in the force-directed graph is a radio station, and each link represents number of seconds of duplicate content, with thicker links meaning more duplicate content. Because of the high number of seconds of duplicate content detected, for visualization purposes, only stations over a certain number of threshold of seconds of duplicate content are shown as ‘linked’.

Force-directed graph of talk radio station duplicates

Looking at this graph, we see right away two obvious clusters: the red cluster is Public Radio stations, and the blue cluster is News/Talk stations. These categories were derived from Radio Locator independently from the fingerprinting process.

It was exciting to see that our fingerprinting results clustered largely in the same way as externally identified categories. Also of note are the few brown nodes, which are College radio stations. These tend to stay unlinked even as the threshold for whether or not a station is linked is lowered, indicating a high volume of unique content.

Play with the observable notebook here!

Conclusion

Overall, audio fingerprinting has proven to be capable of raising our duplicate detection accuracy and we have had it successfully running in our infrastructure since October. In the future, we would like to look at commercial detection (frequently repeating 30 second clips), and also at removing syndicated content in order to train/test on accents from different parts of the country.

If you’re interested in building neat processes like this, now is your chance — we’re hiring and would love to hear from you!

Much thanks to Jordan Isler for initial investigations into audio fingerprinting, Madeleine Li for tirelessly listening to audio, and Brandon Roy for determined debugging and guidance throughout the whole process!

For a slightly more in depth, video version of this blog post, see my talk at DataEngConf in New York City last fall: