Page Nav

HIDE

Breaking News:

latest

Ads Place

Parsing the UrbanSound8K Dataset with TensorFlow

https://ift.tt/3rV7mIa From raw audio to TFRecord to padding batches There are multiple ways of feeding data to a Neural Network. Most oft...

https://ift.tt/3rV7mIa

From raw audio to TFRecord to padding batches

There are multiple ways of feeding data to a Neural Network. Most often, Numpy arrays and CSV/Pandas files are used. Additionally, TensorFlow offers a custom storage format, TFRecord. It’s handy but not very beginner-friendly.

Photo by Richard Horvath on Unsplash

Though many datasets are readily available, such as MNIST or CIFAR, you sometimes have to implement custom ones yourself. Thankfully, getting other datasets into the TFRecord format is not that hard once you’ve done it a few times. Here, we’ll go over parsing the UrbanSound8K dataset, store it in the TFRecord format, and, finally, iterate over its samples.

Dataset description

The UrbanSound8K dataset (available free of charge for non-commercial use only; CC BY-NC 3.0) contains 8732 audio files of varying duration. At most 4 seconds long, each sample belongs to one of 10 classes. The files come pre-arranged in ten folds and are stored in WAVE format. Each file might have a different sampling rate, bit-depth, and number of channels. With ~7 GB, it’s a medium-sized dataset. Thus, if you’re looking for a smaller, more-accessible audio dataset, then have a look at ESC50.

Downloading and extracting

Go to the dataset’s web page and click on the download link at the bottom to download the dataset. As mentioned, you can use the data free of charge, but only for non-commercial projects — for this explanatory blog post showcasing data parsing and related concepts, things are fine.

Depending on your connection, the download might take a few minutes. After the download, unpack the archive. You now have two folders, audio and metadata. We only need the first one.

Within this folder, the files are pre-arranged into ten folders, fold1 to fold10. All files are named according to the following scheme:

{SOURCE_ID}-{LABEL}-{TAKE}-{SLICE}.wav

The first part is used to identify the ID of the source file, LABEL is for the class label, and TAKE and SLICE are used to distinguish between multiple samples taken from the same original file.

Parsing audio data into TFRecords

As a word beforehand, if this is your first time working with TFRecords, TensorFlow’s native format for efficient data storage, or you need a refresher, have a look at this Google Colab notebook and an accompanying description here.

We do not need the metadata file because an audio sample’s label is inscribed in its filename. We can thus directly start with writing the data to TFRecord files. To get this done, we first import some packages and a couple of helper functions:

Afterwards, we define a function that loads the audio from the disk. For this, we use the librosa library. Though there are alternatives, such as scipy, I found it most convenient. Furthermore, we can quickly extract the label by splitting the filename first on a “/” and then on the “-”. In the end, we return the audio data, its sampling rate, the label, and the filename:

Afterwards, we define the function that makes the extracted data ready to be stored. Here, we’ll use the helper function implemented previously. As I describe in an introductory, hands-on guide to the TFRecord format, these functions are used to make integers, floats, strings, and byte data ready for writing to disk. Giving the features appropriate names helps us extract them later on. Lastly, we wrap them into an Example object, like a box with some content and properties. The following image visualizes this concept:

The TFRecord format stores the data as Example objects, which can be thought of as above. Image by the author.

We now have two core functionalities defined: Loading the audio data from disk, and parsing it into a TFRecord-compatible format. Therefore, we can implement the method combining both. To do this, we create a TFRecordWriter object responsible for writing the data to disk and iterate over all audio files that we’ve found. Then, each audio file is parsed, packed into an Example object, and written to a TFRecord file.

Even though the dataset is relatively large, an individual TFRecord file, storing one complete fold, is of a suitable size, around 550 MB. For larger datasets, you want to use more files. For such cases, the documentation has some tips.

Lastly, we implement the following main function to iterate over all folds:

To round off our small script, we define an argument parser. It takes the path to the audio directory mentioned at the beginning and the output directory. If this directory does not already exist, it will be created:

Running the script is done by calling python /path/to/script.py

With the script defined to create the TFRecord files, we probably want to read the files back later. So that’s what we’ll implement now.

Reading audio data from TFRecords

Once you’ve done it a couple of times, it’s actually very straightforward to get the data out of the TFRecord files. We simply have to inverse the storing procedure. Previously, we put the features, named sr, len, y, and so on, into the box. We thus use the same names to get the data out.

The only caveat is the audio data. Because it is an array, we have to reshape it. That’s why we stored the len property. Similarly, the filename has to be parsed to a string:

We use the following function to read the content of one or more TFRecord files. It returns a dataset object:

We can iterate over the first few elements using a for loop, inspecting them for any errors. Each sample consists of four components: The actual audio data, the label, the sampling rate, and the original filename:

One caveat is the uneven file duration. A regular batch operation would fail. Therefore, we have to pad the batches. We can do this in the following way:

Note: We are padding a sample up to 178017 entries because the longest audio file is slightly over 4 seconds and we don’t want to truncate this data. All other files max out at 4 seconds * 44100 Hz = 176400 samples.

That is it! Now it’s up to you.

Summary

In this post, we parsed the UrbanSound8K dataset to TFRecord. To do so, we used the librosa python library and a couple of helper functions. Then, we iterated over the pre-defined folds and parsed each one separately.

In the end, we extracted the data and also covered padding a batch. This gave us a dataset we can iterate over.


Parsing the UrbanSound8K Dataset with TensorFlow was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/3G3vixX
via RiYo Analytics

ليست هناك تعليقات

Latest Articles