in

Spoken language recognition on Mozilla Frequent Voice — Audio Transformations. | by Sergey Vilov | Aug, 2023


Photograph by Kelly Sikkema on Unsplash

That is the third article on spoken language recognition based mostly on the Mozilla Common Voice dataset. In Part I, we mentioned knowledge choice and knowledge preprocessing and in Part II we analysed efficiency of a number of neural community classifiers.

The ultimate mannequin achieved 92% accuracy and 97% pairwise accuracy. Since this mannequin suffers from considerably excessive variance, the accuracy may probably be improved by including extra knowledge. One quite common approach to get additional knowledge is to synthesize it by performing numerous transformations on the obtainable dataset.

On this article, we are going to contemplate 5 well-liked transformations for audio knowledge augmentation: including noise, altering pace, altering pitch, time masking, and minimize & splice.

The tutorial pocket book could be discovered here.

For illustration functions, will use the pattern common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning hearth had been extinguished.

import librosa as lr
import IPython

sign, sr = lr.load('./remodeled/common_voice_en_100040.wav', res_type='kaiser_fast') #load sign

IPython.show.Audio(sign, fee=sr)

Unique pattern common_voice_en_100040 from MCV.
Unique sign waveform (picture by the writer)

Including noise is the only audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal sign amplitude and commonplace deviation of noise. We are going to generate a number of noise ranges, outlined with SNR, and see how they modify the sign.

SNRs = (5,10,100,1000) #Sign-to-noise ratio: max amplitude over noise std

noisy_signal = {}

for snr in SNRs:

noise_std = max(abs(sign))/snr #get noise std
noise = noise_std*np.random.randn(len(sign),) #generate noise with given std

noisy_signal[snr] = sign+noise

IPython.show.show(IPython.show.Audio(noisy_signal[5], fee=sr))
IPython.show.show(IPython.show.Audio(noisy_signal[1000], fee=sr))

Indicators obtained by superimposing noise with SNR=5 and SNR=1000 on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform for a number of noise ranges (picture by the writer)

So, SNR=1000 sounds nearly just like the unperturbed audio, whereas at SNR=5 one can solely distinguish the strongest elements of the sign. In apply, the SNR stage is hyperparameter that is dependent upon the dataset and the chosen classifier.

The best approach to change the pace is simply to faux that the sign has a unique pattern fee. Nonetheless, this may also change the pitch (how low/excessive in frequency the audio sounds). Growing the sampling fee will make the voice sound increased. As an instance this we will “enhance” the sampling fee for our instance by 1.5:

IPython.show.Audio(sign, fee=sr*1.5)
Sign obtained through the use of a false sampling fee for the unique MCV pattern common_voice_en_100040 (generated by the writer).

Altering the pace with out affecting the pitch is tougher. One wants to make use of the Phase Vocoder(PV) algorithm. In short, the enter sign is first break up into overlapping frames. Then, the spectrum inside every body is computed by making use of Quick Fourier Transformation (FFT). The taking part in pace is then modifyed by resynthetizing frames at a unique fee. For the reason that frequency content material of every body will not be affected, the pitch stays the identical. The PV interpolates between the frames and makes use of the section info to attain smoothness.

For our experiments, we are going to use the stretch_wo_loop time stretching perform from this PV implementation.

stretching_factor = 1.3

signal_stretched = stretch_wo_loop(sign, stretching_factor)
IPython.show.Audio(signal_stretched, fee=sr)

Sign obtained by various the pace of the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after pace enhance (picture by the writer)

So, the length of the sign decreased since we elevated the pace. Nonetheless, one can hear that the pitch has not modified. Word that when the stretching issue is substantial, the section interpolation between frames may not work effectively. Consequently, echo artefacts might seem within the remodeled audio.

To change the pitch with out affecting the pace, we will use the identical PV time stretch however faux that the sign has a unique sampling fee such that the entire length of the sign stays the identical:

IPython.show.Audio(signal_stretched, fee=sr/stretching_factor)
Sign obtained by various pitch of the unique MCV pattern common_voice_en_100040 (generated by the writer).

Why can we ever trouble with this PV whereas librosa already has time_stretch and pitch_shift features? Nicely, these features remodel the sign again to the time area. When it’s essential compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. Alternatively, it’s simple to switch the stretch_wo_loop perform such that it yields Fourier output with out taking the inverse remodel. One may most likely additionally attempt to dig into librosa codes to attain comparable outcomes.

These two transformation have been initially proposed within the frequency area (Park et al. 2019). The thought was to avoid wasting time on FFT through the use of precomputed spectra for audio augmentations. For simplicity, we are going to reveal how these transformations work within the time area. The listed operations could be simply transferred to the frequency area by changing the time axis with body indices.

Time masking

The thought of time masking is to cowl up a random area within the sign. The neural community has then much less possibilities to study signal-specific temporal variations that aren’t generalizable.

max_mask_length = 0.3 #most masks length, proportion of sign size

L = len(sign)

mask_length = int(L*np.random.rand()*max_mask_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

masked_signal = sign.copy()
masked_signal[mask_start:mask_start+mask_length] = 0

IPython.show.Audio(masked_signal, fee=sr)

Sign obtained by making use of time masks transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after time masking (the masked area is indicated with orange) (picture by the writer)

Reduce & splice

The thought is to exchange a randomly chosen area of the sign with a random fragment from one other sign having the identical label. The implementation is sort of the identical as for time masking, besides {that a} piece of one other sign is positioned as an alternative of the masks.

other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second sign

max_fragment_length = 0.3 #most fragment length, proportion of sign size

L = min(len(sign), len(other_signal))

mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

synth_signal = sign.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]

IPython.show.Audio(synth_signal, fee=sr)

Artificial sign obtained by making use of minimize&splice transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after minimize&splice transformation (the inserted fragment from the opposite sign is indicated with orange) (picture by the writer)


Fundamentals Of Statistics For Information Scientists and Analysts

CDC Information Replication: Methods, Tradeoffs, Insights