title

Goal: Loss function which favors what sounds better to the human ear¶

  • Depending on the temporal/spectral shape of the noise, the ear favours one over the other, depsite the same noise power

Problem: Popular and for gradient descent effective loss functions like MSE don’t reflect these ear preferences¶

  • Examples
  • Sound with spectrally flat noise from PCM quantization.
  • Sound with psychoacoustically spectrally shaped noise, with even higher noise power.

Evalutation with MSE¶

  • Evaluating these example with the Mean Square Error (MSE) loss favors the first, noisy, example, wrongly!

Perceptual Loss Function using a Psycho-Acoustic Prefilter¶

  • A psycho-acoustic prefilter uses a linear time-varying filter to normalize an audio signal to its psycho-acostic masking threshold.
  • This is generated by a psycho-acoustic model, similar to what is used in audio coders.
  • After this prefilter, we have a new signal or domain and we apply the MSE loss function there.

Psycho-Acoustic Prefilter Example¶

Musical exerpt: Slash - Anastasia, Released: 2012, Album: Apocalyptic Love

In [1]:
# Imports
import torch
import torchaudio
import IPython.display as ipd
In [2]:
# Load audio files
audio_wav, sr_wav = torchaudio.load('audio_original.wav')
audio_mp3, sr_mp3 = torchaudio.load('audio_mp3_128k.wav')
audio_quantized, sr_wav = torchaudio.load('audio_quantized.wav')
In [3]:
# Playback
print('Example Signal Orignal (PCM 16-bit 44.1kHz)')
display(ipd.Audio(audio_wav,rate=sr_wav))
print('Example Signal MP3 128k')
display(ipd.Audio(audio_mp3,rate=sr_mp3))
print('Example Signal Quantized (choosen quantization factor)')
display(ipd.Audio(audio_quantized,rate=sr_wav))
Example Signal Orignal (PCM 16-bit 44.1kHz)
Your browser does not support the audio element.
Example Signal MP3 128k
Your browser does not support the audio element.
Example Signal Quantized (choosen quantization factor)
Your browser does not support the audio element.

Mean Squared Error (MSE) Loss¶

  • One of the most common loss functions, widely used in many different applications.
  • It assesses the average squared difference between the observed and predicted values.

Reference:
https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html

In [4]:
# MSE Loss
loss_mse = torch.nn.MSELoss()
mse_mp3_original = loss_mse(audio_mp3,audio_wav)
print('MSE Loss (mp3 and original):', mse_mp3_original*100)
mse_quant_original = loss_mse(audio_quantized,audio_wav)
print('MSE Loss (quanitzed and original):', mse_quant_original*100)
MSE Loss (mp3 and original): tensor(5.8766)
MSE Loss (quanitzed and original): tensor(1.3066)

Observe:

  • The MSE loss of the mp3 is significantly greatar than the MSE loss of the quantized audio even though the perceived hearing quality of the mp3 is significantly superior.

Psycho-Acoustic Pre-Filtering + MSE¶

  • A Psycho-Acoustic Model is used to generate filters that are applied to each block in the time-frequency domain.
  • Computational expensive.

Reference:
Schuller, G. (2020). Filter Banks and Audio Coding. Springer International Publishing. https://doi.org/10.1007/978-3-030-51249-1

In [5]:
# Load pre-filtered audio files
audio_wav_pref, sr_wav = torchaudio.load('audio_originalpref.wav')
audio_mp3_pref, sr_mp3 = torchaudio.load('audio_mp3_128kpref.wav')
audio_quantized_pref, sr_wav = torchaudio.load('audio_quantizedpref.wav')
In [6]:
# Pre-Filtering + MSE Loss
loss_mse = torch.nn.MSELoss()
mse_mp3_original = loss_mse(audio_mp3_pref[0,:],audio_wav_pref[0,:])
print('Pre-Filtering + MSE Loss mp3:', mse_mp3_original.numpy()*10000)
mse_quant_original = loss_mse(audio_quantized_pref,audio_wav_pref)
print('Pre-Filtering + MSE Loss Quanitzed:', mse_quant_original.numpy()*10000)
Pre-Filtering + MSE Loss mp3: 1.00347948318813
Pre-Filtering + MSE Loss Quanitzed: 1.4562977594323456

Observe:

  • Now, calculating the same MSE Loss in the psycho-acoustic pre-filtered domain, the MSE Loss for the mp3 audio is smaller than the MSE loss of que quantized sound.

Log Spectral Difference¶

  • A distance measure (in dB) between log magnitudes of the spectra.
  • Much less computational expensive.
  • Can work well in certain applications.

Reference:
Rabiner, L. and Juang, B., 1993. Fundamentals of speech recognition. Englewood Cliffs, N.J.: PTR Prentice Hall.

In [7]:
from lsd_loss import LSDLoss
loss_lsd = LSDLoss()
lsd_mp3_original = loss_lsd(audio_mp3[0,:],audio_wav[0,:])
print('LSD Loss mp3:', lsd_mp3_original)
lsd_quant_original = loss_lsd(audio_quantized[0,:],audio_wav[0,:])
print('LSD Loss Quanitzed:', lsd_quant_original)
LSD Loss mp3: tensor(0.9744)
LSD Loss Quanitzed: tensor(1.9903)

Observe:

  • The MSE loss of the quantized audio is also greater than the one for the mp3 audio, favouring the best sounding audio.

Multi Scale Spectral Loss¶

  • More computational expensive than the LSD but less than the psycho-acoustic pre-filtering.
  • Given two audio files, we compute their (magnitude) spectrogram Si and $\hat{S_i}$, respectively, with a given FFT size i, and define the loss as the sum of the L1 difference between Si and $\hat{S_i}$ as well as the L1 difference between log Si and log $\hat{S_i}$. The total reconstruction loss is then the sum of all the spectral losses with different FFT sizes.

Reference:
Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu, & Adam Roberts (2020). DDSP: Differentiable Digital Signal Processing. In International Conference on Learning Representations.

In [8]:
from asteroid.losses import SingleSrcMultiScaleSpectral
loss_multiScaleSpectral = SingleSrcMultiScaleSpectral()
multiScale_mp3_original = loss_multiScaleSpectral(audio_mp3_pref,audio_wav_pref)
print('Multi Scale Spectral Loss mp3:', multiScale_mp3_original.numpy()/1000000)
multiScale_quant_original = loss_multiScaleSpectral(audio_quantized_pref,audio_wav_pref)
print('Multi Scale Spectral Loss Quanitzed:', multiScale_quant_original.numpy()/1000000)
Multi Scale Spectral Loss mp3: [1.14763687]
Multi Scale Spectral Loss Quanitzed: [2.28411725]

Observe:

  • The MSE loss of the quantized audio is also greated than the one for the mp3 audio, favouring the best sounding audio.

Results¶

  • Some losses favor the better sounding audio, while others don't.
  • The psycho-acoustic filtering makes use of psycho-acoustic effects of the human hearing system and can be used in combination with a loss funcion in the design of a psycho-acoustically perceptual loss function.
  • We can also transform the audio to a 'psycho-acoustic pre-filter domain' and perform different processing in this domain.