librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram

时间: 2023-07-11 admin 互联网

librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram

librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram

窗口的长度与 n_fft 需要匹配大小长度;

1. Mel 语谱图的函数定义

librosa.feature.melspectrogram()函数在spectral.py 中,实现过程为:

def melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512,power=2.0, **kwargs):S, n_fft = _spectrogram(y=y, S=S, n_fft=n_fft, hop_length=hop_length, power=power)# Build a Mel filtermel_basis = filters.mel(sr, n_fft, **kwargs)return np.dot(mel_basis, S)

可以看出 Mel_ 语谱图的计算主要有两个函数构成

  1. 计算出信号的语谱图(功率谱形式构成的), 由 _spectrogram() 函数实现
  2. 构造Mel 滤波器, 由filters.mel 函数实现;
  3. 将Mel 滤波器组与语谱图做矩阵乘法, 得到 mel 语谱图;

1.1 _spectrogram() 函数实现

def _spectrogram(y=None, S=None, n_fft=2048, hop_length=512, power=1):if S is not None:# Infer n_fft from spectrogram shapen_fft = 2 * (S.shape[0] - 1)else:# Otherwise, compute a magnitude spectrogram from inputS = np.abs(stft(y, n_fft=n_fft, hop_length=hop_length))**powerreturn S, n_fft

其中的核心代码部分为stft()函数,短时傅里叶变化stft函数的实现过程为:

def stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann',center=True, dtype=np.complex64, pad_mode='reflect'):# By default, use the entire frameif win_length is None:win_length = n_fft# Set the default hop, if it's not already specifiedif hop_length is None:hop_length = int(win_length // 4)fft_window = get_window(window, win_length, fftbins=True)# Pad the window out to n_fft sizefft_window = util.pad_center(fft_window, n_fft)# Reshape so that the window can be broadcastfft_window = fft_window.reshape((-1, 1))# Check audio is validutil.valid_audio(y)# Pad the time series so that frames are centeredif center:y = np.pad(y, int(n_fft // 2), mode=pad_mode)# Window the time series.y_frames = util.frame(y, frame_length=n_fft, hop_length=hop_length)# Pre-allocate the STFT matrixstft_matrix = np.empty((int(1 + n_fft // 2), y_frames.shape[1]),dtype=dtype,order='F')# how many columns can we fit within MAX_MEM_BLOCK?n_columns = int(util.MAX_MEM_BLOCK / (stft_matrix.shape[0] *stft_matrix.itemsize))for bl_s in range(0, stft_matrix.shape[1], n_columns):bl_t = min(bl_s + n_columns, stft_matrix.shape[1])stft_matrix[:, bl_s:bl_t] = fft.fft(fft_window *y_frames[:, bl_s:bl_t],axis=0)[:stft_matrix.shape[0]]return stft_matrix

1.2 filters.mel 函数实现

输出是 [ n_mels, 1 + (n_fft / 2) ]

这其中需要的注意的是,不同的librosa版本的mel()函数的归一化参数传递的也是不一样的,比如python3.6中的norm可以传递1 或是 None,但是python3.7中就需要传递norm=‘slaney’.

产生一个线性变换矩阵, 将FFT 频率下的bin 的 转化到 Mel 尺度下的bin

@cache(level=10)
def mel(sr,n_fft,n_mels=128,fmin=0.0,fmax=None,htk=False,norm="slaney",dtype=np.float32,
):"""Create a Mel filter-bank.This produces a linear transformation matrix to projectFFT bins onto Mel-frequency bins.Parameters----------sr        : number > 0 [scalar]sampling rate of the incoming signaln_fft     : int > 0 [scalar]number of FFT componentsn_mels    : int > 0 [scalar]number of Mel bands to generatefmin      : float >= 0 [scalar]lowest frequency (in Hz)fmax      : float >= 0 [scalar]highest frequency (in Hz).If `None`, use ``fmax = sr / 2.0``htk       : bool [scalar]use HTK formula instead of Slaneynorm : {None, 'slaney', or number} [scalar]If 'slaney', divide the triangular mel weights by the width of the mel band(area normalization).If numeric, use `librosa.util.normalize` to normalize each filter by to unit l_p norm.See `librosa.util.normalize` for a full description of supported norm values(including `+-np.inf`).Otherwise, leave all the triangles aiming for a peak value of 1.0dtype : np.dtypeThe data type of the output basis.By default, uses 32-bit (single-precision) floating point.Returns-------M         : np.ndarray [shape=(n_mels, 1 + n_fft/2)]Mel transform matrixSee also--------librosa.util.normalizeNotes-----This function caches at level 10.Examples-------->>> melfb = librosa.filters.mel(22050, 2048)>>> melfbarray([[ 0.   ,  0.016, ...,  0.   ,  0.   ],[ 0.   ,  0.   , ...,  0.   ,  0.   ],...,[ 0.   ,  0.   , ...,  0.   ,  0.   ],[ 0.   ,  0.   , ...,  0.   ,  0.   ]])Clip the maximum frequency to 8KHz>>> librosa.filters.mel(22050, 2048, fmax=8000)array([[ 0.  ,  0.02, ...,  0.  ,  0.  ],[ 0.  ,  0.  , ...,  0.  ,  0.  ],...,[ 0.  ,  0.  , ...,  0.  ,  0.  ],[ 0.  ,  0.  , ...,  0.  ,  0.  ]])>>> import matplotlib.pyplot as plt>>> fig, ax = plt.subplots()>>> img = librosa.display.specshow(melfb, x_axis='linear', ax=ax)>>> ax.set(ylabel='Mel filter', title='Mel filter bank')>>> fig.colorbar(img, ax=ax)"""if fmax is None:fmax = float(sr) / 2# Initialize the weightsn_mels = int(n_mels)weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)# Center freqs of each FFT binfftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)# 'Center freqs' of mel bands - uniformly spaced between limitsmel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)fdiff = np.diff(mel_f)ramps = np.subtract.outer(mel_f, fftfreqs)for i in range(n_mels):# lower and upper slopes for all binslower = -ramps[i] / fdiff[i]upper = ramps[i + 2] / fdiff[i + 1]# .. then intersect them with each other and zeroweights[i] = np.maximum(0, np.minimum(lower, upper))if norm == "slaney":# Slaney-style mel is scaled to be approx constant energy per channelenorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])weights *= enorm[:, np.newaxis]else:weights = util.normalize(weights, norm=norm, axis=-1)# Only check weights if f_mel[0] is positiveif not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):# This means we have an empty channel somewherewarnings.warn("Empty filters detected in mel frequency basis. ""Some channels will produce empty responses. ""Try increasing your sampling rate (and fmax) or ""reducing n_mels.")return weights

2. 实例分析

假设 调用形式如下 :

librosa.feature.melspectrogram(y=current_window ,sr=sample_rate, n_mels=n_mels, fmin=f_min, fmax=f_max, n_fft=nfft, hop_length=hop)

def melspectrogram(y=None,sr=22050,S=None,n_fft=2048,hop_length=512,win_length=None,window="hann",center=True,pad_mode="reflect",power=2.0,**kwargs,
):S, n_fft = _spectrogram(y=y,S=S,n_fft=n_fft,hop_length=hop_length,power=power,win_length=win_length,window=window,center=center,pad_mode=pad_mode,)# Build a Mel filtermel_basis = filters.mel(sr, n_fft, **kwargs)return np.dot(mel_basis, S)

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

返回:
mel 语谱图的输出大小为 [n_mels, n_frames ],

Mel 语谱图的输出结果是, Mel 滤波器组 和 功率谱格式的语谱图 两者共同作用的结果。

[ n_mels, 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft​​ ] ×[ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft​​, n_frames ]

= [n_mels, n_frames ]

注意:

Mel 语谱图函数的输入, 其中的S 这个参数量, 代表了输入中是否提供了语谱图的输入。

  1. S 如果提供了语谱图输入, 此时 S 直接使用输入的语谱图;
  2. S 如果没有提供语谱图的输入, 需要根据输入的一维数据y, 先经过STFT 得到频谱, 然后在取模值的平方, 得到频谱的功率谱形式, 此时的功率谱便作为语谱图。

最终 mel 语谱图的输出大小为 [n_mels, frames ], 由 Mel 滤波器组和短时傅里叶变换STFT 两个 输出矩阵,相乘共同作用到的结果。

该函数中调用了两个函数实现:

2.1 STFT 的输出帧数

根据以上参数

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

调用 STFT ,得到的输出 [ 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft​​, n_frames ]

  1. 行数: 1 + n f f t 2 \frac{n_{fft}}{2} 2nfft​​ = 1+ (800/2) = 401

  2. 列数:等于帧数, 帧数 n_frames 求法:
    n f r a m e s = a u d i o l e n g t h / / h o p l e n g t h + 1 , n_{frames} = audio_{length} // hop_{length} + 1, nframes​=audiolength​//hoplength​+1,
    带入上式中的, 可以求得:
    n_frams = (32000 // 200) + 1 = 161,

故此时的STFT 输出 [ 401, 161]

2.2 filter.mels(sr, n_fft, **kwargs)

调用 filter.mels(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm=1 ) 函数, 生成一个 Mel 滤波器组;

输入参数:
y = 32000 个点的音频数据,
sr = 4000;
n_mels = 256;
fmin = 50 , fmax = 2000
n_fft: fft 点数800;
hop_length = win_length / 4 = n_fft / 4

其中有几个重要的参数,需要关注以下

  1. n f f t n_{fft} nfft​ 的点数选取, n f f t n_{fft} nfft​ 的点数越多, CPU需要的运算量也增加,可以分别的频率也精细;

该参数与与频率分辨率有关:
sr 采样率为 16 KHZ;
假设将音频的频率分辨率设置为 10 HZ,

频率分辨率 bin = F s N f f t \frac{F_{s}}{N_{fft}} Nfft​Fs​​ = 采 样 率 N f f t 的 点 数 \frac{采样率}{N_{fft}的点数} Nfft​的点数采样率​

N f f t N_{fft} Nfft​ = F s 10 \frac{F_{s}}{10} 10Fs​​ = 16000 10 \frac{16000}{10} 1016000​ =1600 点;

2.2.1 Mel 频率