0

我目前正在从事语音识别深度学习项目。
我需要用移位声音文件或拉伸它来增加我当前的数据,
但问题是在增加过程中形状正在改变

  y, sr = librosa.load(os.path.join(train_data_path, label, fname))
  librosa.output.write_wav('./input/train_test2/'+label+'/10000'+fname  ,y,sr)

尽管我没有改变任何东西,但它改变了形状。
假设我最初的形状是 (99,81,1) 但在我改变它之后它变成了 (77,81,1) 或其他东西

但问题是当我使用 keras 进行分类时

inp = Input(shape=input_shape)
norm_inp = BatchNormalization()(inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(norm_inp)
img_1 = Convolution2D(8, kernel_size=2, activation=activations.relu)(img_1)
img_1 = MaxPooling2D(pool_size=(2, 2))(img_1)
img_1 = Dropout(rate=0.2)(img_1)

不同的 input_shape 不适用于 keras。我什至不确定修改wav文件后是否可以保持原始形状

  1. 可以保持原来的形状吗?
  2. 如果不可能,是否可以将其更改为原始文件形状?
  3. 你还建议什么其他解决方案?

============================================我做完之后的形状改变了log_spectrogram

def log_specgram(audio, sample_rate, window_size=20,
                 step_size=10, eps=1e-10):
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio,
                                    fs=sample_rate,
                                    window='hann',
                                    nperseg=nperseg,
                                    noverlap=noverlap,
                                    detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

这个 np.log(spec.T.astype(np.float32) + eps) 的形状不同

==================================================== =========== 原始文件

sample_rate, samples = wavfile.read('./input/train/audio/eight/012c8314_nohash_1.wav')
print(sample_rate , sample_rate_test)
new_sample_rate = 8000
resampled = signal.resample(samples, int(new_sample_rate / sample_rate * samples.shape[0]))
print(resampled2.shape)
_, _, specgram = log_specgram(resampled, sample_rate=new_sample_rate)
print("specgramshape->", specgram.shape)
S = librosa.feature.melspectrogram(y =samples, sr =sample_rate, n_mels=128, fmax = 8000  )    
print("S->", S.shape)

librosa.display.specshow(librosa.power_to_db(S, ref=np.max), y_axis = 'mel' , fmax = 8000, x_axis='time')


16000 22050
(5804,)
specgramshape-> (99, 81)
S-> (128, 32)

==================================================== ==============

使用后

y = librosa.resample(y,sr,16000)
librosa.output.write_wav('./input/train_test/'+label+'/10000'+fname  ,y,sr)

(16000,) (16000,)
(5804,)
specgramshape-> (71, 81)
S-> (128, 32)

==================================================== ====================

4

0 回答 0