- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个第一个学位的期末项目,我想建立一个神经网络,它将获取 wav 文件的前 13 个 mfcc 系数,并从一群讲话者中返回谁在音频文件中讲话。
我希望您注意:
我定义:
X=mfcc(sound_voice)
Y=zero_array + 1 在第 i_ 个位置(其中第 i_ 个位置对于第一个发言者是 0,对于第二个发言者是 1,对于第三个发言者是 2...)
然后训练机器,然后检查机器的输出中的某些文件......
这就是我所做的...但不幸的是,结果看起来完全是随机的...
你能帮我理解为什么吗?
这是我的 python 代码 -
from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm
winner = [] # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)): # in every round we build NN with X,Y that out of them we check 50 after we build the NN
X = []
Y = []
onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))] # Files in dir
names = [] # names of the speakers
for file in onlyfiles: # for each wav sound
# UNESSECERY TO UNDERSTAND THE CODE
if " " not in file.split("_")[0]:
names.append(file.split("_")[0])
else:
names.append(file.split("_")[0].split(" ")[0])
names = list(dict.fromkeys(names)) # names of speakers
vector_names = [] # vector for each name
i = 0
vector_for_each_name = [0] * len(names)
for name in names:
vector_for_each_name[i] += 1
vector_names.append(np.array(vector_for_each_name))
vector_for_each_name[i] -= 1
i += 1
for f in onlyfiles:
if " " not in f.split("_")[0]:
f_speaker = f.split("_")[0]
else:
f_speaker = f.split("_")[0].split(" ")[0]
(rate, sig) = wav.read("FinalAudios/" + f) # read the file
try:
mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512) # mfcc coeffs
for index in range(len(mfcc_feat)): # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
# X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
X.append(np.array(mfcc_feat[index]))
Y.append(np.array(vector_names[names.index(f_speaker)]))
except IndexError:
pass
Z = list(zip(X, Y))
shuffle(Z) # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL
X, Y = zip(*Z)
X = list(X)
Y = list(Y)
X = np.asarray(X)
Y = np.asarray(Y)
Y_test = Y[:50] # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
X_test = X[:50]
X = X[50:]
Y = Y[50:]
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2) # create the NN
clf.fit(X, Y) # Train it
for sample in range(len(X_test)): # add 1 to winner array if we correct and 0 if not, than in the end it plot it
if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
winner.append(1)
else:
winner.append(0)
# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
这是我的 zip 文件,其中包含代码和音频文件:https://ufile.io/eggjm1gw
最佳答案
您的代码中存在许多问题,几乎不可能一次性解决它,但让我们尝试一下。主要有两个问题:
python_speech_features
的 winlen 参数)。在每个录音中,前 25 毫秒都几乎相同。即使每个扬声器有 10k 条录音,采用这种方法也不会取得任何成果。 我会给你具体的建议,但不会做所有的编码——毕竟这是你的作业。
(50000, 13)
训练数组。相应的标签数组将为 50000
,具有与说话者相对应的单个常量值 (id)(例如,omersk 为 0,lucas 为 1,很快)。我会考虑使用更长的窗口(也许 200 毫秒,实验!)来减少方差。 不要忘记分割数据进行训练、验证和测试。您将拥有足够多的数据。另外,对于这个练习,我会注意不要为任何单个发言者提供太多数据 - 不采取措施确保算法没有偏见。
稍后,当您进行预测时,您将再次计算说话者的 MFCC。通过 10 秒记录、200 毫秒窗口和 100 毫秒重叠,您将获得 99 个 MFCC 矢量,形状为 (99, 13)
。对于每个产生概率,模型应该在 99 个向量中的每一个上运行。当你对其进行求和(并标准化,使其变得更好)并取最高值时,你将得到最有可能的发言者。
通常还有很多其他事情需要考虑,但在这种情况下(家庭作业),我会专注于掌握正确的基础知识。
编辑:我决定尝试根据您的想法创建模型,但基础知识已确定。它并不是完全干净的 Python,因为它改编 self 正在运行的 Jupyter Notebook。
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
import glob
import os
from collections import defaultdict
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
audio_files_path = glob.glob('audio/*.wav')
win_len = 0.04 # in seconds
step = win_len / 2
nfft = 2048
mfccs_all_speakers = []
names = []
data = []
for path in audio_files_path:
fs, audio = wav.read(path)
if audio.size > 0:
mfcc = python_speech_features.mfcc(audio, samplerate=fs, winlen=win_len,
winstep=step, nfft=nfft, appendEnergy=False)
filename = os.path.splitext(os.path.basename(path))[0]
speaker = filename[:filename.find('_')]
data.append({'filename': filename,
'speaker': speaker,
'samples': mfcc.shape[0],
'mfcc': mfcc})
else:
print(f'Skipping {path} due to 0 file size')
speaker_sample_size = defaultdict(int)
for entry in data:
speaker_sample_size[entry['speaker']] += entry['samples']
person_with_fewest_samples = min(speaker_sample_size, key=speaker_sample_size.get)
print(person_with_fewest_samples)
max_accepted_samples = int(speaker_sample_size[person_with_fewest_samples] * 0.8)
print(max_accepted_samples)
training_idx = []
test_idx = []
accumulated_size = defaultdict(int)
for entry in data:
if entry['speaker'] not in accumulated_size:
training_idx.append(entry['filename'])
accumulated_size[entry['speaker']] += entry['samples']
elif accumulated_size[entry['speaker']] < max_accepted_samples:
accumulated_size[entry['speaker']] += entry['samples']
training_idx.append(entry['filename'])
X_train = []
label_train = []
X_test = []
label_test = []
for entry in data:
if entry['filename'] in training_idx:
X_train.append(entry['mfcc'])
label_train.extend([entry['speaker']] * entry['mfcc'].shape[0])
else:
X_test.append(entry['mfcc'])
label_test.extend([entry['speaker']] * entry['mfcc'].shape[0])
X_train = np.concatenate(X_train, axis=0)
X_test = np.concatenate(X_test, axis=0)
assert (X_train.shape[0] == len(label_train))
assert (X_test.shape[0] == len(label_test))
print(f'Training: {X_train.shape}')
print(f'Testing: {X_test.shape}')
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(label_train)
y_test = le.transform(label_test)
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=42, max_iter=1000)
cv_results = cross_validate(clf, X_train, y_train, cv=4)
print(cv_results)
{'fit_time': array([3.33842635, 4.25872731, 4.73704267, 5.9454329 ]),
'score_time': array([0.00125694, 0.00073504, 0.00074005, 0.00078583]),
'test_score': array([0.40380048, 0.52969121, 0.48448687, 0.46043165])}
test_score
并不出色。有很多需要改进的地方(对于初学者来说,算法的选择),但基础知识已经有了。首先请注意我如何获取训练样本。这不是随机的,我只考虑整个录音。您不能将给定录音中的样本同时放入training
和test
,因为test
应该是新颖的。
您的代码中哪些内容不起作用?我想说很多。您采集了 200 毫秒的样本,但 fft
却非常短。 python_speech_features
可能会向您提示 fft
应该比您正在处理的帧长。
我让您测试模型。这不会很好,但它是一个开始。
关于machine-learning - 我的说话人识别神经网络运行不佳,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58890335/
我是一名优秀的程序员,十分优秀!