Supervised Machine Learning for Speaker Diarization by PNCC with LPCC Audio Coefficients
Abstract
Speaker Diarization is a speech digital signal processing technique that segregates one input observation of n multi-speaker signal into an individual speech of those n persons. Each segregated signal belongs to one of them plus a little bit of error which is speech that belongs to other speakers. The format of that speech is a dialog because they speak non-simultaneously (dialog). Using Speaker diarization algorithms, audio features are extracted from the standard TIMIT databases of speech. This extraction is the Training Stage of Machine Learning. The second Classification Stage can then decide how to divide these features into n groups. Linear Prediction Cepstral Coefficients (LPCC) and Power-Normalized Cepstral Coefficients (PNCC) are used independently to generate their features. In this paper, the researchers re-combinate these LPCC and PNCC features to form a new mixture of features. Modified Euclidian distance facilitates the job of measuring distances to identify who is the nearest label. Because PNCC is a non-inversible transformation, a small frame at the center of a large windowed frame has been regarded (because it has reasonable weight) to obtain original speech signals. The procedure was efficient for clustering a mixture of two speaker signals, female and male from the TIMIT standard audio library, i.e., successfully recovered each person's individual speech. The average Diarization Error Rate (DER) objective tests of the recovered speech were 1.8% for the females, 2.9% for the males, and 2.5% for the overall females and males. Compared with other standard researches, the improvements were 6.5% for the females, 10% for the males, and 8.8% for all females and males.