This paper focus on performance observation of Bangla Automatic Speech Recognition System (ASR) implemented with the vector set of 52 dimensions (includes static, delta, acceleration, and third differential coefficients) of Mel Frequency Cepstral Coefficient (MFCC) and triphone based Hidden Markov Model (HMM). Since the effect of appending third differential coefficients with a vector set of 39 dimensions (includes static, delta and acceleration coefficients) have not been explored in the field of Bangla ASR yet therefore, we have conducted our experiments with MFCC39+triphone HMM also for performance comparison. We have used HTK toolkit to conduct the experiments. A speech corpus of 100 sentences has developed which are uttered by different speakers including both male and female to make the train and test corpus. We have focused on the Sentence Correction Rate (SCR) and set the sampling frequency at 16 KHz to observe the performance. We have got an astonishing average SCR about 99.07% with MFCC52+triphone HMM while the system is trained with the voice samples of 5 males. MFCC39+triphone HMM has also shown an average SCR of 98.8% in this case.
Developments of Automatic Speech Recognizer (ASR) had been started since early 20th century and the English language is the main attraction for such developments. Therefore, in this era of artificial intelligence, we have witnessed some advance ASR based systems like Siri and Cortana developed by Apple incorporation and Microsoft respectively.
But being a seventh largest spoken language in the world [1], some fewer developments can be noticed for Bangla language. So, to become the part of these developments we have taken an initiative for the improvement of Bangla ASR. We are motivated by the statement that time derivatives if being added with their basic static parameters would greatly enhance the performance of ASR [2]. An ASR for Bangla has been developed by us which uses Mel Frequency Cepstral Coefficients (MFCCs) for extracting features from speech waveforms and HMM-based classifier. HMM is a statistical approach that originates stochastic models from known utterances which are compared with the probability of that the unknown utterance developed by each model [3]. To achieve higher accuracy on HMM-based phonetic segmentation, an acoustic model based on tied-state triphone is used here as it is the most effective model for capturing the co-articulation effect since the immediate left and right phonetic contexts are considered here. Being developed by Microsoft and Cambridge University the HTK toolkit is used here to extract the features from speech waveforms and make a five state left-to-right Hidden Markov Model (HMM) based tide-state triphone model.
This paper is organized based on five different sections including this section of introduction. In the upcoming sections, we will discuss some previous works, methodology, experimental setup, and results respectively. This paper will end with a brief overview of this work and will highlight about our future work.
To obtain the optimum accuracy with ASR researchers have conducted a lot of approaches. From them, some of the approaches that are dealing with MFCC to extract the feature vectors from raw input signals for Bangla ASR are given priority to discuss in this section.
[4] had used MFCC39 based system along with triphone HMM over Bangla sentences and hit a Word Correct Rate about 90.33%. In their work they had included only the male speakers and recorded the voice files at a sampling frequency of 16 KHz.
[5] had developed gender independent Bangla ASR using MFCC39 based system along with some components in HMM and found the SCR of 87.30% with their proposed model. They had also used the sampling frequency of 16 KHz.
[6] had found that using Gaussian Mixture Model (GMM) with MFCC of 39 dimensions hit an incentive accuracy rate around 100% for Bangladeshi dialects recognition. They had also used Support Vector Machine (SVM) for comparison. For experiments, they had sampled the voice files at 16 KHz.
[7] had used a vector set of 39 dimensions of MFCC with Dynamic time warping and K- nearest neighbors’ algorithm. The level of accuracy showed by this paper was almost 90%.
A. Speech Corpus
The very first step to building an ASR is to make a speech corpus of certain language that is going to be recognized. As we don’t have any well-developed Bangla speech corpus like the British ones [4] therefore, we have built a speech corpus of 100 sentences by picking up some random words, joining them together to make sentences (e.g. “AALOCHONA AACHHEâ€), and then these sentences are uttered by a number of speakers.
B. Feature Extraction
For extracting feature vectors, we computed Mel Frequency Cepstral Coefficients (MFCC) which is the most presiding and acceptable method in the world. The overall process is shown in Fig 1.
In this method, the speech signal is filtered using a high pass pre-emphasis filter which range is of 0.97 to 1.0 [8]. Mathematically this filter can be represented by the following equation [9]-
(1)
Where, s2(n) is the output signal and a is the pre-emphasis coefficient whose value is between 0.91 and 1.0.
These pre-emphasized signals are framed for analyzing over a short period of time. Hamming window is then multiplied with each of these frames to overcome the discontinuity that will occur at the both ends after performing the Fourier Transform [6]. Hamming window can be represented by the following equation [9]-
(2)
Where, 0 ≤ 𑛠≤ ð‘ – 1 and w(n) is the hamming window.
Then to transform these signals from time domain to frequency domain Fast Fourier Transform (FFT) is performed. Then these are converted into Mel-Scale using Mel-Scaled Filter Bank. The Mel-Scaled Filter Bank can be represented by the following equation [10]-
(3)
Our filter bank consists of 26 triangular bandpass filters which are then multiplied with the spectrum. After that, the coefficients are summed [11].
After that, Discrete Cosine Transformation (DCT) is performed according to the following equation [2] over these signals to transform them to the time domain.
(4)
Here, mj represents the log filter bank amplitudes and the number of the filter bank is represented by N. Thus we obtain MFCCs. For achieving better performance, we added time derivatives with these basic static parameters.
C. HMM-based Modelling
We used HMM-based tied-state triphone model which generates models from known utterances, then these are compared to the probability of unknown utterances. These unknown utterances are also developed by this HMM. Computing from observance sequence probability, the maximum one is chosen for recognizing a word [12]. The fo With the speech corpus we have developed, we conducted the experiment in two phases. The phases of experiments are shown in the following Table 1. We have built two train corpus and the test corpus is built with voice samples of 5 males and 5 females. We have recorded the voice files using “Easy Voice Recorder†(which is an android app) at 44.1 KHz and then the voice files have been sampled down at 16 KHz using the audio processing app “Audacityâ€. llowing figure shows the overall concept of an ASR.
The vector set of 39 dimensions (MFCC39) has been made with 12-MFCC, 12-ΔMFCC, 12-ΔΔ MFCC, E, ΔE, and ΔΔE, where E stands for log energy [2]) in per 10ms. Therefore, to make a vector set of 52 dimensions (MFCC52) we have just added 12 third differential coefficients (12-ΔΔΔMFCC) along with the log energy (ΔΔΔE) a total of 13 new coefficients with previous 39 coefficients. Table 2 shows the coding parameters for HTK toolkit that are necessary to execute the above operations.
The vector set of 39 dimensions (MFCC39) has been made with 12-MFCC, 12-ΔMFCC, 12-ΔΔ MFCC, E, ΔE, and ΔΔE, where E stands for log energy [2]) in per 10ms. Therefore, to make a vector set of 52 dimensions (MFCC52) we have just added 12 third differential coefficients (12-ΔΔΔMFCC) along with the log energy (ΔΔΔE) a total of 13 new coefficients with previous 39 coefficients. Table 2 shows the coding parameters for HTK toolkit that are necessary to execute the above operations.
We do acknowledge the enormous support of Md. Jakaria Rahimi, Assistant Professor of Ahsanullah University of Science & Technology. We would like to appreciate and give special thanks to all the previous thesis students of Md. Jakaria Rahimi for their contribution to the development of our work.