Keywords: Speaker identification , Speech recognition , support vector machine , melfrequency cepstrum coefficient .
Biometric System
Can be either identification or verification system there are a lot of the type of the biometric system like face recognition, fingerprint, voice recognition…etc. In my project I will use the voice recognition in order to identify and verify the speaker voice. The speaker should provide the system with any word in order to identify the identity for the speaker while in the verification the speaker must provide a system with password in order to verify the identity and provide the speaker an access to enter to the system, a lot of the organization want to use these services like Phone banking services, ATM services ‘etc .
Speaker Identification using MFCC-Domain Support Vector Machine
Abstract
Speaker identification and Speech recognition are very significant for authentication and verification in security purpose, but they are hard to achieve. Speaker identification ways can be apportioned into text-dependent and text-independent. This paper provides a technique of text-dependent speaker identification using MFCC-domain and support vector machine (SVM). In this work, melfrequency cepstrum coefficients (MFCCs) and their statistical distribution characteristics are applied as features, which will be entrances to the neural network.
Introduction
Identify the speaker has been the theme of active research for many years, and has a lot of possible applications where suitability of the information source of concern. Speaker identification is the way of automatic recognize a speaker by machine using the speaker’s sound . The most popular programs of speaker identification systems is in access control, for example, featured information over the phone or access to a room. Also it has a very beneficial usage for speaker adaptation in system of automatic recognition of speech.
Speaker recognition can be categorized into two things: verification and identification . Speaker identification is the process of identifying which recorded speaker provides a certain utterance. Speaker verification is the process of acceptance or rejection the identity claim of a speaker. Speaker recognition techniques can also be splitted into text-dependent and text-independent techniques.
Description Of Technique
Speaker Recognition
Voice processing
MFCC is probably the most popular and best known for representing the speech
signal for the speaker recognition function . MFCCs are depends on the known
difference of the human ear’s critical bandwidths with frequency.
The MFCC processor
Figure 1: Block diagram of the MFCC processor[1].
In the above diagram as shown in Figure 1 represent the structure of MFCC
processor .The speech input is registered at a sampling rate of 22050 Hz. This
sampling frequency is selected to reduce the effects of obfuscation in the analog-todigital
transfer process.
Mel-frequency wrapping
The speech signal is composed of tones at different frequencies. For each tone with
an actual frequency f, measured in Hz, a subjective pitch is measured on the ‘Mel’
scale[1].
The following formula to compute the mels for a given frequency f in Hz[1]:
mel(f)= 2595*log10(1+f/700) ”’.. (1)
Cepstrum
The mel frequency cepstral coefficients (MFCCs) is used to transform the log mel
spectrum to the time and because the mel spectrum coefficients are real numbers ,
they may be transformed to the time domain by using the discrete cosine transform
(DCT).
The MFCCs may be calculated using the following equation:
where n=1, 2, ‘, K. The number of mel cepstrum coefficients, K, is typically
chosen as 20. The first component, c~ 0 , is excluded from the DCT since it represents
the mean value of the input signal which carries little speaker specific information [1].
Support vector machine
Three alternative methods for training SVMs:
Speaker Recognition
6
1. Chunking:
The chunking algorithm uses the fact that the value of the quadratic form is the same if you remove the rows and columns of the matrix that corresponds to zero Lagrange multipliers. Therefore, the large quadratic optimization problem can be broken down into a series of smaller quadratic problems, whose ultimate goal is to identify all of the non-zero Lagrange multipliers and discard all of the zero Lagrange
Multipliers[1].
2. Osuna’s algorithm:
In 1997, Osuna ,proved a theorem which suggests a whole new set of quadratic algorithms for SVMs. The theorem proves that the large quadratic optimization problem can be broken down into a series of smaller quadratic sub-problems[1].
3. SMO:
Sequential minimal optimization (SMO) is a simple algorithm that can quickly solve the SVM problem without any extra matrix storage and without using numerical quadratic optimization steps at all. SMO decomposes the overall quadratic problem into quadratic sub-problems, using Osuna’s theorem to ensure convergence[1].
For each method, three steps are illustrated in the Figure2. The horizontal thin line at every step represents the training set, while the thick boxes represent tdhe Lagrange multipliers being optimized at that step. For chunking, a fixed number of examples are added every step, while the zero Lagrange multipliers are discarded at every step. Thus, the number of examples trained per step tends to grow. For Osuna’s algorithm, a fixed number of examples are optimized every step: the same number of examples is added to and discarded from the problem at every step. For SMO, only two examples are analytically optimized at every step, so that each step is very fast[1].
Figure 2: Three alternative methods for training SVMs:Chunking, Osuna’s algorithm, and SMO[1].
Comparative analysis
Method
Strengths
Weaknesses
SMO
‘ Training time is faster than the other methods.
‘ SMO is superior to Osuna and chunking in computation time[1].
Speaker
Recognition
7
Chunking
‘ Chunking solves a quadratic optimization problem.
‘ Chunking cannot handle large-scale training problems, since even this reduced matrix cannot fit into memory.
Osuna’s algorithm
‘ Suggests keeping a constant size matrix for every quadratic sub-problem, which implies adding and deleting the same number of examples at every step.
‘ Inefficient.
Table1: Comparisons between the SVM methods.
From this Comparisons we will see that the SMO is the best method based on the results.
Conclusions
This paper provide an policy depends on neural networks for sound recognition and user identification. This policy depends on MFCC domain support vector machine by using SMO learning way. We have found a good outcome after testing the sound dependent system where there is a 91.88% success rate for using Chunking SVM training way and 95% for using SMO SVM training way . Thus we can Deduce that, our policy have identified and verified human sound better than earlier policies.
The MFCC technique has been applied for speaker identification. VQ is used to reduce the data of the derived feature. The study describes that as number of centroids rises, identification rate of the system rises . It has been found that composition of Mel frequency and Hamming window gives the better performance.
Speaker Recognition using Support Vector Machine
Abstract
Voice recognition is the process of identifying the speaker, according to several characteristics such as pitch, tone in the speech wave. Background noise greatly affect on the overall efficiency of a voice recognition system and this is considered as a difficult and a great challenging for the speaker recognition system. MFCC advantage is derived from the input speech and then vector muzzling of the derived MFCC advantages is done using VQLBG algorithm. It Minimizes the dimensionality of the input vector .These MFCCs are used as the speaker advantages for matching via Support Vector Machine (SVM) method.
Introduction
Speaker identification is the operation of identifying a speaker by the machine automatically by used some of the properties of the voice speaker’s. Voice recognition can be classified into verification and identification . Speaker identification is the operation of determining the speaker from the database while the Speaker verification is the operation to accept or reject the identity of a speaker. For more than fifty years ,the issue of determining the speaker has been active in a lot of the researches . Many methods [3] like simple template matching, dynamic time-warping approaches have also been used for speaker recognition.
Speaker Recognition
8
Description Of Technique
The following steps are currently being adopted to design Speaker Recognition
System:
‘ Data Acquisition through Microphone[2].
‘ Feature Extraction[2].
‘ Data Compression Using VQ LBG algorithm[2].
‘ Feature Matching[2].
FEATURE EXTRACTION
The objective of Feature Extraction module is to derived the audio advantage
vectors which are used to describe the spectral characteristic of the time varying
speech signal .These feature vectors are used to recognize the speaker . there are a lot
of the available techniques for parametrically representing the speech signal to
recognize the speaker. Linear prediction coding (LPC), mel-frequency cepstrum
coefficients (MFCC) are the most commonly used technique[2]. MFCCs are based on
the known variation of the human ear’s critical bandwidths with frequency. The
MFCC technique makes use of two types of filter, i.e. linearly spaced filters and
logarithmically spaced filters[2]. MFCCs are less prone to the difference in speech
waveform because of physical status of speakers vocal cord.
The MFCC processor
Mel frequency cepstral coefficients (MFCC) is the best known and most frequently
used for both speech recognition and speaker . A mel is measure unit depends on
human ear’s perceived hesitate. The mel scale has approximately linear hesitate
spacing under 1000Hz and a logarithmic spacing above 1000Hz. The approximation
of mel from hesitate can be expressed as
mel(f) = 2595*log(1+f/700) (1)
where f denotes the real frequency and mel(f) denotes the perceived frequency.
The block diagram showing the computation of MFCC is shown in Figure 3[2]
Figure 3: MFCC Extraction[2].
In the first phase speech signal is apportion into frames with the length of 20 to 40
ms and an overlap of 50% to 75%. In the second phase windowing of every frame
with certain window function is done to reduce the Lack of continuity of the signal by
tapering the starter frame and end frame to zero for each frame . In time domain
window is point wise multiplication of the framed signal and the window function. A
good window function has a narrow basic lobe and low side lobe levels in their
carriage function. In our work hamming window is used to perform windowing
Speaker Recognition
9
function. In third phase DFT block transfer every frame from time domain to frequency domain [4]. Hamming window is given by
where N represents the width, in samples, of a discrete-time, symmetrical window function
The aim of VQ is to compress data .We select the more effective features instead of using the entire feature vectors. By Compile the speaker’s feature vector into a known cluster numbers the speaker models are shaped . Each group is called centroid and is represented by a code vector .The code vectors constitute a codebook . Each feature vector of the input is then compared with all the other codebooks. Select the best codebook which gives the minimum distance .
FEATURE MATCHING
Support vector machine
Support vector machine (SVM) was developed by Vapinik (1998).It is one of the most important developments in pattern recognition in the last 10 years. Other techniques like Hidden Markov models (HMM) and Gaussian mixture models (GMM) which are used for feature matching are prone to over fitting and they do not directly optimize discrimination[5].
SVM is a linear classifier. For a group of training examples, an SVM training algorithm construct a model that assigns new examples into a single category or the other. It is very important to select the right demarcation line to delete the misclassification. It is needed to maximize the margin between two groups [15]. The optimal hyperplane is calculated using kernel functions. The samples closest to the separating hyperplane are knows as support vectors.
To find the optimal hyperplane, we have to solve the following optimization problem:
When the data are not linearly separable then no hyperplane exists for which all points meet the inequality. Consequently slack variables ??i are listed into the inequalities. This relaxes the inequality and certain points are allowed to be misclassified. The objective function becomes:
The second term of formula 6 is the empirical risk linked with those points that are misclassified, L is the loss function (cost function) and C is a hyper parameter .It trades off the impacts of reducing the empirical risk against increasing the margin.
Speaker Recognition
01
Kernels are used to non-linearly map the input data to a high dimensional space (feature space). The new mapping is then linearly separable.
RESULT
A database of eight speakers is established .The feature extraction was done by using MFCC (Mel Frequency Cepstral Coefficients). The speakers were shaped using Vector Quantization (VQ). A VQ codebook is created by grouping the training feature vectors of every speaker and then put it in the speaker database. The LBG algorithm was used for grouping purpose. All of the entry speech signals are sampled at 8000 Hz with 16-bit resolution. A speaker identification system includes of a training stage and a test stage . In the training stage the SVM forms are generated for every speaker. In testing stage the stored data are compared with the claimed SVM model and a decision is taken. The Equal Error Rate (EER) is used to measure the system performance. We compared against these two types of kernel functions used in SVM execution . As a kernel functions RBF and Polynomial (degree 2 and 3) are used. The results are given in Table 2. The best results are acquired with RBF kernel function.
Table 2: Mean EER for different kernels and coefficients[2].
Table 3 illustrate the impact of replacing the number of centroids on the identification rate of the system. It can be noted that an Precision rate of 95.0% is acquired for MFCC of order 24.
Table 3: Comparison of SVM based text-dependent speaker identification system with different MFCC orders[2].
Comparative analysis
Method
Strengths
Weaknesses
HMM
‘ —-
‘ Do not directly optimize discrimination.
HMM
‘ —
‘ Do not directly optimize discrimination.
SVM
‘ Its high accuracy.
‘ Directly optimize discrimination.
‘ —
Table 4: Comparisons between the methods.
Speaker Recognition
00
Conclusions
This paper presented successfully an approach depends on SVM for speaker identification in order to determine the identity of the speaker . The MFCC approach has been applied for feature extraction. VQ is used to reduce the data of the extracted feature. The outcome illustrate that as number of centroids increments, identification rate of the system increments but it comes at the account of increasing computational time. Also the mix of Mel frequency and Hamming window provides the better performance. It can be inferred that the result of the work clearly indicates that the suggested model can be used as an effective and attractive means for the recognition.
Speech Recognition: Increasing Efficiency of Support Vector Machines
Abstract
With the progress of communication and security technologies, it has become important to have strength of embedded biometric systems. This paper provides the achievement of such technologies which demands reliable and error-free biometric identity verification systems. High dimensional styles are not allowed due to eigen-decomposition in high dimensional feature space and degeneration of scattering matrixes in little size sample. Generalization, dimensionality reduction and maximizing the margins are controlled by minimizing weight vectors. Outcomes present better style by multimodal biometric system proposed in this paper. Aim of this paper at studying a biometric identification system using Support Vector Machines(SVMs) and Lindear Discriminant Analysis(LDA) with MFCCs and executing such system in real-time using SignalWAVE.
Introduction
The Performance of a speech recognition technology is impacted greatly by the choice of features for a lot of the applications. Therefore , it is significant to find an Suitable representation of such Ongoing speech. Raw data obtained from speech registering cannot be used to immediately train the recognizer as for the same phonemes do not necessarily have same sample values. Features which are depends on signal spectra are of best choice . Our representation must be compact and features should add information to the recognition operation to satisfy the property of features being separate. Here we suggested a system which increases the efficiency of support vector machines by perform this technique after LDA execution.
Description Of Technique
MFCC
Figure 4 illustrate the MFCC processor model .The speech input is usually registered . The sampling frequency 10000 Hz is chosen to reduce the impacts of aliasing in the analog-to-digital transformation . In addition, rather than the speech waveforms themselves, Mel-Frequency Cepstrum Coefficient processor is known to be small susceptible to previously mentioned changes.
Speaker Recognition
02
Figure 4:Block diagram of the MFCC processor[3].
Frame Blocking is the process of fragmenting the speech samples acquired from analog to digital conversion (ADC) into a small frame with the length within the range of 20 to 40 msec. Hamming windowing is working to window every individual frame so as to reduce the signal discontinuities at the starting and end of every frame. The principle here is to reduce the spectral distortion by using the window to taper the signal to zero at the starting and end of every frame. rapid Fourier transfer converts every frame of N samples from the time domain into the frequency domain.
For each tone with an actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the “mel’ scale. The mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz , the subjective spectrum is to use a filter bank, spaced uniformly on the mel-scale as illustrated in Figure 5 [7]. The log mel spectrum is transformed back to time. The output is called the mel frequency cepstrum coefficients (MFCC).
Figure 5: An example of mel-spaced filterbank[7].
LDA
Linear Discriminant Analysis is a statistical way which minimize the dimension of the features while maximizing the information preserved in the minimized feature space. Use of lda after mfcc drastically minimizes the dimension of features as LDA finds optimal conversion matrix which preserve a lot of the information and the same can be used to discriminate between the various classes.
SVM
The SVD works on concept of depends on some Earlier training through inputs, using supervised learning techniques for data classification. SVM has been tested for hand writing, face recognition in overall pattern classification and regression based applications. The SVM provide the good outcomes than the neural networks despite the complexity of SVD in hierarchical and design .
Speaker Recognition
03
Description
In this paper suggest a three phases design. First phase is about taking the registered data and then passing it via the MFCC processor. Frame Blocking is blocking speech signals into frames of N samples, with Neighboring frames being separated by M (M < N). The first frame composed of the first N samples. The second frame starts M samples after the first frame, and overlaps it by N – M samples and so on[8]. This operation continues until all the speech is accounted for within one or more frames. Typical values for N and M are N = 256 (which is equivalent to ~ 30 msec windowing and facilitate the fast radix-2 FFT) and M = 100 [7].
The outcome of windowing is the signal.
Comparative analysis
Method
Strengths
Weaknesses
LDA
‘ Minimize the dimension of the features while maximizing the information preserved in the minimized feature space.
‘ Finds optimal conversion matrix which preserve a lot of the information and the same can be used to discriminate between the various classes.
‘
SVM
‘ Easy to train and scale complicated high dimensional data in comparison with neural networks..
‘ —
Table 5: Comparisons between LDA and SVM .
Conclusions
Outcomes show how SVM alone is insufficient for categorization and poor outcomes are observed . However, SVM when used with statistical instrument of LDA which is used to minimize the dimension of the features data received from MFCC processor increments the competence and proves to be 100% accurate and the prediction rate for more than 40%. It is shown here SVM competence is minimized when the dimensionality of the training data is incremented . The paper also identifies a practical solution for executing the suggested system on FPGA for great speed increment . The efficiency highly increments with the use of SVM over LDA. Speech recognition system suggested in this paper depends on real tests give accurate outcomes and greatly reduces the complexity. It is noticed during the entire study that if SVM and LDA are both used together it provide accurate outcomes.
Speaker Verification Using MFCC and Support Vector Machine
Speaker Recognition
04
Abstract
This paper suggested a research on the use of support vector machine (SVM) and mel-frequency cepstral coefficients (MFCC) to verify the speaker based on the text-dependent . The MFCCs used in this paper are mined from the voiced password spoken by the user. These MFCCs will be normalized and then can be used as the speaker features for training a claimed speaker model via SVM. Finally one could make use of the claimed speaker model to distinguish between the speaker and other impostors. Researches were performed on the Aurora-2 database with different orders of MFCCs. It follows from the experimental outputs that the suggested text-dependent speaker verification system depends on the 22th-order MFCCs and SVM provides an equal error rate (EER) of 0.0% and average Precision rate of 95.1%.
Introduction
The speaker verification is considered as a subclass of automatic speaker recognition (ASR) system and can be applied to identify whether a person is who she/he claims to be. Therefore, the problem of speaker verification is a true-false (accept-reject) question[8] . The speaker verification is wanted widely in several speech associated applications, such as banking services via phone, voice calling, and biometric security system. At the same time , depended on the variation of recognition goal , the systems of speaker verification divided into two types: text-independent and text-dependent. This paper will focus on the problem of the text-dependent speaker verification in order to security reasons.
Description Of Technique
Speaker verification system composed of two tasks: enrollment and verification as illustrated in Figure 6. Enrollment is the task to build a speaker model. This step will Picking up the speaker properties or features. A lot of the present-day systems use the MFCCs or linear prediction coefficients (LPCs) as speaker features to verify the speaker .Then these speaker features are used to construct a model that could authenticate the speaker during the verification phase[8].
Figure 6: The typical speaker verification system[8].
In the speaker verification task, the speaker features of the input speech from test subject will be extracted and matched against the speaker model. A likelihood ratio will evaluate the similarity between the model and the measured observations[8]. The objective of this paper is to develop a more competency approach to the text-dependent speaker verification using SVM and MFCC. In this paper, different orders of MFCCs are used as speaker features to conduct speaker verification. In the first ,
Speaker Recognition
05
the user has to give a voiced password and the corresponding MFCCs will be derived from this spoken password. Then the suggested text-dependent speaker verification system will make use of SVM to train the speaker features from these MFCCs and create a speaker model to distinguish between the speaker and other impostors. Using speech signals selected from the Aurora-2 database, experimental results shown the performance of the proposed speaker verification algorithm yields an equal error rate (EER) of 0% and average accuracy rate of 95.1% with 22-order MFCCs[8].
MFCC
The MFCC can capture the acoustic properties for speech recognition. According to psychophysical studies, human perception of the frequency content of sounds follow a subjectively defined nonlinear scale called the “mel” scale [9] defined as,
(1)
where f is the actual frequency in Hz. This leads to the definition of MFCC and its computation operation is shown as follows.
Figure 7 show the mel space filter bank with M=40[10].
Figure 7:Mel-space filter bank (M=40) [10].
Finally, discrete cosine transform (DCT) is taken on the log filter bank energies, log[e(l)], and the MFCC coefficients Cm can be written as [8],
where 0 ‘ m ‘ M-1. Figure11 illustrate the summary of MFCC computation operation.
SUPPORT VECTOR MACHINE
An SVM is a two-class classifier built from sums of a known kernel function K(‘, ‘) to identify a hyperplane.
where ‘{1,’1} i y are the target values , The vector n i x ‘ R are support vectors and obtained from the training[8] . This hyperplane will separate particular points into two predefined categories . Assume a training set and a
Speaker Recognition
06
kernel function is given , where < ‘,’ >indicates the inner product and ?? maps the input space X to another high dimensional feature space F. With suitably chosen ??, the given non linearly separable samples S may be linearly separated in F, as illustrated in Figure 8. An improved SVM known as soft-margin SVM can tolerate minor bad classifications.
Figure 8: A feature map simplifies the classification task[8].
THE PROPOSED SPEAKER VERIFICATION SYSTEM
Figure 9 illustrate the model of the text-dependent speaker verification system suggested in the present paper. Before executing speaker verification, one has to create a claimed speaker model and an imposter model via SVM training.
Figure9: The block diagram of the proposed text-dependent
speaker verification system[8].
Comparative analysis
Method
Strengths
Weaknesses
SVM
‘ Easy to train .
‘ Determine the difference between the speaker and other impostors in good way .
‘ —
MFCC
‘ The MFCC can capture the acoustic properties for speech recognition .
‘ This algorithm gives an equal error rate (EER) of 0.0% and average accuracy rate of 95.1% with 22-order MFCCs.
‘ The other order for the MFCC does not have good accuracy ..
Table 6: Comparisons for SVM and SVM .
CONCLUSIONS
The mel-frequency cepstral coefficients (MFCC) and support vector machine (SVM) are performed to the task of text-dependent speaker verification system. First, the MFCCs will be derived from the voiced password submitted by user. Then the suggested algorithm will make use of SVM to train the speaker properties model from these MFCCs and outcome in a claimed speaker model that can determine the difference between the speaker and other impostors. different experiments were
Speaker Recognition
07
applied on the Aurora-2 database and illustrated that the performance of the suggested algorithm gives an equal error rate (EER) of 0.0% and average accuracy rate of 95.1% with 22-order MFCCs.
An Overview of Automatic Speaker Recognition Technology
Abstract
In the present paper we discuss the field of the speaker recognition and the techniques that are used in this filed and provides explanation of the main weaknesses and strengths points in the field of the recognition speaker .
Introduction
The talk signal transmits a number of levels of info to listeners. At the main level, speech transmit a message through words . while in the other levels speech transmits information regarding the language spoken by the person , and the passion and sex, and in general, the identity of the talker . The speech recognition is designed to identify the spoken word in the expression, and the main aim of the Auto identification systems is to describe , extract and identify the info in the talker indication conveying talker identity . The most important area of the speaker recognition include two more essential tasks. Speaker identification is the task that are used to identify who’s speaking from a group of known speaker or voices . Verification speaker (also called as the speaker validation or disclosure ) is the task of identifying whether an individual is who’s claiming to be a correct user or not (yes / no resolution).
Based on the level of user’s cooperation and control in an application, the speech used for these functions can either be text-independent or text dependent. In the application depends on the text, the recognition system has a advance knowledge of the text to be spoken , it is expected that the user will talk the given text. advance knowledge and limitation of the text can significantly enhances the performance of the recognition systems. In the text-independent , there is no advance knowledge by the system of the text . the independent text is more resilient but more difficult . Phone system offers, familiar grid from the sensor for receiving and passing speech signal.
Description Of Technique
APPLICATIONS
Speaker recognition applications are very diverse and ever-increasing and used in many areas like Access Control ,Law Enforcement and Transaction Authentication …etc .In the Access Control the biometric added to the password to control the access for the website .In the Law Enforcement the recognition system used in order to monitor the home of the inmate .In Transaction Authentication the recognition system used in order to control the access to the customer account through the banking telephone services .
VERIFICATION TECHNOLOGY
Speaker Recognition
08
Figure 10 illustrate the main structure for the most speaker verification systems . The speaker verification systems applied a likelihood ratio test . The speech signal come from the original speaker or from an imposter . The features extracted from the voice signal in the front-end processing are compared with both of the speaker model and imposter model . The difference in the match scores of the speaker and imposter is the likelihood ratio statistic (??) will compared with a threshold (??) in order to accept or reject the speaker.
Figure 10: Basic components of modern speaker verification systems[12]
Front-end processing Usually composed of three sub processes . First, some form of the disclosure technique is implemented in order to remove the non-speech portions from the speech signal. After that, features transfer speaker info are mined from the speech .
Speaker Modeling
There are several modeling techniques that have some or all of these qualities and have been used in the speaker verification systems like a theoretical underpinning and parsimonious representation in both size and arithmetic. Choice modeling highly dependent on the kind of speech to be used, and anticipated performance, ease of training and storage considerations.
Some of the modeling techniques more common:
In the Template Matching , the model is composed of a template, which is a series of feature vectors of fixed word. During the verification a match degree is produced by using dynamic time warping (DTW) to measure the similarity between the test words and the speaker template .This approach used in the test-dependent systems . In the Nearest Neighbor technique, all the features of the vectors from the enrollment speech are retained to represent the speaker. During the verification, the match degree is calculated as the cumulated distance of every test feature vector to its k nearest neighbors in the speaker’s training vectors.
In the Neural Networks technology ,models are clearly trained to distinguish between the alternative speakers and speaker being modeled. Training can be computationally more expensive and models are not always generalizable. The Hidden Markov Models technology uses HMMs, which encode the temporal develop of features and efficiently model statistical variation of the features, to offers a statistical representation of how a speaker gives sounds. In the verification, the likelihood of the test feature series is calculated against the speaker’s HMMs. HMM systems always produce the best performance .
Speaker Recognition
09
Imposter Modeling
This model is used to help minimize non-speaker related variability like noise in the likelihood ratio degree. There are two basic approaches used in the imposter model . The first approach, likelihood sets, uses a group of other speaker models to calculate the imposter match degree. The imposter match degree is always calculated as a function, like max. The second approach, known as general background modeling, uses a single speaker-independent model trained on speech from a big number of speakers to represent speaker-independent speech. It is very difficult to compare the performance of the speaker verification applications because every system performed different enrollment approach .
STRENGTHS AND WEAKNESSES
The basic strength in the verification system works almost by using the telephone networks with no further user equipment or training and it can based on the signal that is natural. In addition, speaker verification is simple to use, and has less computing requirements and also has a high precision. Certain of the flexibility of speech lends to its vulnerabilities. The speeches are the behavioral signals that may not be reproduced continually by speakers and could influenced by the health of the speaker like cold or sore throat. Also microphones and a variety of channels that people use can produce difficulties because most of speaker verification systems depends on spectrum at a low level vulnerable to the effects of channel / adapter.
Comparative Analysis
Method
Strengths
Weaknesses
Template Matching
‘ Used for text-dependent applications.
‘ Not used for text-independent applications.
Nearest Neighbor
‘ In order to limit the storage and the calculation , feature vector pruning
‘ technology are applied.
‘ —
Neural Networks
‘ Models are clearly trained .
‘ Not always generalizable .
Hidden Markov Models
‘ HMM systems always produce the best performance.
‘ —
Table 7: Comparisons between different methods.
Conclusion
This paper present the writer described the field of the speaker recognition and the techniques that are used in this filed and provides explanation of the main weaknesses and strengths points in the field of the recognition speaker.
References
[1] S. M. Kamruzzaman, A. N. M. Rezaul Karim, Md. Saiful Islam and Md. Emdadul Haque ,” Speaker Identification using MFCC-Domain Support Vector Machine”.
Speaker Recognition
21
[2] Geeta Nijhawan , M. K. Soni, Ph.D ,” Speaker Recognition using Support Vector Machine”, International Journal of Computer Applications (0975 ‘ 8887) Volume 87 ‘ No.2, February 2014
[3] Amruta Anantrao Malode,Shashikant Sahare,2012 , “Advanced Speaker Recognition’, International Journal of Advances in Engineering & Technology ,Vol. 4, Issue 1, pp. 443-455.
[4] D.A. Reynolds, “Experimental evaluation of features for robust speaker identification,” IEEE Trans. Speech Audio Process., vol. 2(4), pp. 639-43, Oct. 1994.
[5] Shi-Huang Chen and Yu-Ren Luo, Speaker Verification Using MFCC and Support Vector Machine, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol I,IMECS 2009, March 18 – 20, 2009, Hong Kong.
[6] http://research.cs.tamu.edu/prism/lectures/sp/l16.pdf.
[7] Aamir Khan, Muhammad Farhan, Asar Ali,” Speech Recognition: Increasing Efficiency of Support Vector Machines”, International Journal of Computer Applications (0975 ‘ 8887)Volume 35’ No.7, December 2011.
[8] Shi-Huang Chen and Yu-Ren Luo ,” Speaker Verification Using MFCC and Support Vector Machine”, Proceedings of the International MultiConference of Engineers and Computer Scientists 2009 Vol IIMECS 2009, March 18 – 20, 2009, Hong Kong.
[9] S. B. Davis and P. Mermelstein, ‘Comparison of Parametric Repre- sentation for Monosyllabic Word Recognition in Continuously Spoken Sentences’, IEEE Trans. On ASSP, vol. ASSP 28, no. 4, pp. 357-365, Aug. 1980.
[10] http://www.softwarepractice.org/wiki/Team_D_Speaker_Re cognition (MediaWiKi, Term D Speaker Recognition).
[11] http://www.elda.org/article52.html. (Aurora Database 2.0)
[12] Douglas A. Reynolds,” An Overview of Automatic Speaker Recognition Technology”, 0-7803-7402-9/02/$17.00 ??2002 IEEE.
Essay: Speaker identification and vertification
Essay details and download:
- Subject area(s): Computer science essays
- Reading time: 20 minutes
- Price: Free download
- Published: 30 September 2015*
- Last Modified: 23 July 2024
- File format: Text
- Words: 5,576 (approx)
- Number of pages: 23 (approx)
Text preview of this essay:
This page of the essay has 5,576 words.
About this essay:
If you use part of this page in your own work, you need to provide a citation, as follows:
Essay Sauce, Speaker identification and vertification. Available from:<https://www.essaysauce.com/computer-science-essays/essay-speaker-identification-and-vertification/> [Accessed 17-01-25].
These Computer science essays have been submitted to us by students in order to help you with your studies.
* This essay may have been previously published on EssaySauce.com and/or Essay.uk.com at an earlier date than indicated.