ABSTRACT
The main task of automatic Language Identification (LID) is to quickly and accurately discriminate the language of spoken utterance (e.g. English, Spanish, etc.) Speech perception by human is a multimodal process and the visual signal (Lip movements) as well as the auditory signal produced is robust and rich sources of information. Observers can extract different components of visual signal and the source and contents of signal can be judged by utilizing them. This paper discusses the new approach for language identification using visual speech recognition i.e. lip reading .The related technology is called as visual only language Identification (VLID). The process of automatic lip reading also can be benefited by the research in this field.
Keywords: Language Identification (LID); Visual speech; Lip Reading; VLID
I. Introduction
Automatic Language Identification (LID) is the process of recognizing a language of a given spoken utterance by a computer. Language identification has numerous applications in a wide range of multi-lingual services. An example is the language identification system used to route an incoming telephone call to a human switchboard operator fluent in the corresponding language.
Automatic visual language identification (VLID) is the technology which makes use of visual cues derived from movement of the speech articulators (lip movements) to identify the language of spoken utterance, without using any audio information [1]. In situations where conventional audio processing is ineffective due to very noisy environments, or when no audio signal is available this technique of language identification (LID) is useful. In our paper [2] an overview of spoken language identification, various language identification cues and basic frame work of visual language identification is discussed. According to the proposed frame work the first task is feature extraction from videos of speech articulators i.e movements of lips. For this it is necessary to detect Lips from the frames of videos containing face images.
This paper discusses the lip detection using constrained local model. In Section II various methods of lip detection are discussed. In Section III and IV CLM model building and searching is discussed. Results are shown in Section V. In Section VI conclusion and future scope from the study of the topic is presented.
II. Methods of Lip Detection
Many techniques for lips detection/localization in digital images have been reported in the literature, and can be categorized into two main types of solutions:
1. Model-based lips detection methods. Such models include the use of spline based deformations called “Snakes”, Active Shape Models (ASM), Active Appearance Models (AAM), and deformable templates. This approach depends on building lip model(s), with or without using training face images and subsequently using the defined model to search for the lips in any freshly input image. The best fit to the model, with respect to some prescribed criteria, is declared to be the location of the detected lips. These methods include the active contours (Snakes), Active Shape Models, Active Appearance Models.
i) Snakes
Their optimisation techniques are based on splines, and the idea is to iteratively minimize the energies of the spline to fit local minima. The optimal fit between a snake and the detected shape is found by minimizing the energies in the following equation:
Fig.1 Lip Detection using Snakes
Problems using Snakes are as follows:
1. They can fit to the wrong feature, such as the nose or the chin; especially if the initial
position was far from the lip edges.
2. Snakes like splines do not easily bend sharply so it is difficult for them to locate sharp
curves like the corners of the mouth.
3. They can be fooled easily by facial hair (moustache and beard).
4. Sometimes (depending on the detected object) the tuning of snake parameters is very difficult to achieve, and takes a long time (several seconds!)
ii) Active Shape Models
ASM are statistical models of the shapes of objects, which iteratively adjust to fit to the detected object in digital images. A shape is represented by a set of labelled“landmarks”, where each point is represented by its x and y axis. X={(x1,y1), (x2,y2), …., (xn,yn)}
Fig.2 ASM landmark points
Using a training set of a land-marked object in images, ASMs build a statistical shape model of the global shape variation from the training set. A principal component analysis (PCA) is used to construct such a model. Artificial markers or coloured lipstick have been applied on speaker‟s mouth to extract shape-based features consisting of the lip contours .The use of artificial markers is not suitable for practical applications.
iii) Active Appearance Models
Active Appearance Models (AAM) are the same as ASM, but instead of using the edge profile along a landmark, AAM are extended to include the grey scale information of the whole image along with the shape information. The ASM uses models of the image texture around the landmark points, whereas the AAM uses a model of the image texture of the whole region. The ASM searches around the boundary, whereas the AAM only samples the image under the current position. The ASM seeks to minimize the distance between the model and the image points, whereas the AAM seeks to minimize the difference between the synthesized model image and the target image
2. Image-based lips detection methods. These include the use of spatial information, pixel colour and intensity, lines, corners, edges, and motion. Since there is a difference between the colour of lips and the colour of the face region around the lips, detecting lips using colour information attracted researchers‟ interest recently. Its simplicity is that, not time consuming, and the use of fewer resources e.g. low memory. The most important information that researchers focus on are: using the red and the green colours in the RGB colour system, the hue of the HSV colour system, and the component of the red and blue in the YCbCr colour system.
i) RGB approach
Colours are seen as variable combinations of the so-called primary colours red (R), green (G), blue (B). The primary colours can be added to produce the secondary colours of light magenta, cyan, and yellow. The RGB colour system can be represented by a cube, where R, G and B values are at three corners. Cyan, magenta and yellow are at the other three corners, where black is at the origin, and white is at the corner farthest from the origin. In RGB space, skin and lip pixels have different components. The red is dominant for both, the green is more than the blue in the skin colour, and skin appears more yellow than lips. The image is transformed by a linear combination of red, green and blue components of the RGB colour space. Then they apply a high pass filter to highlight the details of the lip in the transformed image, after which both of the generated images are converted to obtain a binary image. The largest area in the binary image is recognized as the lip area.
ii) HSV approach
The RGB colour system is most commonly used in practice, while the HSV is closer to how humans describe and interpret colour. Hue (H) represents the dominant colour as perceived
by someone, so if someone calls an object green, blue, orange, or any other known colour, he/she is specifying its hue. Saturation (S) refers to the amount of white light mixed with a hue. The brightness (or intensity) is represented by the value (V).There are specific equations to convert from RGB to HSV and vice versa. Apparently, hue is a key feature for detecting lips using the HSV colour system, since the hue value of the lip pixels is smaller than that of the face pixels.
iii) YCbCr approach
This colour space is used in digital videos, where Y is the luminous component; Cb and Cr are the blue difference and red-difference chroma components respectively. The lip region has high Cr and low Cb values YCbCr colour space for detecting lips, is based on the fact that lips are more red than faces .So Cr component is maximize using a specific equation and Cb is minimised. After using edge detection as a mask to remove any unnecessary information, the image output of the equation is thresholded, and the largest area is found to be the mouth area. The drawback of previous model-based lip-detection methods is that it needs a significant amount of processing time, and training time for those which need training steps, which makes them difficult to be applied for the online systems, or to be applied on low resources machines, like the PDAs, in which this study is more interested.“The implementation of ASM and AAM is always difficult to run in Real- Time.AAM is slower than ASM.The drawback of color based
approach is that it is Vulnerable to variations in light conditions.
III. CLM : Constrained Local Model
Detecting some specific points on face, such as eye, mouth, Lips, Nose from given face image points is a difficult task. CLM can be used to find these feature points, given a face image. Given a rectangular patch of image that contains face (and often, also its background), how does a real person find the mouth, nose, etc? There are two things to note in the process described above:
a. We already know what eyes/nose generally look like, otherwise we cannot find them even if they are in the patch of image. Or speak technically, we have a model of the look of eye/nose, and we use this model to search for them in a given patch of image.
b. We also know the arrangement of eyes/nose on a face, eyes on top, nose in the center, etc. We can think of this as a constraint. This constraint is good for us, because when asked to search for nose in a face image, we do not need to go the length of searching the whole image, only the center or a close neighborhood of center (local region).
In summary, we use models to search in local region around where the corresponding item might appear. And we also use knowledge of shape of a face to constrain the search. This is the rationale behind CLM. That is, we use Local Models to search for individual items, and use knowledge of the shape to constrain the search, hence the name Constrained Local Models [3]
Fig.3 CLM shape and patch model
So conceptually, the implementation of CLM consists of two steps:
a. Building a CLM model from a set of training images.
b. Use CLM to do search in new images
1. CLM Model-building
Model-building can also be considered as training phase, in which we build a model from training images. As previously mentioned, a CLM model consists two parts, one part describing the shape variation of feature points, the shape model, and the other describing what each patch of image around the feature point might look like, the patch model. Shape model describes how face shape can vary, and is often built with PCA. Before applying PCA, Preprocessing is done using Procrustes analysis to remove the translation, scale and rotation.
Patch model describes how image around each feature point should look like. Patch models can be built in several ways. In this paper, we build template model using linear Support-Vector Machine (SVM). For each feature point, we train a linear SVM to recognize the local patch around the feature point, and will later use it in the search process.
2. Search Process
After building a CLM model, we can use it to find the position of nose, eyes and mouth in a rectangle image, we call it search process. The steps in the search process are as follows
1. Make initial guess of feature point position.
2. For each feature point, use SVM to search in the local region of its current position, to obtain SVM response image.
3. Fit each response image with a quadratic function.
4. Find best feature point position by optimizing quadratic functions and shape constraints.
5. Repeat step 2-4 until converges.
IV. Results
Implementation of Constrained Local Model
for Face Alignment [4] is modified for detection of lips .It is tested using FGnet Talking face video database. Fig 4 shows the initial position during search and fig.5 shows the position of lips after convergence. Fig 6 shows the images for one complete iterations cycle till convergence.
Fig.4 Initial Position
Fig.5.After Convergence
V. Conclusion and Future Scope
Visual speech recognition is the process of automating the human ability to lip read. Visual speech is transcribed using Visem i.e visual phonem is the visual appearance of speech articulators (Lips) involved in speech. So the first step towards spoken language identification using VSR is the detection of lips. Further step is feature extraction to guess the utterance by recognizing the corresponding visem. Visem recognition uses the shape information of lips for which it is necessary to define feature points on lips. Constrained local model uses Active Shape Model and patch model .As ASM requires labeled landmark points in model building which will be further useful in extracting features like height, width opening of lips etc. to be used for visual feature extraction. The further step in this research of spoken language identification using VSR is to track the movement of lips by observing the difference between various feature points defined on lips and thereby guessing the possible utterance.
In order to use this for videos of a talking person, person specific training dataset is needed. Developing this dataset requires manually labeling the landmark points on each image frame. So this research further concentrates on developing such dataset.
Fig 6 Images for one cycle of Iterations until convergence
References
[1] Jacob L. Newman and Stephen J. Cox,” Language Identification Using Visual Features”in IEEE Transactions on audio, speech, and language processing, vol. 20, no. 7, september 2012,pp 1936-1937
[2] N Kale and U.S Bhadade, “An overview of spoken language identification using visual cues from speech” cjtim
[3] Xiaoguang Yan, “Constrained Local Model for Face Alignment, a Tutorial
[4]http://sites.google.com/site/xgyanhome/home/projects/clm-implementation
Paste your essay in here…