Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two di ﬀ erent kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show e ﬀ ectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were conﬁrmed to be e ﬀ ective even when the audio HMM was adapted to noise by the MLLR method.


INTRODUCTION
In the current environment of mobile technology, the demand for noise-robust speech recognition is growing rapidly. Audio-visual (bimodal) speech recognition techniques using face information in addition to acoustic information are promising directions for increasing the robustness of speech recognition, and many audio-visual methods have been proposed thus far [1][2][3][4][5][6][7][8][9][10][11]. Most use lip information extracted from frontal images of the face. However, when using these methods in mobile environments, users need to hold a handset with a camera in front of their mouth at some distance, which is not only unnatural but also inconvenient for conversation. Since the distance between the mouth and the handset decreases SNR, recognition accuracy may worsen. If the lip information can be taken by using a handset held in the usual way for telephone conversations, this would greatly improve the usefulness of the system.
From this point of view, we propose an audio-visual speech recognition method using side-face images, assuming that a small camera can be installed near the microphone of the mobile device in the future. This method captures the images of lips located at a small distance from the microphone. Many geometric features, mouth width and height [3,11], teeth information [11], and information about points located on a lip-contour [6,7], have already been used for bimodal speech recognition based on frontalface images. However, since these features were extracted based on "oval" mouth shape models, they are not suitable for side-face images. To effectively extract geometric information from side-face images, this paper proposes using lipcontour geometric features (LCGFs) based on a time series of estimated angles between upper and lower lips [12]. In our previous work on audio-visual speech recognition using frontal-face images [9,10], we used lip-motion velocity features (LMVFs) derived by optical-flow analysis. In this paper, LCGFs and LMVFs are used individually and jointly [12,13]. (Preliminary versions of this paper have been presented at workshops [12,13].) Since LCGFs use lip-shape information, they are expected to be effective in discriminating phonemes. On the other hand, since LMVFs are based on lip-movement information, they are expected to be effective in detecting voice activity. In order to integrate the audio and visual features, a multistream HMM technique is used.
In Section 2, we explain the method for extracting the LCGFs. Section 3 describes the extraction method of the LMVFs based on optical-flow analysis. Section 4 explains our audio-visual recognition method. Experimental results are reported in Section 5, and Section 6 concludes this paper.

EXTRACTION OF LIP-CONTOUR GEOMETRIC FEATURES
Upper and lower lips in side-face images are modeled by two-line components. An angle between the two lines is used as the lip-contour geometric features (LCGFs). The angle is hereafter referred to as "lip-angle." The lip-angle extraction process consists of three components: (1) detecting a lip area, (2) extracting a center point of lips, and (3) determining liplines and a lip-angle. Details are explained in the following subsections.

Detecting a lip area
In the side-view video data, speaker's lips are detected by using a rectangular window. An example of a detected rectangular area is shown in Figure 1.
For detecting a rectangular lip area from an image frame, two kinds of image processing methods are used: edge detection by Sobel filtering and binarization using hue values. Examples of the edge image and the binary image are shown in Figures 2(a) and 2(b), respectively. As shown in Figure 2(a), the edge image is effective in detecting horizontal positions of a nose, a mouth, lips, and a jaw. Therefore, the edge image is used for horizontal search of the lip area; first counting the number of edge points on every vertical line in the image, and then finding the image area which has a larger value of edge points than a preset threshold. The area (1) in Figure 2(a) indicates the area detected by the horizontal search.
Since lips, cheek, and chin areas have hue values within 1.5π ∼ 2.0π, these areas are detected by thresholding the hue values in the above detected area. The region labeling technique [14] is applied to the binary image generated by the thresholding process to detect connected regions. The largest connected region in the area (1), indicated by (2) in Figure 2(b), is extracted as a lip area.
To determine a final square area (3), horizontal search on an edge image and vertical search on a binary image are sequentially conducted to cover the largest connected region. Since these two searches are independently conducted, the aspect ratio of the square is variable. The original image of the square area shown in Figure 2(c) is extracted for use in the following process.

Extracting the center point of lips
The center point of the lips is defined as an intersection of the upper and lower lips, as shown in Figure 1. For finding the center point, a dark area considered to be the inside of the mouth is first extracted from the rectangular lip area. The dark area is defined as a set of pixels having brightness values lower than a preset threshold. In our experiments, the threshold was manually set to 15 after preliminary experiments using a small dataset. 1 The leftmost point of the dark area is extracted as the center point.

Determining lip-lines and a lip-angle
Finally, two lines modeling upper and lower lips are determined in the lip area. These lines are referred to as "lip-lines." The detecting process is as follows.
(1) An AND (overlapped) image is created for edge and binary images.   An example of the sequence of extracted lip-lines is shown in Figure 4. Finally, a lip-angle between the upper and lower lip-lines is measured.

Building LCGF vectors
The LCGF vectors, consisting of a lip-angle and its derivative (delta), are calculated for each frame and are normalized by the maximum values in each utterance. Figure 5(a) shows an example of a time function of the normalized lip-angle for a Japanese digit utterance, "7102, 9134," as well as the period of each digit. It is shown that the features are almost constant in pause/silence periods and have large values when the speaker's mouth is widely opened. As indicated by the figure, the speaker's mouth starts moving approximately 300 milliseconds before the sound is acoustically emitted. Normalized lip-angle values between 2.8 ∼ 3.5 seconds indicate that speaker's mouth is not immediately closed after uttering " 2 / n i /." A sequence of large lip-angle values, which appears after 7.0 seconds in Figure 5(a), is attributed to lip-lines determination errors.

EXTRACTION OF LIP-MOTION VELOCITY FEATURES
Our previous research [9,10] shows that visual information of lip movements extracted by optical-flow analysis based on the Horn-Schunck optical-flow technique [15] is effective for bimodal speech recognition using frontal-face (lip) images.

4
EURASIP Journal on Audio, Speech, and Music Processing Thus, the same feature extraction method [9] is applied to a bimodal speech recognition method using side-face images. The following subsections explain the Horn-Schunck optical-flow analysis technique [15] and our feature extraction method [9], respectively.

Optical-flow analysis
To apply the Horn-Schunck optical-flow analysis technique [15], image brightness at a point (x, y) in an image plane at time t is denoted by E(x, y, t). Assuming that brightness of each point is constant during a movement for a very short period, the following equation is obtained: If we let then a single linear equation This is called "smoothness constraint." As a result, an opticalflow pattern is obtained, under the condition that the apparent velocity of brightness pattern varies smoothly in the image. The flow velocity of each point is practically computed by an iterative scheme using the average of flow velocities estimated from neighboring pixels.

Building LMVF vectors
Since (1) assumes that the image plane has a spatial gradient and that correct optical-flow vectors cannot be computed at a point without a spatial gradient, the visual signal is passed through a lowpass filter and low-level random noise is added to the filtered signal. Optical-flow velocities are calculated from a pair of connected images, using five iterations. An example of two consecutive lip images is shown in Figures 6(a) and 6(b). Figure 6(c) shows the corresponding optical-flow analysis result indicating the lip image changes from (a) to (b). Next, two LMVFs, the horizontal and vertical variances of flow-vector components, are calculated for each frame and one normalized by the maximum values in each utterance. Since these features indicate whether the speaker's mouth is moving or not, they are especially useful for detecting the onset of speaking periods. Figure 5(b) shows an example of a  time function of the normalized vertical variance for the utterance appearing in Section 2.4. It is shown that the features are almost 0 in pause/silence periods and have large values in speaking periods. Similar to Figure 5(a), Figure 5(b) shows that the speaker's mouth starts moving approximately 300 milliseconds before the sound is acoustically emitted. It was found that time functions of the horizontal variance were similar to those of the vertical variance.
Finally, the two-dimensional LMVF vectors consisting of normalized horizontal and vertical variances of flow vector components are built. Figure 7 shows our bimodal speech recognition system using side-face images.

Overview
Both speech and lip images of the side view are synchronously recorded. Audio signals are sampled at 16 kHz with 16-bit resolution. Each speech frame is converted into  Figure 7: audio-visual speech recognition system using side-face images.
38 acoustic parameters: 12 MFCCs, 12 ΔMFCCs, 12 ΔΔMFCCs, Δ log energy, and ΔΔ log energy. The window length is 25 milliseconds. Cepstral mean subtraction (CMS) is applied to each utterance. The acoustic features are computed with a frame rate of 100 frames/s. Visual signals are represented by RGB video captured with a frame rate 30 frames/s and 720 × 480 pixel resolution. Before computing the feature vectors, the image size is reduced to 180 × 120. For reducing computational costs of optical-flow analysis, we reduce a frame rate to 15 frames/s and transform the images to gray-scale before computing the LMVFs.
In order to cope with the frame rate differences, the normalized lip-angle values and LMVFs (the normalized horizontal and vertical variances of optical-flow vector components) are interpolated from 30/15 Hz to 100 Hz by a 3degree spline function. The delta lip-angle values are computed as differences between the interpolated values of adjacent frames. Final visual feature vectors consist of both or either of the two features (LCGFs and LMVFs). In case that the two features are jointly used, a 42-dimensional audio-visual feature vector is built by combining the acoustic and the visual feature vectors for each frame. When using either LCGFs or LMVFs as visual feature vectors, a 40-dimensional audiovisual feature vector is built.
Triphone HMMs are constructed with the structure of multistream HMMs. In recognition, the probabilistic score b j (o av ) of generating audio-visual observation o av for state j is calculated by where b a j (o a ) is the probability of generating acoustic observation o a , and b v j (o v ) is the probability of generating visual observation o v . λ a and λ v are weighting factors for the audio and the visual streams, respectively. They are constrained by λ a + λ v = 1 (λ a , λ v ≥ 0).

Building multistream HMMs
Since audio HMMs are much more reliable than visual HMMs at segmenting the feature sequences into phonemes, audio and visual HMMs are trained separately and one combined using a mixture-tying technique as follows.
(1) The audio triphone HMMs are trained using 38dimensional acoustic (audio) feature vectors. Each audio HMM has 3 states, except for the "sp (short pause)" model which has a single state. (2) Training utterances are segmented into phonemes by forced alignment using the audio HMMs, and timealigned triphone labels are obtained. (3) The visual HMMs are trained for each triphone by four-dimensional visual feature vectors using the triphone labels obtained during step 2. Each visual HMM has 3 states, except for the "sp" and "sil (silence)" models which have a single state.  Figure 8 shows an example of the integration process. In this example, an audio-visual HMM for the triphone /n-a+n/ is built. The mixtures for the audio-visual HMM "n-a+n,AV" are tied with the audio HMM "n-a+n,A" and the visual HMM "n-a+n,V."

Database
An audio-visual speech database was collected from 38 male speakers in a clean/quiet condition. The signal-to-noise ratio (SNR) was, therefore, higher than 30 dB. Each speaker uttered 50 sequences of four connected digits in Japanese. Short pauses were inserted between the sequences. In order to avoid contaminating the visual data with noises, a gray monotone board was used as a background and speakers side-face images were captured under constant illumination conditions. The age range of speakers was 21 ∼ 30. Two speakers had facial hair. In order to simulate the situation in which speakers would be using a mobile device with a small camera installed near a microphone, speech and lip images were recorded by a microphone and a DV camera located approximately 10 cm away from each speaker's right cheek. The speakers were requested to shake their heads as little as possible.

Training and recognition
The HMMs were trained using clean audio-visual data, and audio data for testing were contaminated with white noise at four SNR levels: 5, 10, 15, and 20 dB. The total number of states in the audio-visual HMMs was 91. In all the HMMs, the number of mixture components for each state was set at two. Each component was modeled by a diagonal-covariance Gaussian distribution. Experiments were conducted using the leave-one-out method: data from one speaker were used for testing, while data from the remaining 37 speakers were used for training. Accordingly, 38 speaker-independent experiments were conducted, and a mean word accuracy was calculated as the measure of the recognition performance. The recognition grammar was constructed so that all digits can be connected with no restrictions. Table 1 shows digit recognition accuracies obtained by the audio-only and the audio-visual methods at various SNR conditions. Accuracies using only LCGFs or LMVFs as visual information are also shown in the table for comparison. "LCGF + LMVF" indicates the results using combined four-dimensional visual feature vectors. The audio and visual stream weights used in the audio-visual methods were optimized a posteriori for each noise condition; multiple experiments were conducted by changing the stream weights, and the weights which maximized the mean accuracy over all the 38 speakers were selected. The optimized audio stream weights (λ a ) are shown next to the audio-visual recognition accuracies in the table. Insertion penalties were also optimized for each noise condition. In all the SNR conditions, digit accuracies were improved by using LCGFs or LMVFs in comparison with the results obtained by the audio-only method. Combination of LCGFs and LMVFs improved digit accuracies more than using either LCGFs or LMVFs, at all SNR conditions. The best improvement from the baseline (audio-only) results, 10.9% in absolute value, was obtained at the 5 dB SNR condition.

Comparison of various visual feature vectors
Digit accuracies obtained by the visual-only method using LCGFs, LMVFs, and the combined features "LCGF + LMVF" were 24.0%, 21.9%, and 26.0%, respectively. Figure 9 shows the digit recognition accuracy as a function of the audio stream weight (λ a ) at the 5 dB SNR condition. The horizontal and vertical axes indicate the audio stream weight (λ a ) and the digit recognition accuracy, respectively. The dotted straight line indicates the baseline (audio-only) result, and others indicate the results obtained by audiovisual methods. For all the visual feature conditions, improvements from baseline are observed over a wide range of the stream weight. The range over which accuracy is improved is the largest when the combined visual features are used. It was found that the relationship between accuracies and stream weights at other SNR conditions was similar to that at the 5 dB SNR condition. This means that the method using the combined visual features is less sensitive to the stream weight variation than the method using either LCGF or LMVF alone.

Combination with audio-HMM adaptation
It is well known that noisy speech recognition performance can be greatly improved by adapting audio HMM to noisy speech. In order to confirm that our audio-visual speech recognition method is still effective, even after applying the audio-HMM adaptation, a supplementary experiment was performed. Unsupervised noise adaptation by the MLLR (maximum likelihood linear regression) method [16] was applied to the audio HMM. The number of regression classes was set to 8. The audio-visual HMM was constructed by integrating the adapted audio HMM and nonadapted visual HMM. Table 2 shows the results when using the adapted audiovisual HMM. Comparing these to the results of the baseline (audio-only) method in Table 1, it can be observed that accuracies are largely improved by MLLR adaptation. It can also be observed that visual features further improve the performance. Consequently, the best improvement from the nonadapted audio-only result, 30%(= 58.4%-28.4%) in absolute value at the 5 dB SNR condition, was observed when using the adapted audio-visual HMM which included the combined features.

Performance of onset detection for speaking periods
As another supplementary experiment, we compared audiovisual HMMs and audio HMMs in terms of the onset detection capability for speaking periods in noisy environments. Noise-added utterances and clean utterances were segmented by either of these models using the forced-alignment technique, and the detected boundaries between silence and beginning of each digit sequence were used to evaluate the performance of onset detection. The amount of errors (ms) was measured by averaging the differences of detected onset locations for noise-added utterances and clean utterances. Table 3 shows the onset detection errors in various SNR conditions. MLLR adaptation is not applied in this experiment. The optimized audio and visual stream weights decided by the experiments in Section 5.3.1 were used. Comparing the results under audio-only and audio-visual conditions, it can be found that the LMVFs, having significantly smaller detection errors than the audio-only condition, are effective in improving the onset detection. Therefore, the recognition error reduction by using the LMVFs can be attributed to the precise onset information prediction. On the other hand, the LCGFs do not yield significant improvement for onset detection in most of the SNR conditions. Since the LCGFs can also effectively increase recognition accuracies, they are considered capable of increasing the capacity to discriminate between phonemes. The increase of noise robustness in audio-visual speech recognition by combining LCGFs and LMVFs is therefore attributed to the integration of these two different effects.

Performance comparison of audio-visual speech recognition methods using frontal-face and side-face images
In our previous research on audio-visual speech recognition using frontal-face images [9], LMVFs were used as visual features and experiments were conducted under similar conditions to this paper; Japanese connected-digits speech contaminated with white noise was used for evaluation. Reference [9] reported that error reduction rates achieved using LMVFs were 9% and 29.5% at 10 and 20 dB SNR conditions, 8 EURASIP Journal on Audio, Speech, and Music Processing respectively. Since the error reduction rates achieved using LMVFs from side-face images were 8.8% (5 dB SNR) and 10% (10 dB SNR), it may be said that the effectiveness of LMVFs obtained from side-face images is less than that obtained from frontal-face images, although they cannot be strictly compared because the set of speakers was not the same for both experiments. Lucey and Potamianos compared audio-visual speech recognition results using profile and frontal views in their framework [17], and showed that the effectiveness of visual features from profile views was inferior to that from frontal views. It is necessary to evaluate the side-face-based and frontalface-based methods from the human-interface point of view, to clarify how much the ease-of-use advantages of the sideface-based method described in the introduction could compensate for the method's performance inferiority to frontalface-based approaches.

CONCLUSIONS
This paper has proposed audio-visual speech recognition methods using lip information extracted from side-face images, focusing on mobile environments. The methods individually or jointly use lip-contour geometric features (LCGFs) and lip-motion velocity features (LMVFs) as visual information. This paper makes the first proposal to use LCGFs based on an angle measure between the upper and lower lips in order to characterize side-face images. Experimental results for small vocabulary speech recognition show that noise robustness is increased by combining this information with audio information. The improvement was maintained even when MLLR-based noise adaptation was applied to the audio HMM. Through the analysis on the onset detection, it was found that LMVFs are effective for onset prediction and LCGFs are effective for increasing the phoneme discrimination capacity. Noise robustness may be further increased by combining these two disparate features.
In this paper, all evaluations were conducted without considering the effects of visual noises. It is necessary to evaluate the effectiveness/robustness of our recognition method on a real-world database containing visual noises. Our previous research on frontal-face images [11] showed that lip-motion features based on optical-flow analysis improved the performance of bimodal speech recognition in actual running cars. The lip-angle extraction method investigated in this paper might be more sensitive to illumination conditions, speaker variation, and visual noises. There-fore, this method also needs to be evaluated on a real-world database. Feature normalization techniques, in addition to the maximum-based method used in this paper, also need to be investigated in real-world environments. Developing an automatic stream-weight optimization method is also an important issue. For frontal images, several weight optimization methods have been proposed [8,18,19]. We have also proposed weight optimization methods and confirmed their effectiveness by experiments using frontal images [20,21]. It is necessary to apply these weight optimization methods to the side-face method and evaluate the resulting effectiveness. Future works also include (1) evaluating the lip-angle estimation process using manually labeled data, (2) evaluating recognition performance using more general tasks, and (3) improving the combination method for LCGFs and LMVFs.