Performance Enhancement of Automatic Speech Recognition (ASR) Using Robust Wavelet-Based Feature Extraction Techniques

In this era of smart applications, Automatic Speech Recognition (ASR) has established itself as an emerging technology that is becoming popular day by day. However, the accuracy and reliability of these systems are somehow restricted by the acoustic conditions such as background noise, and channel noise. Thus, there is a considerable gap between human-machine communications, due to their lack of robustness in the composite auditory scene. The objective of this thesis is to enhance the robustness of the system in the complex auditory environment by developing new front-end acoustic feature extraction techniques. Pros and cons of the different techniques are also highlighted. In the recent years, wavelet based acoustic features have been popular for speech recognition applications. The wavelet transform is an excellent tool for the time-frequency analysis with good signal denoising property. A new auditory-based Wavelet Packet (WP) features are proposed to enhance the system performance across different types of noisy conditions. The design and development of the proposed technique is carried out in such a way that it mimics the frequency response of human ear according to the Equivalent Rectangular Bandwidth (ERB) scale. In the subsequent chapters, the further developments of the proposed technique are discussed by using the Sub-band based Periodicity and Aperiodicity Decomposition (SPADE) and harmonic analysis. The TIMIT (English) and CSIR-TIFR (Hindi) phoneme recognition tasks are carried out to evaluate the performance of proposed technique. The simulation results demonstrate the potentiality of proposed techniques to enhance the system accuracy in a wide range of SNR. Further, visual modality plays a vital role in computer vision systems when the acoustic modality is disturbed by the background noise. However, most of the systems rarely addressed the visual domain problems, to make it work in real world conditions. Multiple-camera protocol ensures more flexibility to the system by allowing speakers to move freely. In the last chapter, consideration is given to Audio-Visual Speech Recognition (AVSR) implementation in vehicular environments, which resulted in one novel contribution-the one-way Analysis Of Variance (ANOVA}-based camera fusion strategy. Multiple-camera fusion technique is an imperative part of multiple cameras computer vision applications. The ANOVA-based approach is proposed to study the relative contribution of each camera for AVSR experiments in-vehicle environments. The four-cam automotive audio-visual corpus is used to investigate the performance of the proposed technique. Speech is a primary medium of communication for humans, and various speech-based applications can work reliably only by improving the performance of ASR across different environments. In the modern era, there is a vast potential and immense possibility of using speech effectively as a communication medium between human and machine. The robust and reliable speech technology ensures people to experience the full benefits of Information and Communication Technology (lCT). [brace not closed]