Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Speech Recognition: Fundamentals and Applications
Speech Recognition: Fundamentals and Applications
Speech Recognition: Fundamentals and Applications
Ebook129 pages1 hour

Speech Recognition: Fundamentals and Applications

Rating: 0 out of 5 stars

()

Read preview

About this ebook

What Is Speech Recognition


Computer science and computational linguistics include a subfield called speech recognition that focuses on the development of approaches and technologies that enable computers to recognize spoken language and translate it into text. Speech recognition is an interdisciplinary subfield of computer science. It is also known as computer speech recognition (CSR) and speech to text (STT). Another name for it is automatic speech recognition (ASR). The domains of computer science, linguistics, and computer engineering are all represented in its incorporation of knowledge and study. Speech synthesis is the process of doing things backwards.


How You Will Benefit


(I) Insights, and validations about the following topics:


Chapter 1: Speech recognition


Chapter 2: Computational linguistics


Chapter 3: Natural language processing


Chapter 4: Speech processing


Chapter 5: Pattern recognition


Chapter 6: Language model


Chapter 7: Deep learning


Chapter 8: Recurrent neural network


Chapter 9: Long short-term memory


Chapter 10: Voice computing


(II) Answering the public top questions about speech recognition.


(III) Real world examples for the usage of speech recognition in many fields.


(IV) 17 appendices to explain, briefly, 266 emerging technologies in each industry to have 360-degree full understanding of speech recognition' technologies.


Who This Book Is For


Professionals, undergraduate and graduate students, enthusiasts, hobbyists, and those who want to go beyond basic knowledge or information for any kind of speech recognition.

LanguageEnglish
Release dateJul 5, 2023
Speech Recognition: Fundamentals and Applications

Related to Speech Recognition

Titles in the series (100)

View More

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Speech Recognition

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Speech Recognition - Fouad Sabry

    Chapter 1: Speech recognition

    Computer science and computational linguistics have spawned a subfield known as speech recognition, which is an interdisciplinary field that focuses on the development of methodologies and technologies that enable computers to recognize and translate spoken language into text. The primary advantage of this is that the text can then be searched. Automatic speech recognition, sometimes abbreviated as ASR, is another name for it, as is computer speech recognition and voice to text (STT). The domains of computer science, linguistics, and computer engineering are all represented in its incorporation of knowledge and study. Speech synthesis is the process that occurs in reverse.

    Certain voice recognition systems name this process training, although it's also known as enrollment. During training, an individual reader feeds the system text or isolated vocabulary. The accuracy of the speech recognition for that individual is improved as a consequence of the system's analysis of that person's unique voice and its application of that analysis to the process. Speaker-independent systems are those that do not need users to go through any kind of training. The term speaker dependent refers to the systems that need training.

    Included in speech recognition applications are voice user interfaces such as voice dialing (for example, call home), call routing (for example, I would like to make a collect call), domotic appliance control, search key words (for example, find a podcast where particular words were spoken), simple data entry (for example, entering a credit card number), preparation of structured documents (for example, a radiology report), determining speaker characteristics, and speech-to-text processing (for example, word processors (usually termed direct voice input).

    Voice recognition is more concerned with identifying who is speaking than with understanding what is being said by the individual. The task of translating speech in systems that have been trained on a specific person's voice can be made easier by recognizing the speaker, or it can be used to authenticate or verify the identity of a speaker as part of a security process. Both of these uses are important for ensuring the safety of sensitive information.

    Speech recognition has a lengthy history, and during that history, there have been multiple waves of significant technological advancements. Recent developments in areas such as deep learning and big data have been beneficial to the subject. The developments are shown not only by the increase in the number of academic articles that have been published in the subject, but more significantly by the global industrial acceptance of a range of deep learning approaches in the process of creating and implementing voice recognition systems.

    The most significant improvements were made in the following areas: vocabulary size; speaker independence; and processing speed.

    The year 1952 saw three researchers from Bell Labs, Stephen Balashek,, The source-filter model of speech generation was created and published by Gunnar Fant in the year 1960.

    At the World's Fair in 1962, IBM showed off the voice recognition capabilities of their Shoebox system, which could recognize up to 16 words.

    While working on voice recognition in 1966, Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) came up with the idea for the Linear Predictive Coding (LPC) technique of speech coding.

    In 1969, the prominent John Pierce issued an open letter that was critical of and defunded speech recognition research. As a result of this letter, funding for speech recognition research at Bell Labs dried up for many years. This financing cut remained in place until Pierce left the company and James L. Flanagan took charge.

    When Raj Reddy was a graduate student at Stanford University in the late 1960s, he was the first person to work on continuous speech recognition. Previous methods required a pause from the user after each each word. The game of chess was controlled by verbal orders delivered by Reddy's system.

    Around this time, researchers from the Soviet Union developed the dynamic temporal warping (DTW) method. They then used it to the development of a recognizer that could function using a vocabulary of up to 200 words. DTW analyzed speech by first breaking it into several small frames, each lasting ten milliseconds or less, and then processing each frame as if it were an independent entity. Although DTW would eventually be replaced by more advanced algorithms, the method itself survived. At this point in time, the problem of achieving speaker autonomy had not been resolved.

    Voice Understanding Study was given funding by DARPA for a period of five years in 1971. This research focused on speech recognition and aimed to have a vocabulary of at least one thousand words. They believed that comprehending speech would be essential to make advances in speech recognition, but this turned out to not be the case later on. Research on voice recognition was resuscitated as a result of John Pierce's letter.

    In the year 1972, a conference was hosted by the IEEE Acoustics, Speech, and Signal Processing section in Newton, Massachusetts.

    Since its inception in 1976, the International Conference on Acoustics, Voice, and Signal Processing (ICASSP) has been the preeminent forum for the presentation and publishing of research on speech recognition. Researchers were able to integrate many domains of expertise, such as acoustics, language, and grammar, into a single probabilistic model thanks to the use of hidden Markov models (HMMs).

    One of IBM's few rivals, Fred Jelinek's team at IBM built a voice-activated typewriter named Tangora in the middle of the 1980s. Tangora could handle a vocabulary of 20,000 words and was one of IBM's few competitors.

    In addition, the n-gram language model was developed and put into use throughout the 1980s.

    1987 saw the introduction of the back-off model, which made it possible for language models to make use of n-grams of varying lengths. At the same time, CSELT began using HMM to distinguish different languages (both in software and in hardware specialized processors, e.g. RIPAC).

    The fast expanding capabilities of computers are largely responsible for the significant progress that has been made in this area. The DARPA program came to an end in 1976, and the finest computer that was accessible to researchers at that time was the PDP-10 with 4 MB of RAM.

    There were also two useful goods:

    1984 saw the debut of the Apricot Portable, which supported a maximum of 4096 words but could only keep 64 of them in RAM at any one moment.

    1987 – a recognizer from Kurzweil Applied Intelligence

    Dragon Dictate, a consumer product that was produced in 1990 Xuedong Huang, a former student of Raj Reddy's who designed the Sphinx-II system at CMU, was the inventor of Dragon Dictate. Sphinx-II was the first system to accomplish speaker-independent, big vocabulary, continuous speech recognition, and it had the greatest performance in DARPA's 1992 assessment. Sphinx-II also featured the most advanced features. A significant turning point in the development of voice recognition was the ability to process continuous speech together with a huge vocabulary. After that, in 1993, Huang established the voice recognition division at Microsoft where he worked. Kai-Fu Lee, who had been a student of Raj Reddy's, went on to work at Apple, where in 1992 he contributed to the creation of a voice interface prototype for the Apple computer known as Casper.

    A firm called Lernout & Hauspie, which is situated in Belgium, is in the business of voice recognition. Over the years, it has purchased many other businesses, including Dragon Systems in 2000 and Kurzweil Applied Intelligence in 1997. Within the Windows XP operating system was a component that made use of the L&H voice technology. Before the firm was shut down in 2001 due to an accounting scandal, L&H had a prominent position in the industry. ScanSoft, who later changed their name to Nuance in 2005, purchased the speech technology that L&H had developed. Initially, Apple licensed software from Nuance in order to offer Siri, the company's digital assistant, with the capacity of voice recognition.

    Both the Effective Affordable Reusable Voice-to-Text (EARS) program in 2002 and the Global Autonomous Language Exploitation program were speech recognition initiatives that were financed by DARPA throughout the 2000s (GALE). There were a total of four teams that took part in the EARS program. These teams included IBM, a group that was directed by BBN and included LIMSI and the University of Pittsburgh, Cambridge University, and

    Enjoying the preview?
    Page 1 of 1