The possibility of some software application developer coming up with an Urdu speech recognition program just got more likely as the most fundamental tool needed for it has just been developed at Lahore’s Information Technology University.
Linguistic technology expert Dr. Agha Ali Raza and his team at ITU’s Center for Speech and Language Technologies (CSaLT) laboratory has released for public use a corpus of Urdu sentences that covers all possible distinct sounds, called phoneme by linguists, used in everyday speech. This corpus comprising 708 sentences that covers all 63 phonemes will soon be available for download at the C-SALT website.
Those interested in developing an Urdu speech recognition software will now have access to the most basic ingredient needed for the purpose. They will just need a repository of words used in everyday speech to proceed with developing the application, says Dr. Raza.
“Speech recognition is a two-step process. The corpus will give the computer application access to all possible phonemes used in formation of meaningful Urdu words from everyday speech,” he says. Though there are 63 distinct phonemes in Urdu, in everyday speech these don’t correspond to 63 distinct sounds. Dr. Raza explains that sound made for a phoneme may vary from one utterance to another depending on the phoneme used before and after it in a word. Thus, he says, for every phoneme x, there will be 63*x*63 possible (tri-phoneme) sounds. The corpus of sentences covers for all these possible sounds.
In the first step, words from the corpus will allow the application to train itself in the sounds of various Urdu words. The separate repository of words will come into play in the second stage allowing the application to to choose the most appropriate words for the output sentences. “This will enhance accuracy of the software,” Dr. Raza says.
Thus, the accuracy of the speech recognition softwares depends on written or oral sources from where words and sentences are generated for the corpus and the repository maintained separately for ruling out meaningless words.
Dr. Raza and his team has relied upon written material obtained from newspaper and magazine articles in Urdu to generate this corpus. In another corpus under development at the CSaLT lab, he is using oral material recorded with permission from several thousand telephonic conversations. “This will provide us with a more comprehensive corpus of words used in everyday speech,” he says. He expects that work on this bigger project will be completed this year.
One way speech recognition softwares available for international languages like English constantly keep enhancing their accuracy is by adding to the repository (for future use) words spoken to them by users.
Dr. Raza says he had started work on the corpus under supervision of Dr. Sarmad Hussain as part of his master’s’ thesis at the Lahore-based National University of Computer and Emerging Sciences FAST. Then, he proceeded to Carnegie Mellon University where he completed his doctorate of philosophy (PhD) in Language Technologies. He and Dr. Hussain were assisted by a colleague, Huda Sarfaraz, and two linguists, Inamullah and Zahid Sarfaraz, in compiling the list of 708 sentences for the corpus.
“We hope that release of this corpus will also prove beneficial for regional languages in the country and languages lacking ample linguistic resources all over the world. Those interested in working on those languages can follow our technique to develop similar corpora of sentences in those languages,” he says. “The technique used in development of this corpus will work for any language for which written material is available,” he says.
The other corpus under development at the lab will pave the way for working on regional languages with little or no written material. “Recorded speech in those languages will be sufficient to provide material for speech recognition,” he says.
Linguist Tariq Rahman welcomes the development and says that such initiatives for promotion and preservation of the country’s languages have been due for a long time now. “These resources have the potential of raising the profile of a language. The closer a language is to domains of power the higher the probability of its survival,” he says. He adds people tend to prioritize learning of languages that offer material rewards like better employment opportunities. He hopes similar work will soon be started for ‘other national languages of the country’.
Rahman highlights that there may be a need for maintaining multiple databases of words and sentences to account for different varieties of Urdu spoken in the country. There are subtle differences in Urdu in different regions of the country. Urdu of Punjab is different from that of Sind and both have features that distinguish them from its variety in other regions like Khyber-Pakhtunkhwa, he adds.
Dr. Raza says this variety has been taken into consideration in compilation of the database of oral communications. He says the software being used to collect data has been made available all over the country. Initially, the users were concentrated in the Punjab but now we have reached out to all regions of the country, he adds.
Dr. Raza says speech recognition programmes facilitate access to smartphone- and internet-based services. Those unable to use these services due to a disability like visual impairment and amputation, paralysis or illiteracy can do so with help of speech recognition feature, he says. It also allows use of smartphone and internet in situations where it may be inconvenient otherwise, like when you’re driving a car. “Now, all of this can be done in our local languages as well,” he says.
The corpus is available for download at: http://csalt.itu.edu.pk/