УДК 81'42=162.1 V. S. Rogozhina Postgraduate Student, Department of Applied and Experimental Linguistics, Institute of Applied and Mathematical Linguistics, Faculty of the Humanities and Applied Sciences, MSLU; е-mail: mslu.italiano@gmail.com PHONETIC DATABASE OF SPOKEN DISCOURSE (REGARDING POLISH SPEECH)1 Development of speech corpora and acoustic-phonetic databases are indispensible for any research and development work in spoken language systems. This paper is focused on describing the main spheres of speech corpora application and demonstrating the process of creating Polish speech database. For this purpose 40 female and 40 male Polish native speakers were recorded. The 21 hours of direct face-to-face conversations, interviews and discussions as well as audiobooks and audio tapes of Polish textbooks were analyzed, segmented and transcribed. During the research two types of analysis were carried out: acoustic and perceptual. As a result an annotated speech database along with transcribing rules was created. Key words: speech corpora; acoustic-phonetic databases; spoken language systems; Polish speech database; annotated speech database; database management system. В. С. Рогожина аспирант каф. прикладной и экспериментальной лингвистики Института прикладной и математической лингвистики фак-та ГПН МГЛУ ФОНЕТИЧЕСКАЯ БАЗА ДАННЫХ РАЗГОВОРНОГО ДИСКУРСА (на примере польского языка) Современные речевые технологии автоматического распознавания и синтеза речи невозможно представить без корпусной лингвистики и устноречевых баз данных. Данная исследовательская работа посвящена исследованию проблемы современной корпусной лингвистики на примере создания устно-речевой базы данных польского языка. Для этого были записаны голоса 43 мужчин и 40 женщин – носителей, в основном, нормативного варианта польского языка. Был проанализирован, сегментирован и затранскрибирован 21 час монологов, диалогов, аудиокниг, а также спонтанной и квазиспонтанной речи. Во время исследования проводилось два вида анализа: акустический и перцептивный. В результате были созданы аннотированная устно-речевая база данных и правила транскрибирования. 1 Исследование выполнено в рамках Задания № 34.1254.2014К Министерства образования и науки России. Науч. рук. Р. К. Потапова. 29 Вестник МГЛУ. Выпуск 13 (699) / 2014 Ключевые слова: корпусная лингвистика; устно-речевая база данных; системы обработки языка; польская речевая база данных; аннотированная база данных; системы управления базами данных. Spoken language is central to human communication and has significant links to both national identity and individual existence (http:// www.ldc. upenn.edu/annotation/). In the area of speech and language technology, including speech synthesis and recognition, speaker identification, language identification and message understanding the common basis need is the speech corpora (http://cslu.cse.ogi.edu/HLTsurvey/ch12node5). Speech is produced differently by each speaker. Each utterance is produced by a unique vocal tract which leaves its traces on the signal (http:// www.ldc. upenn.edu/annotation/). The speech corpora vary in features like recording conditions, environments, age groups, media used, sampling rates, data collection protocols, annotation levels and tags. That is why the general purpose of speech corpora is to cover as much variability as possible to enable its use in various applications [1; 2]. Creation of large speech databases is one of the important conditions for solving the problem of speech recognition and speech synthesis [3]. This problem was examined and extensively covered by R. K. Potapova and the Department of Applied and Experimental Linguistics of Moscow State Linguistic University. In the article “The Main Tendencies of Multilingual Corpus Linguistics” [1] R. K. Potapova describes the stages of creating speech databases for French and Arabic languages. She also points out the challenges that one could face while developing a speech database. As for Polish language, no audio corpus of acceptable quality had been created till 1998. In 1998 within the project SpeechDat(E) (a project in a series of European projects aiming at the creation of large telephone speech databases) was created The Polish SpeechDat(E) database , containing the recordings of 1,000 Polish speakers (488 males, 512 females) recorded over the Polish fixed telephone network [4]. But this database is aimed to meet, first of all, the needs of telecommunication services. The main aim of the research is to create a speech database that will be appropriate for speech recognition and speaker identification. Thus, the immediate goals of the present paper are confided to give an overall description of the main stages of the development of Polish speech database. The creation of a speech database falls [5] into following stages: • accumulation enough audio materials 30 В. С. Рогожина • analysis obtained recordings • segmentation recordings into chunks • orthographic transcription • developing of transcriptional rules of the required language • phonemic transcription of each segment using transcriptional rules • saving all files in data carrier (CD-RW or DVD+R) During the research carried out by the Department of Applied Linguistic of Moscow State Linguistic University 21 hours of recordings are obtained in a variety of ways. 11 hours of recordings are direct face-toface conversations, interviews and discussions taken from Polish broadcast website www.polskieradio.pl, and the other part of data is audiobooks and audio tapes of Polish textbooks. To make the process of segmentation easier, an unique ID was given to each speaker. Table 1 presents the information of the records made by female speakers ( speaker, source, ID, duration). Table 1 “Female speakers” Id uration (min) Blanca Kutyłowska Audiobook: Opowiedzcie, jak tam żyjecie F1 9 Barbara Utlinska Audiobook: Pod sloncem Toskanii F2 8 Hanna Kaminska Audiobook: Bella Toskania F3 7 KATARZYNA MATERNOWSKA Audiobook: Weisberger Lauren -Diabeł ubiera się u Prady F 9 Elzbieta Kijowska Audiobook: Kossak Zofia -Bursztyny F 7 F6 9 Speaker Source Klaudia Binkowska Audiobook: Zapolska Gabriela -Z pamietnikow mlodej mezatki The research involves 40 female and 40 male native speakers. All recordings are digitized.Recording is done in 16-bit PCM (*.wav) mono with sampling frequency of 16 kHz . Of all recordings a verbatim transcript is made. To facilitate the transcription process, the interactive signal processing tool PRAAT1 was used. PRAAT software gives full scope for visualizing the speech signal and at the same time creating and viewing 1 For more information on PRAAT see http://www.fon.hum.uva.nl/praat/ 31 Вестник МГЛУ. Выпуск 13 (699) / 2014 an orthographic transcription. During the transcription process, the audio files were segmented using Adobe Audition 1.5 and Sound Forge 7.0 by inserting time markers in unfilled pauses between words. At a later stage these markers are used as anchor points for the automatic alignment of the transcript and the speech file. For the broad phonetic transcription of the data, the SAMPA1 set was used. The speech database of Polish language made for “Foresight” project is being developed with two major objectives- one to use it as a support to fundamental research for the study of acoustic-phonetic, lexical, semantic, syntactic manifestations in a language and the other to capture the variability that arises due to variations among speakers, sex, speaking environments, recording, transmission channels and etc., that are essential for solving the task of speech recognition and speaker identification. The created Polish speech database includes the 21 hours of direct faceto-face conversations, interviews and discussions as well as audiobooks in Polish language. After acoustic and perceptual analysis of audio files, they were segmented in 2256 relatively short chunks (of approximately 20 to 30 seconds each). Figure 1 shows the segmented audio file that afterwards was cut into chunks. Figure 1. The example of the segmented audio file ( f7) The speech database consist of separate folders named after speaker’s id (i.e. f1, f2, m1, m2 and etc.). Each prompt utterance is sorted within a 1 SAMPA is an ASCII encoding system for various languages, including Dutch, based on the International Phonetic Alphabet (IPA). 32 В. С. Рогожина separate file. Each file contains all the chunks of audio files made by one speaker as well as an orthographic text with a phonemic transcription in SAMPA, saved in *.txt file format. The created database comprises 83 Polish speakers (40 female, 43 male). All the speech database is partitioned into 2 CDs, each of which comprises 40 speakers sessions. Along with the database creation some of the phonetic rules were developed (Appendix 1). As a data base management system (DBMS) Microsoft Office Access 2007 (Figure) was chosen. Figure 2. The example of DBMS The creation of qualitative speech corpus is a rather complicated technological task. To solve the problems concerning speech corpus development, special coordinating centers were set up for recording, keeping, spreading and creation of public and standardized language recourses, including speech ones (http://cslu.cse.ogi.edu/HLTsurvey/ ch12node5). Among them there are: – LDC (Linguistic Data Consorcium, http://www.ldc.upenn.edu) – CSLU (Center for Spoken Language Understanding, Oregon Graduate Institute http://www. CSLU.ogi.edu) – ELRA (European Language Resources Association, http://www.elra. info). 33 Вестник МГЛУ. Выпуск 13 (699) / 2014 Though the collection of speech corpora offered by centers is increasing every year, only three speech databases of Polish language are worth mentioning: The Polish SpeechDat(E) database PELCRA (Polish and English Language Corpora for Research and Applications), Korpus Języka Polskiego Wydawnictwa Naukowego PWN. These corpora content nearly 2 million words, but there is no opportunity to use them without buying the corpora. In internet there are only trial versions of such corpora. To use the full version one should be a member of the special coordinating center, which is also money consuming. That is why it is significant for development of corpus linguistics in Russia to create its own collection of speech databases of different languages. Speech database of Polish language could make a valuable contribution to this collection. Speech corpora can be classified based on their characteristics and purpose for creating them as – task specific corpora, general purpose corpora, lexical, morphological, syntactical and semantic corpora, acoustic-phonetic database, databases of supra-segmental features, databases of source and tract parameters etc. The created speech corpus is developed for a generic use. Though the extensive work was carried out, still many thing left to be done. The recordings need to be done in more number of environments and using variety of devices like over telephone, mobile, hands-free environments and in different types of transports like own car, public vehicles and in different regions varying in geographical conditions. Besides recordings in different conditions one studio recording by male /female professional speaker must be done. The question of the type of database management system (DBMS) still remains open. REFERENCES 1. Потапова Р. К. Основные тенденции развития многоязычной корпусной лингвистики // Речевые технологии. – 2009. – № 2. – С. 92–112. 2. Потапова Р. К. Значение полилингвизма в прикладном речеведении // Актуальные проблемы прикладного речеведения. – М. : ИПК МГЛУ «Рема», 2010. – С. 9–19. – (Вестн. Моск. гос. лингвист. ун-та; вып. 592. Сер. Языкознание). 3. Church K. W., Mercer R. L. Special Issue on Computational Linguistics Using Large Corpora. – Vol. 19(1, 2). MIT Press, 1993. 4. Kibkalo A. A., M. M. Lotkov. Choice of Phonetic Alphabet for Russian LVCSR System // Proceedings of the International Workshop «Speech and Computer» SPECOM’ 2003, Moscow, 27–29 October, 2003. – M. : MSLU, 2003. – P. 102–105. 34 В. С. Рогожина 5. Potapova R. Acoustic-phonetic Speech Databases for Speech / Speaker Recognition Systems / R. Potapova, N. Bobrov, E. Bogacheva, D. Dorofeev, A. Kozlova // Proceedings of the International workshop “Speech and Computer” SPECOM’2005, Patras, Greece, 17–19 October, 2005. – Patras–M. : MSLU, 2005. – 590 p. 35