Вопросно-ответные системы Актуальные проблемы прикладной лингвистики, Е.Ю. Калинина, МГУ, 2007-2008 Что такое воспросно-ответные системы? • Q&A: • Inputs: a question in English; a set of text and database resources • Output: a set of possible answers drawn from the resources • When is the next train to Dublin? --> Q&A System --> Text Corpora & RelDBases --> 8.35, track 9 Основные компоненты вопросноответной системы Generic architecture • • Query processing • • Paragraph indexing • • Answer resolution • • Open-domain vs. domain-specific systems Given a NL query, extract its possible answer(s) from real-world NL text documents. Типы информационных потребностей • Ad hoc retrieval: find me documents “like this” Identify positive accomplishments of the Hubble telescope since it was launched in 1991. Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them. Типы информационных потребностей • Question answering “Factoid” “List” “Definition” Who discovered Oxygen? When did Hawaii become a state? Where is Ayer’s Rock located? What team won the World Series in 1992? What countries export oil? Name U.S. cities that have a “Shubert” theater. Who is Aaron Copland? What is a quasar? Предшественники вопросноответных систем • Information Retrieval – Retrieve relevant documents from a set of keywords; search engines • Information Extraction – Template filling from text (e.g. event detection); e.g. TIPSTER, MUC • Relational QA – Translate question to relational DB query; e.g. LUNAR, FRED Information Extraction (IE) IE systems: • Identify documents of a specific type • Extract information according to predefined templates • Place the information into frame-like database records Weather disaster: Type Date Location Damage Deaths ... Information Extraction (IE) • Templates = pre-defined questions • Extracted information = answers Limitations • Templates are domain dependent and not easily portable • One size does not fit all! Typical TREC QA Pipeline “A simple factoid question” Questi on Query Extract Keywords Answe rs Docs Search Engine Passage Extractor Answer Selector Answe r Corpu s “A 50-byte passage likely to contain the desired answer” (TREC QA track) Основная идея Factoid Q&A – Determine the semantic type of the expected answer “Who won the Nobel Peace Prize in 1991?” is looking for a PERSON – Retrieve documents that have keywords from the question Retrieve documents that have the keywords “won”, “Nobel Peace Prize”, and “1991” – Look for named-entities of the proper type near keywords Look for a PERSON near the keywords “won”, “Nobel Peace Prize”, and “1991” Пример Who won the Nobel Peace Prize in 1991? But many foreign investors remain sceptical, and western governments are withholding aid because of the Slorc's dismal human rights record and the continued detention of Ms Aung San Suu Kyi, the opposition leader who won the Nobel Peace Prize in 1991. Пример (продолжение) The military junta took power in 1988 as prodemocracy demonstrations were sweeping the country. It held elections in 1990, but has ignored their result. It has kept the 1991 Nobel peace prize winner, Aung San Suu Kyi - leader of the opposition party which won a landslide victory in the poll under house arrest since July 1989. Пример (продолжение) The regime, which is also engaged in a battle with insurgents near its eastern border with Thailand, ignored a 1990 election victory by an opposition party and is detaining its leader, Ms Aung San Suu Kyi, who was awarded the 1991 Nobel Peace Prize. According to the British Red Cross, 5,000 or more refugees, mainly the elderly and women and children, are crossing into Bangladesh each day. Типовая архитектура вопросноответной системы Generic QA Architecture NL question Question Analyzer IR Query Document Retriever Answer Type Documents Passage Retriever Passages Answer Extractor Answers Анализ вопроса Question word cues • Who person, organization, location (e.g., city) • When date • Where location • What/Why/How ?? Head noun cues • What city, which country, what year... • Which astronaut, what blues band, ... Scalar adjective cues • How long, how fast, how far, how old, ... Использование WordNet Using WordNet What is the service ceiling of an U-2? length wingspan diameter radius NUMBER altitude ceiling NE (Named Entities): примеры • Person: Mr. Hubert J. Smith, Adm. McInnes, Grace Chan • Title: Chairman, Vice President of Technology, Secretary of State • Country: USSR, France, Haiti, Haitian Republic • City: New York, Rome, Paris, Birmingham, Seneca Falls • Province: Kansas, Yorkshire, Uttar Pradesh • Business: GTE Corporation, FreeMarkets Inc., Acme • University: Bryn Mawr College, University of Iowa • Organization: Red Cross, Boys and Girls Club NE (Named Entities): примеры Currency: 400 yen, $100, DM 450,000 Linear: 10 feet, 100 miles, 15 centimeters Area: a square foot, 15 acres Volume: 6 cubic feet, 100 gallons Weight: 10 pounds, half a ton, 100 kilos Duration: 10 day, five minutes, 3 years, a millennium Frequency: daily, biannually, 5 times, 3 times a day Speed: 6 miles per hour, 15 feet per second, 5 kph Age: 3 weeks old, 10-year-old, 50 years of age Способы извлечения NE • Heuristics and patterns • Fixed-lists (gazetteers) • Machine learning approaches Иерархия типов ответов Результат??? • Where do lobsters like to live? on a Canadian airline • Where do hyenas live? in Saudi Arabia in the back of pick-up trucks • Where are zebras most likely found? near dumps in the dictionary • Why can't ostriches fly? Because of American economic sanctions • What’s the population of Maryland? three Эволюция вопросно-ответных систем • Traditional QA Systems (TREC) – Question treated like keyword query – Single answers, no understanding Q: Who is prime minister of India? <find a person name close to prime, minister, India (within 50 bytes)> A: John Smith is not prime minister of India Вопросно-ответные системы будущего – System understands questions – System understands answers and interprets which are most useful – System produces sophisticated answers (list, summarize, evaluate) What other airports are near Niletown? Where can helicopters land close to the embassy? Основные проблемы исследований • Acquiring high-quality, high-coverage lexical resources • Improving document retrieval • Improving document understanding • Expanding to multi-lingual corpora • Flexible control structure – “beyond the pipeline” • Answer Justification – Why should the user trust the answer? – Is there a better answer out there? Почему необходимы технологии обработки естественного языка? • Question: “When was Wendy’s founded?” • Passage candidate: – “The renowned Murano glassmaking industry, on an island in the Venetian lagoon, has gone through several reincarnations since it was founded in 1291. Three exhibitions of 20thcentury Murano glass are coming up in New York. By Wendy Moonan.” • Answer: 20th Century Необходимость анализа предикатноаргументной структуры • Q336: When was Microsoft established? • Difficult because Microsoft tends to establish lots of things… • Microsoft plans to establish manufacturing partnerships in Brazil and Mexico in May. • Need to be able to detect sentences in which `Microsoft’ is object of `establish’ or close synonym. • Matching sentence: • Microsoft Corp was founded in the US in 1975, incorporated in 1981, and established in the UK in 1982. Планирование стратегии поиска ответа • Question: What is the occupation of Bill Clinton’s wife? – No documents contain these keywords plus the answer • Strategy: decompose into two questions: – Who is Bill Clinton’s wife? = X – What is the occupation of X? Что еще необходимо? • Taxonomy of question-answer types and type-specific constraints Q-Types: • Express relationships between events, entities and attributes • Influence Planner strategy A-Types • Express semantic type of valid answers Примеры When did the Titanic sink ? Q-Type A-Type eventcompletion time-point Who was Darth Vader's conceptson? completion personname What is thalassemia ? definition definition Пример иерархии типов вопросов Пример • Q: What year did the Titanic sink? • A: 1912 • Supporting evidence: • It was the worst peacetime disaster involving a British ship since the Titanic sank on the 14th of April, 1912. • The Titanic sank after striking an iceberg in the North Atlantic on April 14th, 1912. • The Herald of Free Enterprise capsized off the Belgian port of Zeebrugge on March 6, 1987, in the worst peacetime disaster involving a British ship since the Titanic sank in 1912. Поиск ответа • Different formats for answer candidates detected, normalized and combined: – `April 14th, 1912’ – `14th of April, 1912’ • Supporting evidence detected and combined: – `1912’ supports `April 14th, 1912’ • Structure of date expressions understood and correct piece output: – `1912’ rather than `April 14th, 1912’ • Most frequent answer candidate found and output: – `April 14th, 1912’ rather than something else. • It is a misconception the Titanic sank on April the 15th,1912 … Уровни сложности вопросов и ответов SOPHISTICATION LEVELS OF QUESTIONERS Level 1 Level 2 Level 3 "Casual "Template "Cub Questioner" Reporter" COMPLEXITY OF QUESTIONS & ANSWERS RANGES: FROM: TO: Questions: Questions: Simple Facts Answers: Simple Answers found in Single Document Level 4 "ProfessionalQuestioner" Information Analyst" Complex; Uses Judgement Terms Knowledge of User Context Needed; Broad Scope Answers: Search Multiple Sources (in multiple Media/languages); Fusion of information; Resolution of conflicting data; Multiple Alternatives; Adding Interpretation; Drawing Conclusions Уровень 1 • Level 1 “Casual Questioner” • Q: Why did Elian Gonzales leave the U.S.? • Focus: the departure of Elian Gonzales. Уровень 2 • Level 2 “Template Questioner” • Q: What was the position of the U.S. Government regarding the immigration of Elian Gonzales in the U.S.? • Focus: set of templates that are generated to extract information about (1) INS statements and actions regarding the immigration of Elian Gonzales; (2) the actions and statements of the Attorney General with respect to the immigration of Elian Gonzales; (3) actions and statements of other members of the administration regarding the immigration of Elian Gonzales; etc Уровень 3 • Level 3 “Cub reporter” • Q: How did Elian Gonzales come to be considered for immigration in the U.S.?— • translated into a set of simpler questions: • Q1: How did Elian Gonzales enter the U.S.? • Q2: What is the nationality of Elian Gonzales? • Q3: How old is Elian Gonzales? • Q4: What are the provisions in the Immigration Law for Cuban refugees? • Q5: Does Elian Gonzales have any immediate relatives? Уровень 3 • Focus: composed of the question foci of all the simpler questions in which the original question is translated. • Focus Q1: the arrival of Elian Gonzales in the U.S. • Focus Q2: the nationality of Elian Gonzales. • Focus Q3: the age of Elian Gonzales. • Focus Q4: immigration law. • Focus Q5: immediate relatives of Elian Gonzales. Уровень 4 • Level 4 “Professional Information Analyst” • Q: What was the reaction of the Cuban community in the U.S. to the decision regarding Elian Gonzales? • Focus: every action and statement, present or future, taken by any American-Cuban, and especially by Cuban anti-Castro leaders, related to the presence and departure of Elian Gonzales from the U.S. Any action, statements or plans involving Elian’s Miami relatives or their lawyers. Система 09 • Диалог оператора справочной системы 09 с пользователем, запрашивающим информацию – номер телефона референта, находящегося в базе оператора • Задача: по данным, предоставляемым пользователем, найти в базе референт и выдать нужную информацию Модель системы знаний оператора • Фрейм знаний оператора о референте: • Номер слота • Имя слота (признак) • Пропозициональное заполнение слота (значение признака) Слоты фрейма • Дескрипция референта (слоты 1-8) • 1. Гиперкатегоризатор (предприятие общественного питания, трест, объединение, институт) • 2. Категоризатор родовой (кафе, ресторан, чайхана, рюмочная) • 3. Категоризаторы видовые: 3.1., 3.2., etc Слоты фрейма • 4. Идентификаторы • • • • 4.1. Район 4.2. Улица 4.3. Номер дома 4.4. etc Слоты фрейма • • • • • • • • • 5. Идентификатор «номер» 6. Идентификатор «имя» 7. Детализатор 7.1. (дефолт) справочная, секретарь 7.2. Директор 7.3. Отдел работы с клиентами 7.4. Регистратура 7.5. Учительская 7.6. … Слоты фрейма • 8. Номер телефона референта • 9. Характерная функция референта • 10. Прагматическая цель пользователя, достижение которой он связывает с данным референтом • 11. Прагматическая ситуация П, в которой у него возникает информационная потребность (№ телефона референта) Динамическая модель • А-фрейм: фрейм знаний Оператора о референте (рабочий фрейм) • В-фрейм: Фрейм знаний Базы Данных о референте • Задача оператора: получить такое заполнение А-фрейма, чтобы отождествить его с соответствующим В-фреймом Примеры: гиперкатегоризатор • П.: Будьте любезны, телефон Трансэлектромонтаж • О.: Что это такое? • П.: Это трест Примеры: гиперкатегоризатор • П.: В Тушинском районе есть, кажется, Сходненская улица, дом 8, там предприятие 4, Теплоэнергии, а? Номер телефона. • О.: Что за предприятие 4 Теплоэнергии? • П.: Что за предприятие 4 Теплоэнергии? Как это понять? • О.: А что это такое? Что мне смотреть? Примеры: детализатор • П.: Меня интересует номер телефона железнодорожной больницы, 3-я железнодорожная больница, Это на улице 3-я Часовая, дом 20. • О.: № телефона… • П.: Это что Вы мне дали? • О.: Справочную я Вам дала. • П.: Приемная Склифософского, будьте любезны • О.: Есть только регистратура, справочная, директор, главврач Примеры: соотношение фреймов • П.: Алло! Детский сад 11/33 Перовского района. • О.: Это не детский сад, это ясли-сад. • П.: Девушка, мне нужна закусочная в Серебряном бору. • О.: У нас есть пельменная. • П.: Да-да. Примеры: дополнительная информация • П.: Скажите пожалуйста номер парикмахерской при Ханое. • О.: Парикмахерская? Какой адрес? • П.: Это где-то недалеко, это на Профсоюзной находится, недалеко метро Академическая. • Ну а какая там проходит улица? • П.: Будьте любезны, подскажите мне пожалуйста телефон мехового ателье, которое находится на Таганке. • О.: Это не Динамовская улица? • П.: Нет! • О.: Ульяновская?