Вопросно-ответные системы

реклама
Вопросно-ответные
системы
Актуальные проблемы
прикладной лингвистики, Е.Ю.
Калинина, МГУ, 2007-2008
Что такое воспросно-ответные
системы?
• Q&A:
• Inputs: a question in English; a set of text
and database resources
• Output: a set of possible answers drawn
from the resources
• When is the next train to Dublin? --> Q&A
System --> Text Corpora & RelDBases -->
8.35, track 9
Основные компоненты вопросноответной системы
Generic architecture
• • Query processing
• • Paragraph indexing
• • Answer resolution
• • Open-domain vs. domain-specific systems
Given a NL query, extract its possible answer(s)
from real-world NL text documents.
Типы информационных
потребностей
• Ad hoc retrieval: find me documents “like
this”
Identify positive accomplishments of
the Hubble telescope since it was
launched in 1991.
Compile a list of mammals that are
considered to be endangered, identify
their habitat and, if possible, specify
what threatens them.
Типы информационных
потребностей
• Question answering
“Factoid”
“List”
“Definition”
Who discovered Oxygen?
When did Hawaii become a state?
Where is Ayer’s Rock located?
What team won the World Series in 1992?
What countries export oil?
Name U.S. cities that have a “Shubert” theater.
Who is Aaron Copland?
What is a quasar?
Предшественники вопросноответных систем
• Information Retrieval
– Retrieve relevant documents from a set of
keywords; search engines
• Information Extraction
– Template filling from text (e.g. event
detection); e.g. TIPSTER, MUC
• Relational QA
– Translate question to relational DB query; e.g.
LUNAR, FRED
Information Extraction (IE)
IE systems:
• Identify documents of a specific type
• Extract information according to predefined templates
• Place the information into frame-like
database records
Weather disaster:
Type
Date
Location
Damage
Deaths
...
Information Extraction (IE)
• Templates = pre-defined questions
• Extracted information = answers
Limitations
• Templates are domain dependent and not
easily portable
• One size does not fit all!
Typical TREC QA Pipeline
“A simple
factoid
question”
Questi
on
Query
Extract
Keywords
Answe
rs
Docs
Search
Engine
Passage
Extractor
Answer
Selector
Answe
r
Corpu
s
“A 50-byte passage likely
to contain the desired
answer” (TREC QA track)
Основная идея Factoid Q&A
– Determine the semantic type of the expected
answer “Who won the Nobel Peace Prize in 1991?”
is looking for a PERSON
– Retrieve documents that have keywords from
the question Retrieve documents that have the
keywords “won”, “Nobel Peace
Prize”, and “1991”
– Look for named-entities of the proper type
near keywords Look for a PERSON near the
keywords “won”, “Nobel Peace
Prize”, and “1991”
Пример
Who won the Nobel Peace Prize in 1991?
But many foreign investors remain sceptical,
and western governments are withholding aid
because of the Slorc's dismal human rights
record and the continued detention of Ms Aung
San Suu Kyi, the opposition leader who won
the Nobel Peace Prize in 1991.
Пример (продолжение)
The military junta took power in 1988 as prodemocracy demonstrations were sweeping
the country. It held elections in 1990, but
has ignored their result. It has kept the
1991 Nobel peace prize winner, Aung San
Suu Kyi - leader of the opposition party
which won a landslide victory in the poll under house arrest since July 1989.
Пример (продолжение)
The regime, which is also engaged in a battle with
insurgents near its eastern border with Thailand,
ignored a 1990 election victory by an opposition
party and is detaining its leader, Ms Aung San Suu
Kyi, who was awarded the 1991 Nobel Peace
Prize. According to the British Red Cross, 5,000 or
more refugees, mainly the elderly and women and
children, are crossing into Bangladesh each day.
Типовая архитектура вопросноответной системы
Generic QA Architecture
NL question
Question Analyzer
IR Query
Document Retriever
Answer Type
Documents
Passage Retriever
Passages
Answer Extractor
Answers
Анализ вопроса
Question word cues
• Who  person, organization, location (e.g., city)
• When  date
• Where  location
• What/Why/How  ??
Head noun cues
• What city, which country, what year...
• Which astronaut, what blues band, ...
Scalar adjective cues
• How long, how fast, how far, how old, ...
Использование WordNet
Using WordNet
What is the service ceiling of an U-2?
length
wingspan
diameter
radius
NUMBER
altitude
ceiling
NE (Named Entities): примеры
• Person: Mr. Hubert J. Smith, Adm. McInnes, Grace
Chan
• Title: Chairman, Vice President of Technology,
Secretary of State
• Country: USSR, France, Haiti, Haitian Republic
• City: New York, Rome, Paris, Birmingham, Seneca Falls
• Province: Kansas, Yorkshire, Uttar Pradesh
• Business: GTE Corporation, FreeMarkets Inc., Acme
• University: Bryn Mawr College, University of Iowa
• Organization: Red Cross, Boys and Girls Club
NE (Named Entities): примеры
Currency: 400 yen, $100, DM 450,000
Linear: 10 feet, 100 miles, 15 centimeters
Area: a square foot, 15 acres
Volume: 6 cubic feet, 100 gallons
Weight: 10 pounds, half a ton, 100 kilos
Duration: 10 day, five minutes, 3 years, a millennium
Frequency: daily, biannually, 5 times, 3 times a day
Speed: 6 miles per hour, 15 feet per second, 5 kph
Age: 3 weeks old, 10-year-old, 50 years of age
Способы извлечения NE
• Heuristics and patterns
• Fixed-lists (gazetteers)
• Machine learning approaches
Иерархия типов ответов
Результат???
• Where do lobsters like to live?
on a Canadian airline
• Where do hyenas live?
in Saudi Arabia
in the back of pick-up trucks
• Where are zebras most likely found?
near dumps
in the dictionary
• Why can't ostriches fly?
Because of American economic sanctions
• What’s the population of Maryland?
three
Эволюция вопросно-ответных
систем
• Traditional QA Systems (TREC)
– Question treated like keyword query
– Single answers, no understanding
Q: Who is prime minister of India?
<find a person name close to prime,
minister, India (within 50 bytes)>
A: John Smith is not prime minister
of India
Вопросно-ответные системы
будущего
– System understands questions
– System understands answers and interprets
which are most useful
– System produces sophisticated answers (list,
summarize, evaluate)
What other airports are near Niletown?
Where can helicopters land close to the
embassy?
Основные проблемы
исследований
• Acquiring high-quality, high-coverage lexical
resources
• Improving document retrieval
• Improving document understanding
• Expanding to multi-lingual corpora
• Flexible control structure
– “beyond the pipeline”
• Answer Justification
– Why should the user trust the answer?
– Is there a better answer out there?
Почему необходимы технологии
обработки естественного языка?
• Question: “When was Wendy’s founded?”
• Passage candidate:
– “The renowned Murano glassmaking industry,
on an island in the Venetian lagoon, has gone
through several reincarnations since it was
founded in 1291. Three exhibitions of 20thcentury Murano glass are coming up in New
York. By Wendy Moonan.”
• Answer: 20th Century
Необходимость анализа предикатноаргументной структуры
• Q336: When was Microsoft established?
• Difficult because Microsoft tends to establish lots of
things…
•
Microsoft plans to establish manufacturing
partnerships in Brazil and Mexico in May.
• Need to be able to detect sentences in which `Microsoft’
is object of `establish’ or close synonym.
• Matching sentence:
•
Microsoft Corp was founded in the US in 1975,
incorporated in 1981, and established in the UK in 1982.
Планирование стратегии поиска
ответа
• Question: What is the occupation of Bill
Clinton’s wife?
– No documents contain these keywords plus
the answer
• Strategy: decompose into two questions:
– Who is Bill Clinton’s wife? = X
– What is the occupation of X?
Что еще необходимо?
• Taxonomy of question-answer types and
type-specific constraints
Q-Types:
• Express relationships between events,
entities and attributes
• Influence Planner strategy
A-Types
• Express semantic type of valid answers
Примеры
When did the Titanic
sink ?
Q-Type
A-Type
eventcompletion
time-point
Who was Darth Vader's conceptson?
completion
personname
What is thalassemia ?
definition
definition
Пример иерархии типов вопросов
Пример
• Q: What year did the Titanic sink?
•
A: 1912
•
Supporting evidence:
•
It was the worst peacetime disaster involving a
British ship since the Titanic sank on the 14th of April,
1912.
•
The Titanic sank after striking an iceberg in the
North Atlantic on April 14th, 1912.
•
The Herald of Free Enterprise capsized off the
Belgian port of Zeebrugge on March 6, 1987, in the
worst peacetime disaster involving a British ship since
the Titanic sank in 1912.
Поиск ответа
• Different formats for answer candidates detected,
normalized and combined:
– `April 14th, 1912’
– `14th of April, 1912’
• Supporting evidence detected and combined:
– `1912’ supports `April 14th, 1912’
• Structure of date expressions understood and correct
piece output:
– `1912’ rather than `April 14th, 1912’
• Most frequent answer candidate found and output:
– `April 14th, 1912’ rather than something else.
• It is a misconception the Titanic sank on April the
15th,1912 …
Уровни сложности вопросов и
ответов
SOPHISTICATION LEVELS OF QUESTIONERS
Level 1
Level 2
Level 3
"Casual
"Template
"Cub
Questioner"
Reporter"
COMPLEXITY OF QUESTIONS & ANSWERS RANGES:
FROM:
TO:
Questions:
Questions:
Simple Facts
Answers: Simple Answers found in
Single Document
Level 4
"ProfessionalQuestioner"
Information
Analyst"
Complex;
Uses Judgement Terms
Knowledge of User Context
Needed; Broad Scope
Answers: Search Multiple Sources (in multiple
Media/languages); Fusion of information; Resolution of
conflicting data; Multiple Alternatives; Adding
Interpretation; Drawing Conclusions
Уровень 1
• Level 1 “Casual Questioner”
• Q: Why did Elian Gonzales leave the
U.S.?
• Focus: the departure of Elian Gonzales.
Уровень 2
• Level 2 “Template Questioner”
• Q: What was the position of the U.S. Government
regarding the immigration of Elian Gonzales in the U.S.?
• Focus: set of templates that are generated to extract
information about (1) INS statements and actions
regarding the immigration of Elian Gonzales; (2) the
actions and statements of the Attorney General with
respect to the immigration of Elian Gonzales; (3) actions
and statements of other members of the administration
regarding the immigration of Elian Gonzales; etc
Уровень 3
• Level
3 “Cub reporter”
• Q: How did Elian Gonzales come to be
considered for immigration in the U.S.?—
• translated into a set of simpler questions:
• Q1: How did Elian Gonzales enter the U.S.?
• Q2: What is the nationality of Elian Gonzales?
• Q3: How old is Elian Gonzales?
• Q4: What are the provisions in the Immigration
Law for Cuban refugees?
• Q5: Does Elian Gonzales have any immediate
relatives?
Уровень 3
• Focus: composed of the question foci of all the
simpler questions in which the original question
is translated.
• Focus Q1: the arrival of Elian Gonzales in the
U.S.
• Focus Q2: the nationality of Elian Gonzales.
• Focus Q3: the age of Elian Gonzales.
• Focus Q4: immigration law.
• Focus Q5: immediate relatives of Elian
Gonzales.
Уровень 4
• Level 4 “Professional Information Analyst”
• Q: What was the reaction of the Cuban
community in the U.S. to the decision regarding
Elian Gonzales?
• Focus: every action and statement, present or
future, taken by any American-Cuban, and
especially by Cuban anti-Castro leaders, related
to the presence and departure of Elian Gonzales
from the U.S. Any action, statements or plans
involving Elian’s Miami relatives or their lawyers.
Система 09
• Диалог оператора справочной системы
09 с пользователем, запрашивающим
информацию – номер телефона
референта, находящегося в базе
оператора
• Задача: по данным, предоставляемым
пользователем, найти в базе референт
и выдать нужную информацию
Модель системы знаний
оператора
• Фрейм знаний оператора о референте:
• Номер слота
• Имя слота (признак)
• Пропозициональное заполнение слота
(значение признака)
Слоты фрейма
• Дескрипция референта (слоты 1-8)
• 1. Гиперкатегоризатор (предприятие
общественного питания, трест,
объединение, институт)
• 2. Категоризатор родовой (кафе,
ресторан, чайхана, рюмочная)
• 3. Категоризаторы видовые: 3.1., 3.2.,
etc
Слоты фрейма
• 4. Идентификаторы
•
•
•
•
4.1. Район
4.2. Улица
4.3. Номер дома
4.4. etc
Слоты фрейма
•
•
•
•
•
•
•
•
•
5. Идентификатор «номер»
6. Идентификатор «имя»
7. Детализатор
7.1. (дефолт) справочная, секретарь
7.2. Директор
7.3. Отдел работы с клиентами
7.4. Регистратура
7.5. Учительская
7.6. …
Слоты фрейма
• 8. Номер телефона референта
• 9. Характерная функция референта
• 10. Прагматическая цель пользователя,
достижение которой он связывает с
данным референтом
• 11. Прагматическая ситуация П, в
которой у него возникает
информационная потребность (№
телефона референта)
Динамическая модель
• А-фрейм: фрейм знаний Оператора о
референте (рабочий фрейм)
• В-фрейм: Фрейм знаний Базы Данных о
референте
• Задача оператора: получить такое
заполнение А-фрейма, чтобы отождествить
его с соответствующим В-фреймом
Примеры: гиперкатегоризатор
• П.: Будьте любезны, телефон
Трансэлектромонтаж
• О.: Что это такое?
• П.: Это трест
Примеры: гиперкатегоризатор
• П.: В Тушинском районе есть, кажется,
Сходненская улица, дом 8, там
предприятие 4, Теплоэнергии, а? Номер
телефона.
• О.: Что за предприятие 4 Теплоэнергии?
• П.: Что за предприятие 4 Теплоэнергии?
Как это понять?
• О.: А что это такое? Что мне смотреть?
Примеры: детализатор
• П.: Меня интересует номер телефона
железнодорожной больницы, 3-я
железнодорожная больница, Это на улице 3-я
Часовая, дом 20.
• О.: № телефона…
• П.: Это что Вы мне дали?
• О.: Справочную я Вам дала.
• П.: Приемная Склифософского, будьте
любезны
• О.: Есть только регистратура, справочная,
директор, главврач
Примеры: соотношение фреймов
• П.: Алло! Детский сад 11/33 Перовского
района.
• О.: Это не детский сад, это ясли-сад.
• П.: Девушка, мне нужна закусочная в
Серебряном бору.
• О.: У нас есть пельменная.
• П.: Да-да.
Примеры: дополнительная
информация
• П.: Скажите пожалуйста номер парикмахерской при
Ханое.
• О.: Парикмахерская? Какой адрес?
• П.: Это где-то недалеко, это на Профсоюзной
находится, недалеко метро Академическая.
• Ну а какая там проходит улица?
• П.: Будьте любезны, подскажите мне пожалуйста
телефон мехового ателье, которое находится на
Таганке.
• О.: Это не Динамовская улица?
• П.: Нет!
• О.: Ульяновская?
Скачать