The PERTOMed Project: Exploiting and validating terminological

реклама
The PERTOMed Project:
Exploiting and validating terminological resources
of comparable Russian-French-English corpora
within Pharmacovigilance
Cedric BOUSQUET
INSERM U729 (Faculté de Médecine - Paris 5)
cedric.bousquet@spim.jussieu.fr
Maria ZIMINA
EA2290 SYLED (Paris 3) / CRIM-INaLCO (Paris)
zimina@msh-paris.fr
International Conference on Terminology
“Terminology and Society".
1
Outline

Introduction. The PERTOMed Project:




Research methodology:





SYNTEX
Repeated Segments extraction
Multiple co-occurrences
Collaboration domain expert / corpus linguist
Discussion:



Background
Objectives
Material: parallel French/English vs. comparable Russian corpora
Positive results
Limits
Conclusions and future work
International Conference on Terminology
“Terminology and Society".
2
The PERTOMed Project:

Project Directors:
 Marie-Christine Jaulent (INSERM, Paris).
 Jean Charlet (INSERM, Paris).

Partners:
 INSERM U729, Faculté de Médecine - Paris 5 (France).
 ERSS: Equipe de Recherche en Syntaxe et Sémantique, UMR
5610 CNRS and Toulouse le Mirail University (France).
 CRIM: Centre de Recherche en Ingénierie Multilingue, INaLCO
(Paris, France).
International Conference on Terminology
“Terminology and Society".
3
The PERTOMed Project:

Development of terminological resources in medicine is a major
issue to allow collecting data and browsing knowledge databases.

The objective of the PERTOMed project (Production et évaluation
de ressources terminologiques et ontologiques dans le domaine de
la médecine) was to build terminological or ontological resources
from texts in the medical domain.

Potential applications concern several fields:




Pharmacovigilance
Pneumology
Drug-drug interactions
Multilingual terminologies
International Conference on Terminology
“Terminology and Society".
4
Pharmacy-related issues in Russia

Several pharmaceutical companies are present in Russia: medicines
produced in EU or USA are also commercialised in Russia.

Required qualities of translation of product information: Precision,
Reproducibility, Exactness…

High quality of drug product information translations is vital for
 Pharmaceutical companies
 Russian regulatory authorities
 Medical doctors
 Pharmacists
 Consumers
International Conference on Terminology
“Terminology and Society".
5
Pharmacovigilance
According to World Health Organization (WHO),
pharmacovigilance is “the science and activities
relating to the detection, assessment, understanding and
prevention of adverse effects or any other drug-related
problems.”
International Conference on Terminology
“Terminology and Society".
6
Available international terminologies in
Pharmacovigilance


WHO-ART (World Health Organization – Adverse Reaction
Terminology) was developed in English with translations into French,
German, Spanish, Portuguese and Italian.
MedDRA (Medical Dictionary for Drug regulatory Activities) defines
fully equivalent medical terms in different languages, including
English, French, German, Japanese and Spanish.
International Conference on Terminology
“Terminology and Society".
7
Objectives

To propose methods for creating terminological resources
from comparable French-English-Russian corpora on
adverse drug reactions.

To build a trilingual French-English-Russian terminological
resource describing adverse drug reactions.
International Conference on Terminology
“Terminology and Society".
8
Available resources:

Parallel French-English medical text corpora:
Summaries of Product Characteristics (SPC).

Comparable medical corpora on Russian Web sites.
International Conference on Terminology
“Terminology and Society".
9
SPC: Summary of Product Characteristics

European Medicines Agency (EMEA) is a decentralised EU body
with headquarters in London.

Companies submit a single marketing authorisation application to
the EMEA.

In case of approval given by the Committee for Medicinal Products
for Human Use (CMPHU), applicants receive a single market
authorisation valid for the entire EU.

The SPCs are provided in all EU languages (undesirable effects are
described in Section 4.8).
International Conference on Terminology
“Terminology and Society".
10
French-English corpus from the
PERTOmed Project (C. Bousquet)


156 SPCs in French and English downloaded as PDF files.
NLP processing by SYNTEX (French/English parser and term
extractor).
French
English
La Lamivudine peut inhiber la phosphorylation intracellulaire de la
zalcitabine lorsque ces deux produits sont administrés de manière
concomitante.
Lamivudine may inhibit the intracellular phosphorylation of
zalcitabine when the two medicinal products are used concurrently.
Par conséquent, il n'est pas recommandé d'utiliser Zeffix en
association avec la zalcitabine.
Zeffix is therefore not recommended to be used in combination with
zalcitabine.
La Lamivudine a été bien tolérée au cours des essais cliniques
réalisés chez des patients atteints d'hépatite B chronique.
In clinical studies of patients with chronic hepatitis B, Lamivudine
was well tolerated.
Les effets indésirables le plus souvent rapportés étaient : malaise et
fatigue, infections respiratoires, gêne au niveau de la gorge et des
amygdales, céphalées, douleur ou gêne abdominale, nausées,
vomissements et diarrhée.
The most common adverse events reported were malaise and
fatigue, respiratory tract infections, throat and tonsil discomfort,
headache, abdominal discomfort and pain, nausea, vomiting and
diarrhoea.
Chez les patients traités pour une hépatite B chronique, l'incidence
des anomalies biologiques a été similaire dans les groupes
Lamivudine et placebo, à l'exception : - d'augmentations du taux
des CPK non associées à des symptômes ou à des signes cliniques
,
The incidence of laboratory abnormalities in chronic hepatitis B
patients were similar in the Lamivudine and placebo treated groups,
with the exception of CPK elevations (which were not associated
with clinical signs or symptoms), and ALT elevations post-treatment,
which were more common in Lamivudine treated patients.
International Conference on Terminology
“Terminology and Society".
11
SYNTEX (D. Bourigault, S. Ozdowska)





Step 1: Sentence alignment (JAPA)
Step 2: Part-of-Speech tagging (TreeTagger)
Step 3: Parsing (Syntax):
 syntactic dependencies are identified (subjects, direct and
indirect objects of verbs…
Step 4: Identification of anchor pairs:
 cognates, translation equivalents within aligned sentences…
Step 5: Alignment by syntactic propagation:
Subject
…the two medicinal products are used concurrently.
…ces deux produits sont administrés de manière concomitante.
Subject
International Conference on Terminology
“Terminology and Society".
12
Comparable resources on medicinal
products in Russia (J. Ivanova et I. Nuk)

Russian Websites selected for the Project:
 RECIPE: http://www.recipe.ru
 RLS: http://www.rlsnet.ru
 Russian Vidal: http://www.vidal.ru

Criteria for comparability with SPC:
 Degree of specialisation
 Clarity and precision
 Recognition by domain experts in Russia
 Information granularity
 Style (summarization)
 Possible text to text alignment: direct search in Russian by active
component or medicinal product
International Conference on Terminology
“Terminology and Society".
13
The RECIPE Website:
The site of legal pharmacological documentation; Medline user manual, index
of Russian bio-medical Websites, several criteria to search for medical
products (including ICD-10): http://www.recipe.ru
International Conference on Terminology
“Terminology and Society".
14
The RLS Website:
Le site RLS, acronym of Регистр Лекарственных Средств России
[Register of Medical Substances of Russia] : http://www.rlsnet.ru/:
Encyclopaedia of medical products and product description.
International Conference on Terminology
“Terminology and Society".
15
The Russian Vidal Website:
Russian Vidal: http://www.vidal.ru is edited and regularly updated by the private
company AstraPharmService in accordance with the Industrial Standard of
Russian Federation.
International Conference on Terminology
“Terminology and Society".
16
Russian corpus from the PERTOMed
Project: results of Correspondence Analysis
(Lexico3)
Regardless various origins (different
Websites used to collect information), the
descriptions of medical products in
Russian gathered within the corpus tend to
share common lexical characteristics…
International Conference on Terminology
“Terminology and Society".
17
Material: parallel vs. comparable corpora

Difficulties:
 Corpus size differences…
 Information coverage?
 Degree of comparability?
 NLP tools/methods for comparable multilingual text processing?
comparable
parallel
Nb occ
Nb forms
F max
Hapax
Russian
15 465
3 034
461 (в)
1 483
French
161 995
7 280
8 022 (de)
2 389
English
133 984
5 957
4 936 (of)
1 836
Delimiting characters: .,:;!?/_-\"'()[]{}§$
International Conference on Terminology
“Terminology and Society".
18
Methods for building terminologies from
comparable corpora (1/2)

If two words are mutual translations, their collocates are
likely to correspond as well…
 Collocation is defined as a co-occurrence relation.
 Domain specific words co-occur with general words
(possibility to use general bilingual dictionaries).

Mapping through bilingual dictionary:
 Build context vectors for source and target words.
 Translate context vectors.
 Compute similarity between source and target context vectors.
International Conference on Terminology
“Terminology and Society".
19
Methods for building terminologies from
comparable corpora (2/2)

Statistical Machine Translation:




Mixed approaches:


A translation model is learned from existing translations
(parallel corpora).
Alignment probabilities are introduced to refine the model.
Limits: considerable amounts of training data, several
heuristics possible.
Syntactic relations transfer, co-occurrence relations, dictionary
mapping, alignment probabilities …
Problems:


Lack of equivalence between tools performing similar tasks on
different languages
Term extraction from comparable corpora not satisfying yet.
International Conference on Terminology
“Terminology and Society".
20
Repeated Segments extraction
Repeated Segments (SALEM 1987): series of consecutive forms whose
frequency is greater then or equal to 2 in the corpus:
FR
syndrome de stevens johnson
27 EN
stevens johnson syndrome 25
syndrome pseudo grippal
23
respiratory distress syndrome 8
de syndrome de
13
ovarian hyperstimulation syndrome 7
syndrome de lyell
11
flu like syndrome 6
syndrome de détresse respiratoire
8
multiforme stevens johnson syndrome 6
syndrome de stevens johnson et
8
adult respiratory distress syndrome 5
de syndrome de stevens johnson
7 erythema multiforme stevens johnson syndrome 5
un syndrome de
7
syndrome d’hyperstimulation ovarienne 7
un syndrome grippal
5
un syndrome pseudo grippal
5
syndrome de turner
5
RU
гриппоподобный синдром
синдром стивенса джонсона
или синдром
синдром гиперстимуляции яичников
синдром и интерстициальная пневмония
синдром лайелла
синдром лизиса опухоли
синдром высвобождения цитокинов
International Conference on Terminology
“Terminology and Society".
13
7
2
3
2
2
2
2
21
Multiple co-occurrences
MARTINEZ (2003): The method is based on iterative calculation of lexical
attractions. Filtering techniques reduce the number of contextual explorations:
A
B
E
G
A
B
E
G
A
B
E
H
A
B
F
I
A
C
A
D
H
F
I
C
D
Only non-inclusive paths are selected
ABEG
ABEH
ABFI
International Conference on Terminology
“Terminology and Society".
22
Choosing comparable textual units as
starting points for exploration…
со стороны (F=216)
French
English
Russian
troubles (F=922)
disorders (F=835)
нарушения ?? (Dictionary)
troubles du système nerveux
nervous system disorders
со стороны нервной системы
troubles respiratoires
respiratory system disorders
со стороны респираторной
системы
troubles musculo-squelettiques
musculoskeletal system
disorders
со стороны опорнодвигательного аппарата
CONTEXT:
FR: troubles du système nerveux : insomnie ; hypoesthésie ; paresthésies.
EN: nervous system disorders: dizziness, paraesthesia, hyperaesthesia.
RU: со стороны нервной системы: головокружение, головная боль, тревога, депрессия,
парестезии, гиперстезии, возбуждение, нарушения сна, нервозность, астения.
International Conference on Terminology
“Terminology and Society".
23
Exploring collocation networks : French
Contexte n°1292
(15 formes dont 3 vedettes) Densité info.=0.20
troubles du métabolisme et de la nutrition : augmentation des triglycérides sériques,
augmentation du cholestérol sérique.
Contexte n°1414
(12 formes dont 3 vedettes) Densité info.=0.25
troubles du métabolisme et de la nutrition : augmentation de la créatinine, hypokaliémie.
Contexte n°1425
(12 formes dont 3 vedettes) Densité info.=0.25
troubles du métabolisme et de la nutrition : élévation de l'urée sanguine.
Contexte n°3180
(10 formes dont 3 vedettes) Densité info.=0.30
troubles du métabolisme et de la nutrition : oedèmes, oedèmes périphériques
Contexte n°4667
(12 formes dont 3 vedettes) Densité info.=0.25
troubles du métabolisme et de la nutrition : élévation de l'urée sanguine.
Contexte n°6157
(10 formes dont 3 vedettes) Densité info.=0.30
troubles du métabolisme et de la nutrition : fréquent : hypertriglycéridémie, hyperglycémie
Contexte n°8334
(13 formes dont 3 vedettes) Densité info.=0.23
troubles du métabolisme et de la nutrition : prise de poids ou amaigrissement, oedèmes.
Legend:
f [s] c
Contexte n°10151
(19 formes dont 3 vedettes) Densité
info.=0.16
f
=
co-frequency
troubles du métabolisme et de la nutrition très fréquents perte de poids fréquents perte
s = specificity
d'appétit, prise de poids
c = number of contexts
International Conference on Terminology
“Terminology and Society".
24
Exploring collocation networks : English
Contexte n°32
metabolism and nutrition disorders
(4 formes dont 3 vedettes) Densité info.=0.75
Contexte n°3715
(7 formes dont 3 vedettes) Densité info.=0.43
metabolism and nutrition disorders : oedema, peripheral oedema
Contexte n°3763
(5 formes dont 3 vedettes) Densité info.=0.60
metabolism and nutrition disorders : hypokalaemia.
Contexte n°5392
(23 formes dont 3 vedettes) Densité info.=0.13
metabolism and nutrition disorders : very common : hypercholesterolemia, hypertriglyceridemia
(hyperlipemia) ; hypokalaemia ; increased lactic dehydrogenase (ldh) common : liver function tests
abnormal ; increased sgot, increased sgpt.
Contexte n°7698
(7 formes dont 3 vedettes) Densité info.=0.43
metabolism and nutrition disorders : common : hypertriglyceridaemia, hyperglycaemia
Contexte n°8856
(11 formes dont 3 vedettes) Densité info.=0.27
metabolism and nutrition disorders : abnormal renal function tests (increased creatinine, bun)
Contexte n°9771
(9 formes dont 3 vedettes) Densité info.=0.33
metabolism and nutrition disorders : weight gain or loss, oedema.
Legend: f [s] c
f = co-frequency
Contexte n°11578
(13 formes dont 3 vedettes) Densité
info.=0.23
s
=
specificity
metabolism and nutrition disorders very common weight loss common decreased appetite,
c = number of contexts
weight increase
International Conference on Terminology
“Terminology and Society".
25
Exploring collocation networks : Russian
Contexte n°150
(10 formes dont 4 vedettes) Densité info.=0.40
со стороны обмена веществ : обострение сахарного диабета, отеки или обезвоживание.
Contexte n°832
(37 formes dont 4 vedettes) Densité info.=0.11
c=10500 побочное действие : со стороны обмена веществ : после назначения натеглинида, как и
при применении других гипогликемических препаратов, были отмечены симптомы,
предположительно свидетельствующие о развитии гипогликемии, такие как повышенная
потливость, тремор, головокружение, повышенный аппетит, сердцебиение, тошнота, слабость,
недомогание.
Contexte n°647
(16 formes dont 4 vedettes) Densité info.=0.25
со стороны пищеварительной системы : возможны тошнота, рвота, повышение активности act, алт
Contexte
n°930 случаи гепатита.
(18 formes dont 4 vedettes) Densité info.=0.22
и ггт ; описаны
со стороны обмена веществ : гиперпролактинемия, увеличение (редко уменьшение) массы
тела,
сахарный
кетоацидоз,
диабетическая
кома, зоб.
Contexte
n°718 диабет, гипергликемия,
(23 диабетический
formes dont 4 vedettes)
Densité
info.=0.17
c=752 побочное действие : со стороны пищеварительной системы : возможны боли и дискомфорт
Contexte
n°1439 области, тошнота, рвота,
(10 formes
dont
4 vedettes)
Densitéповышение
info.=0.40 активности
в эпигастральной
диарея,
снижение
аппетита,
со
стороны
обмена
веществ
:
гипергликемия,
сахарный
диабет,
кетоацидоз
ожирение,
печеночных трансаминаз.
дегидратация.
Legend: f [s] c
Contexte n°873
(31 formes dont 4 vedettes) Densité info.=0.13
Contexte
n°1459
(12 formes
dont 4 vedettes)
Densité info.=0.33
co-frequency
со стороны
пищеварительной системы
: часто-повышение
активности
ггт f; =
возможны-повышение
со
стороны
обмена
веществ
обезвоживание,
обострение
диабета,
ожирение
s =боли
specificity
активности
алт,
аст, щф
и уровня: общего
билирубина,
тошнота,сахарного
рвота, диарея,
в животе ;по
в
центральному
типу.
/
c = number of contexts
единичных случаях-желтуха, тяжелые гепатотоксические реакции.
International Conference on Terminology
“Terminology and Society".
26
Combining with French-English lexicon
extracted from parallel corpus…

Automatic segmentation into textual units (Lexico3):
 Forms, Repeated Segments (Russian-French-English)

Identification of anchor pairs (starting points):
 Frequency counts, cognates, general words, French/English
terminology…
(syndrome / syndrome / синдром)

Trilingual collocation networks (COOCS):
 Identification of similar context vectors.
 Semi-automatic segmentation into terminological units.
 Cross-language check.
 Expert validation.
International Conference on Terminology
“Terminology and Society".
27
Building trilingual terminology:
collaboration domain expert/corpus linguist
Two different kinds of knowledge / skills:
 From corpus linguist:
 Methodological knowledge: tools and methods for text
exploration.
 Quantitative results on corpora.

From domain expert:
 Domain specific knowledge on ADRs.
 Choice of relevant terms/contexts when several variants
attested in texts.
International Conference on Terminology
“Terminology and Society".
28
Results: trilingual lexicon on the Web
PERTOMed Server : http://baneyx.net/SPIP/
Each trilingual entry comprises the
following fields:
- Simple term (with possible variants)
- Abbreviation (if applicable)
- Related composed term(s)
- Domain(s)
- Medical product(s) concerned
International Conference on Terminology
“Terminology and Society".
29
Results: choosing terms…
Russian segments
French segments
English segments
гриппоподобный синдром
(13 occurrences)
гриппоподобные
симптомы
(2 occurrences)
симптомы
гриппоподобного синдрома
(1 occurrence)
гриппоподобная
симптоматика
(1 occurrences)
syndrome pseudo-grippal
(23 occurrences)
syndrome pseudogrippal
(2 occurrences)
symptômes pseudo-grippaux
(7 occurrences)
influenza-like symptoms
(9 occurrences)
influenza-like illness
(6 occurrences)
flu-like symptoms
(8 occurrences)
flu-like symptom complex
(4 occurrences)
flu-like syndrome
(6 occurrences)
flu-like illness
(5 occurrences)
гриппоподобный синдром / syndrome pseudo-grippal / influenza-like symptom, flu-like symptom
International Conference on Terminology
“Terminology and Society".
30
Results: choosing domains…
Russian
French
English
Аллергические реакции
Дерматология
Дыхательная система
Желудочно-кишечный тракт
Костно-мышечная система
Кроветворение и кровь
Местные реакции
Мочеполовая система
Нервная система
Обмен веществ
Организм в целом
Органы чувств
Réactions allergiques
Dermatologie
Appareil respiratoire
Système gastro-intestinal
Système ostéo-musculaire
Hématopoïèse et sang
Réactions locales au traitement
Appareil uro-genital
Système nerveux
Metabolic disorders
Etat général
Troubles des organes
sensoriels
Foie et voies biliaires
Divers
Troubles psychiatriques
Appareil cardio-vasculaire
Allergic reactions
Dermatology
Respiratory system disorders
Gastro-intestinal system disorders
Musculo-skeletal system disorders
Hematopoiesis and blood
Application site disorders
Urogenital disorders
Nervous system
Metabolisme
Body as a whole-general disorders
Sensory system disorders
Печеночно-билиарная система
Прочие
Психические нарушения
Сердечно-сосудистая система
International Conference on Terminology
“Terminology and Society".
Liver and biliary system disorders
Other
Psychiatric disorders
Cardiovascular disorders
31
Discussion: positive results

430 validated trilingual terminological entries in XML format
 2002 simple terminological records (single word terms)
 1006 complex terminological records (~50%) (multiword terms)
 Co-occurrence relations:
Simple term:
диабет
diabète
diabetes
Context:
диабетическая кома
coma diabétique
diabetic coma
Domain:
Обмен веществ
Troubles du métabolisme
Metabolism disorders
Medical product:
RLS Olanzapine
Simple term:
кома
coma
Coma
Context:
диабетическая кома
coma diabétique
diabetic coma
Domain:
Обмен веществ
Troubles du métabolisme
Metabolism disorders
Medical product:
RLS Olanzapine
International Conference on Terminology
“Terminology and Society".
32
Discussion: limits




Lexical coverage.
Contextual access.
Presentation (no visual aids for navigation yet…).
Evaluation difficulties:
 Choice of criteria.
 Comparable resources needed.
International Conference on Terminology
“Terminology and Society".
33
Conclusions

Creating terminological resources from comparable corpora
is faced with intrinsic heterogeneity of texts.

The challenge of exploring texts coming from different
cultural and linguistic sources should be taken into account
in the terminology project feasibility study.

Creation of Russian Internet corpus in the field of
Pharmacovigilance is a pioneering work.

The use of textometric approach for comparable corpora
exploration gives encouraging results.

Our methods should be improved taking into account the
availability of new tools / resources for processing Russian
texts.
International Conference on Terminology
“Terminology and Society".
34
Future work…

Intertextual exploration on the document level based on visual aids:
DISTRIBUTION INVENTORY OF REPEATED SEGMENTS
216 ---- ---- ---- ---2
16 ---- ---- ---3
3
4 ---- ---2
5 ---- ---2 ---10 ---- ---2 ---2 ---4 ---2 ---- ---- ---2 ---- ---4 ---- ----
-------------------------------------------------
со
со
со
со
со
со
со
со
со
со
со
со
со
со
со
со
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
стороны
центральной и периферической нервной системы
цнс
цнс головная боль головокружение
цнс и периферической нервной системы
цнс возможны
цнс возможны повышенная утомляемость головная боль
дыхательной системы
дыхательной системы возможны
кожных покровов
кожных покровов алопеция
кожных покровов сыпь
костно мышечной системы
крови
лабораторных показателей
мочевыделительной системы
International Conference on Terminology
“Terminology and Society".
35
Publications on the PERTOMed Project:

Baneyx A., Charlet J., Jaulent M.-C. (2005) "Building medical ontologies based on
terminology extraction from texts: methodological propositions". In Proceedings of the
10th Conference on Artificial Intelligence in Medicine in Europe, Lecture Notes in
Computer Science, Aberdeen, GB, July 2005. Springer.

Jaulent M.-C., Charlet J. (2006) "PERTOMed : Production et évaluation de
ressources terminologiques et ontologiques dans le domaine de la médecine".
PERTOMed : Rapport de fin de projet, INSERM U729.

Nuk I., Ivanova J. (2005) "Création d’une terminologie français/russe dans le
domaine de la pharmacovigilance". Mémoire de DESS (dir. Monique Slodzian),
Centre de Recherche en Ingénierie Multilingue, INaLCO.

Ozdowska S., Névéol A., Thirion B. (2005) "Traduction compositionnelle
automatique de bitermes dans des corpus anglais/français alignés". Actes de la
Conférence Terminologie et Intelligence Artificielle, TIA'05, Rouen, France.
International Conference on Terminology
“Terminology and Society".
36
Скачать