The PERTOMed Project: Exploiting and validating terminological resources of comparable Russian-French-English corpora within Pharmacovigilance Cedric BOUSQUET INSERM U729 (Faculté de Médecine - Paris 5) cedric.bousquet@spim.jussieu.fr Maria ZIMINA EA2290 SYLED (Paris 3) / CRIM-INaLCO (Paris) zimina@msh-paris.fr International Conference on Terminology “Terminology and Society". 1 Outline Introduction. The PERTOMed Project: Research methodology: SYNTEX Repeated Segments extraction Multiple co-occurrences Collaboration domain expert / corpus linguist Discussion: Background Objectives Material: parallel French/English vs. comparable Russian corpora Positive results Limits Conclusions and future work International Conference on Terminology “Terminology and Society". 2 The PERTOMed Project: Project Directors: Marie-Christine Jaulent (INSERM, Paris). Jean Charlet (INSERM, Paris). Partners: INSERM U729, Faculté de Médecine - Paris 5 (France). ERSS: Equipe de Recherche en Syntaxe et Sémantique, UMR 5610 CNRS and Toulouse le Mirail University (France). CRIM: Centre de Recherche en Ingénierie Multilingue, INaLCO (Paris, France). International Conference on Terminology “Terminology and Society". 3 The PERTOMed Project: Development of terminological resources in medicine is a major issue to allow collecting data and browsing knowledge databases. The objective of the PERTOMed project (Production et évaluation de ressources terminologiques et ontologiques dans le domaine de la médecine) was to build terminological or ontological resources from texts in the medical domain. Potential applications concern several fields: Pharmacovigilance Pneumology Drug-drug interactions Multilingual terminologies International Conference on Terminology “Terminology and Society". 4 Pharmacy-related issues in Russia Several pharmaceutical companies are present in Russia: medicines produced in EU or USA are also commercialised in Russia. Required qualities of translation of product information: Precision, Reproducibility, Exactness… High quality of drug product information translations is vital for Pharmaceutical companies Russian regulatory authorities Medical doctors Pharmacists Consumers International Conference on Terminology “Terminology and Society". 5 Pharmacovigilance According to World Health Organization (WHO), pharmacovigilance is “the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problems.” International Conference on Terminology “Terminology and Society". 6 Available international terminologies in Pharmacovigilance WHO-ART (World Health Organization – Adverse Reaction Terminology) was developed in English with translations into French, German, Spanish, Portuguese and Italian. MedDRA (Medical Dictionary for Drug regulatory Activities) defines fully equivalent medical terms in different languages, including English, French, German, Japanese and Spanish. International Conference on Terminology “Terminology and Society". 7 Objectives To propose methods for creating terminological resources from comparable French-English-Russian corpora on adverse drug reactions. To build a trilingual French-English-Russian terminological resource describing adverse drug reactions. International Conference on Terminology “Terminology and Society". 8 Available resources: Parallel French-English medical text corpora: Summaries of Product Characteristics (SPC). Comparable medical corpora on Russian Web sites. International Conference on Terminology “Terminology and Society". 9 SPC: Summary of Product Characteristics European Medicines Agency (EMEA) is a decentralised EU body with headquarters in London. Companies submit a single marketing authorisation application to the EMEA. In case of approval given by the Committee for Medicinal Products for Human Use (CMPHU), applicants receive a single market authorisation valid for the entire EU. The SPCs are provided in all EU languages (undesirable effects are described in Section 4.8). International Conference on Terminology “Terminology and Society". 10 French-English corpus from the PERTOmed Project (C. Bousquet) 156 SPCs in French and English downloaded as PDF files. NLP processing by SYNTEX (French/English parser and term extractor). French English La Lamivudine peut inhiber la phosphorylation intracellulaire de la zalcitabine lorsque ces deux produits sont administrés de manière concomitante. Lamivudine may inhibit the intracellular phosphorylation of zalcitabine when the two medicinal products are used concurrently. Par conséquent, il n'est pas recommandé d'utiliser Zeffix en association avec la zalcitabine. Zeffix is therefore not recommended to be used in combination with zalcitabine. La Lamivudine a été bien tolérée au cours des essais cliniques réalisés chez des patients atteints d'hépatite B chronique. In clinical studies of patients with chronic hepatitis B, Lamivudine was well tolerated. Les effets indésirables le plus souvent rapportés étaient : malaise et fatigue, infections respiratoires, gêne au niveau de la gorge et des amygdales, céphalées, douleur ou gêne abdominale, nausées, vomissements et diarrhée. The most common adverse events reported were malaise and fatigue, respiratory tract infections, throat and tonsil discomfort, headache, abdominal discomfort and pain, nausea, vomiting and diarrhoea. Chez les patients traités pour une hépatite B chronique, l'incidence des anomalies biologiques a été similaire dans les groupes Lamivudine et placebo, à l'exception : - d'augmentations du taux des CPK non associées à des symptômes ou à des signes cliniques , The incidence of laboratory abnormalities in chronic hepatitis B patients were similar in the Lamivudine and placebo treated groups, with the exception of CPK elevations (which were not associated with clinical signs or symptoms), and ALT elevations post-treatment, which were more common in Lamivudine treated patients. International Conference on Terminology “Terminology and Society". 11 SYNTEX (D. Bourigault, S. Ozdowska) Step 1: Sentence alignment (JAPA) Step 2: Part-of-Speech tagging (TreeTagger) Step 3: Parsing (Syntax): syntactic dependencies are identified (subjects, direct and indirect objects of verbs… Step 4: Identification of anchor pairs: cognates, translation equivalents within aligned sentences… Step 5: Alignment by syntactic propagation: Subject …the two medicinal products are used concurrently. …ces deux produits sont administrés de manière concomitante. Subject International Conference on Terminology “Terminology and Society". 12 Comparable resources on medicinal products in Russia (J. Ivanova et I. Nuk) Russian Websites selected for the Project: RECIPE: http://www.recipe.ru RLS: http://www.rlsnet.ru Russian Vidal: http://www.vidal.ru Criteria for comparability with SPC: Degree of specialisation Clarity and precision Recognition by domain experts in Russia Information granularity Style (summarization) Possible text to text alignment: direct search in Russian by active component or medicinal product International Conference on Terminology “Terminology and Society". 13 The RECIPE Website: The site of legal pharmacological documentation; Medline user manual, index of Russian bio-medical Websites, several criteria to search for medical products (including ICD-10): http://www.recipe.ru International Conference on Terminology “Terminology and Society". 14 The RLS Website: Le site RLS, acronym of Регистр Лекарственных Средств России [Register of Medical Substances of Russia] : http://www.rlsnet.ru/: Encyclopaedia of medical products and product description. International Conference on Terminology “Terminology and Society". 15 The Russian Vidal Website: Russian Vidal: http://www.vidal.ru is edited and regularly updated by the private company AstraPharmService in accordance with the Industrial Standard of Russian Federation. International Conference on Terminology “Terminology and Society". 16 Russian corpus from the PERTOMed Project: results of Correspondence Analysis (Lexico3) Regardless various origins (different Websites used to collect information), the descriptions of medical products in Russian gathered within the corpus tend to share common lexical characteristics… International Conference on Terminology “Terminology and Society". 17 Material: parallel vs. comparable corpora Difficulties: Corpus size differences… Information coverage? Degree of comparability? NLP tools/methods for comparable multilingual text processing? comparable parallel Nb occ Nb forms F max Hapax Russian 15 465 3 034 461 (в) 1 483 French 161 995 7 280 8 022 (de) 2 389 English 133 984 5 957 4 936 (of) 1 836 Delimiting characters: .,:;!?/_-\"'()[]{}§$ International Conference on Terminology “Terminology and Society". 18 Methods for building terminologies from comparable corpora (1/2) If two words are mutual translations, their collocates are likely to correspond as well… Collocation is defined as a co-occurrence relation. Domain specific words co-occur with general words (possibility to use general bilingual dictionaries). Mapping through bilingual dictionary: Build context vectors for source and target words. Translate context vectors. Compute similarity between source and target context vectors. International Conference on Terminology “Terminology and Society". 19 Methods for building terminologies from comparable corpora (2/2) Statistical Machine Translation: Mixed approaches: A translation model is learned from existing translations (parallel corpora). Alignment probabilities are introduced to refine the model. Limits: considerable amounts of training data, several heuristics possible. Syntactic relations transfer, co-occurrence relations, dictionary mapping, alignment probabilities … Problems: Lack of equivalence between tools performing similar tasks on different languages Term extraction from comparable corpora not satisfying yet. International Conference on Terminology “Terminology and Society". 20 Repeated Segments extraction Repeated Segments (SALEM 1987): series of consecutive forms whose frequency is greater then or equal to 2 in the corpus: FR syndrome de stevens johnson 27 EN stevens johnson syndrome 25 syndrome pseudo grippal 23 respiratory distress syndrome 8 de syndrome de 13 ovarian hyperstimulation syndrome 7 syndrome de lyell 11 flu like syndrome 6 syndrome de détresse respiratoire 8 multiforme stevens johnson syndrome 6 syndrome de stevens johnson et 8 adult respiratory distress syndrome 5 de syndrome de stevens johnson 7 erythema multiforme stevens johnson syndrome 5 un syndrome de 7 syndrome d’hyperstimulation ovarienne 7 un syndrome grippal 5 un syndrome pseudo grippal 5 syndrome de turner 5 RU гриппоподобный синдром синдром стивенса джонсона или синдром синдром гиперстимуляции яичников синдром и интерстициальная пневмония синдром лайелла синдром лизиса опухоли синдром высвобождения цитокинов International Conference on Terminology “Terminology and Society". 13 7 2 3 2 2 2 2 21 Multiple co-occurrences MARTINEZ (2003): The method is based on iterative calculation of lexical attractions. Filtering techniques reduce the number of contextual explorations: A B E G A B E G A B E H A B F I A C A D H F I C D Only non-inclusive paths are selected ABEG ABEH ABFI International Conference on Terminology “Terminology and Society". 22 Choosing comparable textual units as starting points for exploration… со стороны (F=216) French English Russian troubles (F=922) disorders (F=835) нарушения ?? (Dictionary) troubles du système nerveux nervous system disorders со стороны нервной системы troubles respiratoires respiratory system disorders со стороны респираторной системы troubles musculo-squelettiques musculoskeletal system disorders со стороны опорнодвигательного аппарата CONTEXT: FR: troubles du système nerveux : insomnie ; hypoesthésie ; paresthésies. EN: nervous system disorders: dizziness, paraesthesia, hyperaesthesia. RU: со стороны нервной системы: головокружение, головная боль, тревога, депрессия, парестезии, гиперстезии, возбуждение, нарушения сна, нервозность, астения. International Conference on Terminology “Terminology and Society". 23 Exploring collocation networks : French Contexte n°1292 (15 formes dont 3 vedettes) Densité info.=0.20 troubles du métabolisme et de la nutrition : augmentation des triglycérides sériques, augmentation du cholestérol sérique. Contexte n°1414 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : augmentation de la créatinine, hypokaliémie. Contexte n°1425 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : élévation de l'urée sanguine. Contexte n°3180 (10 formes dont 3 vedettes) Densité info.=0.30 troubles du métabolisme et de la nutrition : oedèmes, oedèmes périphériques Contexte n°4667 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : élévation de l'urée sanguine. Contexte n°6157 (10 formes dont 3 vedettes) Densité info.=0.30 troubles du métabolisme et de la nutrition : fréquent : hypertriglycéridémie, hyperglycémie Contexte n°8334 (13 formes dont 3 vedettes) Densité info.=0.23 troubles du métabolisme et de la nutrition : prise de poids ou amaigrissement, oedèmes. Legend: f [s] c Contexte n°10151 (19 formes dont 3 vedettes) Densité info.=0.16 f = co-frequency troubles du métabolisme et de la nutrition très fréquents perte de poids fréquents perte s = specificity d'appétit, prise de poids c = number of contexts International Conference on Terminology “Terminology and Society". 24 Exploring collocation networks : English Contexte n°32 metabolism and nutrition disorders (4 formes dont 3 vedettes) Densité info.=0.75 Contexte n°3715 (7 formes dont 3 vedettes) Densité info.=0.43 metabolism and nutrition disorders : oedema, peripheral oedema Contexte n°3763 (5 formes dont 3 vedettes) Densité info.=0.60 metabolism and nutrition disorders : hypokalaemia. Contexte n°5392 (23 formes dont 3 vedettes) Densité info.=0.13 metabolism and nutrition disorders : very common : hypercholesterolemia, hypertriglyceridemia (hyperlipemia) ; hypokalaemia ; increased lactic dehydrogenase (ldh) common : liver function tests abnormal ; increased sgot, increased sgpt. Contexte n°7698 (7 formes dont 3 vedettes) Densité info.=0.43 metabolism and nutrition disorders : common : hypertriglyceridaemia, hyperglycaemia Contexte n°8856 (11 formes dont 3 vedettes) Densité info.=0.27 metabolism and nutrition disorders : abnormal renal function tests (increased creatinine, bun) Contexte n°9771 (9 formes dont 3 vedettes) Densité info.=0.33 metabolism and nutrition disorders : weight gain or loss, oedema. Legend: f [s] c f = co-frequency Contexte n°11578 (13 formes dont 3 vedettes) Densité info.=0.23 s = specificity metabolism and nutrition disorders very common weight loss common decreased appetite, c = number of contexts weight increase International Conference on Terminology “Terminology and Society". 25 Exploring collocation networks : Russian Contexte n°150 (10 formes dont 4 vedettes) Densité info.=0.40 со стороны обмена веществ : обострение сахарного диабета, отеки или обезвоживание. Contexte n°832 (37 formes dont 4 vedettes) Densité info.=0.11 c=10500 побочное действие : со стороны обмена веществ : после назначения натеглинида, как и при применении других гипогликемических препаратов, были отмечены симптомы, предположительно свидетельствующие о развитии гипогликемии, такие как повышенная потливость, тремор, головокружение, повышенный аппетит, сердцебиение, тошнота, слабость, недомогание. Contexte n°647 (16 formes dont 4 vedettes) Densité info.=0.25 со стороны пищеварительной системы : возможны тошнота, рвота, повышение активности act, алт Contexte n°930 случаи гепатита. (18 formes dont 4 vedettes) Densité info.=0.22 и ггт ; описаны со стороны обмена веществ : гиперпролактинемия, увеличение (редко уменьшение) массы тела, сахарный кетоацидоз, диабетическая кома, зоб. Contexte n°718 диабет, гипергликемия, (23 диабетический formes dont 4 vedettes) Densité info.=0.17 c=752 побочное действие : со стороны пищеварительной системы : возможны боли и дискомфорт Contexte n°1439 области, тошнота, рвота, (10 formes dont 4 vedettes) Densitéповышение info.=0.40 активности в эпигастральной диарея, снижение аппетита, со стороны обмена веществ : гипергликемия, сахарный диабет, кетоацидоз ожирение, печеночных трансаминаз. дегидратация. Legend: f [s] c Contexte n°873 (31 formes dont 4 vedettes) Densité info.=0.13 Contexte n°1459 (12 formes dont 4 vedettes) Densité info.=0.33 co-frequency со стороны пищеварительной системы : часто-повышение активности ггт f; = возможны-повышение со стороны обмена веществ обезвоживание, обострение диабета, ожирение s =боли specificity активности алт, аст, щф и уровня: общего билирубина, тошнота,сахарного рвота, диарея, в животе ;по в центральному типу. / c = number of contexts единичных случаях-желтуха, тяжелые гепатотоксические реакции. International Conference on Terminology “Terminology and Society". 26 Combining with French-English lexicon extracted from parallel corpus… Automatic segmentation into textual units (Lexico3): Forms, Repeated Segments (Russian-French-English) Identification of anchor pairs (starting points): Frequency counts, cognates, general words, French/English terminology… (syndrome / syndrome / синдром) Trilingual collocation networks (COOCS): Identification of similar context vectors. Semi-automatic segmentation into terminological units. Cross-language check. Expert validation. International Conference on Terminology “Terminology and Society". 27 Building trilingual terminology: collaboration domain expert/corpus linguist Two different kinds of knowledge / skills: From corpus linguist: Methodological knowledge: tools and methods for text exploration. Quantitative results on corpora. From domain expert: Domain specific knowledge on ADRs. Choice of relevant terms/contexts when several variants attested in texts. International Conference on Terminology “Terminology and Society". 28 Results: trilingual lexicon on the Web PERTOMed Server : http://baneyx.net/SPIP/ Each trilingual entry comprises the following fields: - Simple term (with possible variants) - Abbreviation (if applicable) - Related composed term(s) - Domain(s) - Medical product(s) concerned International Conference on Terminology “Terminology and Society". 29 Results: choosing terms… Russian segments French segments English segments гриппоподобный синдром (13 occurrences) гриппоподобные симптомы (2 occurrences) симптомы гриппоподобного синдрома (1 occurrence) гриппоподобная симптоматика (1 occurrences) syndrome pseudo-grippal (23 occurrences) syndrome pseudogrippal (2 occurrences) symptômes pseudo-grippaux (7 occurrences) influenza-like symptoms (9 occurrences) influenza-like illness (6 occurrences) flu-like symptoms (8 occurrences) flu-like symptom complex (4 occurrences) flu-like syndrome (6 occurrences) flu-like illness (5 occurrences) гриппоподобный синдром / syndrome pseudo-grippal / influenza-like symptom, flu-like symptom International Conference on Terminology “Terminology and Society". 30 Results: choosing domains… Russian French English Аллергические реакции Дерматология Дыхательная система Желудочно-кишечный тракт Костно-мышечная система Кроветворение и кровь Местные реакции Мочеполовая система Нервная система Обмен веществ Организм в целом Органы чувств Réactions allergiques Dermatologie Appareil respiratoire Système gastro-intestinal Système ostéo-musculaire Hématopoïèse et sang Réactions locales au traitement Appareil uro-genital Système nerveux Metabolic disorders Etat général Troubles des organes sensoriels Foie et voies biliaires Divers Troubles psychiatriques Appareil cardio-vasculaire Allergic reactions Dermatology Respiratory system disorders Gastro-intestinal system disorders Musculo-skeletal system disorders Hematopoiesis and blood Application site disorders Urogenital disorders Nervous system Metabolisme Body as a whole-general disorders Sensory system disorders Печеночно-билиарная система Прочие Психические нарушения Сердечно-сосудистая система International Conference on Terminology “Terminology and Society". Liver and biliary system disorders Other Psychiatric disorders Cardiovascular disorders 31 Discussion: positive results 430 validated trilingual terminological entries in XML format 2002 simple terminological records (single word terms) 1006 complex terminological records (~50%) (multiword terms) Co-occurrence relations: Simple term: диабет diabète diabetes Context: диабетическая кома coma diabétique diabetic coma Domain: Обмен веществ Troubles du métabolisme Metabolism disorders Medical product: RLS Olanzapine Simple term: кома coma Coma Context: диабетическая кома coma diabétique diabetic coma Domain: Обмен веществ Troubles du métabolisme Metabolism disorders Medical product: RLS Olanzapine International Conference on Terminology “Terminology and Society". 32 Discussion: limits Lexical coverage. Contextual access. Presentation (no visual aids for navigation yet…). Evaluation difficulties: Choice of criteria. Comparable resources needed. International Conference on Terminology “Terminology and Society". 33 Conclusions Creating terminological resources from comparable corpora is faced with intrinsic heterogeneity of texts. The challenge of exploring texts coming from different cultural and linguistic sources should be taken into account in the terminology project feasibility study. Creation of Russian Internet corpus in the field of Pharmacovigilance is a pioneering work. The use of textometric approach for comparable corpora exploration gives encouraging results. Our methods should be improved taking into account the availability of new tools / resources for processing Russian texts. International Conference on Terminology “Terminology and Society". 34 Future work… Intertextual exploration on the document level based on visual aids: DISTRIBUTION INVENTORY OF REPEATED SEGMENTS 216 ---- ---- ---- ---2 16 ---- ---- ---3 3 4 ---- ---2 5 ---- ---2 ---10 ---- ---2 ---2 ---4 ---2 ---- ---- ---2 ---- ---4 ---- ---- ------------------------------------------------- со со со со со со со со со со со со со со со со стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны стороны центральной и периферической нервной системы цнс цнс головная боль головокружение цнс и периферической нервной системы цнс возможны цнс возможны повышенная утомляемость головная боль дыхательной системы дыхательной системы возможны кожных покровов кожных покровов алопеция кожных покровов сыпь костно мышечной системы крови лабораторных показателей мочевыделительной системы International Conference on Terminology “Terminology and Society". 35 Publications on the PERTOMed Project: Baneyx A., Charlet J., Jaulent M.-C. (2005) "Building medical ontologies based on terminology extraction from texts: methodological propositions". In Proceedings of the 10th Conference on Artificial Intelligence in Medicine in Europe, Lecture Notes in Computer Science, Aberdeen, GB, July 2005. Springer. Jaulent M.-C., Charlet J. (2006) "PERTOMed : Production et évaluation de ressources terminologiques et ontologiques dans le domaine de la médecine". PERTOMed : Rapport de fin de projet, INSERM U729. Nuk I., Ivanova J. (2005) "Création d’une terminologie français/russe dans le domaine de la pharmacovigilance". Mémoire de DESS (dir. Monique Slodzian), Centre de Recherche en Ingénierie Multilingue, INaLCO. Ozdowska S., Névéol A., Thirion B. (2005) "Traduction compositionnelle automatique de bitermes dans des corpus anglais/français alignés". Actes de la Conférence Terminologie et Intelligence Artificielle, TIA'05, Rouen, France. International Conference on Terminology “Terminology and Society". 36