Phrase-Based SMT with Shallow Tree-Phrases
Transcription
Phrase-Based SMT with Shallow Tree-Phrases
Phrase-Based SMT with Shallow Tree-Phrases Philippe Langlais and Fabrizio Gotti RALI Université de Montréal, Canada http://rali.iro.umontreal.ca June 8th P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 1 / 34 Context Recently, many syntax-aware SMT systems have been proposed (Chiang, 2005 ; Ding and Palmer, 2005 ; Quirk et al., 2005 ; Galley and al., 2004 ; Lin, 2004 ; Melamed,2004 ; . . . ) Phrases with gaps (Simard et al., 2005) Middle ground between (Quirk et al., 2005) and (Simard et al., 2005) : the Tree-Phrase. , the : acting : speaker président M qqq MMMMM q q M q q le suppléant P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 2 / 34 1 Motivation 2 Training 3 Decoding 4 Experiments 5 Conclusions P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 3 / 34 Syntex Syntex (Bourigault and Fabre,2000) : a robust syntactic parser available for French and English Source sentence Request for federal funding Syntex output NOUN?S|request|Request|1|0|PREP;2 PREP|for|for|2|PREP;1|NOUNPREP;4 ADJ|federal|federal|3|ADJ;4|0 NOUN?S|funding|funding|NOUNPREP;2|ADJ;3 P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 5 / 34 Syntex We used Syntex to parse the French part of an Hansard bitext (≈ 1.7 M. pairs of E-F sentences) a demandé YY SUBlllll l llll on YYYYYYOBJ YYYYYY YYYY crédits R DETlllll l llll RRRRADJ RRRR R des fédéraux On a demandé des crédits fédéraux (Request for federal funding) P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 6 / 34 Building the Tree-Phrase Table 1 2 token alignment : a demandé ≡ request for, fédéraux ≡ federal, crédits ≡ funding treelets acquisition (by parsing the source material with Syntex) crédits M a demandé M qM qqq MMMMM q q q qM qqq MMMMM q q q on 3 crédits des fédéraux tree-phrases extraction (by simple token-alignment projection) treelet? elastic phrase? {{on@-1} a demandé {crédits@2}} |request@0| |for@1| |funding@3| treelet elastic phrase {{des@-1} crédits {fédéraux@1}} |federal@0| |funding@1| P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 7 / 34 The Tree-Phrase Table s 2 3 4 5 6 7 8 all |treelet| 936 180 1 399 288 464 062 45 234 1 898 52 4 2 846 718 P. Langlais and F. Gotti (RALI) %-c 48.2% 39.2% 35.6% 21.1% 10.7% 7.7% 0.0% 41.3% |elastic phrase| 3 857 234 2 588 553 621 742 48 509 1 925 52 4 7 118 019 SMT with Shallow Tree-Phrases %-c 55.9% 49.2% 39.7% 24.6% 14.4% 7.7% 0.0% 51.8% June 8th 8 / 34 Example of a long Tree-Phrase tl {{par@-3} {un@-2} {excellent@-1} Chili {con@1} {carne@2} {servi@3} {Léger@6}} ep |culmination@0| |a@2| |Chili@3| |con@4| |carne@5| |feast@6| |provided@7| |Leger@11| src La soirée s’ est terminée en beauté par un excellent Chili con carne servi par Mme Léger . trg The culmination was a Chili con carne feast provided by Ms . Leger . P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 9 / 34 The 5 Most Frequent Tree-Phrases freq 75 051 32 601 26 347 14 515 13 043 tl ep tl ep tl ep tl ep tl ep P. Langlais and F. Gotti (RALI) tree-phrase {{monsieur@-2} {Le@-1} président} |Mr@0| |.@1| |Speaker@2| {{Le@-1} gouvernement} |the@0| |Government@1| {{de@-2} {les@-1} voix} |Some@0| |Honourable@1| |Members@2| {{Le@-1} ministre} |the@0| |Minister@1| {{Madame@-2} {la@-1} Présidente} |Madam@0| |Speaker@1| SMT with Shallow Tree-Phrases June 8th 10 / 34 1 Motivation 2 Training 3 Decoding 4 Experiments 5 Conclusions P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 11 / 34 A Standard Phrase-Based decoder TRG: the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue SRC: P. Langlais and F. Gotti (RALI) again , the black cat crossed the street SMT with Shallow Tree-Phrases June 8th 12 / 34 A Standard Phrase-Based decoder TRG: le chat noir the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue SRC: P. Langlais and F. Gotti (RALI) again , the black cat crossed the street SMT with Shallow Tree-Phrases June 8th 13 / 34 A Standard Phrase-Based decoder TRG: le chat noir a the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue SRC: P. Langlais and F. Gotti (RALI) again , the black cat crossed the street SMT with Shallow Tree-Phrases June 8th 14 / 34 encore une fois A Standard Phrase-Based decoder TRG: le chat noir a the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue SRC: P. Langlais and F. Gotti (RALI) again , the black cat crossed the street SMT with Shallow Tree-Phrases June 8th 15 / 34 encore une fois traversé la rue A Standard Phrase-Based decoder TRG: le chat noir a the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue SRC: P. Langlais and F. Gotti (RALI) again , the black cat crossed the street SMT with Shallow Tree-Phrases June 8th 16 / 34 encore une fois the black cat ||| le chat noir NULL ||| a again, ||| encore une fois crossed the street ||| traverser la rue TRG: le chat noir a generation traversé la rue A Standard Phrase-Based decoder P. Langlais and F. Gotti (RALI) SRC: again , the black cat SMT with Shallow Tree-Phrases crossed the street June 8th 17 / 34 A Tree-Phrase Based Decoder mr le président ||| the president président a_demandé hier ||| president requested yesterday des crédits fédéraux ||| federal funding hier matin ||| yesterday morning {des@−1} crédits {fédéraux@1} ||| federal@0 funding@1 {mr@−2} {le@−1} président ||| the@0 president@1 TRG: {président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3 SRC: mr le président a_demandé P. Langlais and F. Gotti (RALI) hier matin des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 18 / 34 A Tree-Phrase Based Decoder mr le président ||| the president président a_demandé hier ||| president requested yesterday des crédits fédéraux ||| federal funding hier matin ||| yesterday morning {des@−1} crédits {fédéraux@1} ||| federal@0 funding@1 {mr@−2} {le@−1} président ||| the@0 president@1 TRG: the president {président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3 SRC: mr le président a_demandé P. Langlais and F. Gotti (RALI) hier matin des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 19 / 34 A Tree-Phrase Based Decoder mr le président ||| the president TRG: the president requested GAP funding président a_demandé hier ||| president requested yesterday des crédits fédéraux ||| federal funding hier matin ||| yesterday morning {des@−1} crédits {fédéraux@1} ||| federal@0 funding@1 {mr@−2} {le@−1} président ||| the@0 president@1 {président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3 SRC: mr le président a_demandé P. Langlais and F. Gotti (RALI) hier matin des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 20 / 34 A Tree-Phrase Based Decoder mr le président ||| the president président a_demandé hier ||| president requested yesterday hier matin ||| yesterday morning {des@−1} crédits {fédéraux@1} ||| federal@0 funding@1 {mr@−2} {le@−1} président ||| the@0 president@1 {président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3 TRG: the president requested federal funding des crédits fédéraux ||| federal funding SRC: mr le président P. Langlais and F. Gotti (RALI) a_demandé hier matin des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 21 / 34 yesterday morning A Tree-Phrase Based Decoder mr le président ||| the president président a_demandé hier ||| president requested yesterday hier matin ||| yesterday morning {des@−1} crédits {fédéraux@1} ||| federal@0 funding@1 {mr@−2} {le@−1} président ||| the@0 president@1 {président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3 TRG: the president requested federal funding des crédits fédéraux ||| federal funding SRC: mr le président P. Langlais and F. Gotti (RALI) a_demandé hier matin des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 22 / 34 Embedded Components Log-linear combination of 9 components : Phrase-Phrase table : relative frequency, IBM-1 like score Tree-Phrase table : relative frequency, IBM-1 like score Language model (Kneser-Ney implementation in SRILM) Distortion (adapted from pharaoh) Word and Phrase penalty Future cost (adapted from pharaoh) Coefficients tuned by grid searching P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 23 / 34 Updating the Language Model Score h request U on a_demandé u h’ for B request U on P. Langlais and F. Gotti (RALI) funding U des crédits S[3] fédéraux TL: {on@−1} a_demandé {crédits@2} EP: request@0 for@1 funding@3 for federal B T a_demandé u F des funding T crédits S[5] fédéraux TL: {des@−1} crédits {fédéraux@1} EP: federal@0 funding@1 SMT with Shallow Tree-Phrases June 8th 24 / 34 Need for a “definition” of two compatible treelets Either no token in common, or only one that must be the head in one treelet, and a dependent in the other. Compatible : a demandé M crédits M qM qqq MMMMM qqq on qM qqq MMMMM qqq crédits des fédéraux Not compatible : président président M qM qqq MMMMM q q q mr. P. Langlais and F. Gotti (RALI) le SMT with Shallow Tree-Phrases suppléant June 8th 25 / 34 Some (other) heuristics embedded Tree-Phrase and Phrase-Phrase units compete. SRC: mr le président a_demandé hier matin des crédits fédéraux {président@−1} a_demandé {crédits@4} ||| president@0 requested@1 funding@3 SRC: mr le président a_demandé hier matin des crédits fédéraux a_demandé hier matin ||| requested yesterday morning P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 26 / 34 1 Motivation 2 Training 3 Decoding 4 Experiments 5 Conclusions P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 27 / 34 Corpora (In-house Canadian Hansards) sentences e-toks f-toks e-toks/sent f-toks/sent e-types f-types e-hapax f-hapax train 1 699 592 27 717 389 30 425 066 16.3 (± 9.0) 17.9 (± 9.5) 164 255 210 085 68 506 90 747 dev 500 8 160 8 946 16.3 17.9 (± 9.1) (± 9.5) 2 224 2 481 1 469 1 704 test 8 000 129 601 143 237 16.2 (± 9.1) 17.9 (± 9.4) 12 143 14 484 6 673 8 381 test set : 16 (disjoint) slices of 500 sentences each. results averaged over the slices P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 28 / 34 Results MT engine pp tp 1 wer (%) 52.58 ± 1.2 52.23 ± 1.1 ser (%) 94.20 ± 0.8 93.98 ± 0.9 bleu (%) 30.43 ± 1.4 30.70 ± 1.2 Significant at the 95% probability level for bleu and ser and significant at the 99% level for wer (Wilcoxon signed-rank paired-test) P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 29 / 34 Extending to depth-2 Tree-Phrases a demandé YY YYYYYYOBJ YYYYYY YYYY SUBlllll l llll on crédits R DETlllll l llll RRRRADJ RRRR R des fédéraux a demandé elelelelelYYYYYYYYYYY e e e e e l YYYYYY e e l e Y eeeee lll on P. Langlais and F. Gotti (RALI) des crédits SMT with Shallow Tree-Phrases fédéraux June 8th 30 / 34 Results MT engine pp tp 1 tp 2 P. Langlais and F. Gotti (RALI) wer (%) 52.58 ± 1.2 52.23 ± 1.1 51.44 ± 1.2 ser (%) 94.20 ± 0.8 93.98 ± 0.9 92.55 ± 1.3 SMT with Shallow Tree-Phrases bleu (%) 30.43 ± 1.4 30.70 ± 1.2 31.07 ± 1.3 June 8th 31 / 34 1 Motivation 2 Training 3 Decoding 4 Experiments 5 Conclusions P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 32 / 34 Conclusions & Future Work A viable approach to ad hoc syntax-based SMT As lazy as phrase-based SMT Statistically significant bleu-like improvements Can be seen as : a variant of (Simard et al.,2005) a simplification of (Quirk et al., 2005) P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 33 / 34 Future Work Easy things that we must investigate : We do not use parsing when translating We do not encode the word alignment in a Tree-Phrase Why not depth-n Tree-Phrases ? More thorough investigations : We did not consider elastic “gaps” in this study Unclear at this stage why we observe improvements P. Langlais and F. Gotti (RALI) SMT with Shallow Tree-Phrases June 8th 34 / 34