Phrase-Based SMT with Shallow Tree-Phrases

Transcription

Phrase-Based SMT with Shallow Tree-Phrases
Phrase-Based SMT with Shallow Tree-Phrases
Philippe Langlais and Fabrizio Gotti
RALI
Université de Montréal, Canada
http://rali.iro.umontreal.ca
June 8th
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
1 / 34
Context
Recently, many syntax-aware SMT systems have been proposed
(Chiang, 2005 ; Ding and Palmer, 2005 ; Quirk et al., 2005 ; Galley
and al., 2004 ; Lin, 2004 ; Melamed,2004 ; . . . )
Phrases with gaps (Simard et al., 2005)
Middle ground between (Quirk et al., 2005) and (Simard et al.,
2005) : the Tree-Phrase.


















, the : acting : speaker
président
M




qqq MMMMM


q


q


M
q
q






le
suppléant
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
2 / 34
1
Motivation
2
Training
3
Decoding
4
Experiments
5
Conclusions
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
3 / 34
Syntex
Syntex (Bourigault and Fabre,2000) : a robust syntactic parser available
for French and English
Source sentence
Request for federal funding
Syntex output
NOUN?S|request|Request|1|0|PREP;2
PREP|for|for|2|PREP;1|NOUNPREP;4
ADJ|federal|federal|3|ADJ;4|0
NOUN?S|funding|funding|NOUNPREP;2|ADJ;3
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
5 / 34
Syntex
We used Syntex to parse the French part of an Hansard bitext (≈ 1.7 M.
pairs of E-F sentences)
a demandé
YY
SUBlllll
l
llll
on
YYYYYYOBJ
YYYYYY
YYYY
crédits
R
DETlllll
l
llll
RRRRADJ
RRRR
R
des
fédéraux
On a demandé des crédits fédéraux
(Request for federal funding)
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
6 / 34
Building the Tree-Phrase Table
1
2
token alignment : a demandé ≡ request for, fédéraux ≡ federal,
crédits ≡ funding
treelets acquisition (by parsing the source material with Syntex)
crédits
M
a demandé
M
qM
qqq MMMMM
q
q
q
qM
qqq MMMMM
q
q
q
on
3
crédits
des
fédéraux
tree-phrases extraction (by simple token-alignment projection)
treelet?
elastic phrase?
{{on@-1} a demandé {crédits@2}}
|request@0| |for@1| |funding@3|
treelet
elastic phrase
{{des@-1} crédits {fédéraux@1}}
|federal@0| |funding@1|
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
7 / 34
The Tree-Phrase Table
s
2
3
4
5
6
7
8
all
|treelet|
936 180
1 399 288
464 062
45 234
1 898
52
4
2 846 718
P. Langlais and F. Gotti (RALI)
%-c
48.2%
39.2%
35.6%
21.1%
10.7%
7.7%
0.0%
41.3%
|elastic phrase|
3 857 234
2 588 553
621 742
48 509
1 925
52
4
7 118 019
SMT with Shallow Tree-Phrases
%-c
55.9%
49.2%
39.7%
24.6%
14.4%
7.7%
0.0%
51.8%
June 8th
8 / 34
Example of a long Tree-Phrase
tl
{{par@-3} {un@-2} {excellent@-1} Chili {con@1}
{carne@2} {servi@3} {Léger@6}}
ep
|culmination@0| |a@2| |Chili@3| |con@4| |carne@5|
|feast@6| |provided@7| |Leger@11|
src
La soirée s’ est terminée en beauté par un excellent Chili con
carne servi par Mme Léger .
trg
The culmination was a Chili con carne feast provided by Ms .
Leger .
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
9 / 34
The 5 Most Frequent Tree-Phrases
freq
75 051
32 601
26 347
14 515
13 043
tl
ep
tl
ep
tl
ep
tl
ep
tl
ep
P. Langlais and F. Gotti (RALI)
tree-phrase
{{monsieur@-2} {Le@-1} président}
|Mr@0| |.@1| |Speaker@2|
{{Le@-1} gouvernement}
|the@0| |Government@1|
{{de@-2} {les@-1} voix}
|Some@0| |Honourable@1| |Members@2|
{{Le@-1} ministre}
|the@0| |Minister@1|
{{Madame@-2} {la@-1} Présidente}
|Madam@0| |Speaker@1|
SMT with Shallow Tree-Phrases
June 8th
10 / 34
1
Motivation
2
Training
3
Decoding
4
Experiments
5
Conclusions
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
11 / 34
A Standard Phrase-Based decoder
TRG:
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
SRC:
P. Langlais and F. Gotti (RALI)
again ,
the black cat
crossed the street
SMT with Shallow Tree-Phrases
June 8th
12 / 34
A Standard Phrase-Based decoder
TRG:
le chat noir
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
SRC:
P. Langlais and F. Gotti (RALI)
again ,
the black cat
crossed the street
SMT with Shallow Tree-Phrases
June 8th
13 / 34
A Standard Phrase-Based decoder
TRG:
le chat noir
a
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
SRC:
P. Langlais and F. Gotti (RALI)
again ,
the black cat
crossed the street
SMT with Shallow Tree-Phrases
June 8th
14 / 34
encore une fois
A Standard Phrase-Based decoder
TRG:
le chat noir
a
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
SRC:
P. Langlais and F. Gotti (RALI)
again ,
the black cat
crossed the street
SMT with Shallow Tree-Phrases
June 8th
15 / 34
encore une fois
traversé la rue
A Standard Phrase-Based decoder
TRG:
le chat noir
a
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
SRC:
P. Langlais and F. Gotti (RALI)
again ,
the black cat
crossed the street
SMT with Shallow Tree-Phrases
June 8th
16 / 34
encore une fois
the black cat ||| le chat noir
NULL ||| a
again, ||| encore une fois
crossed the street ||| traverser la rue
TRG:
le chat noir
a
generation
traversé la rue
A Standard Phrase-Based decoder
P. Langlais and F. Gotti (RALI)
SRC:
again ,
the black cat
SMT with Shallow Tree-Phrases
crossed the street
June 8th
17 / 34
A Tree-Phrase Based Decoder
mr le président ||| the president
président a_demandé hier ||| president requested yesterday
des crédits fédéraux ||| federal funding
hier matin ||| yesterday morning
{des@−1} crédits {fédéraux@1} ||| federal@0 funding@1
{mr@−2} {le@−1} président ||| the@0 president@1
TRG:
{président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3
SRC: mr le président a_demandé
P. Langlais and F. Gotti (RALI)
hier matin
des crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
18 / 34
A Tree-Phrase Based Decoder
mr le président ||| the president
président a_demandé hier ||| president requested yesterday
des crédits fédéraux ||| federal funding
hier matin ||| yesterday morning
{des@−1} crédits {fédéraux@1} ||| federal@0 funding@1
{mr@−2} {le@−1} président ||| the@0 president@1
TRG: the president
{président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3
SRC: mr le président a_demandé
P. Langlais and F. Gotti (RALI)
hier matin
des crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
19 / 34
A Tree-Phrase Based Decoder
mr le président ||| the president
TRG: the president requested GAP
funding
président a_demandé hier ||| president requested yesterday
des crédits fédéraux ||| federal funding
hier matin ||| yesterday morning
{des@−1} crédits {fédéraux@1} ||| federal@0 funding@1
{mr@−2} {le@−1} président ||| the@0 president@1
{président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3
SRC: mr le président a_demandé
P. Langlais and F. Gotti (RALI)
hier matin
des crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
20 / 34
A Tree-Phrase Based Decoder
mr le président ||| the president
président a_demandé hier ||| president requested yesterday
hier matin ||| yesterday morning
{des@−1} crédits {fédéraux@1} ||| federal@0 funding@1
{mr@−2} {le@−1} président ||| the@0 president@1
{président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3
TRG: the president
requested federal funding
des crédits fédéraux ||| federal funding
SRC: mr le président
P. Langlais and F. Gotti (RALI)
a_demandé
hier matin
des crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
21 / 34
yesterday morning
A Tree-Phrase Based Decoder
mr le président ||| the president
président a_demandé hier ||| president requested yesterday
hier matin ||| yesterday morning
{des@−1} crédits {fédéraux@1} ||| federal@0 funding@1
{mr@−2} {le@−1} président ||| the@0 president@1
{président@−1} a_demandé {crédits@3} ||| president@0 requested@1 funding@3
TRG: the president
requested federal funding
des crédits fédéraux ||| federal funding
SRC: mr le président
P. Langlais and F. Gotti (RALI)
a_demandé
hier matin
des crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
22 / 34
Embedded Components
Log-linear combination of 9 components :
Phrase-Phrase table : relative frequency, IBM-1 like score
Tree-Phrase table : relative frequency, IBM-1 like score
Language model (Kneser-Ney implementation in SRILM)
Distortion (adapted from pharaoh)
Word and Phrase penalty
Future cost (adapted from pharaoh)
Coefficients tuned by grid searching
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
23 / 34
Updating the Language Model Score
h
request
U
on
a_demandé
u
h’
for
B
request
U
on
P. Langlais and F. Gotti (RALI)
funding
U
des
crédits
S[3]
fédéraux
TL: {on@−1} a_demandé {crédits@2}
EP: request@0 for@1 funding@3
for federal
B
T
a_demandé
u
F
des
funding
T
crédits
S[5]
fédéraux
TL: {des@−1} crédits {fédéraux@1}
EP: federal@0 funding@1
SMT with Shallow Tree-Phrases
June 8th
24 / 34
Need for a “definition” of two compatible treelets
Either no token in common, or only one that must be the head in
one treelet, and a dependent in the other.
Compatible :
a demandé
M
crédits
M
qM
qqq MMMMM
qqq
on
qM
qqq MMMMM
qqq
crédits
des
fédéraux
Not compatible :
président
président
M
qM
qqq MMMMM
q
q
q
mr.
P. Langlais and F. Gotti (RALI)
le
SMT with Shallow Tree-Phrases
suppléant
June 8th
25 / 34
Some (other) heuristics embedded
Tree-Phrase and Phrase-Phrase units compete.
SRC: mr le président
a_demandé
hier matin
des crédits
fédéraux
{président@−1} a_demandé {crédits@4} ||| president@0 requested@1 funding@3
SRC: mr le président
a_demandé
hier matin
des crédits
fédéraux
a_demandé hier matin ||| requested yesterday morning
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
26 / 34
1
Motivation
2
Training
3
Decoding
4
Experiments
5
Conclusions
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
27 / 34
Corpora (In-house Canadian Hansards)
sentences
e-toks
f-toks
e-toks/sent
f-toks/sent
e-types
f-types
e-hapax
f-hapax
train
1 699 592
27 717 389
30 425 066
16.3 (± 9.0)
17.9 (± 9.5)
164 255
210 085
68 506
90 747
dev
500
8 160
8 946
16.3
17.9
(± 9.1)
(± 9.5)
2 224
2 481
1 469
1 704
test
8 000
129 601
143 237
16.2 (± 9.1)
17.9 (± 9.4)
12 143
14 484
6 673
8 381
test set : 16 (disjoint) slices of 500 sentences each.
results averaged over the slices
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
28 / 34
Results
MT engine
pp
tp 1
wer (%)
52.58 ± 1.2
52.23 ± 1.1
ser (%)
94.20 ± 0.8
93.98 ± 0.9
bleu (%)
30.43 ± 1.4
30.70 ± 1.2
Significant at the 95% probability level for bleu and ser and significant
at the 99% level for wer (Wilcoxon signed-rank paired-test)
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
29 / 34
Extending to depth-2 Tree-Phrases
a demandé
YY
YYYYYYOBJ
YYYYYY
YYYY
SUBlllll
l
llll
on
crédits
R
DETlllll
l
llll
RRRRADJ
RRRR
R
des
fédéraux
a demandé
elelelelelYYYYYYYYYYY
e
e
e
e
e
l
YYYYYY
e
e
l
e
Y
eeeee
lll
on
P. Langlais and F. Gotti (RALI)
des
crédits
SMT with Shallow Tree-Phrases
fédéraux
June 8th
30 / 34
Results
MT engine
pp
tp 1
tp 2
P. Langlais and F. Gotti (RALI)
wer (%)
52.58 ± 1.2
52.23 ± 1.1
51.44 ± 1.2
ser (%)
94.20 ± 0.8
93.98 ± 0.9
92.55 ± 1.3
SMT with Shallow Tree-Phrases
bleu (%)
30.43 ± 1.4
30.70 ± 1.2
31.07 ± 1.3
June 8th
31 / 34
1
Motivation
2
Training
3
Decoding
4
Experiments
5
Conclusions
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
32 / 34
Conclusions & Future Work
A viable approach to ad hoc syntax-based SMT
As lazy as phrase-based SMT
Statistically significant bleu-like improvements
Can be seen as :
a variant of (Simard et al.,2005)
a simplification of (Quirk et al., 2005)
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
33 / 34
Future Work
Easy things that we must investigate :
We do not use parsing when translating
We do not encode the word alignment in a Tree-Phrase
Why not depth-n Tree-Phrases ?
More thorough investigations :
We did not consider elastic “gaps” in this study
Unclear at this stage why we observe improvements
P. Langlais and F. Gotti (RALI)
SMT with Shallow Tree-Phrases
June 8th
34 / 34