Synthesis Experiments | Benchmark Pt.9
11 January 2021Using moses
The monolingual datasets and terminologies have been tokenized using the moses
tokenizer and the special characters from HTML
have been replaced. A new experiment is now being run with the pre-tokenized datasets.
Results
Now there are embeddings generated for terms such as covid19 and H1N1. The new tokenizer seems to work well. There are still some terms that do not have embeddings generated:
FRENCH | ENGLISH |
---|---|
c19 (472) | c19 (5199) |
cv19 (11) | cv19 (34) |
cov19 (4) | cov19 (20) |
CV-19 (0) | CV-19 (0) |
Isopropanol (62) | Isopropanol (208) |
vivid-19 (0) | vivid-19 (0) |
Télétravail (311) | WFH (270) |
There are also three pairs of terms: corona/corona, germicide/germicide, and Wuhan/Wuhan whose similar terms could not be found in the parallel corpora provided by EMEA, and hence do not have replacements.