Synthesis Experiments | Benchmark Pt.9

Using moses

The monolingual datasets and terminologies have been tokenized using the moses tokenizer and the special characters from HTML have been replaced. A new experiment is now being run with the pre-tokenized datasets.

Results

Now there are embeddings generated for terms such as covid19 and H1N1. The new tokenizer seems to work well. There are still some terms that do not have embeddings generated:

FRENCH ENGLISH
c19 (472) c19 (5199)
cv19 (11) cv19 (34)
cov19 (4) cov19 (20)
CV-19 (0) CV-19 (0)
Isopropanol (62) Isopropanol (208)
vivid-19 (0) vivid-19 (0)
Télétravail (311) WFH (270)

There are also three pairs of terms: corona/corona, germicide/germicide, and Wuhan/Wuhan whose similar terms could not be found in the parallel corpora provided by EMEA, and hence do not have replacements.