Synthesis Experiments | Benchmark Pt.9

11 January 2021

Using `moses`

The monolingual datasets and terminologies have been tokenized using the moses tokenizer and the special characters from HTML have been replaced. A new experiment is now being run with the pre-tokenized datasets.

Results

Now there are embeddings generated for terms such as covid19 and H1N1. The new tokenizer seems to work well. There are still some terms that do not have embeddings generated:

FRENCH	ENGLISH
c19 (472)	c19 (5199)
cv19 (11)	cv19 (34)
cov19 (4)	cov19 (20)
CV-19 (0)	CV-19 (0)
Isopropanol (62)	Isopropanol (208)
vivid-19 (0)	vivid-19 (0)
Télétravail (311)	WFH (270)

There are also three pairs of terms: corona/corona, germicide/germicide, and Wuhan/Wuhan whose similar terms could not be found in the parallel corpora provided by EMEA, and hence do not have replacements.

Lisa Z.

Synthesis Experiments | Benchmark Pt.9

Using moses

Results

Related Posts

Synthesis Experiments | Benchmark Pt.11 11 Apr 2021

I'm Back! 20 Mar 2021

Synthesis Experiments | Benchmark Pt.10 25 Jan 2021

Using `moses`