Synthesis Experiments | Benchmark Pt.1

Selecting training datasets

Considering the fact that we will mainly be working with medical-related terminologies, I went on OPUS to search for datasets that are more appropriate for this purpose than the EN-DE toy dataset. I came across EMEA - β€œa parallel corpus made out of PDF documents from the European Medicines Agency”. I chose the EN-FR datasets (in nice MOSES format) to experiment with, partly because I am familiar with both languages.

TODO