Synthesis Experiments | Benchmark Pt.1
21 October 2020Selecting training datasets
Considering the fact that we will mainly be working with medical-related terminologies, I went on OPUS to search for datasets that are more appropriate for this purpose than the EN-DE toy dataset. I came across EMEA - βa parallel corpus made out of PDF documents from the European Medicines Agencyβ. I chose the EN-FR datasets (in nice MOSES format) to experiment with, partly because I am familiar with both languages.
TODO
- Modify
synthesis.py
to take in two files for terminologies provided by Facebook - Run experiment to get preliminary results