Synthesis Experiments | Benchmark Pt.3

24 October 2020

From the benchmark:

What could be the cause of the low similarity level in generated replacement pairs?
- I used the EMEA dataset as both parallel corpus and monolingual corpora inputs to the program, yet they may not contain all the glossary terms and their translations, could that be the problem?
- Question for Shuhao: do the TICO-19 datasets contain all the glossary terms? Perhaps that would be a more appropriate monolingual corpora set to use?
How to get the model to work with multi-word glossary terms?
- The current generated embeddings from training on corpus seem to only work for single words, perhaps modify to account for multiple words?
- Match the glossary terms (with the ECO method) to newly generated embeddings?
- Other approaches?

From the paper:

How are the edge cases handled when generating embeddings? For fixed window size $c$, how are the embeddings for - let’s say, the first and last word of the phrase - computed?
How to integrate this into the current program? Perhaps generate single embeddings for the terminologies and find a match? Still not sure exactly how to handle this…

Lisa Z.