Synthesis Experiments | Benchmark Pt.2
22 October 2020Modifying script
I started with creating a new script benchmark.py
(based on synthesis.py
), which can handle two parallel files with the --monolingual-glossary
flag - provided that they are aligned line by line (in this case, specifically for the Facebook datasets).
Initial run failure
After the initial training of the script, I remembered that the current script only works on glossaries with single words, which is the reason why there were few embeddings generated. For example, here are a few lines in the output:
As you can see, in the original English term we have bigrams. No word embeddings are generated for either pair. There are some rare instances of embeddings that do get generated, but not only are they just unigrams, they also have abysmal similarity ratings (calculated as: $sim(f_{gloss}, f_{match}) \cdot sim(e_{gloss}, e_{match})$).
Click to see sample output
Perhaps these ratings are meant to be bad - after all we are looking at a glossary list with terms never seen before, even in a medicine-related parallel corpus with a high-resource language pair.TODO
- Figure out a way to create embeddings for multi-word phrases:
- Even when phrases are of different lengths ($n \in [1, 7]$ for an $n$-gram in the TICO-19 glossary by Facebook), if we choose the max $n$ we should still be able to match shorter phrases
- Huda’s suggestion: average the embeddings
this work from a few years ago looks at phrase embeddings but there is probably something newer, and I would start with just averaging
- Read the suggested paper on phrase embeddings