Synthesis Experiments | Benchmark Pt.2

Modifying script

I started with creating a new script benchmark.py (based on synthesis.py), which can handle two parallel files with the --monolingual-glossary flag - provided that they are aligned line by line (in this case, specifically for the Facebook datasets).

Initial run failure

After the initial training of the script, I remembered that the current script only works on glossaries with single words, which is the reason why there were few embeddings generated. For example, here are a few lines in the output:

Grippe de 1918 1918 flu 	 no embeddings
bronchite aiguë acute bronchitis 	 no embeddings

As you can see, in the original English term we have bigrams. No word embeddings are generated for either pair. There are some rare instances of embeddings that do get generated, but not only are they just unigrams, they also have abysmal similarity ratings (calculated as: $sim(f_{gloss}, f_{match}) \cdot sim(e_{gloss}, e_{match})$).

Click to see sample output
SIDA AIDS 	 pruned 	 1280 	 337 	 [{'f': 'maladie', 'e': 'disease', 'similarity': 0.10634502472876672, 'count': 20}, {'f': 'familiaux', 'e': 'family', 'similarity': 0.09439994727364365, 'count': 27}, {'f': 'sévère', 'e': 'severe', 'similarity': 0.06758738560884492, 'count': 22}, {'f': 'antécédents', 'e': 'history', 'similarity': 0.061721948986114494, 'count': 18}, {'f': 'patients', 'e': 'patients', 'similarity': 0.05183887041854973, 'count': 250}]

aérien airway 	 pruned 	 5120 	 224 	 [{'f': 'trouble', 'e': 'disorder', 'similarity': 0.06756156498017774, 'count': 18}, {'f': 'symptômes', 'e': 'symptoms', 'similarity': 0.055993081384110965, 'count': 97}, {'f': 'raideur', 'e': 'stiffness', 'similarity': 0.05032308793326834, 'count': 27}, {'f': 'psychotiques', 'e': 'psychotic', 'similarity': 0.04466918184504509, 'count': 19}, {'f': 'cérébrovasculaires', 'e': 'cerebrovascular', 'similarity': 0.04236828493397127, 'count': 36}, {'f': 'cardiovasculaire', 'e': 'cardiovascular', 'similarity': 0.040491460778032895, 'count': 27}]
Perhaps these ratings are meant to be bad - after all we are looking at a glossary list with terms never seen before, even in a medicine-related parallel corpus with a high-resource language pair.

TODO