Synthesis Experiments | Benchmark Pt.2

22 October 2020

Modifying script

I started with creating a new script benchmark.py (based on synthesis.py), which can handle two parallel files with the --monolingual-glossary flag - provided that they are aligned line by line (in this case, specifically for the Facebook datasets).

Initial run failure

After the initial training of the script, I remembered that the current script only works on glossaries with single words, which is the reason why there were few embeddings generated. For example, here are a few lines in the output:

Grippe de 1918 1918 flu 	 no embeddings
bronchite aiguë acute bronchitis 	 no embeddings

As you can see, in the original English term we have bigrams. No word embeddings are generated for either pair. There are some rare instances of embeddings that do get generated, but not only are they just unigrams, they also have abysmal similarity ratings (calculated as: $sim(f_{gloss}, f_{match}) \cdot sim(e_{gloss}, e_{match})$).

Click to see sample output

SIDA AIDS 	 pruned 	 1280 	 337 	 [{'f': 'maladie', 'e': 'disease', 'similarity': 0.10634502472876672, 'count': 20}, {'f': 'familiaux', 'e': 'family', 'similarity': 0.09439994727364365, 'count': 27}, {'f': 'sévère', 'e': 'severe', 'similarity': 0.06758738560884492, 'count': 22}, {'f': 'antécédents', 'e': 'history', 'similarity': 0.061721948986114494, 'count': 18}, {'f': 'patients', 'e': 'patients', 'similarity': 0.05183887041854973, 'count': 250}]

aérien airway 	 pruned 	 5120 	 224 	 [{'f': 'trouble', 'e': 'disorder', 'similarity': 0.06756156498017774, 'count': 18}, {'f': 'symptômes', 'e': 'symptoms', 'similarity': 0.055993081384110965, 'count': 97}, {'f': 'raideur', 'e': 'stiffness', 'similarity': 0.05032308793326834, 'count': 27}, {'f': 'psychotiques', 'e': 'psychotic', 'similarity': 0.04466918184504509, 'count': 19}, {'f': 'cérébrovasculaires', 'e': 'cerebrovascular', 'similarity': 0.04236828493397127, 'count': 36}, {'f': 'cardiovasculaire', 'e': 'cardiovascular', 'similarity': 0.040491460778032895, 'count': 27}]

Perhaps these ratings are meant to be bad - after all we are looking at a glossary list with terms never seen before, even in a medicine-related parallel corpus with a high-resource language pair.

TODO

Figure out a way to create embeddings for multi-word phrases:
- Even when phrases are of different lengths ($n \in [1, 7]$ for an $n$-gram in the TICO-19 glossary by Facebook), if we choose the max $n$ we should still be able to match shorter phrases
- Huda’s suggestion: average the embeddings
  
  this work from a few years ago looks at phrase embeddings but there is probably something newer, and I would start with just averaging
Read the suggested paper on phrase embeddings

Lisa Z.

Synthesis Experiments | Benchmark Pt.2

Modifying script

Initial run failure

TODO

Related Posts

Synthesis Experiments | Benchmark Pt.11 11 Apr 2021

I'm Back! 20 Mar 2021

Synthesis Experiments | Benchmark Pt.10 25 Jan 2021