Synthesis Experiments | Benchmark Pt.4

26 October 2020

Tips for moving forward

Things to keep in mind after meeting with the mentors today:

The current script only handles single-words, and for now keep it this way
- Filter out multi-word terminologies and only keep the single-word ones
For the monolingual data input to the script, look to CommonCrawl (more abundant) or Wikipedia (need to choose the more recent datasets for including COVID-19 terminologies)
For future references - possible ways to find replacement for phrases:
- Basic approach: average the embeddings of terminologies and attempt to replace single words with these phrases in the parallel corpus, logic is reasonable enough
- More sophisticated approach: align the terminologies with the corpus and somehow extract phrases to replace; this will be a lot more involved and complicated

Points of confusion

I had a misconception regarding how embeddings are generated. The embeddings are generated for the terminologies, not the parallel corpus, but via the monolingual corpus input. We find similar pairs to the terminologies in the monolingual data embeddings, and if these similar pairs exist in the parallel data, compute similarity between the similar pairs and the original glossary terms.

Lisa Z.

Synthesis Experiments | Benchmark Pt.4

Tips for moving forward

Points of confusion

Related Posts

Synthesis Experiments | Benchmark Pt.11 11 Apr 2021

I'm Back! 20 Mar 2021

Synthesis Experiments | Benchmark Pt.10 25 Jan 2021