Synthesis Experiments | Benchmark Pt.4
26 October 2020Tips for moving forward
Things to keep in mind after meeting with the mentors today:
- The current script only handles single-words, and for now keep it this way
- Filter out multi-word terminologies and only keep the single-word ones
- For the monolingual data input to the script, look to CommonCrawl (more abundant) or Wikipedia (need to choose the more recent datasets for including COVID-19 terminologies)
- For future references - possible ways to find replacement for phrases:
- Basic approach: average the embeddings of terminologies and attempt to replace single words with these phrases in the parallel corpus, logic is reasonable enough
- More sophisticated approach: align the terminologies with the corpus and somehow extract phrases to replace; this will be a lot more involved and complicated
Points of confusion
I had a misconception regarding how embeddings are generated. The embeddings are generated for the terminologies, not the parallel corpus, but via the monolingual corpus input. We find similar pairs to the terminologies in the monolingual data embeddings, and if these similar pairs exist in the parallel data, compute similarity between the similar pairs and the original glossary terms.