Synthesis Experiments | Benchmark Pt.7

Running with Wikipedia Datasets

There were some issues with running with the wikipedia datasets as the output was still a bit unsatisfactory. For example, no embeddings were generated for critical terms such as COVID and cov19 (along with other variants). The Wikipedia dumps, however, should have contained these terms because of how recent they were (11/20/2020). As suggested by advisors, it would be a good idea to count the frequency of these terms in the monoloingual corpora.

read more...

Synthesis Experiments | Benchmark Pt.6

Wikipedia Dumps

Of course, Wikipedia always has some of the most up-to-date information on any major world events, including the current pandemic crisis. I have been hesitant to use it because of how the dumps are structured - they are in XML or SQL formats, and I did not know how to extract information from them and save it in text file format. Luckily, Matt showed me this helpful repo that builds a tool for parsing these dumps. However, the documentation has said that this script is for parsing Wikitionaries and for English only, and I needed one for general articles in different languages. I looked around and found the Wikipedia Extractor script which should do just that. This gave me some hope - I downloaded the gigantic .bz2 dumps for English and French and ran the script to generate plain text files and remove formatting.

Synthesis Experiments | Benchmark Pt.5

Using Other Datasets

I soon ran into the problem where the CommonCrawl datasets available to use were a bit too messy & not yet processed, and the clean data were not recent enough to contain the terminologies. While the hunt for good datasets to use is still on, I tried NeuLab’s covid-datashare repository. The datasets here were not quite long enough, and did not contain all the single-word terminologies as I had hoped, causing the program to crash because certain embeddings could not be generated.

Synthesis Experiments | Benchmark Pt.4

Tips for moving forward

Things to keep in mind after meeting with the mentors today:

  • The current script only handles single-words, and for now keep it this way
    • Filter out multi-word terminologies and only keep the single-word ones
  • For the monolingual data input to the script, look to CommonCrawl (more abundant) or Wikipedia (need to choose the more recent datasets for including COVID-19 terminologies)
  • For future references - possible ways to find replacement for phrases:
    • Basic approach: average the embeddings of terminologies and attempt to replace single words with these phrases in the parallel corpus, logic is reasonable enough
    • More sophisticated approach: align the terminologies with the corpus and somehow extract phrases to replace; this will be a lot more involved and complicated
read more...

Synthesis Experiments | Benchmark Pt.3

Questions that remain

From the benchmark:

  • What could be the cause of the low similarity level in generated replacement pairs?
    • I used the EMEA dataset as both parallel corpus and monolingual corpora inputs to the program, yet they may not contain all the glossary terms and their translations, could that be the problem?
    • Question for Shuhao: do the TICO-19 datasets contain all the glossary terms? Perhaps that would be a more appropriate monolingual corpora set to use?
  • How to get the model to work with multi-word glossary terms?
    • The current generated embeddings from training on corpus seem to only work for single words, perhaps modify to account for multiple words?
    • Match the glossary terms (with the ECO method) to newly generated embeddings?
    • Other approaches?
read more...