Synthesis Experiments | Benchmark Pt.5

Using Other Datasets

I soon ran into the problem where the CommonCrawl datasets available to use were a bit too messy & not yet processed, and the clean data were not recent enough to contain the terminologies. While the hunt for good datasets to use is still on, I tried NeuLab’s covid-datashare repository. The datasets here were not quite long enough, and did not contain all the single-word terminologies as I had hoped, causing the program to crash because certain embeddings could not be generated.