Synthesis Experiments | Benchmark Pt.5
22 November 2020Using Other Datasets
I soon ran into the problem where the CommonCrawl datasets available to use were a bit too messy & not yet processed, and the clean data were not recent enough to contain the terminologies. While the hunt for good datasets to use is still on, I tried NeuLabβs covid-datashare
repository. The datasets here were not quite long enough, and did not contain all the single-word terminologies as I had hoped, causing the program to crash because certain embeddings could not be generated.