Synthesis Experiments | Benchmark Pt.7

Running with Wikipedia Datasets

There were some issues with running with the wikipedia datasets as the output was still a bit unsatisfactory. For example, no embeddings were generated for critical terms such as COVID and cov19 (along with other variants). The Wikipedia dumps, however, should have contained these terms because of how recent they were (11/20/2020). As suggested by advisors, it would be a good idea to count the frequency of these terms in the monoloingual corpora.

Counting Frequency

With the Facebook dataset in English and French and a Python script, I counted the occurences of each term (single and multi-word ones) in the respective Wikipedia datasets. 27 (0.75%) terms in the English dataset and 45 (12.5%) in the French dataset never appeared in their respective monolingual corpora. Curiously, while there was embedding generated for the pair pandemic (50843)/pandΓ©mie (9326), there was none for the pair covid19 (55440)/covid19 (10175), both of which occurred more frequently than the former.