Synthesis Experiments | Benchmark Pt.7

28 December 2020

Running with Wikipedia Datasets

There were some issues with running with the wikipedia datasets as the output was still a bit unsatisfactory. For example, no embeddings were generated for critical terms such as COVID and cov19 (along with other variants). The Wikipedia dumps, however, should have contained these terms because of how recent they were (11/20/2020). As suggested by advisors, it would be a good idea to count the frequency of these terms in the monoloingual corpora.

Counting Frequency

With the Facebook dataset in English and French and a Python script, I counted the occurences of each term (single and multi-word ones) in the respective Wikipedia datasets. 27 (0.75%) terms in the English dataset and 45 (12.5%) in the French dataset never appeared in their respective monolingual corpora. Curiously, while there was embedding generated for the pair pandemic (50843)/pandémie (9326), there was none for the pair covid19 (55440)/covid19 (10175), both of which occurred more frequently than the former.

Lisa Z.

Synthesis Experiments | Benchmark Pt.7

Running with Wikipedia Datasets

Counting Frequency

Related Posts

Synthesis Experiments | Benchmark Pt.11 11 Apr 2021

I'm Back! 20 Mar 2021

Synthesis Experiments | Benchmark Pt.10 25 Jan 2021