Synthesis Experiments | Benchmark Pt.8

Follow-up on Embedding Anomaly

As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.

Eureka

The results still indicated no embeddings were generated for covid19. After a meeting with Huda, we figured out that it was the tokenizer that was having issues. The one originally used was gensim.utils.tokenize(), which removes numbers - not hard to see why that would cause a problem.