Synthesis Experiments | Benchmark Pt.8
4 January 2021Follow-up on Embedding Anomaly
As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.
Eureka
The results still indicated no embeddings were generated for covid19. After a meeting with Huda, we figured out that it was the tokenizer that was having issues. The one originally used was gensim.utils.tokenize()
, which removes numbers - not hard to see why that would cause a problem.