Synthesis Experiments | Benchmark Pt.8

4 January 2021

Follow-up on Embedding Anomaly

As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.

Eureka

The results still indicated no embeddings were generated for covid19. After a meeting with Huda, we figured out that it was the tokenizer that was having issues. The one originally used was gensim.utils.tokenize(), which removes numbers - not hard to see why that would cause a problem.

Lisa Z.

Synthesis Experiments | Benchmark Pt.8

Follow-up on Embedding Anomaly

Eureka

Related Posts

Synthesis Experiments | Benchmark Pt.11 11 Apr 2021

I'm Back! 20 Mar 2021

Synthesis Experiments | Benchmark Pt.10 25 Jan 2021