Synthesis Experiments | Benchmark Pt.11

BPE

Based on previous outputs, there seemed to be a problem where there are a lot of unknown tokens in the output translation (see the last line of synthesis/train_model/pass-1/fs_generate*.out, where BLEU = 1.87 for example). It was then recommended that the corpus all be tokenized before training. The results did improve significantly, and the third pass (synthesis/train_model/pass-3), I got BLEU = 10.92. Still not great, but at least greatly improved.

I'm Back!

So it’s been a bit since I made my last post… The past two months have been kind of insane with classes, schoolwork, and perhaps most importantly, the much dreaded summer internship search πŸ‘». But I finally got an offer, and I am now ready to get back on track with blogging!

read more...

Synthesis Experiments | Benchmark Pt.10

Generating Synthetic Corpus

Following from the last post, it was discussed that no alignment situations are inevitable due to noisy data and such. I therefore edited the script to skip the words that are not aligned. In addition, I ran the moses tokenizer (tokenize_parallel.sh) on the parallel corpus as well just in case (even though the website said it is already tokenized). While I was successful in generating synthetic corpus after this modification, it was strangely repetitive. Note that for each terminology, we keep a list of words from the parallel corpus that it can replace. However, it looks like the script loops over a specific word a number of times instead of all of the words.

read more...

Synthesis Experiments | Benchmark Pt.9

Using moses

The monolingual datasets and terminologies have been tokenized using the moses tokenizer and the special characters from HTML have been replaced. A new experiment is now being run with the pre-tokenized datasets.

read more...

Synthesis Experiments | Benchmark Pt.8

Follow-up on Embedding Anomaly

As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.

read more...