Synthesis Experiments | Benchmark Pt.11

11 Apr 2021

BPE

Based on previous outputs, there seemed to be a problem where there are a lot of unknown tokens in the output translation (see the last line of synthesis/train_model/pass-1/fs_generate*.out, where BLEU = 1.87 for example). It was then recommended that the corpus all be tokenized before training. The results did improve significantly, and the third pass (synthesis/train_model/pass-3), I got BLEU = 10.92. Still not great, but at least greatly improved.

I'm Back!

20 Mar 2021

So it’s been a bit since I made my last post… The past two months have been kind of insane with classes, schoolwork, and perhaps most importantly, the much dreaded summer internship search 👻. But I finally got an offer, and I am now ready to get back on track with blogging!

Synthesis Experiments | Benchmark Pt.10

25 Jan 2021

Generating Synthetic Corpus

Following from the last post, it was discussed that no alignment situations are inevitable due to noisy data and such. I therefore edited the script to skip the words that are not aligned. In addition, I ran the moses tokenizer (tokenize_parallel.sh) on the parallel corpus as well just in case (even though the website said it is already tokenized). While I was successful in generating synthetic corpus after this modification, it was strangely repetitive. Note that for each terminology, we keep a list of words from the parallel corpus that it can replace. However, it looks like the script loops over a specific word a number of times instead of all of the words.

Synthesis Experiments | Benchmark Pt.9

11 Jan 2021

Using `moses`

The monolingual datasets and terminologies have been tokenized using the moses tokenizer and the special characters from HTML have been replaced. A new experiment is now being run with the pre-tokenized datasets.

Synthesis Experiments | Benchmark Pt.8

4 Jan 2021

Follow-up on Embedding Anomaly

As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.

Lisa Z.

Synthesis Experiments | Benchmark Pt.11

BPE

I'm Back!

Synthesis Experiments | Benchmark Pt.10

Generating Synthetic Corpus

Synthesis Experiments | Benchmark Pt.9

Using moses

Synthesis Experiments | Benchmark Pt.8

Follow-up on Embedding Anomaly

Using `moses`