11 Apr 2021
BPE
Based on previous outputs, there seemed to be a problem where there are a lot of unknown tokens in the output translation (see the last line of synthesis/train_model/pass-1/fs_generate*.out
, where BLEU = 1.87
for example). It was then recommended that the corpus all be tokenized before training. The results did improve significantly, and the third pass (synthesis/train_model/pass-3
), I got BLEU = 10.92
. Still not great, but at least greatly improved.
20 Mar 2021
So itβs been a bit since I made my last postβ¦ The past two months have been kind of insane with classes, schoolwork, and perhaps most importantly, the much dreaded summer internship search π». But I finally got an offer, and I am now ready to get back on track with blogging!
read more...
25 Jan 2021
Generating Synthetic Corpus
Following from the last post, it was discussed that no alignment situations are inevitable due to noisy data and such. I therefore edited the script to skip the words that are not aligned. In addition, I ran the moses
tokenizer (tokenize_parallel.sh
) on the parallel corpus as well just in case (even though the website said it is already tokenized). While I was successful in generating synthetic corpus after this modification, it was strangely repetitive. Note that for each terminology, we keep a list of words from the parallel corpus that it can replace. However, it looks like the script loops over a specific word a number of times instead of all of the words.
read more...
11 Jan 2021
Using moses
The monolingual datasets and terminologies have been tokenized using the moses
tokenizer and the special characters from HTML
have been replaced. A new experiment is now being run with the pre-tokenized datasets.
read more...
4 Jan 2021
Follow-up on Embedding Anomaly
As mentioned in the previous post, there seems to be an issue when generating embeddings even if the term occured many times in both monolingual corpora. To investigate further, I plan to run the script on the term covid19 only.
read more...