Synthesis Experiments | Benchmark Pt.10

Generating Synthetic Corpus

Following from the last post, it was discussed that no alignment situations are inevitable due to noisy data and such. I therefore edited the script to skip the words that are not aligned. In addition, I ran the moses tokenizer (tokenize_parallel.sh) on the parallel corpus as well just in case (even though the website said it is already tokenized). While I was successful in generating synthetic corpus after this modification, it was strangely repetitive. Note that for each terminology, we keep a list of words from the parallel corpus that it can replace. However, it looks like the script loops over a specific word a number of times instead of all of the words.

Finding Bugs

The issue occurs in the generate_new_sentence_pairs function, where the index is not updating properly, causing the word at index 0 to be returned for every iteration of the loop. However, there are other concerns. Since the goal is to generate a new parallel corpus with those terms included, it is necessary to overhaul the entire function so that it does not only spit out individual sentences. I am currently still working on this new function.