Literature Review Notes | ECO

As mentioned in the previous post, I did not know how to generate embeddings for $n$-gram glossary terms. This paper can hopefully shed light on how to approach this problem.

Efficient, Compositional, Order-Sensitive $n$-gram Embeddings

The ECO method creates decompositional embeddings for words offline and combines them to create new embeddings for phrases in real time. Upside: ECO can create embeddings for phrases not seen during training.

read more...

Synthesis Experiments | Benchmark Pt.2

Modifying script

I started with creating a new script benchmark.py (based on synthesis.py), which can handle two parallel files with the --monolingual-glossary flag - provided that they are aligned line by line (in this case, specifically for the Facebook datasets).

read more...

Synthesis Experiments | Benchmark Pt.1

Selecting training datasets

Considering the fact that we will mainly be working with medical-related terminologies, I went on OPUS to search for datasets that are more appropriate for this purpose than the EN-DE toy dataset. I came across EMEA - β€œa parallel corpus made out of PDF documents from the European Medicines Agency”. I chose the EN-FR datasets (in nice MOSES format) to experiment with, partly because I am familiar with both languages.

read more...

Literature Review Notes | TICO-19

Translation Initiative for COvid-19

With a focus on low-resource languages, TICO-19 aims to make available data in 35 different lanuages (of which 26 lesser resourced languages in addition to the 9 β€œpivot” languages) for researchers and to assist the development of tools to effectively disseminate critical information for disaster situations such as presented by the current global pandemic.

read more...

CLSP Cluster Notes | Day 6

fairseq-generate

Other than the fact that I did not request a GPU in the beginning (which I did not think it needed), the script ran smoothly - only took me two tries of qsub! Here are some outputs (DE -> EN):

GERMAN ENGLISH
darf ich ehrlich sein ? can i be honest ?
mache einfach deinen job . just do your job .
die ganze umwelt@@ diskussion . the whole environmental discussion .
oh mein gott . oh , my god .
die sonne geht auf the sun is ris@@ ing .

:)