23 Oct 2020
As mentioned in the previous post, I did not know how to generate embeddings for $n$-gram glossary terms. This paper can hopefully shed light on how to approach this problem.
Efficient, Compositional, Order-Sensitive $n$-gram Embeddings
The ECO method creates decompositional embeddings for words offline and combines them to create new embeddings for phrases in real time. Upside: ECO can create embeddings for phrases not seen during training.
read more...
22 Oct 2020
Modifying script
I started with creating a new script benchmark.py
(based on synthesis.py
), which can handle two parallel files with the --monolingual-glossary
flag - provided that they are aligned line by line (in this case, specifically for the Facebook datasets).
read more...
21 Oct 2020
Selecting training datasets
Considering the fact that we will mainly be working with medical-related terminologies, I went on OPUS to search for datasets that are more appropriate for this purpose than the EN-DE toy dataset. I came across EMEA - βa parallel corpus made out of PDF documents from the European Medicines Agencyβ. I chose the EN-FR datasets (in nice MOSES format) to experiment with, partly because I am familiar with both languages.
read more...
20 Sep 2020
Translation Initiative for COvid-19
With a focus on low-resource languages, TICO-19 aims to make available data in 35 different lanuages (of which 26 lesser resourced languages in addition to the 9 βpivotβ languages) for researchers and to assist the development of tools to effectively disseminate critical information for disaster situations such as presented by the current global pandemic.
read more...
3 Sep 2020
fairseq-generate
Other than the fact that I did not request a GPU in the beginning (which I did not think it needed), the script ran smoothly - only took me two tries of qsub
! Here are some outputs (DE -> EN):
GERMAN |
ENGLISH |
darf ich ehrlich sein ? |
can i be honest ? |
mache einfach deinen job . |
just do your job . |
die ganze umwelt@@ diskussion . |
the whole environmental discussion . |
oh mein gott . |
oh , my god . |
die sonne geht auf |
the sun is ris@@ ing . |
:)