Literature Review Notes | ECO

As mentioned in the previous post, I did not know how to generate embeddings for $n$-gram glossary terms. This paper can hopefully shed light on how to approach this problem.

Efficient, Compositional, Order-Sensitive $n$-gram Embeddings

The ECO method creates decompositional embeddings for words offline and combines them to create new embeddings for phrases in real time. Upside: ECO can create embeddings for phrases not seen during training.

Word2Vec model

Word2Vec consists of four possible models: CBOW (Continuous Bag-of-Words) with hierarchical softmax or negative sampling, and SG (Skip-Gram) with the same choices. The former predicts a single word $w$ surrounded by the given context, the latter context words given $w$. Given a token $s_j$, we calculate the log-probability (softmax over inner products of the embeddings of the two tokens) of the context words $s_k$, $k \in [j - c, 0) \cup (0, j + c]$ for a window size $c$ (half of context size) and average across all $s_j \in s$, and finally average again across all $s \in S$, which gives us:

\[\frac{1}{|S|}\sum_{s \in S}\frac{1}{|s|}\sum_j\sum_k\log(p(s_k|s_j))\]

with probability distribution:

\[p(s_k | s_j) = \frac{\exp(\langle v_{s_k}^{out}, v_{s_j}\rangle)}{\exp(\sum_{w \in W}\langle v_w^{out}, v_{s_j} \rangle)}\]

where $v$ and $v_{out}$ denote input and output representations of a word. The concept of input v.s. output confused me, but this StackExchange thread cleared it up a bit. First we map the input one-hot vectors to word embeddings, and then map the hidden layer to the output vector of vocabulary size.

Rudimentary approaches

ECO

Based off of SG, we are going to define SE (Skip-Embeddings) that will split up the context around a word into multiple categories. Parametrize each word $w$ with $2c$ embeddings each with their own objective function:

\[\frac{1}{|S|}\sum_{s \in S}\frac{1}{|s|}\sum_j\log(p(s_k|s_j))\]

where $s_k$ is the word $i$ positions away from $s_j$ in $s$, $i \in [-c, 0) \cup (0, c]$, and a negative $i$ means to the left, positive to the right of $w$. The new probability distribution is:

\[p(s_k | s_j) = \frac{\exp(\langle v_{s_k}^{i_{out}}, v_{s_j}^i\rangle)}{\exp(\sum_{w \in W}\langle v_w^{i_{out}}, v_{s_j}^i \rangle)}\]

All $2c$ of these decompositional embeddings created per word are SEs. Since a skip-embedding only accounts for a single token of distance $i$ from $w$, the dimensionality of $v_w^i$ should be kept to $d/2c$ (as opposed to $d$ as used by Word2Vec). Each SE is trained with only $d/2c$ parameters.

To actually embed the $n$-gram after creating these SEs, we average the SEs and divide them into two vectors: $v^{L}$ and $v^{R}$ that summarize left and right contexts of the $n$-gram independently, computed as follows:

\[v^{L}_{[w_1, w_n]} = \frac{v_{w_1}^{-n} + \ldots + v_{w_n}^{-1}}{n}\] \[v^{R}_{[w_1, w_n]} = \frac{v_{w_1}^{1} + \ldots + v_{w_n}^{n}}{n}\]

Concatenate the two vectors to create a single embedding of the entire $n$-gram. Note that after concatenation we have a dimentionality of $d/c$.