Literature Review Notes | ECO
23 October 2020As mentioned in the previous post, I did not know how to generate embeddings for $n$-gram glossary terms. This paper can hopefully shed light on how to approach this problem.
Efficient, Compositional, Order-Sensitive $n$-gram Embeddings
The ECO method creates decompositional embeddings for words offline and combines them to create new embeddings for phrases in real time. Upside: ECO can create embeddings for phrases not seen during training.
Word2Vec model
Word2Vec consists of four possible models: CBOW (Continuous Bag-of-Words) with hierarchical softmax or negative sampling, and SG (Skip-Gram) with the same choices. The former predicts a single word $w$ surrounded by the given context, the latter context words given $w$. Given a token $s_j$, we calculate the log-probability (softmax over inner products of the embeddings of the two tokens) of the context words $s_k$, $k \in [j - c, 0) \cup (0, j + c]$ for a window size $c$ (half of context size) and average across all $s_j \in s$, and finally average again across all $s \in S$, which gives us:
\[\frac{1}{|S|}\sum_{s \in S}\frac{1}{|s|}\sum_j\sum_k\log(p(s_k|s_j))\]with probability distribution:
\[p(s_k | s_j) = \frac{\exp(\langle v_{s_k}^{out}, v_{s_j}\rangle)}{\exp(\sum_{w \in W}\langle v_w^{out}, v_{s_j} \rangle)}\]where $v$ and $v_{out}$ denote input and output representations of a word. The concept of input v.s. output confused me, but this StackExchange thread cleared it up a bit. First we map the input one-hot vectors to word embeddings, and then map the hidden layer to the output vector of vocabulary size.
Rudimentary approaches
- Treat an $n$-gram as a single word: just preprocess text and run Word2Vec accordingly; cannot embed unkown $n$-grams even if each of the words actually appeared in the raining corpus
- Simple average over embeddings: compute embedding for each word and average them; does not take into account the ordering of words in the $n$-gram
ECO
Based off of SG, we are going to define SE (Skip-Embeddings) that will split up the context around a word into multiple categories. Parametrize each word $w$ with $2c$ embeddings each with their own objective function:
\[\frac{1}{|S|}\sum_{s \in S}\frac{1}{|s|}\sum_j\log(p(s_k|s_j))\]where $s_k$ is the word $i$ positions away from $s_j$ in $s$, $i \in [-c, 0) \cup (0, c]$, and a negative $i$ means to the left, positive to the right of $w$. The new probability distribution is:
\[p(s_k | s_j) = \frac{\exp(\langle v_{s_k}^{i_{out}}, v_{s_j}^i\rangle)}{\exp(\sum_{w \in W}\langle v_w^{i_{out}}, v_{s_j}^i \rangle)}\]All $2c$ of these decompositional embeddings created per word are SEs. Since a skip-embedding only accounts for a single token of distance $i$ from $w$, the dimensionality of $v_w^i$ should be kept to $d/2c$ (as opposed to $d$ as used by Word2Vec). Each SE is trained with only $d/2c$ parameters.
To actually embed the $n$-gram after creating these SEs, we average the SEs and divide them into two vectors: $v^{L}$ and $v^{R}$ that summarize left and right contexts of the $n$-gram independently, computed as follows:
\[v^{L}_{[w_1, w_n]} = \frac{v_{w_1}^{-n} + \ldots + v_{w_n}^{-1}}{n}\] \[v^{R}_{[w_1, w_n]} = \frac{v_{w_1}^{1} + \ldots + v_{w_n}^{n}}{n}\]Concatenate the two vectors to create a single embedding of the entire $n$-gram. Note that after concatenation we have a dimentionality of $d/c$.