CLSP Cluster Notes | Day 5

fairseq-train

To train the model, in addition to the tutorial on fairseq’s webpage, I had to specify an optimizer. I went with SGD:

It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

Probably should have chosen Adam, which seems more popular now. Perhaps I will train the model on Adam as well and compare output.

I set --max-tokens to 500 - not sure if 4000 would be too much to ask of the GPU.

Speaking of GPU, I also included Guanghui’s script to acquire one.

Job stopped on the 18th epoch. I will go do my homework now - fairseq-generate coming up next.