This is currently #1 on the front page, but I think HN is celebrating the defeat of GPUs prematurely. The architectures used in this paper have just one hidden activation layer. It remains to be seen whether the ideas in SLIDE are applicable to general deep learning tasks and architectures.
From the previous paper:
We choose the standard fully connected neural
network with one hidden layer of size 128.
Yep, I feel like anyone realizes when they first start playing around with NNs in a jupyter notebook that you get faster training on the CPU until you level up to bigger networks.
That's normally due to the cost of copying the data to GPU space. Here, they've reformulated the training algorithm to be a less-computationally-complex problem. It's a huge difference.
From the previous paper: