Indeed, in this case ESRI is talking about deep-learning based single image superresolution. With multiple images, even something as simple as shift-and-add can recover details (and lower noise in the picture), but you do require having multiple images. Video being a sequence of images, with large or small movement in between frames, can be an ideal source for robust SR algorithms. To complicate things further, there are deep-learing based multi-image SR algorithms too.