I'm Francesc Lluis, one of the coauthors of the paper.
The reason why we don't use audio stems as training data is because during the preparation of the MUSDB dataset, conversion to WAV can sometimes halt because of an ffmpeg process freezing that is used within the musdb python package to identify the datasets mp4 audio streams. This seems to be an error ocurring upon the subprocess.Popen() used deep within the stempeg library. Due to its random nature, it is not currently known how to fix this.