Yes this is true, you do lose some information between the layers, and this incr...

Yes this is true, you do lose some information between the layers, and this increased expressibility is the big benefit of using ML instead of classic feature engineering. However, I think the gain would be worth it for some use cases. You could for instance take an existing image, run that through a semantic segmentation model, and then edit the underlying image description. You could add a yellow hat to a person without regenerating any other part of the image, you could edit existing text, change a person's pose, you could probably more easily convert images to 3D, etc.

It's probably not a viable idea, I just wish for more composable modules that lets us understand the models' representation better and change certain aspects of them, instead of these massive black boxes that mix all these tasks into one.

I would also like to add that the text2image models already have multiple interfaces between different parts. There's the text encoder, the latent to pixel space VAE decoder, controlnets, and sometimes there a separate img2imgstyle transfer at the end. Transformers already process images patchwise, but why does those patches have to be even square patches instead of semantically coherent areas?