I also think this is the route we are heading, a few 1-7B or 14B param models that are very good at their tasks, stitched together with a model that's very good at delegating. Huggingface has Transformers Agents which "provides a natural language API on top of transformers: we define a set of curated tools and design an agent to interpret natural language and to use these tools"
Some of the tools it already has are:
Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)
Text question answering: given a long text and a question, answer the question in the text (Flan-T5)
Unconditional image captioning: Caption the image! (BLIP)
Image question answering: given an image, answer a question on this image (VILT)
Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)
Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)
Text to speech: convert text to speech (SpeechT5)
Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)
Text summarization: summarize a long text in one or a few sentences (BART)
Translation: translate the text into a given language (NLLB)
Text downloader: to download a text from a web URL
Text to image: generate an image according to a prompt, leveraging stable diffusion
Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
Text to video: generate a small video according to a prompt, leveraging damo-vilab
It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.
I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)
There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:
First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.
Some of the tools it already has are:
Document question answering: given a document (such as a PDF) in image format, answer a question on this document (Donut)
Text question answering: given a long text and a question, answer the question in the text (Flan-T5)
Unconditional image captioning: Caption the image! (BLIP)
Image question answering: given an image, answer a question on this image (VILT)
Image segmentation: given an image and a prompt, output the segmentation mask of that prompt (CLIPSeg)
Speech to text: given an audio recording of a person talking, transcribe the speech into text (Whisper)
Text to speech: convert text to speech (SpeechT5)
Zero-shot text classification: given a text and a list of labels, identify to which label the text corresponds the most (BART)
Text summarization: summarize a long text in one or a few sentences (BART)
Translation: translate the text into a given language (NLLB)
Text downloader: to download a text from a web URL
Text to image: generate an image according to a prompt, leveraging stable diffusion
Image transformation: modify an image given an initial image and a prompt, leveraging instruct pix2pix stable diffusion
Text to video: generate a small video according to a prompt, leveraging damo-vilab
It's written in a way that allows the addition of custom tools so you can add use cases or swap models in and out.
https://huggingface.co/docs/transformers/transformers_agents