-
Notifications
You must be signed in to change notification settings - Fork 42
Description
See #159: For now, Google is the only provider with multimodal output models that we have implemented support for (since other multimodal output models are for audio generation, which we generally don't support yet).
In general, we need to figure out the best approach for implementing classes for those models. Technically, they are pretty much using the exact implementation that our current text generation model class implementations rely on. But because they support both text generation and image generation, we need a class that implements both TextGenerationModelInterface and ImageGenerationModelInterface.
Looking at the Google implementation, we could combine both classes GoogleTextGenerationModel and GoogleImageGenerationModel into one, but that IMO would make things very messy and crowded. So my initial suggestion would be to leave those classes as is and instead introduce another class like GoogleTextAndImageGenerationModel, which simply forwards to a new instance of the respective other one, depending on which method is called. This would be almost similar to how currently GoogleImageGenerationModel will instantiate a GoogleTextGenerationModel for Gemini multimodal image output models.