Implement proper multimodal output model classes

See #159: For now, Google is the only provider with multimodal output models that we have implemented support for (since other multimodal output models are for audio generation, which we generally don't support yet).

In general, we need to figure out the best approach for implementing classes for those models. Technically, they are pretty much using the exact implementation that our current text generation model class implementations rely on. But because they support both text generation and image generation, we need a class that implements both `TextGenerationModelInterface` and `ImageGenerationModelInterface`.

Looking at the Google implementation, we could combine both classes `GoogleTextGenerationModel` and `GoogleImageGenerationModel` into one, but that IMO would make things very messy and crowded. So my initial suggestion would be to leave those classes as is and instead introduce another class like `GoogleTextAndImageGenerationModel`, which simply forwards to a new instance of the respective other one, depending on which method is called. This would be almost similar to how currently `GoogleImageGenerationModel` will instantiate a `GoogleTextGenerationModel` for Gemini multimodal image output models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement proper multimodal output model classes #160

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement proper multimodal output model classes #160

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions