Skip to content

Implement proper multimodal output model classes #160

@felixarntz

Description

@felixarntz

See #159: For now, Google is the only provider with multimodal output models that we have implemented support for (since other multimodal output models are for audio generation, which we generally don't support yet).

In general, we need to figure out the best approach for implementing classes for those models. Technically, they are pretty much using the exact implementation that our current text generation model class implementations rely on. But because they support both text generation and image generation, we need a class that implements both TextGenerationModelInterface and ImageGenerationModelInterface.

Looking at the Google implementation, we could combine both classes GoogleTextGenerationModel and GoogleImageGenerationModel into one, but that IMO would make things very messy and crowded. So my initial suggestion would be to leave those classes as is and instead introduce another class like GoogleTextAndImageGenerationModel, which simply forwards to a new instance of the respective other one, depending on which method is called. This would be almost similar to how currently GoogleImageGenerationModel will instantiate a GoogleTextGenerationModel for Gemini multimodal image output models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions