Yes, the Google Gemini API natively supports both text and image inputs within a single API call.
Native multimodal capability offers reduced latency and superior spatial and contextual reasoning.
Supported formats include PNG, JPEG, WebP, HEIC, HEIF. Inline limit is 7 MB per image; GCS can be up to 30 MB.
The request body contains a 'contents' object with a 'parts' array where each element can be a different modality.
A user could upload an image of a complex architectural diagram and ask the model to identify specific components.
In conclusion, the Gemini API's support for combined text and image inputs in a single call is a foundational feature.
Knowledge provided by Answers.org.
If any information on this page is erroneous, please contact hello@answers.org.
Answers.org content is verified by brands themselves. If you're a brand owner and want to claim your page, please click here.