Additionally, when you feed an image into a prompt UI, it simply generates a text description of the image using image recognition and feeds it into the LLM.
All the LLM receives is “Picture containing a slice of pizza”, it has no control over the granularity of the image recognition software, nor is that software designed to provide anything more than OCR and a rough description of the image by way of pattern matching.
Additionally, when you feed an image into a prompt UI, it simply generates a text description of the image using image recognition and feeds it into the LLM.
All the LLM receives is “Picture containing a slice of pizza”, it has no control over the granularity of the image recognition software, nor is that software designed to provide anything more than OCR and a rough description of the image by way of pattern matching.