LLMs and VLM performance

By Jason Tzu-Cheng Chuang 2024-01-19

AI models include Large Language Model (LLM) and Vision-Language Model (VLM). OpenAI ChatPGT became popular in December 2022. LLaVA: Large Language and Vision Assistant was submitted in late 2023. In this research, I would like to compare what's described on Unsplash website and the description output from LLaVA:13b.

As for more advanced Q&A from the images, it's not shown in this research. Computer Vision has been improved so much.

Sample Photos from Unsplash

Sample1

Sample1
Photo by Shelby Murphy Figueroa from Unsplash

Description from the Unsplash website :

a group of women standing next to each other holding flowers

Description from llava:13b :

The image features a group of seven women standing together 
and holding flowers in their hands. They are positioned close 
to each other, with some of the women wearing shorts as part 
of their outfits. The women appear to be posing for a picture, 
showcasing their colorful bouquets of flowers.

Sample2

Sample2
Photo by Pipcke from Unsplash

Description from the Unsplash website :

modern living room with white furniture and classy lights

Description from llava:13b :

The image shows a modern and stylish living room filled with 
various white furniture. There is a couch situated in the middle 
of the room, accompanied byseveral chairs arranged throughout the space. 
A couple of tables can be seen, one near the seating area and 
another closer to the edge of the frame.

Additionally, there are two vases placed within the room, with one positioned
more towards the left side and the other on the right side. The living room also
features a large window that adds natural light to the space, creating an 
inviting atmosphere for relaxation and socialization.

Sample3

Sample3
Photo by Nick Karvounis from Unsplash

Description from the Unsplash website :

a boat is in the water near a restaurant

Description from llava:13b :

The scene depicts a beautiful beach with a large boat sailing near 
the shore. The boat is docked close to the shoreline, creating an 
inviting and picturesque atmosphere. Numerous people can be seen 
scattered around the beach, enjoying the view and the sunny day. 
Some of them are standing near the water's edge, while others are
situated  further away in various spots across the scene.

In addition to the boat and people, there is a chair positioned 
towards the middle of the image. The combination of the beach setting, 
the large boat, and the presence of multiple individuals contributes 
to creating a lively and captivating environment for both locals and visitors.

Sample4

Sample4
Photo by Kristjan Kotar from Unsplash

Description from the Unsplash website :

four person on mountain during daytime

Description from llava:13b :

The image depicts a group of five people standing on top of a rocky hill 
or mountain, enjoying the panoramic view. They are all equipped with backpacks, 
suggesting that they may be hikers or travelers exploring the area together.

The scene captures the essence of adventure and camaraderie among the group 
members. The people seem to be taking a break from their journey or activity, 
perhaps admiring the stunning surroundings or simply appreciating the moment.

Sample5

Sample5
Photo by Patrick Amoy from Unsplash

Description from the Unsplash website :

person holding a computer mouse in a room

Description from llava:13b :

The image depicts a room full of computers and people. There is a young man 
sitting at a desk, working on a laptop. Another person can be seen sitting 
nearby, also using a computer. Several other individuals are present in the 
room, some of them engaged with their laptops as well.

Various elements fill the room, including multiple cups scattered around the 
space and several chairs placed both close to and farther from the people. 
Additionally, there is a clock on the wall above the desks where the 
individuals are working. The scene suggests an environment like an office 
or study area, with many people collaborating or focusing on their tasks.

Reference

LLaVA: Large Language and Vision Assistant https://llava-vl.github.io/

Ollama: Run LLaVA:13b locally https://ollama.ai/