VideoGameBunny: Towards vision assistants for video games

by Mohammad Reza Taesiri, & Cor-Paul Bezemer

We introduce VideoGameBunny, a LLaVA-style model designed specifically for understanding video game images. We present a comprehensive dataset of game images and instruction pairs, demonstrating that our model can outperform larger state-of-the-art models in game-related tasks, paving the way for advanced AI assistants in video game understanding, playing, commentary, and debugging.

Paper (ArXiv) Dataset Model (8B) Model (4B)

WACV 2025

Abstract

Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging.

VideoGameBunny Model Architecture

VideoGameBunny is based on Bunny, an efficient and lightweight large multimodal language model.

Dataset

We collect a diverse dataset of 185,259 high-resolution images from 413 video games sourced from YouTube videos to address the lack of game-specific instruction-following data. We generate various types of instructions for these images using different large multimodal models: short captions, long captions, image-to-JSON conversions, and image-based question-answering pairs. The image-to-JSON format provides structured, detailed descriptions of game elements, while the question-answering data is generated using both text-based models (LLama-3) and image-based models (GPT-4o), where we feed an image to a model and ask a question about its contents. This comprehensive dataset aims to improve the ability of open-source models to understand and respond to video game content.

Task	Generator	Samples
Short Captions	Gemini-1.0-Pro-Vision	70,673
Long Captions	GPT-4V	70,799
Image-to-JSON	Gemini-1.5-Pro	136,974
Question Answering	Llama-3, GPT-4o	81,122

We also use Gemini-1.5-Pro to create an evaluation set of 3,375 samples containig multiple choice questions about the images in 10 different categories.

Category	Description	Count
Action Understanding	Recognizing and describing the actions taking place within the image.	356
	Sample: What action is the character in the foreground performing?
Anomalies and Glitches	Identifying errors, bugs, glitches, or placeholder elements within the game environment.	223
	Sample: Describe any anomalies or glitches present in the image.
Character Analysis	Recognizing characters, understanding their roles, and interpreting their expressions and poses.	312
	Sample: What is Aloy's emotional state based on her facial expression?
Common Sense Reasoning	Understanding the image using general knowledge and everyday logic.	430
	Sample: Based on the score and time remaining, which team is likely to win the match?
Gameplay Mechanics	Understanding the rules and mechanics that govern the game.	273
	Sample: What game mechanic is most likely being utilized by the player character?
OCR and UI	Reading and interpreting on-screen text and user interface elements.	334
	Sample: What is written in the caption box at the bottom of the image?
Miscellaneous	Any other type of question that does not fit into the previous categories.	239
	Sample: What material are the containers in the image primarily made of?
Scene Understanding	Recognizing and interpreting the overall environment or setting in the image.	566
	Sample: The racetrack depicted in the image is set in what type of environment?
Small Details	Identifying and interpreting small but significant details within the image.	356
	Sample: What color is the jacket worn by the character in the foreground?
Spatial Reasoning	Testing the ability to understand spatial relationships of objects present in the image.	286
	Sample: What is the spatial relationship between the two red markers visible in the image?

Categories of questions in our dataset, along with a sample question for each category.

Our dataset contains 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images.

Note: Click on the circles in the image to view detailed annotations.

Model Training and Results

We instruction tune Bunny using image-instruction pairs that contain image captions, image-based question answering, and image-to-JSON conversion.

Which type of data has the potential to improve the model’s performance?

We investigate which type of data has the potential to improve the model's performance. To do this, we fine-tune the Bunny model using a single dataset at a time, varying the subset size from 2K to 60K samples. We perform this fine-tuning only once for each dataset and subset size combination. Our goal is to observe overall performance trends rather than optimizing for the best performance. We continue increasing the subset size until we observe a sharp decline in performance, at which point we stop the experiment for that particular dataset. This approach allows us to efficiently explore the impact of different data types on the model's performance and identify which datasets are most promising for further investigation.

Dataset	2k	5k	10k	20k	40k	50k	60k
Short Caps.	-0.3	+0.8	-35.5	-30.0
Long Caps.	+3.6	+3.8	+4.6	+6.8	+6.4	+6.3	+4.2
IM-to-JS	+3.8	+5.6	+7.6	+8.8	+9.8	+8.9	+11.7
Llama-3 QA	+1.5	+1.8	+2.5	+3.0	+6.1	+6.2	+2.6
GPT4Q QA	+4.5	+7.3	+6.1	N/A	N/A	N/A	N/A

Relative performance improvement (pp) of Bunny fine-tuned on different subsets of each dataset. The image-to-JSON dataset shows a strong positive trend, while the short captions dataset degrades performance.

Which type of data has the potential to improve the model's performance?

We explore different data mixing strategies to determine which approach improves the model's performance the most. We evaluate four strategies: Random, Equal, Stratified, and Weighted. The Random strategy involves sampling without replacement from all datasets combined. The Equal strategy selects an equal number of samples from each dataset. The Stratified approach mixes datasets based on video games to ensure a balanced representation of different games. The Weighted strategy prioritizes the three most effective datasets identified in our previous experiment: image-based question-answering (GPT-4o), long captions, and image-to-JSON. We fine-tune the Bunny model on these mixture strategies with dataset sizes ranging from 2K to 30K samples. Each experiment is repeated three times with different samples to report mean performance and standard deviation. We stop at 30K samples due to limitations in our smallest dataset (GPT-4o with 10K samples), which would be exhausted for the Equal and Weighted strategies at this point.

Size	Random	Equal	Stratified	Weighted
2K	76.7 ± 0.9	77.8 ± 0.8	78.0 ± 0.2	79.0 ± 0.6
5K	79.2 ± 0.4	79.9 ± 0.4	80.0 ± 0.5	79.8 ± 0.6
10K	79.8 ± 0.8	80.8 ± 0.6	80.8 ± 0.1	81.4 ± 0.5
20K	81.5 ± 0.1	81.3 ± 0.7	81.8 ± 0.8	82.3 ± 0.9
30K	81.8 ± 0.4	81.2 ± 1.1	81.6 ± 0.7	82.6 ± 0.3

Performance of models fine-tuned on a mixture of data with various strategies. The Weighted strategy leads to better performance with smaller dataset sizes, but as the size increases, both Equal and Weighted strategies perform similarly.

Looking at the average improvement per category reveals that most improvements come from game-related categories (Anomalies and Glitches, and HUD and UI).

Category/Dataset Size	2k	5K	10K	20K	30K
Action Understanding	1.6	2.5	2.5	3.7	3.9
Anomalies and Glitches	23.4	33.0	33.2	34.0	32.0
Character Analysis	2.6	3.9	4.2	4.7	4.4
Common Sense Reasoning	3.7	4.2	3.8	4.3	4.0
Gameplay Mechanics	4.2	5.0	6.4	8.2	8.9
HUD and UI	9.3	12.9	16.5	18.9	21.0
Miscellaneous	7.2	7.9	9.6	9.9	9.8
Scene Understanding	-0.2	0.6	1.3	2.0	2.0
Small Details	0.3	1.2	2.4	3.4	3.0
Spatial Reasoning	5.3	6.2	7.1	7.8	7.4

Average improvement for different sizes for each category

How does VideoGameBunny perform compared to SOTA open-source models on game understanding tasks?

We assess the effectiveness of fine-tuning a smaller model on game-specific data by comparing our method to state-of-the-art open-source models. To do this, we fine-tune our model on a dataset of 50K image-instruction samples compiled from all previously introduced datasets. We then evaluate our model's performance against LLaVA-1.6, a state-of-the-art open-source model with 4.2 times more parameters. This comparison allows us to determine how well our approach of fine-tuning a smaller model on game-specific data performs compared to larger, more general models on game understanding tasks.

Model	Accuracy	Model	Accuracy
Bunny-1.1-Llama-3-8B	73.3	LLaVA-v1.5-13b	64.6
VideoGameBunny	85.1	LLaVA-v1.6-vicuna-13b	71.7
LLaVA-v1.5-7b	61.3	LLaVA-v1.6-34b	83.9

Performance of various models on the evaluation set (%).

Question: Are there any visible glitches or errors in the game environment?

VideoGameBunny: (D) No, there are no apparent glitches. ✓

Bunny: (B) Yes, the glowing orb is clipping through the counter. ✗

LLaVA-1.6-34B: (C) Yes, the 'Additional Download' progress bar seems stuck. ✗

Conclusion

We introduce a new instruction-following dataset, with 389,565 image-instruction pairs, specifically designed for video game understanding. We investigate the effectiveness of fine-tuning LMMs on different instruction-following dataset types and mixtures of them, and finally introduce VideoGameBunny, an 8B parameter model that outperforms a SOTA model, LLaVA-1.6-34B, on a game-related question answering benchmark.

VideoGameBunny: Towards vision assistants for video games

Abstract

VideoGameBunny Model Architecture

Dataset

Model Training and Results

Which type of data has the potential to improve the model’s performance?

Which type of data has the potential to improve the model's performance?

How does VideoGameBunny perform compared to SOTA open-source models on game understanding tasks?

JSON Representation

JSON Representation

JSON Representation

JSON Representation

JSON Representation

JSON Representation

Conclusion