Set-of-Mark Prompting Unleashes
Extraordinary Visual Grounding in GPT-4V

Jianwei Yang1*, Hao Zhang2*, Feng Li2*, Xueyan Zou3*, Chunyuan Li1, Jianfeng Gao1    1 Microsoft Research, Redmond   2 HKUST   3 University of Wisconsin-Madison    *Core Contributor   Project Lead
Abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

Refer to caption
Refer to caption
Figure 1: Comparisons of GPT-4V prompting techniques: (left) standard prompting and (right) the proposed Set-of-Mark Prompting . Simply overlaying ids on image regions unleashes visual grounding and corrects answers for GPT-4V. Note that no marks are leaked to user text inputs..

1 Introduction

In the past few years, we have witnessed a significant advancement in large language models (LLMs) [2, 3, 10, 44, 60, 35]. In particular, Generative Pre-trained Transformers or GPTs [2, 35] have led a number of breakthroughs in the industry and academia. Since the release of GPT-4, large multimodal models (LMMs) have gained a growing interest in the research community. A number of works strive to build multimodal GPT-4 based on open-sourced models [28, 63, 57, 12, 48, 51, 20]. Recently, GPT-4V(ision) is released and attracts immediate attention from the community for its outstanding multimodal perception and reasoning capability. Its superiority and generality are showcased in [50]. Despite the unprecedented, strong, vision-language understanding capabilities, GPT-4V’s fine-grained visual grounding ability is relatively weak, or yet to be unleashed. For example, GPT-4V can hardly produce a sequence of accurate coordinates for a dog in the image111https://blog.roboflow.com/gpt-4-vision/, or a few traffic lights [54].

Our study is motivated by addressing the failed cases of GPT-4V on fine-grained vision tasks, such as object referring. In principle, these tasks require the model to have rich semantic understanding and accurate spatial understanding of visual contents. However, prompting GPT-4V to produce a sequence of tokens that contains textual description and numeric coordinates not only hurts the natural fluency in LLMs but also dismisses the spatial understanding ability in vision models used by LMMs, as pointed out by [7]. Therefore, in this study we focus on improving visual inputs by proposing a new visual prompting method to unleash the visual grounding capability of LMMs. Visual prompting has been explored for various vision and multimodal tasks [9, 19, 65, 46, 42]. These works can be categorized into two categories. One is to encode visual prompts such as point, box, and stroke into latent features, which are then used to prompt the vision models [65, 19]. The other is to overlay visual marks directly onto the input images. The overlaid marks could be a red circle [42], a highlighted region [49], or a few circles with arrows [50]. Although these studies demonstrate the promise of pixel-level visual prompting, they are limited to visually referring to one or a few objects. Moreover, all these marks are not easily “speakable” by LLMs, thus impeding the grounding capability for GPT-4V.

In this paper we present a new prompting mechanism called Set-of-Mark (SoM) prompting , i.e. simply adding a set of visual marks on top of image regions. We first partition an input image into a set of semantically meaningful regions. Then, we overlay each region with a mark of various formats such as numbers, alphabets, masks, or boxes. We perform an empirical study to validate whether GPT-4V could better ground the regions given these interpretable and “spreadable” visual marks. Our experiments show that SoM can drastically unleash the grounding capability of GPT-4V. As shown in Fig. 1, compared to the standard prompting, SoM helps GPT-4V accurately ground its answers in corresponding regions, e.g., the right laptop is “9” and the lamp is “12”. Likewise, it can also associate the chair on the left with “7”. Note that no information about the marks is leaked to the conversation with GPT-4V, which indicates that the model can not only understand the semantics of the visual marks but also know how and when to associate visual contents with marks. To the best of our knowledge, this is the first study to demonstrate that the emergent visual grounding ability of GPT-4V can be unleashed by visual prompting, SoM. To summarize, our main contributions are:

  • We propose a simple yet effective visual prompting, Set-of-Mark (SoM), for LMMs, and show empirically that SoM effectively unleashes the grounding ability of GPT-4V.

  • We have developed a new suite of evaluation benchmarks to examine the grounding ability of GPT-4V and other LMMs. For each image in the dataset, we employ off-the-shelf segmentation models to segment an image into regions, and overlay each region with visual marks, such as numeric numbers.

  • We have conducted quantitative and qualitative analysis to validate the effectiveness of SoM on a wide range of vision tasks. Our experiments show that SoM significantly improves GPT-4V’s performance on complex visual tasks that require grounding. For example, GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

2 Set-of-Mark Prompting

Refer to caption
Figure 2: Comparing standard GPT-4V and its combination with Set-of-Mark (SoM) Prompting. it clearly shows that our proposed prompting method helps GPT-4V to see more precisely and finally induce the correct answer. We highlight the differences between our method and the standard one. (The image is generated by Dalle-3 and is better viewed in color.)

This section introduces Set-of-Mark Prompting and explains how it can be applied to prompt the LMMs, GPT-4V in particular.

2.1 Problem Definition

Typically, LMMs take an image IH×W×3 and a text query of length of li, Ti=[t1i,,tlii] as input, and generate a sequence of textual output with length lo, To=[t1o,,tloo], formulated as:

To=(I,Ti) (1)

Given the versatility of current LLMs used in LMMs, the input and output texts can be comprised of different types of characters (e.g., both alphabetic and numeric), and also multilingual. A large number of works have attempted to improve prompt engineering on the input text Ti to endow more reasoning capability in LLMs [47, 53].

In contrast to prompt engineering for LLMs, the goal of this study is to develop a new prompting method for input images to unleash visual grounding ability of LMMs. In other words, we strive to equip a LMM with the ability to see location-by-location. This necessitates two essential properties for the prompting strategy:

  • The method should be able to partition an image into a set of semantically meaningful regions to align with the textual outputs, an ability known as grounding.

  • The auxiliary information cast to the input image should be both interpretable and speakable by the LMM, so that it can be described in its textual outputs.

With this in mind, we develop Set-of-Mark Prompting, a simple prompting method of overlaying a number of marks to the meaningful regions in an image. This operation augments the input image I to a marked image Im while keeping the other inputs to LMMs unchanged, as shown in Fig. 2. Mathematically, Eq. (1) becomes

To=(𝑆𝑜𝑀(I)Im,Ti). (2)

Although it is straightforward to apply SoM to all LMMs, we find that not all LMMs have the ability to “speak out” about the marks. Actually, we find that only GPT-4V, when equiped with SoM, shows the emergent grounding ability and significantly outperforms the other LMMs. In what follows, we will explain how we partition an image into regions and mark the image regions in SoM.

2.2 Image Partition

Given an input image, we need to extract meaningful and semantically aligned regions. Ideally, the extraction should be automatic or semi-automatic to avoid extra burden on users. To achieve this, we employ a suite of image segmentation tools. To support different use cases, the segmentation tools need to possess the following properties:

  • Strong performance: the region partition should be precise enough to convey fine-grained spatial layout information to GPT-4V. In this sense, we choose one of the state-of-the-art image segmentation models MaskDINO [24].

  • Open vocabulary: the segmentation models should be open-vocabulary so that they can recognize objects out of predefined vocabulary. As such, we opt for the advanced models like SEEM [65].

  • Rich granularity: Finally, the region of interest might be not only a full object but also a part of it. Therefore, we also employ SAM [19] and Semantic-SAM [21].

We have collected a suite of image segmentation models that offers a comprehensive toolbox for a user to partition arbitrary images. Most of models are interactive and promotable such that users can customize the SoM prompts interactively.

Based on our image partition toolkit, we divide an input image I of size H×W into K regions R=[r1,,rK]{0,1}K×H×W, which are represented by K binary masks. We show some SoM examples in Fig. 3 generated using our toolbox.

Refer to caption
Figure 3: We compile different segmentation models including SEEM [65], Semantic-SAM [21] and SAM [19] as the image partition tools used in our work. Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.

2.3 Set-of-Mark Generation

Once we obtain image partition M, we need to generate for each region a mark that is useful for grounding by GPT-4V. We consider two factors, the types and locations of marks.

Mark Type.

As we discussed earlier, the mark type depends on whether they can be interpreted by GPT-4V. In this work, we consider alphanumeric in that it is not only compact enough to not occupy much image space but recognizable by GPT-4V (using its OCR capability). Besides, we also consider boxes and mask boundaries as auxiliary marks. In addition, we note that the mark types should be image-dependent to avoid any conflicts with the original image contents. For example, given an arithmetic image full of numbers, the numeric marks should be avoided, while the alphabetic characters should be not used for a screenshot of a document. We leave the automatically determining which mark types to use and how to combine them to future work. Fig. 4 shows different mark types and the interpreted outputs from GPT-4V.

Mark Location.

Given an image of size H×W, we need to find good mark locations for all K regions. A straightforward way is to derive the center coordinate ck=(xkc,ykc) of k-th mask, and overlay the marks there. However, this inevitably introduces some overlaps or conflicts that may confuse GPT-4V, especially for images with dense objects, e.g., two objects centered around similar locations or concave regions, as shown in Fig. 5.

Refer to caption
Figure 4: Different types of marks can be used in our Set-of-Mark.
Refer to caption
Refer to caption
Figure 5: Left: some conflicts caused by putting all marks in the center. Right: our proposed mark allocation algorithm to address the conflicts.

Refer to caption User: What is in the image? GPT-4V: In the image, I observe the following items: Loaf or Cake (1) : A rectangular-shaped baked item, possibly a loaf of bread or a cake. Creamy Fruit Mixture (2) : A bowl containing a creamy mixture with various berries and possibly other fruits. It might be a fruit salad with a creamy dressing, yogurt with fruits, or some kind of dessert. Sliced Fruits (3): A plate with slices of various fruits, including what looks like pineapples and possibly slices of watermelon or another type. Table Surface (4): A portion of the table or countertop with a bluish hue. (con’t) Additionally, there are various decorative elements, plates, and bowls. User: What is in 3? GPT-4V: In the image, the item labeled "3" appears to be a plate of sliced fruits. It looks like there’s pineapple and possibly another type of melon or fruit, such as watermelon or pink grapefruit. User: Can I put the fruits in 3 into 2? GPT-4V: Yes, based on the image, the fruits in 3 appear to be sliced fresh fruits, and they can be mixed or poured into the creamy fruit mixture in 2 to enhance flavor and texture. It’s a common practice in desserts to combine fresh fruits with creamy mixtures. User: Which can be toasted, 1 or 2? GPT-4V: Based on the image, item 1 (which appears to be a loaf of bread or cake) is the one that can be toasted. Item 2 is a creamy fruit mixture and is not suitable for toasting. User: How long it usually takes to toast a slice of loaf in 1? GPT-4V: Typically, to toast a slice of a loaf like the one labeled as "1" in the image, it would take between 2 to 5 minutes in a toaster on a medium setting.
Figure 6: We can use two types of text prompts to LLMs: Left shows a plain text prompt, which does not include any clues about the marks; Right is an interleaved text prompt, which uses the marks to ground image regions.

To mitigate the problem, we propose a mark allocation algorithm as illustrated in the algorithm on the right side of Fig. 5. Given the set of masks M, we first calculate the areas for all and sort them in an ascending order (line 6). This strategy ensures that smaller regions are considered before large regions. To further avoid the potential overlaps, for k-th mask, we exclude the region that is covered by any k1 masks (line 8). The resulting mask is then fed to a distance transform algorithm, which helps to find the location inside the mask where the minimal distance to all boundary points is maximal. In practice, however, a region may be so small that the mark could cover the (almost) whole region. In this case, we move the marks off the region slightly. We find that GPT-4V can still build a decent association between the marks and regions.

Once we determine the mark type and locations for all regions, we overlay mark mk to region rk at location ck. We make sure that each mark is unique so that the full set M={m1,mk} are distinguishable and speakable by LLMs.

2.4 Interleaved Prompt

Thus far we have obtained a new image Im with overlaid marks. Additionally, we have a set of K region-mark pairs {r1,m1,,rK,mK}. Given the additional cues in Im, we can use either a plain text prompt or interleaved ones to prompt LMMs:

  • Plain Text Prompt. We can use plain text prompts as usual without any explicit reference to the marks/regions in the image. As shown in Fig. 1, even without any special textual prompt, GPT-4V can automatically ground itself in the regions and corresponding marks. It can be used in a wide range of scenarios where users do not have specific regions of interest.

  • Interleaved Text Prompt. We can use interleaved ones by injecting the marks into the text directly. Since the marks are interpretable to LMMs, we can seamlessly blend them into the original text prompt to make a symbolic reference.

Examples of applying these two types of text prompts based on SoM are demonstrated in Fig. 6. Note that for each question, we use a new chat window to avoid context leakage during the conversation. In real-world scenarios, the above two text prompting strategies can be combined and used in multi-round conversations with GPT-4V to significantly enrich human-AI interaction. Moreover, users can also opt to draw the marks by themselves or revise the marks generated using the toolbox.

3 SoM Prompting for Vision

Refer to caption User: I have labeled a bright numeric ID at the center for each visual object in the image. Please enumerate their names. You must answer by selecting from the following names: [Person, …, curtain] (ignore the 80 categories in COCO) GPT-4V 1. Person 2. Person 3. Person 4. Surfboard 5. Handbag
Refer to caption User: I have labeled a bright numeric ID at the center for each visual object in the image. Please tell me the IDs for: The laptop behind the beer bottle; Laptop turned on. GPT-4V The IDs for the items you’ve mentioned: The laptop behind the beer bottle: 6 Laptop turned on: 2
Refer to caption User: The 2 images are from the same video, where the first image is the first frame and the second image is a later frame. In the first image, there are 2 objects labeled with 1,2. Can you track these 2 objects in the second image? GPT-4V 1. The object labeled with 1 (a boxer in red headgear) is most similar to the object labeled with 2 (another boxer in red headgear). 2. The object labeled with 2 (a boxer in blue) is most similar to the object labeled with 1 (another boxer in blue).
Refer to caption User: I have labeled a bright numeric ID at the center for each visual object in the image. Given the image showing a man in glasses holding a piece of paper, find the corresponding regions for a man in glasses, a piece of paper. GPT-4V The "man in glasses" corresponds to the region labeled with "2". The "piece of paper" corresponds to the region labeled "5".
Figure 7: Examples of some vision tasks we benchmark on GPT-4V.

We highlight that the unique merit of using SoM to prompt GPT-4V is that it can produce outputs beyond texts. Since each mark is exclusively associated with an image region represented by a mask, we can trace back the masks for any mentioned marks in the text outputs. Consider the example in Fig. 7, the GPT-4V response on the left side contains the names and details of four regions. It can induce the one-to-one mappings between marks and text description, i.e., mktextk. Given that rkmk, we can further associate textual description with the masks for all regions. Finally, we can bridge the triplets rk,mk,textk for all regions, i.e., rkmktextk. The ability to produce paired texts and masks allows the SoM prompted GPT-4V to produce plausible visually grounded texts and more importantly support a variety of fine-grained vision tasks, which are challenging for the vanilla GPT-4V model.

Vision Tasks.

We quantitatively examine the performance of SoM prompted GPT-4V. With simple prompt engineering, it can be readily used in, but not limited to, a wide range of vision tasks such as

  • Open-vocabulary Image Segmentation: We ask GPT-4V to exhaustively tell the categories for all marked regions and the categories that are selected from a predetermined pool.

  • Referring Segmentation: Given a referring expression, the task for GPT-4V is selecting the top-matched region from the candidates produced by our image partition toolbox.

  • Phrase Grounding: Slightly different from referring segmentation, phrase grounding uses a complete sentence consisting of multiple noun phrases. We ask GPT-4V to allocate the corresponding regions for all labeled phrases.

  • Video Object Segmentation: It takes two images as input. The first image is the query image which contains a few objects of interest to identify in the second image. Given that GPT-4V supports multiple images as input, our prompting method can also be applied to ground visual objects across frames in a video.

We show how we prompt GPT-4V using SoM for the above vision tasks in Fig. 7. It is worth noting that SoM can be applied to broader tasks, such as region captioning, keypoint localization, and part segmentation, etc. Due to the limited access (quota) to GPT-4V, we focus on the aforementioned vision tasks and leave the explorations to other tasks for future work. We present the detailed empirical study in the next section.

4 Experiments

4.1 Experimental Setup

Implementation.

We do not need to train any models for our method. However, due to the limited quota and absence of GPT-4V API, we have to exhaustively send the SoM augmented images to the ChatGPT interface. The authors in this work used a divide-and-conquer strategy to run the experiments and evaluations. For each instance, we use a new chat window so that there is no context leakage during the evaluation. In this sense, all the reported results for our method are zero-shot. Unless otherwise mentioned, we always use the numeric numbers as the marks for reporting the main results. The usage of other mark types is shown in our ablation and qualitative studies.

Benchmarks.

Given the limited quota of GPT-4V, it is not possible for now to fully evaluate the validation set in each individual vision task as we listed above. Consequently, we select a small subset of validation data from each dataset for our study. For each image in the datasets, we overlay a set of marks on the regions extracted using our image partition toolbox. Depending on the specific task, we exploit different partition tools to propose regions. In Table 1, we list the setup for each task.

Comparisons.

We compare our method with:

  • GPT-4V baseline predicting coordinates. We use this as our baseline model. By default, GPT-4V can neither predict boxes nor generate masks. Following prior practice, we prompt the model to predict box coordinates. Comparing with the default GPT-4V baseline helps to inspect the benefit of our proposed Set-of-Mark Prompting.

  • State-of-the-art specialist models. For each of the vision tasks, a number of methods have been proposed. We choose state-of-the-art and representative methods for comparison. More specifically, MaskDINO [24] for generic segmentation on COCO, OpenSeeD [55] for zeroshot segmentation on ADE20K, Grounding DINO [31] and GLIPv2 [56] for phrase grounding on Flick30K, Grounding DINO and PolyFormer [29] for referring expression comprehension, PolyFormer and SEEM [65] for referring expression segmentation and SegGPT [46] for video object segmentation. We attempt to shed light on the gap between the strongest generalist vision model GPT-4V and specialist models that are sophisticatedly designed and trained with the take-specific data.

  • Open-sourced LMMs. We quantitatively evaluate one of the state-of-the-art open-sourced LMMs, LLaVa-1.5 [27], and qualitatively compare with MiniGPT-v2 [5] in our study. Note that both models are trained with a good amount of data from the targeted vision tasks. We are the first to compare closed- and open-sourced LMMs on vision benchmarks.

Vision Task Dataset #Images #Instances Marks Metric
Generic Segmentation COCO [26] 100 567 Number & Mask Precision
OV Segmentation ADE20K [62] 100 488 Number & Mask Precision
Phrase Grounding Flickr30K [38] 100 274 Number & Box Recall@1
Referring Expression Segmentation RefCOCO [18] 100 177 Number & Mask mIoU
Referring Expression Comprehension RefCOCO [18] 100 177 Number & Mask ACC@0.5
Video Object Segmentation DAVIS [37] 71 157 Number & Mask J&F
Table 1: Benchmarks used to evaluate the visual grounding capability of GPT-4V. Note that we select a small subset of images or videos from the corresponding datasets.

4.2 Quantitative Results

We present the experimental results in Table 2.

Model LMM Zero-shot OV Segmentation RefCOCOg Phrase Grounding VOS
COCO ADE20K REC RES Flickr30K DAVIS2017
MaskDINO [23] 80.7 - n/a n/a n/a n/a
OpenSeeD [55] - 23.4 n/a n/a n/a n/a
GLIPv2 [56] - - - - 87.7* n/a
GDINO [31] n/a n/a 86.1* n/a 90.5 n/a
X-Decoder [64] - - - 64.6* n/a 62.8
PolyFormer [29] n/a n/a 85.8* 67.2 n/a n/a
SegGPT [46] n/a n/a n/a n/a n/a 75.6
SEEM [65] - - - 65.7* - 62.8
RedCircle [42] n/a n/a 59.4* n/a n/a n/a
FGVP [49] n/a n/a 63.3* n/a n/a n/a
Shikra [6] n/a n/a 82.6* n/a 77.4 n/a
LLaVA-1.5 [27] n/a n/a 63.3 n/a n/a n/a
MiniGPT-v2 [5] n/a n/a 84.4* n/a n/a n/a
Ferret [54] n/a n/a 85.8* n/a 81.1 n/a
GPT-4V [36] n/a n/a 25.7 n/a n/a n/a
GPT-4V [36] + SoM (Ours) 75.7 63.4 86.4 75.6 89.2 78.8
Table 2: Main quantitative results. * denotes that the number is reported in the original papers which are evaluated in the full validation datasets. Other numbers are evaluated in our sub-sampled validation datasets constraint by GPT-4V interface.
Table 3: The performance on Flickr30K with different mark types on a subset of our dataset.
Mark type Flickr (R@1) Number & Mask 80.7 Number & Mask & Box 85.0
Mask type Refcocog (mIoU) MaskDINO 75.6 GT mask 90.1
Table 3: The performance on Flickr30K with different mark types on a subset of our dataset.
Table 4: The performance on Flickr30K with different mask types.

Image Segmentation

On image segmentation tasks, we evaluate the classification precision of GPT-4V + SoM and compare it with the strong segmentation model MaskDINO on COCO Panoptic segmentation dataset and OpenSeeD on ADE20K Panoptic segmentation dataset. For MaskDINO and OpenSeeD, we give them GT boxes and evaluate the output class. For GPT-4V + SoM, we overlay GT masks with alpha=0.4 and 0.2 for COCO and OpenSeeD, respectively, and add the ID number on each mask. We provide GPT-4V with the vocabulary of the datasets and ask it select a class label for each region. The results show that zero-shot performance of GPT-4V + SoM is close to the performance of fine-tuned MaskDINO and is much higher than the zero-shot performance of OpenSeeD on ADE20K. The similar performance on COCO and ADE20K for GPT-4V indicates its strong generalization ability to a wide of visual and semantic domains.

Referring

For referring tasks, we evaluate RES and REC on RefCOCOg. We use MaskDINO to propose masks and overlay the masks and numbers on the images. We use mIoU as the evaluation metric and compare it with state-of-the-art specialist PolyFormer [29] and SEEM [65]. Accordingly, GPT-4V + SoM outperforms PolyFormer by a large margin. Note that the performance of PolyFormer is evaluated on our dataset for apple-to-apple comparison. For REC, we convert the masks into boxes for our method. When prompting GPT-4V to directly output the coordinates, we attain significantly poor performance (25.7), which verifies our earlier hypothesis. Once augmented with SoM, GPT-4V beats both specialists such as Grounding DINO and Polyformer and recent open-source LMM including Shikra, LLaVA-1.5, MiniGPT-v2, and Ferret.

Phrase Grounding

For phrase grounding on Flickr30K, we use Grounding DINO to generate box proposals for each image. We use a threshold of 0.27 to filter redundant boxes, then we use SAM to predict a mask for each box. We draw boxes, masks, and numbers on the images. We give GPT-4V a caption and the noun phrases of the caption for each image and let GPT-4V to ground each noun phrase to a region. Our zero-shot performance is comparable with SOTA models GLIPv2 and Grounding DINO.

Video Object Segmentation

We evaluate DAVIS2017 [39] for the video segmentation task. To get the mask proposals of each video frame, we use MaskDINO [24]. The predicted masks are then overlayed on the corresponding frames with numeric labels following our SoM. We follow the semi-supervised setting to use the first frame masks as the reference and segment all the other frames of the same video. As GPT-4V can take in multiple images, we prompt GPT-4V with the first frame and the current frame to do segmentation by comparing similar objects across images. As shown in the last column of Table 2, GPT-4V achieves the best tracking performance (78.8 J&F) compared with other generalist vision models.

4.3 Ablation Study

Mark type.

We study how the choice of mark types affects the final performance for the phrase grounding task on Flickr30k. We compare two types of marks. The first is numbers and masks and the second is numbers, masks, and boxes. As shown in Table 4, the scheme of adding extra boxes can improve the performance significantly.

Golden mark location.

The extracted regions from our toolbox may often come with some errors. In this study, we investigate how GPT-4V behaves when we generate the marks with ground-truth annotations. Specifically, we replace the predicted segmentation masks with ground-truth ones in our RefCOCOg validation set. This means GPT-4V only needs to select one from regions for annotated phrases. As expected, the performance for referring segmentation can be further improved, especially when the segmentation model has some missed regions. The results are shown in Table 4, which shows using ground-truth masks in our SoM improves the RefCOCOg performance by 14.5 mIoU. We also observe that most failure cases when using GT masks are not the GPT-4V’s problem, but the annotation itself is ambiguous or incorrect.

4.4 Qualitative Observations

When running through this work, we observed several intriguing qualitative findings.

Refer to caption
Figure 8: (a, b, c) Examples of some wrong annotations for referring segmentation (RefCOCOg) and grounding (Flikr30k). (e, f) Examples of confusing annotated labels.

The “golden” annotations are not always golden.

When evaluating SoM, we find that a few human annotations in datasets are incorrect. Some examples are shown in Fig. 8 (a,b,c). For example, human users annotate several people while the referring expression is “the legs of a person sitting on the couch” or only one truck is annotated for referring to “truck race”. In contrast, GPT-4V with SoM yields correct answers. This implies that the “golden” labels in vision datasets may be subject to some noises while GPT-4V could be a good judge to help find the ambiguous annotations for further cleaning.

The centered marks are not always the best.

By default, our mark allocation algorithm finds the center for each region. However, we observe that putting marks at the central locations does not necessarily bring the best performance. For example in Fig. 8 (e), there are two marks, one is for the person and the other one for the backpack. When being asked which one is the person, GPT-4V gives the mark for the backpack which is located at the upper part of the human body. A similar situation also happens in the second image where there is a background spanning the whole image. An example is shown in 8 (f), where the center of the grassland (labeled 2) is on the person (labeled 1). These results indicate that the focus of GPT-4V to understand visual contents is not necessarily at the center. Developing a better mark allocation algorithm should be considered.

Dynamitc selection of mark types.

In realistic scenarios, we find dynamically determining which type of mark to use is important. For example, an image of arithmetic problems usually contains many numbers. In this case, overlaying numeric marks on to will confuse the GPT-4V. Likewise, for a screenshot of a document, it might be not rational to overlay more alphabets on the image. As a result, to fully leverage the visual capability of GPT-4V, users may need to carefully design the SoM prompts before sending the image.

5 Related Work

We discuss related work from the perspective of prompting.

LLMs and Prompting.

We have witnessed significant progress in large language models (LLMs) [2, 3, 10, 44, 60, 35]. Of particular, Generative Pre-trained Transformers, commonly known as GPTs [2, 35], have led to a breakthrough in the realm of natural language processing. Despite the size of LLMs growing dramatically, eliciting reasoning capabilities still requires more sophisticatedly designed queries, i.e., prompting. In the past, a number of works attempted to do prompt engineering to endow more capability for LLMs. In-context learning is a main-stream way to teach LLMs to follow specific instructions as instantiated by a few examples [2, 15] As a follow-up, some other techniques such as chain-of-thought and tree-of-thought [47, 53] are proposed to enhance the arithmetic, commonsense, and symbolic reasoning tasks. Analogous to these works, our Set-of-Mark Prompting can be considered as a way of prompting the model to look at the image regions location-by-location. However, it also differs in that no in-context examples are needed in our method.

Visual and Multimodal Prompting.

In earlier works on interactive image segmentation [34, 8, 9], the spatial prompting is used so that the model can take multiple user inputs (e.g., clicks) to gradually refine the mask. Recently, the pioneering work SAM [19] and its variants [43, 11] have proposed a unified model to support different types of prompting modes including points, boxes, and texts. In SEEM [65], the authors proposed a novel visual sampler to expand to visual prompting across images. Likewise, PerSAM [58] and SAM-PT [41] finetune SAM to support personalized segmentation and point tracking. Another line of work directly generates the prompts on the input images. In [1], image inpainting is used as the prompting to teach the model to predict dense outputs given example images and corresponding outputs, both in pixel space. Painter [45] and SegGPT [46] share similar spirits while using masked image modeling for decoding pixels in continuous space. Visual prompting can be also applied to multimodal models like CLIP [40]. RedCircle [42] draws a red circle on an image to focus the vision model on the enclosed region. In [49], the authors enhance the prompts by specifically segmenting and highlighting a target object in the image. Both methods then ask CLIP model to measure the similarity between the prompted image and a number of textual concepts.

LMMs and Prompting.

In a short time, large multimodal models (LMMs) have emerged as a new line of research in the community. The goal is to build a generalist multimodal system that behaves as LLMs that can perceive and reason with multimodal inputs [28, 63, 57, 12, 59, 48, 51, 6, 20, 5, 54]. Earlier works like MiniGPT-4 [5] and LLaVa [28] proposed a simple yet effective way to connect vision and LLMs followed by an instruction tuning. Later, a similar training strategy is applied to video tasks [33, 4] and fine-grained vision tasks [6, 59, 20, 5, 54]. Please refer to [20] for a more comprehensive review of LMMs. Thus far, prompting LMMs is rarely explored in academia, partially because most of the recent open-sourced models are of limited capacity and thus incapable of such emerging capability [47]. Most recently, GPT-4V [36] was released followed by a comprehensive qualitative study on it [50]. The authors in [50] used a similar prompting strategy as in RedCircle [42] to prompt GPT-4V. However, it only shows some referring capability of GPT-4V with hand-drawn circles or arrows, let alone a comprehensive quantitative study.

6 Discussion

The mystery in GPT-4V.

It is still mysterious why the proposed Set-of-Mark Prompting could work so well with GPT-4V. Out of curiosity, we also ran some examples on other open-sourced LMMs such as LLaVa-1.5 and MiniGPT-v2. However, both models can hardly interpret the marks and ground themselves on those marks. We hypothesize a few reasons for the extraordinary visual grounding capability exhibited in GPT-4V. First, scale matters. We believe the scale of model and training data used in GPT-4V is several orders of magnitude than the aforementioned open-sourced LMMs. Second, the data curation strategy is probably another secret sauce for GPT-4V. GPT-4V could automatically associate the image regions and marks without any explicit prompt in texts. Such kind of data could be probably from literature figures, charts, etc, which are usually clearly labeled or marked [1]. We doubt that GPT-4V specifically employs fine-grained vision data as covered in this work. Note that the amount is extremely limited, and annotating more such kinds of data is costly, not to mention the quality is hard to control. In the end, we note that all the above suspicions are not grounded on the facts but on some empirical studies as we conducted above.

Connecting visual and LLMs prompting.

Despite the unknowns behind GPT-4V. Our work does take one of the first steps to connect visual prompting and LLMs prompting. In the past, many works have studied how to make vision models more promptable, which is isolated from the text promptings to LLMs. The barrier is mainly due to the language bottleneck in that we can hardly express the visual promptings precisely in language. For example, the shape/location/color of a randomly drawn stroke on the image can be hardly described verbally, unless we can encode the visual prompts and finetune the whole model [54]. However, given the limited fine-grained training data and inferior open-sourced large vision and language models. We still see a clear gap. At the current stage, our proposed Set-of-Mark Prompting demonstrates a simple yet only feasible way to inherit all existing capabilities of the strongest LMM while unleashing its strongly aspired grounding capability. We hope this work as the first to seamlessly connect visual and language prompting, could help to pave the road towards more capable LMMs.

Scaling data via Set-of-Mark Prompting with GPT-4V.

In the past, the whole community has strived to build fine-grained, open-vocabulary vision systems, spanning from detection [17, 61, 25, 56, 52, 31] to segmentation [16, 64, 14], and further expand to 3D [32, 13, 30]. Though tremendous progress, we have been struggling with how to map the rich semantic knowledge from CLIP to the fine-grained domains. This is mainly due to the extremely limited number of fine-grained semantic annotations, which is still a considerable challenge to the current LMMs. As we already see, the feasibility of scaling data with fine-grained spatial has been demonstrated in SAM [19] and Semantic-SAM [22], but how to further annotate these regions with semantic labels is still an open problem. In the light of the study in this work, we are envisioning the potential of using GPT-4V plus our Set-of-Mark Prompting to intensively scale up the multimodal data which has both fine-grained spatial and detailed language description.

7 Conclusion

We have presented Set-of-Mark Prompting, a simple yet effective visual prompting mechanism for LMMs, particularly GPT-4V. We show that simply overlaying a number of symbolic marks on a set of regions of an input image can unleash the visual grounding ability of GPT-4V. We present a comprehensive empirical study on a wide range of fine-grained vision tasks to demonstrate that SoM-prompted GPT-4V is superior to fully-finetuned specialist models and other open-sourced LMMs. Moreover, our qualitative results show that GPT-4V with SoM possesses extraordinary fine-grained multimodal perception, cognition, and reasoning capabilities across the board. We hope that SoM would inspire future works on multimodal prompting for LMMs and pave the road towards multimodal AGI.

Acknowledgement

We thank Fangrui Zhu for the thoughtful discussions.

References

  • [1] Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei Efros. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  • [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
  • [3] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  • [4] Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei Huang, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, and Limin Wang. Videollm: Modeling video sequence with large language models, 2023.
  • [5] Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. github, 2023.
  • [6] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
  • [7] Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection, 2022.
  • [8] Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7345–7354, 2021.
  • [9] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Focalclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022.
  • [10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • [11] Haixing Dai, Chong Ma, Zhengliang Liu, Yiwei Li, Peng Shu, Xiaozheng Wei, Lin Zhao, Zihao Wu, Dajiang Zhu, Wei Liu, et al. Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023.
  • [12] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • [13] Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, and Xiaojuan Qi. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023.
  • [14] Zheng Ding, Jieke Wang, and Zhuowen Tu. Open-vocabulary panoptic segmentation with maskclip. arXiv preprint arXiv:2208.08984, 2022.
  • [15] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  • [16] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  • [17] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  • [18] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  • [19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.
  • [20] Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023.
  • [21] Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity, 2023.
  • [22] Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, and Jianfeng Gao. Semantic-sam: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767, 2023.
  • [23] Feng Li, Hao Zhang, Huaizhe xu, Shilong Liu, Lei Zhang, Lionel M. Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022.
  • [24] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3041–3050, 2023.
  • [25] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  • [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
  • [27] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
  • [28] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  • [29] Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R. Manmatha. Polyformer: Referring image segmentation as sequential polygon generation, 2023.
  • [30] Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. 3d open-vocabulary segmentation with foundation models. arXiv preprint arXiv:2305.14093, 2023.
  • [31] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023.
  • [32] Yuheng Lu, Chenfeng Xu, Xiaobao Wei, Xiaodong Xie, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. Open-vocabulary 3d detection via image-level class and debiased cross-modal contrastive learning. arXiv preprint arXiv:2207.01987, 2022.
  • [33] Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration, 2023.
  • [34] Kevin McGuinness and Noel E O’connor. A comparative evaluation of interactive segmentation algorithms. Pattern Recognition, 43(2):434–444, 2010.
  • [35] OpenAI. Gpt-4 technical report, 2023.
  • [36] OpenAI. Gpt-4v(ision) system card, 2023.
  • [37] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  • [38] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
  • [39] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • [40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • [41] Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Segment anything meets point tracking. arXiv preprint arXiv:2307.01197, 2023.
  • [42] Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms, 2023.
  • [43] Lv Tang, Haoke Xiao, and Bo Li. Can sam segment anything? when sam meets camouflaged object detection. arXiv preprint arXiv:2304.04709, 2023.
  • [44] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [45] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and Tiejun Huang. Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023.
  • [46] Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. Seggpt: Segmenting everything in context, 2023.
  • [47] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • [48] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  • [49] Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, and Jian Yang. Fine-grained visual prompting, 2023.
  • [50] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023.
  • [51] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  • [52] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Advances in Neural Information Processing Systems, 35:9125–9138, 2022.
  • [53] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  • [54] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  • [55] Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection, 2023.
  • [56] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding, 2022.
  • [57] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  • [58] Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan, Hao Dong, Peng Gao, and Hongsheng Li. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023.
  • [59] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest, 2023.
  • [60] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  • [61] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022.
  • [62] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [63] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  • [64] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized decoding for pixel, image, and language, 2022.
  • [65] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.

Appendix A Use Cases

Combining GPT-4V with the proposed SoM prompting, we run many realistic use cases to check its generality and merit in comparison with the original GPT-4V.


Refer to caption User: Can you point me what is the most unusual thing in this image? GPT-4V: Certainly! The most unusual thing in the image is a man ironing clothes on an ironing board that is placed atop a moving yellow taxi cab in a city street. It’s not a common sight to see someone ironing clothes on top of a taxi, especially while it’s on the road! User: Can you point me what is the most unusual thing in this image? GPT-4V (SoM): The most unusual thing in this image is the man (elements 8, 17, 19) standing on an ironing board (element 12) that’s placed atop a moving yellow taxi (elements 2, 6, 11). He’s ironing clothes in the middle of a city street, which is highly atypical and not a safe behavior. This situation stands out as the most bizarre aspect of the scene. Refer to caption User: What are the common objects in the two images? GPT-4V (SoM): The common objects in the two images are: A man (In the first image, elements 8, 17, 19; In the second image, element a) An ironing board (In the first image, element 12; In the second image, element b) An iron (Not distinctly labeled in the first image but the man is using it; In the second image, element f) Clothes/shirt being ironed (Not distinctly labeled in the first image but the man is ironing it; In the second image, element c)
Figure 9: In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right.


User: Please solve this Refer to caption GPT-4V: The squares with traffic lights in the provided CAPTCHA image are: The first square in the first row. The second square in the first row. The first square in the second row. The second square in the second row. Please select these squares to solve the CAPTCHA. Remember, when you encounter CAPTCHAs in real-world scenarios, it’s important to solve them yourself to ensure the security and authenticity of your online actions. GPT-4V (SoM): The squares containing traffic lights are: 2, 7, and 11.
Figure 10: Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.


User: What is in this dish? Refer to caption GPT-4V: The dish appears to feature: A grilled or pan-seared fish fillet, which looks like it could be a type of white fish, possibly sea bass, cod, or halibut. Asparagus spears. Potatoes, possibly roasted or boiled. A wedge of lemon for squeezing over the dish. Garnished with fresh parsley or cilantro leaves. There might be a light sauce or seasoning drizzled over the fish and vegetables, but it’s hard to specify its ingredients from the image alone. This looks like a delicious and healthy meal! GPT-4V (SoM): Here’s a description of the items in the dish: 1. Fish Fillet: The main component of the dish appears to be a fillet of fish, likely grilled or pan-seared. 2. Lemon Wedge: A common accompaniment with fish dishes to enhance flavor when squeezed over the top. 3. Parsley or Cilantro: This green herb is used as a garnish and can add a fresh flavor to dishes. 4. Asparagus: These green stalks are often steamed, sautéed, or grilled and served as a side. 5. Sauce or Broth: There is a yellowish sauce or broth on the plate. 6. Potato: It looks like a roasted or boiled potato. 7. Garnish: These could be green onions or chives scattered on the plate to add color and flavor. 8. Sauce or Oil Drizzle: There’s a darker drizzle around the potato, which could be a type of sauce or perhaps a balsamic reduction. The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs).
Figure 11: Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.


User: I am on a diet, what are the top things I should avoid eating? Refer to caption GPT-4V: If you’re on a diet, it would depend on the specific type of diet you’re following. However, based on the image you provided, here are some common items that many diets might limit or avoid: Burgers: Especially if made with high-fat meats or served with cheese and mayonnaise. Fries: Deep-fried foods are typically high in calories and unhealthy fats. Chips: Can be high in salt and fats, especially if they’re commercially produced. Onion rings: Another deep-fried item. Sandwiches with white bread: Many diets recommend whole grains over refined grains. Processed meats: Like the ham in the sandwich, as they can be high in sodium and preservatives. Sauces: Some can be high in sugar, sodium, or unhealthy fats. Soft drinks or sweetened beverages: High in sugars and empty calories. Always consult with a nutritionist or dietician to get guidance tailored to your specific needs and health goals. They can provide recommendations based on your diet and personal health objectives. GPT-4V (SoM): Based on the typical diet considerations, here are the top items from the provided image that you should most likely avoid or limit: 1. Onion rings (1 and 11) - Fried foods like these are often high in unhealthy fats and calories. 2. Fries (8) - Another fried item that can be calorie-dense and contain unhealthy fats. 3. Burgers (3 and 5) - Depending on their composition, they can be high in calories, unhealthy fats, and may contain processed meats. 4. Processed meats (2 and 15) - These can be high in sodium, unhealthy fats, and other additives. 5. Chips (16) - Typically high in unhealthy fats and can be calorie-dense. While indulging occasionally is okay, it’s best to consume these items in moderation if you’re trying to maintain a healthy diet.
Figure 12: SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks.


User: How to use this controller? Refer to caption GPT-4V (SoM): This appears to be an Xbox One or Xbox Series X|S controller. Here’s a basic rundown on how to use it: Thumbsticks (2 & 3): These are primarily used for movement and aiming. Left Thumbstick (2): Controls the movement of the character. Right Thumbstick (3): Controls the camera or aiming reticle. Directional Pad (D-Pad) (1): Used for various functions such as navigating menus, selecting weapons, or other in-game actions. Face Buttons (4, 5, 6, & 7): A (4): Typically the main action or "accept" button. B (7): Often the "back" or "cancel" button. X (5): Varies by game, frequently used for interactions. Y (6): Like X, its function varies but is commonly used for secondary actions or interactions. Bumpers and Triggers (8 & 9): Left and Right Bumpers (8): Used for quick actions or toggling between items. Left and Right Triggers: Used for primary actions like shooting or accelerating in racing games. Menu and View Buttons: Located in the center, used to bring up in-game menus or other overlays. Xbox Button: Press to go to the Xbox home or turn the controller on/off. Long press brings up a menu to turn off the controller or console. Pairing Button: Located on the top edge, used for pairing the controller. 3.5mm Audio Jack: On the bottom, plug in headphones or a headset. USB Port: On the top edge, used for charging or wired play. Refer to the game’s manual or in-game control settings for exact functions for each game.
Figure 13: Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction, teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.


User: How can I get the coins on top? Refer to caption GPT-4V (SoM): It looks like you’re playing a platformer game reminiscent of Super Mario Bros. Here’s a general strategy based on the layout: Elevated Platforms: The platforms labeled as ’4’ and ’7’ are the key. You need to first get onto these platforms to reach greater heights. Depending on the game mechanics, you may need to find a run-up or utilize a jump-boosting power-up if available. Bricks and Blocks: The bricks labeled as ’9’ might contain some power-ups or they might be breakable. In classic platformers like Super Mario, hitting such blocks from below can reveal power-ups or even create platforms. The same goes for the block labeled as ’1’. Vertical Jump: Once on the elevated platform ’4’, try to perform a vertical jump to reach the coins labeled ’11’. Ensure that you time your jump correctly and use any speed or jump boosts, if available. Momentum: Some games also allow for gaining extra height in jumps by building momentum. If this game works similarly, try running from a distance and jumping at the last moment to grab the coins. Look for Hidden Power-ups: Some games hide power-ups in the level that can assist players in reaching difficult places. These might be hidden in unmarked bricks or in other places. If you have abilities like a double jump, use them. Game Mechanics: Every game can have unique mechanics. If there are any other controls or mechanics you haven’t tried, consider experimenting with them. Maybe there’s a way to bounce, double jump, or use an enemy to boost your jump height. If you’ve tried all the general strategies and still can’t reach the coins, it might be beneficial to look up specific strategies or walkthroughs for the particular game you’re playing. Good luck!
Figure 14: GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.

Refer to caption User: I am a bit hungry, where can I find some food? GPT-4V (SoM): Based on the image provided, it looks like you’re viewing a floor plan of an apartment. For food, you should head to the kitchen area which is labeled with the number "10". That’s where you’re most likely to find food or kitchen appliances to prepare a meal. If you’re in this apartment, head to that section! If you’re not, consider checking your own kitchen or a nearby restaurant. Enjoy your meal! User: cool, where should I sit to eat if I find some food from 10? GPT-4V (SoM): Based on the floor plan, the most suitable place to sit and eat would be at the dining area labeled with the number "14". It appears to have a dining table and chairs, making it a comfortable spot for enjoying a meal. User: ok, i got a dirty hand, what should I do? GPT-4V (SoM): Based on the floor plan, you should head to the bathroom areas to wash your hands. You can go to the space labeled with the number "13" which appears to be a bathroom. It’s likely to have a sink and soap where you can clean your hands. Always remember to wash your hands thoroughly, especially before eating! User: now I need to get some sleep. GPT-4V (SoM): For sleep, you should head to the area labeled with the number "19". It appears to be a bedroom with a bed, which would be the most suitable place for rest. Have a good sleep! User: where is the TV, I want to turn it on. GPT-4V (SoM): The TV is located in the area labeled with the number "6". You can head over there to turn it on and enjoy some entertainment. User: Enough to watch a TV, I want to take a shower. GPT-4V (SoM): To take a shower, you should head to the space labeled with the number "17". It appears to be a shower area in the bathroom. Make sure you have your toiletries and towel ready, and enjoy your shower! User: okay, need to wash some clothes, put them where? GPT-4V (SoM): For washing clothes, you should head to the area labeled with the number "8". It appears to be a location designated for a washing machine or laundry area. You can place your clothes there and start the washing process.
Figure 15: GPT-4V with SoM has the potential to help embodied navigation in that it can quickly help to find the regions/locations of particular functionality.