GPT-4 with Vision (GPT-4V) allows users to instruct GPT-4 to analyze image input provided by the user. This is the latest feature we are making widely available. Incorporating additional modalities (such as image input) into large-scale language models (LLMs) is considered by some to be an important frontier in artificial intelligence research and development. Multimodal LLM offers the potential to extend the impact of language-only systems with new interfaces and capabilities, allowing them to solve new tasks and provide new experiences for users. This system card analyzes the safety characteristics of GPT-4V. Our GPT-4V safety efforts build on the work done for GPT-4, and here we provide further details on the evaluation, preparation, and mitigation work done specifically for image input.