arXiv:2305.15328 Abstract | arXiv Analytics

arXiv:2305.15328 [cs.CV]Abstract References Reviews Resources

Visual Programming for Text-to-Image Generation and Evaluation

Published 2023-05-24Version 1

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website: https://vp-t2i.github.io

Comments: 18 pages; Project website: https://vp-t2i.github.io

Categories: cs.CV, cs.AI, cs.CL, cs.LG

Keywords: text-to-image generation, interpretable/explainable visual programming frameworks, interpretable step-by-step t2i generation framework, visual modules, vpeval produces evaluation programs

Tags: github project

Related articles: Most relevant | Search more

arXiv:2307.05134 [cs.CV] (Published 2023-07-11)

TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation

Paul Grimal, Hervé Le Borgne, Olivier Ferret, Julien Tourille

arXiv:2307.02971 [cs.CV] (Published 2023-07-06)

On the Cultural Gap in Text-to-Image Generation

Bingshuai Liu, Longyue Wang, Chenyang Lyu, Yong Zhang, Jinsong Su, Shuming Shi, Zhaopeng Tu

arXiv:2403.07605 [cs.CV] (Published 2024-03-12)