arXiv Analytics

Sign in

arXiv:2305.15328 [cs.CV]AbstractReferencesReviewsResources

Visual Programming for Text-to-Image Generation and Evaluation

Jaemin Cho, Abhay Zala, Mohit Bansal

Published 2023-05-24Version 1

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website: https://vp-t2i.github.io

Related articles: Most relevant | Search more
arXiv:2307.05134 [cs.CV] (Published 2023-07-11)
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation
arXiv:2307.02971 [cs.CV] (Published 2023-07-06)
On the Cultural Gap in Text-to-Image Generation
arXiv:2403.07605 [cs.CV] (Published 2024-03-12)
Optimizing Negative Prompts for Enhanced Aesthetics and Fidelity in Text-To-Image Generation