PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Project Lead: Chunyu Wang    *Corresponding Author: Qinglin Lu
Tencent Hunyuan

beautiful teaser
PromptEnhancer enables high-fidelity and stylistically diverse image generation from user prompts. Using HunyuanImage 2.1 as the base T2I model, our method demonstrates its versatility across various domains, including photorealism, digital art, abstract geometry, and multilingual text-in-image generation. The examples showcase how minimal user inputs are transformed into rich, detailed prompts that yield high-quality visual outputs, bridging the gap between user intent and model execution.

Abstract

Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine- tuning or implicit reward signals like CLIP scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.


Methodology


Stage 1: SFT for Rewriter Initialization (Sec 2.2). The CoT Rewriter is first initialized through Supervised Fine-Tuning (SFT). In this stage, the model learns to generate structured, chain-of-thought style responses by training on (user prompt, reprompt) pairs using a standard next-token prediction loss. This provides a strong starting point for the subsequent alignment phase.
Stage 2: Policy Alignment with GRPO (Sec 2.3). The initialized rewriter is then refined using a reinforcement learning loop based on Generative Reward-based Policy Optimization. For a given prompt, the CoT Rewriter generates multiple candidate reprompts. These are fed into a frozen T2I model to produce images. The AlignEvaluator (Sec 2.4) then assesses each (image, prompt) pair and provides a scalar reward. This reward signal optimizes the rewriter's policy, steering it toward generating prompts that maximize the alignment between the image and the user's intent.

method
An overview of the two-stage training framework for PromptEnhancer. Our framework trains a universal Rewriter to enhance pretrained Text-to-Image (T2I) model without altering its weights. This is achieved through a two-stage process guided by a specialized reward model.

Data Pipeline


method
Overview of the construction and filtering pipeline for the Rewriter training data. The process involves user prompt simulation, Gemini-based generation, human-in-the-loop selection, and automated filtering to ensure high quality.

Data Analysis


method
Distribution of evaluation dimensions in our dataset. (a) The detailed percentage of each of the 24 fine-grained KeyPoints, sorted in descending order. (b) The aggregated percentage for each of the six main Super-Categories, calculated by summing the percentages of their constituent KeyPoints. In both charts, colors represent the Super-Category, visually linking the detailed points to their broader classification.

Experimental Results


method
Qualitative Comparison of Prompt Rewriting. This figure demonstrates the effectiveness of the PromptEnhancer prompt rewriter. Each comparison pair shows an image generated from a simple ”Raw Prompt” alongside an image generated from the detailed prompt created by PromptEnhancer. As illustrated, the enriched prompts, which add specific details like character identity (“Tom cat from Tom & Jerry”) and artistic style (“oil painting style, heavy brushstrokes”), guide the model to produce images with significantly greater detail, stylistic accuracy, and fidelity to the user’s intent.

Quantitative Evaluation


method
Quantitative Evaluation of PromptEnhancer’s Impact on Prompt Following Accuracy. The figure presents a comparative analysis of text-to-image generation accuracy with and without the PromptEnhancer framework across 24 distinct semantic categories. The left panel illustrates the percentage point (pp) improvement for each category, highlighting significant gains (blue) in areas like grammatical understanding and compositional reasoning, as well as regressions (red) in others. The right panel provides a direct comparison of the absolute accuracy scores, showing the performance of the baseline model (“w/o Ours”) versus the enhanced model (“w/ Our”)

Citation


@article{promptenhancer2025,
  title={PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting},
  author={Linqing Wang and Ximing Xing and Yiji Cheng and Zhiyuan Zhao and Jiale Tao and Qixun Wang and Ruihuang Li and Comi Chen and Xin Li and Mingrui Wu and Xinchi Deng and Chunyu Wang and Qinglin Lu},
  booktitle={arXiv preprint:2509.04545},
  year={2025}
}


Acknowledgements


We would like to thank Kai Song, and ZhengKai Jiang for their valuable inputs and suggestions.

We are open-sourcing the PromptEnhancer-7B version, which includes a larger model, dataset, benchmark, and the reward model AlignEvaluator. These will be released progressively. Additionally, the PromptEnhancer series for multimodal applications, such as image-to-image, text-to-video, image-to-video, will be coming soon.
我们将开源PromptEnhancer-7B版本,涵盖更大参数的PromptEnhancer模型、数据集、基准测试以及奖励模型AlignEvaluator等内容,后续将逐步开源。同时,针对图生图、文生视频、图生视频等多模态应用的PromptEnhancer系列也将在不久后发布。