Version: 1.0.5
License: GPL-3.0
Advanced multimodal prompt generation nodes for ComfyUI with local GGUF models (Qwen-VL) and cloud API support.
Based on extensive testing, Wan2.2 and Qwen-Image-Edit respond significantly better to Chinese prompts than English prompts.
Recommendation: Set target_language to "zh" (Chinese) for best results with these models, even if your input is in English. The models will generate more coherent and instruction-following outputs.
Vision input support varies by model and llama-cpp-python version. See Installation section for detailed compatibility information. Results may vary based on your specific environment.
- Local GGUF support: Run Qwen2.5-VL and Qwen3-VL models locally
- Multi-image input: Support batch image input via ComfyUI's batch nodes (e.g., Images Batch Multiple)
- Flexible prompting styles:
raw: Direct LLM response without system promptdefault: Balanced prompt enhancementdetailed: Rich visual details (colors, textures, lighting, atmosphere)concise: Minimal keywords, focused on core elementscreative: Artistic interpretation with unique perspectives
- Device selection: Simple CPU/GPU dropdown for hardware control
- Auto-detect mmproj: Automatic detection or manual selection for Qwen3-VL
- Dynamic model selection: Auto-detect local GGUF models and cloud API models
- Image editing prompts: Specialized for Qwen-Image-Edit tasks
- Manual mmproj selection: Choose specific mmproj files or use auto-detect
- Multi-image support: Up to 3 images via optional inputs (image2/image3)
- Unified interface: Consistent parameter ordering and naming
- API key management: Centralized configuration via
api_key.txt - Device control: CPU/GPU selection for local models
- Video generation prompts: Optimized for Wan2.2 text-to-video and image-to-video
- Local Qwen3-VL integration: Use local models for prompt enhancement
- Task-specific optimization: Separate prompts for T2V and I2V workflows
- Extended token limit: 2048 tokens to support longer Chinese prompts (600+ characters)
- Device selection: CPU/GPU dropdown for local model execution
- Optimized for Chinese: Better performance with Chinese language prompts
Clone this repository into your ComfyUI custom_nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/yourusername/ComfyUI-MultiModal-Prompt-Nodes.gitcd ComfyUI-MultiModal-Prompt-Nodes
pip install -r requirements.txtAlternative manual installation:
pip install dashscope pillow numpyImportant: Model compatibility varies by llama-cpp-python version. Based on my testing environment:
| Version | Qwen2.5-VL (Text) | Qwen2.5-VL (Vision) | Qwen3-VL |
|---|---|---|---|
| 0.3.16 (official) | ✅ | ❌ | ❌ |
| 0.3.21+ (JamePeng fork) | ✅ | ❌* | ✅ |
*Note: Vision input support may vary depending on your environment and configuration. In my setup, I have not been able to get vision input working with Qwen2.5-VL even with the JamePeng fork.
Recommended Installation (JamePeng fork for Qwen3-VL support):
pip install llama-cpp-python==0.3.21 --break-system-packagesSource: https://github.com/JamePeng/llama-cpp-python
My Environment Results:
- Official llama-cpp-python 0.3.16: Qwen2.5-VL text-only, no vision input, Qwen3-VL fails to load
- JamePeng fork 0.3.21+: Qwen3-VL works with vision input, Qwen2.5-VL text works but vision input still unavailable
Place your GGUF models in ComfyUI/models/LLM/:
ComfyUI/models/LLM/
├── Qwen3VL-4B-Q4_K_M.gguf
├── Qwen3VL-4B-Q8_0.gguf
├── mmproj-qwen3vl-4b-f16.gguf
└── ...
For cloud API usage, create api_key.txt in the node folder:
ComfyUI/custom_nodes/ComfyUI-MultiModal-Prompt-Nodes/api_key.txt
Add your Alibaba Cloud Dashscope API key to this file.
Inputs:
prompt: Text prompt to rewrite/enhancestyle: Prompt rewriting styleraw: Direct LLM response without system prompt (useful for custom prompting)default: Balanced prompt enhancementdetailed: Rich visual detailsconcise: Minimal, focused keywordscreative: Artistic interpretation
target_language: Output language (auto/en/zh)model: Select from auto-detected local GGUF modelsmmproj: mmproj file selection(Auto-detect): Automatically search for matching mmproj(Not required): For Qwen2.5-VL or text-only mode- Specific file: Manually select mmproj file
max_tokens: Maximum tokens to generate (default: 512)temperature: Sampling temperature (0.0-2.0, default: 0.7)device: CPU or GPU executionimage(optional): Input image for vision-language processing
Example workflow:
- Load Vision LLM Node
- Enter basic prompt: "a cat sitting on a windowsill"
- Attach image via batch node (optional)
- Select Qwen3-VL model
- Choose
(Auto-detect)for mmproj or select specific file - Select style:
default - Set device:
CPUorGPU - Run to get enhanced prompt
Inputs:
image: Primary input image (required)prompt: Edit instruction or image descriptionprompt_style:Qwen-Image-Edit: For image editing tasksQwen-Image: For general image understanding
target_language: Output language (auto/zh/en)llm_model: Model selectionLocal: xxx: Local GGUF models (auto-detected)- API models: qwen-vl-max, qwen-plus, etc.
mmproj: mmproj file (required for local Qwen3-VL)(Auto-detect): Automatic detection(Not required): For API models or Qwen2.5-VL- Specific file: Manual selection
max_retries: Retry attempts for API calls (default: 3)device: CPU/GPU selection for local modelssave_tokens: Compress images to save API tokensimage2/image3(optional): Additional context images
Use cases:
- Image editing prompt generation
- Multi-image context prompts
- Style transfer descriptions
- Visual question answering
Recommended settings:
- For best results: Set
target_languagetozh(Chinese) - Use local models for privacy, API models for quality
- Enable
save_tokenswhen using API models
Inputs:
prompt: Video scene descriptiontask_type:Text-to-Video: Generate video from text descriptionImage-to-Video: Generate video from image + text
target_language: Output language (auto/zh/en)llm_model: Model selectionLocal: xxx: Local GGUF models- API models: qwen-vl-max (for I2V), qwen-plus, etc.
mmproj: mmproj selection (same as other nodes)max_retries: API retry attemptsdevice: CPU/GPU for local modelssave_tokens: Image compression for APIimage(optional): Reference frame for I2V tasks
Optimized for:
- Wan2.2 video generation
- Temporal coherence descriptions
- Camera movement instructions
- Scene transitions
Important notes:
- Use Chinese prompts (
target_language: zh) for best results - Supports up to 600+ Chinese characters (2048 tokens)
- For I2V tasks, use
qwen-vl-*models
Example T2V workflow:
- Enter prompt: "一只猫在窗台上看风景" (A cat looking at scenery on a windowsill)
- Set
task_type: Text-to-Video - Set
target_language: zh - Select model (local or API)
- Run to get optimized video prompt
Example I2V workflow:
- Attach input image
- Enter motion description: "镜头慢慢推进" (Camera slowly zooms in)
- Set
task_type: Image-to-Video - Set
target_language: zh - Ensure model supports vision (qwen-vl-*)
- Run to get I2V prompt
- ✅ Qwen2.5-VL-2B: Text-only in my environment
- ✅ Qwen2.5-VL-7B: Text-only in my environment
⚠️ mmproj integrated but vision input unavailable in my setup
- ✅ Qwen3-VL-4B: Full vision support with JamePeng fork
- ✅ Qwen3-VL-7B: Full vision support with JamePeng fork
- ✅ Requires matching mmproj file
- Q4_K_M: Balanced quality/size (recommended for most users)
- Q5_K_M: Higher quality, larger size
- Q8_0: Maximum quality, largest size
- Qwen models: https://huggingface.co/Qwen
- GGUF conversions: https://huggingface.co/models?search=qwen+gguf
- mmproj files: Usually bundled with GGUF conversions
- RAM: 8GB+ (16GB recommended for 7B models)
- Storage: 3-8GB per model (depending on quantization)
- GPU: Optional (CPU execution supported)
- NVIDIA GPU: CUDA support via llama-cpp-python
- AMD GPU: ROCm support (requires specific build)
- Intel Arc: Limited support, CPU recommended
- Use Q4_K_M quantization for faster inference and lower memory usage
- Reduce max_tokens if hitting memory limits
- Enable GPU if you have compatible hardware (select
GPUin device dropdown) - Use CPU for stability if encountering GPU issues
- Batch multiple requests when possible for efficiency
- Close other applications to free up RAM during inference
| Model | Quantization | RAM Usage |
|---|---|---|
| Qwen3-VL-4B | Q4_K_M | ~4-5GB |
| Qwen3-VL-4B | Q8_0 | ~7-8GB |
| Qwen3-VL-7B | Q4_K_M | ~6-7GB |
| Qwen3-VL-7B | Q8_0 | ~12-14GB |
Q: "No module named 'llama_cpp'" error
A: Install llama-cpp-python: pip install llama-cpp-python==0.3.21 --break-system-packages
Q: pip install fails with "externally-managed-environment"
A: Use --break-system-packages flag or create a virtual environment
Q: "Failed to load model" with Qwen3-VL
A: Ensure you're using llama-cpp-python 0.3.21+ (JamePeng fork). Version 0.3.16 doesn't support Qwen3-VL.
Q: "mmproj not specified" error
A: Select an mmproj file (or choose (Auto-detect)) in the mmproj dropdown for Qwen3-VL models
Q: "No models found" in model dropdown
A:
- Place GGUF models in
ComfyUI/models/LLM/ - Restart ComfyUI
- Verify file extensions are
.gguf
Q: Vision input not working with Qwen2.5-VL
A: This is a known issue in my environment. Qwen2.5-VL currently only supports text input. Use Qwen3-VL for vision capabilities.
Q: Out of memory errors
A:
- Use smaller quantization (Q4_K_M instead of Q8_0)
- Reduce
max_tokensparameter - Close other applications
- Use a smaller model (4B instead of 7B)
Q: Slow inference on CPU
A: Normal for large models. Consider:
- Q4_K_M quantization (faster than Q8_0)
- Smaller models (4B faster than 7B)
- GPU acceleration if available
Q: "API_KEY is not set" error with local models
A: This error should only appear when using API models. If using local models (starting with "Local:"), this is a bug - please report it.
Q: Wan2.2 output is incoherent or doesn't follow instructions
A: Set target_language to zh (Chinese). Wan2.2 performs significantly better with Chinese prompts, even if your input is in English.
Q: Qwen-Image-Edit not understanding my edits
A:
- Use
target_language: zhfor better results - Be specific in edit instructions
- Try using reference examples in your prompt
Q: Output is cut off or incomplete
A: Increase max_tokens parameter (Vision LLM Node) or note that other nodes have fixed limits (512 for Qwen, 2048 for Wan)
Q: How to choose between CPU and GPU?
A:
- GPU: Faster inference, requires compatible hardware (NVIDIA with CUDA)
- CPU: Universal compatibility, slower but stable
- Recommendation: Start with CPU, switch to GPU if available and working
Q: GPU selected but still using CPU
A: Your GPU may not be compatible with llama-cpp-python. Check:
- NVIDIA GPU with CUDA support
- llama-cpp-python built with CUDA support
- Driver installation
- Create
api_key.txtin the node directory:
ComfyUI/custom_nodes/ComfyUI-MultiModal-Prompt-Nodes/api_key.txt
-
Add your Alibaba Cloud Dashscope API key (single line, no quotes)
-
The key will be automatically loaded by Qwen and Wan nodes when using cloud API models
- Never commit
api_key.txtto version control - The file is listed in
.gitignoreby default - API keys are only loaded when using cloud API models
- Local models don't require API keys
See the examples/ directory for:
- Basic prompt enhancement workflows
- Multi-image vision processing
- Image editing prompt generation
- Video prompt generation (T2V and I2V)
- Style-specific optimizations
This project is licensed under the GNU General Public License v3.0.
Copyright (C) 2026 kantan-kanto
GitHub: https://github.com/kantan-kanto
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Note: GPL-3.0 is required due to llama-cpp-python dependency.
For full details, see the LICENSE file and AUTHORS.md.
- llama-cpp-python: Andrei Betlen
- Qwen3-VL support: JamePeng's fork
- Qwen models: Alibaba Cloud Qwen Team
- Dashscope API: Alibaba Cloud
For full attribution, see AUTHORS.md
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Areas needing help:
- Testing on different hardware configurations
- Documenting vision input compatibility across environments
- Additional workflow examples
- Performance optimizations
- Issues: Report bugs or request features via GitHub Issues
- Documentation: See CHANGELOG.md for version history
- Examples: Check examples/ for workflow templates
See CHANGELOG.md for detailed version history.
- Device selection: CPU/GPU dropdown
- Raw style for Vision LLM Node
- Unified interface across all nodes
- Extended token limit for Wan (2048)
- API key management via api_key.txt only
- mmproj auto-detect improvements