Updating the GenAI comparison website is starting to feel a bit Sisyphean with all the new models coming out lately, but the results are in for the Flux 2 Pro Editing model!
Note: It should be called out that BFL seems to support a more formalized JSON structure for more granular edits so I'm wondering if accuracy would improve using it.
Great, especially that they still have an open-weight variant of this new model too.
But what happened to their work on their unreleased SOTA video model? did it stop being SOTA, others got ahead, and they folded the project, or what?
YT video about it: https://youtu.be/svIHNnM1Pa0?t=208
They even removed the page of that: https://bfl.ai/up-next/
As a startup, they pivoted and focused on image models (they are model providers, and image models often have more use cases than video models, not to mention they continue to have bigger image dataset moat, not video).
Makes no sense since they should have checkpoints earlier in the run that they could restart from and they should have regular checks that keep track if a model has exploded etc.
I didn't read "major failed training run" as in "the process crashed and we lost all data" but more like "After spending N weeks on training, we still didn't achieve our target(s)", which could be considered "failing" as well.
Image models are more fundamentally important at this stage than video models.
Almost all of the control in image-to-video comes through an image. And image models still needs a lot of work and innovation.
On a real physical movie set, think about all of the work that goes into setting the stage. The set dec, the makeup, the lighting, the framing, the blocking. All the work before calling "action". That's what image models do and must do in the starting frame.
We can get way more influence out of manipulating images than video. There are lots of great video models and it's highly competitive. We still have so much need on the image side.
When you do image-to-video, yes you control evolution over time. But the direction is actually lower in terms of degrees of freedom. You expect your actors or explosions to do certain reasonable things. But those 1024x1024xRGB pixels (or higher) have way more degrees of freedom.
Image models have more control surface area. You exercise control over more parameters. In video, staying on rails or certain evolutionary paths is fine. Mistakes can not just be okay, they can be welcome.
It also makes sense that most of the work and iteration goes into generating images. It's a faster workflow with more immediate feedback and productivity. Video is expensive and takes much longer. Images are where the designer or director can influence more of the outcomes with rapidity.
Image models still need way more stylistic control, pose control (not just ControlNets for limbs, but facial expressions, eyebrows, hair - everything), sets, props, consistent characters and locations and outfits. Text layout, fonts, kerning, logos, design elements, ...
We still don't have models that look as good as Midjourney. Midjourney is 100x more beautiful than anything else - it's like a magazine photoshoot or dreamy Instagram feed. But it has the most lackluster and awful control of any model. It's a 2021-era model with 2030-level aesthetics. You can't place anything where you want it, you can't reuse elements, you can't have consistent sets... But it looks amazing. Flux looks like plastic, Imagen looks cartoony, and OpenAI GPT Image looks sepia and stuck in the 90's. These models need to compete on aesthetics and control and reproducibility.
That's a lot of work. Video is a distraction from this work.
Hot take: text-to-image models should be biased toward photorealism. This is because if I type in "a cat playing piano", I want to see something that looks like a 100% real cat playing a 100% real piano. Because, unless specified otherwise, a "cat" is trivially something that looks like an actual cat. And a real cat looks photorealistic. Not like a painting, or cartoon, or 3D render, or some fake almost-realistic-but-cleary-wrong "AI style".
FYI: photorealism is art that imitates photos, and I see the term misused a lot both in comments and prompts (where you'll actually get subideal results if you say "photorealism" instead of describing the camera that "shot" it!)
> Run FLUX.2 [dev] on GeForce RTX GPUs for local experimentation with an optimized fp8 reference implementation of FLUX.2 [dev], created in collaboration with NVIDIA and ComfyUI.
Glad to see that they're sticking with open weights.
That said, Flux 1.x was 12B params, right? So this is about 3x as large plus a 24B text encoder (unless I'm misunderstanding), so it might be a significant challenge for local use. I'll be looking forward to the distill version.
Looking at the file sizes on the open weights version (https://huggingface.co/black-forest-labs/FLUX.2-dev/tree/mai...), the 24B text encoder is 48GB, the generation model itself is 64GB, which roughly tracks with it being the 32B parameters mentioned.
Downloading over 100GB of model weights is a tough sell for the local-only hobbyists.
100 GB is less than a game download, it's actually running it that's a tough sell. That said, the linked blog post seems to say the optimized model is both smaller and greatly improved the streaming approach from system RAM, so maybe it is actually reasonably usable on a single 4090/5090 type setup (I'm not at home to test).
As far as I know, no open-weights image gen tech supports multi-GPU workflows except in the trivial sense that you can generate two images in parallel. The model either fits into the VRAM of a single card or it doesn’t. A 5ish-bit quantization of a 32Gw model would be usable by owners of 24GB cards, and very likely someone will create one.
Genuine question, does anyone use any of these text to image models regularly for non trivial tasks? I am curious to know how they get used. It literally seems like there is a new model reaching the top 3 every week
I just finished my Flux 2 testing (focusing on the Pro variant here: https://replicate.com/black-forest-labs/flux-2-pro). Overall, it's a tough sell to use Flux 2 over Nano Banana for the same use cases, but even if Nano Banana didn't exist it's only an iterative improvement over Flux 1.1 Pro.
Some notes:
- Running my nuanced Nano Banana prompts though Flux 2, Flux 2 definitely has better prompt adherence than Flux 1.1, but in all cases the image quality was worse/more obviously AI generated.
- The prompting guide for Flux 2 (https://docs.bfl.ai/guides/prompting_guide_flux2) encourages JSON prompting by default, which is new for an image generation model that has the text encoder to support it. It also encourages hex color prompting, which I've verified works.
- The Flux 2 API will flag anything tangently related to IP as sensentive even at its lowest sensitivity level, which is different from Flux 1.1 API. If you enable prompt upsampling, it won't get flagged, but the results are...unexpected. https://x.com/minimaxir/status/1993365968605864010
- Costwise and generation-speed-wise, Flux 2 Pro is on par with Nano Banana, and adding an image as an input pushes the cost of Flux 2 Pro higher than Nano Banana. The cost discrepancy increases if you try to utilize the advertised multi-image reference feature.
- Testing Flux 1.1 vs. Flux 2 generations does not result in objective winners, particularly around more abstract generations.
I've re-run my benchmark with the Flux 2 Pro model and found that in some cases the higher resolution models (I believe Flux 2 Pro handles 4k) can actually backfire on some of the tests because it'll introduce the equivalent of an almost ESRGAN style upscale which may add in unwanted additional details. (See the Constanza test in particular).
Text encoder is Mistral-Small-3.2-24B-Instruct-2506 (which is multimodal) as opposed to the weird choice to use CLIP and T5 in the original FLUX, so that's a good start albeit kinda big for a model intended to be open weight. BFL likely should have held off the release until their Apache 2.0 distilled model was released in order to better differentiate from Nano Banana/Nano Banana Pro.
The pricing structure on the Pro variant is...weird:
> Input: We charge $0.015 for each megapixel on the input (i.e. reference images for editing)
> Output: The first megapixel is charged $0.03 and then each subsequent MP will be charged $0.015
> BFL likely should have held off the release until their Apache 2.0 distilled model was released in order to better differentiate from Nano Banana/Nano Banana Pro.
Qwen-Image-Edit-2511 is going to be released next week. And it will be Apache 2.0 licensed. I suspect that was one of the factors in the decision to release FLUX.2 this week.
> as opposed to the weird choice to use CLIP and T5 in the original FLUX
Yeah, CLIP here was essentially useless. You can even completely zero the weights through which the CLIP input is ingested by the model and it barely changes anything.
Nice catch. Looks like engineers tried to take care of the GTM part as well and (surprise!) messed it up. In any case, the biggest loser here is Europe once again.
The model looks good for an open source model. I want to see how these models are trained. may be they have a base model from academic datasets and quickly fine-tune with models like nano banana pro or something? That could be the game for such models. But great to see an open source model competing with the big players.
great this is more on the techincal details. it is great but would be great to see the data. I know they will not expose such information but would be great to have a visibility onto the datasets and how the data was sourced.
Their published benchmarks leave a lot to be desired. I would be interested in seeing their multi-image performance vs. Nano Banana. I just finished up benchmarking Image Editing models and while Nano Banana is the clear winner for one-shot editing its not great at few-shot.
The issue with testing multi-image with Flux is that it's expensive due to its pricing scheme ($0.015 per input image for Flux 2 Pro, $0.06 per input image for Flux 2 Flex: https://bfl.ai/pricing?category=flux.2) while the cost of adding additional images is neligible in Nano Banana ($0.000387 per image).
In the case of Flux 2 Pro, adding just one image increases the total cost to be greater than a Nano Banana generation.
I ran "family guy themed cyberpunk 2077 ingame screenshot, peter griffin as main character, third person view, view of character from the back" on both nano banana pro and bfl flux 2 pro. The results were staggering. The google model aligned better with the cyberpunk ingame scene, flux was too "realistic"
Wow, the Krea relationship soured? These are both a16z companies and they've worked on private model development before. Krea.1 was supposed to be something to compete with Midjourney aesthetics and get away from the plastic-y Flux models with artificial skin tones, weird chins, etc.
This list of partners includes all of Krea's competitors: HiggsField (current aggregator leader), Freepik, "Open"Art, ElevenLabs (which now has an aggregator product), Leonardo.ai, Lightricks, etc. but Krea is absent. Really strange omission.
Oh, looks like someone had to release something very quickly after Google came for their lunch. Their little 15 mins is over already for BFL as it seems.
yeah except I can download this and run it on my computer, whereas Nano Banana is a service that Google will suddenly discontinue the instant they get bored with it
https://genai-showdown.specr.net/image-editing
It scored slightly higher than BFL's Kontext model, coming in around the middle of the pack at 6 / 12 points.
I’ll also be introducing an additional numerical metric soon, so we can add more nuance to how we evaluate model quality as they continue to improve.
If you're solely interested in seeing how Flux 2 Pro stacks up against the Nano Banana Pro, and another Black Forest model (Kontext), see here:
https://genai-showdown.specr.net/image-editing?models=km,nbp...
Note: It should be called out that BFL seems to support a more formalized JSON structure for more granular edits so I'm wondering if accuracy would improve using it.
Almost all of the control in image-to-video comes through an image. And image models still needs a lot of work and innovation.
On a real physical movie set, think about all of the work that goes into setting the stage. The set dec, the makeup, the lighting, the framing, the blocking. All the work before calling "action". That's what image models do and must do in the starting frame.
We can get way more influence out of manipulating images than video. There are lots of great video models and it's highly competitive. We still have so much need on the image side.
When you do image-to-video, yes you control evolution over time. But the direction is actually lower in terms of degrees of freedom. You expect your actors or explosions to do certain reasonable things. But those 1024x1024xRGB pixels (or higher) have way more degrees of freedom.
Image models have more control surface area. You exercise control over more parameters. In video, staying on rails or certain evolutionary paths is fine. Mistakes can not just be okay, they can be welcome.
It also makes sense that most of the work and iteration goes into generating images. It's a faster workflow with more immediate feedback and productivity. Video is expensive and takes much longer. Images are where the designer or director can influence more of the outcomes with rapidity.
Image models still need way more stylistic control, pose control (not just ControlNets for limbs, but facial expressions, eyebrows, hair - everything), sets, props, consistent characters and locations and outfits. Text layout, fonts, kerning, logos, design elements, ...
We still don't have models that look as good as Midjourney. Midjourney is 100x more beautiful than anything else - it's like a magazine photoshoot or dreamy Instagram feed. But it has the most lackluster and awful control of any model. It's a 2021-era model with 2030-level aesthetics. You can't place anything where you want it, you can't reuse elements, you can't have consistent sets... But it looks amazing. Flux looks like plastic, Imagen looks cartoony, and OpenAI GPT Image looks sepia and stuck in the 90's. These models need to compete on aesthetics and control and reproducibility.
That's a lot of work. Video is a distraction from this work.
See my third comparison in Nano Banana blog post: https://quesma.com/blog/nano-banana-pro-intelligence-with-to...
Glad to see that they're sticking with open weights.
That said, Flux 1.x was 12B params, right? So this is about 3x as large plus a 24B text encoder (unless I'm misunderstanding), so it might be a significant challenge for local use. I'll be looking forward to the distill version.
Downloading over 100GB of model weights is a tough sell for the local-only hobbyists.
So the only option will be [klein] on a single GPU... maybe? Since we don't have much information.
Some notes:
- Running my nuanced Nano Banana prompts though Flux 2, Flux 2 definitely has better prompt adherence than Flux 1.1, but in all cases the image quality was worse/more obviously AI generated.
- The prompting guide for Flux 2 (https://docs.bfl.ai/guides/prompting_guide_flux2) encourages JSON prompting by default, which is new for an image generation model that has the text encoder to support it. It also encourages hex color prompting, which I've verified works.
- Prompt upsampling is an option, but it's one that's pushed in the documentation (https://github.com/black-forest-labs/flux2/blob/main/docs/fl...). This does allow the model to deductively reason, e.g. if asked to generate an image of a Fibonacci implementation in Python it will fail hilariously if prompt sampling is disabled, but get somewhere if it's enabled: https://x.com/minimaxir/status/1993361220595044793
- The Flux 2 API will flag anything tangently related to IP as sensentive even at its lowest sensitivity level, which is different from Flux 1.1 API. If you enable prompt upsampling, it won't get flagged, but the results are...unexpected. https://x.com/minimaxir/status/1993365968605864010
- Costwise and generation-speed-wise, Flux 2 Pro is on par with Nano Banana, and adding an image as an input pushes the cost of Flux 2 Pro higher than Nano Banana. The cost discrepancy increases if you try to utilize the advertised multi-image reference feature.
- Testing Flux 1.1 vs. Flux 2 generations does not result in objective winners, particularly around more abstract generations.
https://genai-showdown.specr.net/image-editing
The pricing structure on the Pro variant is...weird:
> Input: We charge $0.015 for each megapixel on the input (i.e. reference images for editing)
> Output: The first megapixel is charged $0.03 and then each subsequent MP will be charged $0.015
Qwen-Image-Edit-2511 is going to be released next week. And it will be Apache 2.0 licensed. I suspect that was one of the factors in the decision to release FLUX.2 this week.
Yeah, CLIP here was essentially useless. You can even completely zero the weights through which the CLIP input is ingested by the model and it barely changes anything.
This method was used in tons of image generation models. Not saying it's superior or even a good idea, but it definitely wasn't "weird".
[1] https://raywang4.github.io/equilibrium_matching/
anyone found this? To me the link doesn't lead to the model
In the case of Flux 2 Pro, adding just one image increases the total cost to be greater than a Nano Banana generation.
Wow, the Krea relationship soured? These are both a16z companies and they've worked on private model development before. Krea.1 was supposed to be something to compete with Midjourney aesthetics and get away from the plastic-y Flux models with artificial skin tones, weird chins, etc.
This list of partners includes all of Krea's competitors: HiggsField (current aggregator leader), Freepik, "Open"Art, ElevenLabs (which now has an aggregator product), Leonardo.ai, Lightricks, etc. but Krea is absent. Really strange omission.
I wonder what happened.
But can it still turn my screen orange?
it's pointless to compare in pure output when one is set in stone and the other can be built upon.