Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
I heard last year the potential future of gaming is not rendering but fully AI generated frames. 3 seconds per 'frame' now, it's not hard to believe it could do 60fps in a few short years. It makes it seem more likely such a game could exist. I'm not sure I like the idea, but it seems like it could happen
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.
That's 2/4? The kitkat bars look nothing like kitkat bars for the most part (logo? splits? white cream filling?). The DNA armor is made from normal metal links.
Fair. Nobody said it was going to surpass Flux.1 Dev (a 12B parameter model) or Qwen-Image (a 20B parameter model) where prompt adherence is strictly concerned.
It's the reason I'm holding off until the Z-Image Base version is released before adding to the official GenAI model comparisons.
But for a 6B model that can generate an image in under 5 seconds, it punches far above its weight class.
As to the passing images, there is white chocolate kit-kat (I know, blasphemy, right?).
Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.
Flux has largely been met with a collective yawn.
The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.
Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.
Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).
I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.
We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
> whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.
Agreed, but let’s not confuse what it is. Talking about safety is just “WE WONT EMBARRASS YOU IF YOU INVEST IN US”.
It will generate anything. Xi/Pooh porn, Taylor Swift getting squashed by a tank at Tiananmen Square, whatever, no censorship at all.
With simplistic prompts, you quickly conclude that the small model size is the only limitation. Once you realize how good it is with detailed prompts, though, you find that you can get a lot more diversity out of it than you initially thought you could.
Absolute game-changer of a model IMO. It is competitive with Nano Banana Pro in some respects, and that's saying something.
I could imagine the Chinese government is not terribly interested in enforcing its censorship laws when this would conflict with boosting Chinese AI. Overregulation can be a significant inhibitor to innovation and competitiveness, as we often see in Europe.
Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.
> It's incredibly clear who the devs assume the target market is.
Not "assume". That's what the target market is. Take a look at civitai and see what kind of images people generate and what LoRAs they train (just be sure to be logged in and disable all of the NSFW filters in the options).
They maybe have an rhlf phase, but I mean there is also just the shape of the distribution of images on the internet and, since this is from alibaba, their part of the internet/social media (Weibo) to consider
With today's remote social validation for women and all time low value of men due to lower death rates and the disconnect from where food and shelter come from, lonely men make up a huge portion of the population.
I'm still not following. Ads for a pickup truck are probably more likely to feature towing a boat than ads for a hatchback even if they're both capable of towing boats. Because buyers of the former are more likely to use the vehicle for that purpose.
If a disproportionate share of users are using image generation for generating attractive women, why is it out of place to put commensurate focus on that use case in demos and other promotional material?
I mean things that take hard physical labor are typically self limiting...
I do nerdy computer things and I actually build things too, for example I busted up the limestone in my backyard in put in a patio and raised garden. Working 16 hours a day coding/or otherwise computering isn't that hard even if your brain is melted at the end of the day. 8 - 10 of physically hard labor and your body starts taking damage if you keep it up too long.
And really building houses is a terrible example! In the US we've been chronically behind on building millions of units of houses. People complain the processes are terribly slow and there is tons of downtime.
Considering how gaga r/stablediffusion is about it, they weren’t wrong. Apparently Flux 2 is dead in the water even though the knowledge it has contained in the model is way, way higher than Z-Image (unsurprisingly).
Z-Image is getting traction because it fits on their tiny GPUs and does porn sure, but even with more compute Flux 2[dev] has no place.
Weak world knowledge, worse licensing, and it ruins the #1 benefit of a larger LLM backbone with post-training for JSON prompts.
LLMs already understand JSON, so additional training for JSON feels like a cheaper way to juice prompt adherence than more robust post-training.
And honestly even "full fat" Flux 2 has no great spot: Nano Banana Pro is better if you need strong editing, Seedream 4.5 is better if you need strong generation.
i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)
- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM
- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.
As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?
I have a 24GB M5 macbook pro. In ComfyUI using default z-image workflow, generating a single image just took me 399 seconds, during which the computer froze and my airpods lost audio.
On replicate.com a single image takes 1.5s at a price of 1000 images per $1. Would be interesting to see how quick it is on ComfyUI Cloud.
Overall, running generative models locally on Macs seems very poor time investment.
If you don't know anything about AI in terms of how these models are run, comfyui's macos version is probably the easiset to use. There is already a Z-Image workflow that you can get and comfyui will get all the models you need and get it work together. Can expect decent speed
I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.
For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.
Unfortunately, another China censored model.
Simply ask it to generate "Tank Man" or "Lady Liberty Hong Kong" and the model return a blackboard with text saying "Maybe Not Safe".
I follow an author who publishes online on places like Scribblehub and has a modestly successful Patreon. Over the years he has spent probably tens of thousands of dollars on commissioned art for his stories, and he's still spending heavily on that. But as image models have gotten better this has increasingly been supplemented with AI-images for things that are worth a couple dollars to get right with AI, but not a couple hundred to get a human artist to do them
Roughly speaking the art seems to have three main functions:
1. promote the story to outsiders: this only works with human-made art
2. enhance the story for existing readers: AI helps here, but is contentious
3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art
By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers
>A creativity tool for kids (and adults; consider memes).
Fixed that for you: (and adults; consider porn).
I don't think you realize the extent of the “underground” nsfw genai community, which has to rely on open-weight models since API models all have prude filters.
Except for gaming, that doesn't sound like a huge market worthy of pouring millions into training these high-quality models. And there is a lot of competition too. I suspect there are some other deep-pocketed customers for these images. Probably animations? movies? TV ads?
I'd say that picture ad market alone would suffice.
OTOH these are open-weight models released to the public. We don't get to use more advanced models for free; the free models are likely a byproduct of producing more advanced models anyway. These models can be the freemium tier, or gateway drugs, or a way of torpedoing the competition, if you don't want to believe in the goodwill of their producers.
Dying businesses like newspapers and local banks, who use it to save the money they used to spend on shutterstock images? That’s where I’ve seen it at least. Replacing one useless filler with another.
I've messed with this a bit and the distill is incredibly overbaked. Curious to see the capabilities of the full model but I suspect even the base model is quite collapsed.
I have had good textual results with the Turbo version so far. Sometimes it drops a letter in the output, but most of the time it adheres well to both the text requested and the style.
I tried this prompt on my username: "A painted UFO abducts the graffiti text "Accrual" painted on the side of a rusty bridge."
My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.
All modern models have their default looks. Meaningful variety of outputs for the same inputs in finetuned models is still an open technical problem. It's not impossible, but not solved either.
It means it respects nationality choices and if you don’t mention it that is your bad prompting and not a failure to not have the default nationality you would prefer.
Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.
Incredibly fast, on my 5090 with CUDA 13 (& the latest diffusers, xformers, transformers, etc...), 9 samplig steps and the "Tongyi-MAI/Z-Image-Turbo" model I get:
Did you use PyTorch Native or Diffusers Inference? I couldn't get the former working yet so I used Diffusers, but it's terribly slow on my 4080 (4 min/image). Trying again with PyTorch now, seems like Diffusers is expected to be slow.
Uh, not sure? I downloaded the portable build of ComfyUI and ran the CUDA-specific batch file it comes with.
(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)
I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.
Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
https://imgpb.com/exMoQ
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
reply