Hi there,
I'm a complete amateur in design and painting, but I got kinda hooked on Stable Diffusion, because it (theoretically) lets me create images out of my fantasy without needing the digital painting skills.
I poured a couple of weekend free time into learning how to use SD and by now I'm somewhat familiar with how to make useful prompts, how to use Control Net, Inpainting and Upscaling.
But now I'm a bit at a loss on how to further perfect my workflow, because as of right now I can get really good images that kinda resemble the scene I was going for (letting the model / loras do the heavy lifting) or I'm getting an image that is composed exactly as I want (utilizing control net heavily) but is very poorly executed in the details with all sorts of distorted faces, ugly hands and so on.
Basically, if I give a more vague prompt the image comes out great but the more specific I want to be, the more the image generation feels "strangled" by prompt and control net and it doesn't seem to result in usable images ...
How do you approach this? Trying to generate 100's or more images in the hope that one of them will get your envisioned scene correctly? Or do you make heavy use of Photoshop/Gimp for postprocessing (<- I want to avoid this) or do you painstakingly inpaint all the small details until it fits?
Edit: Just to add a thought here: I just started to realise how limited most of the models are in what they "recognise". All our everyday items are covered pretty well, e.g. prompting "smartphone" or "coffeemachine" will produce very good results, but things like "screwdriver" are getting dicey already and with special terms like "halberd" it is completely hopeless. Seems I will need to go through with making my own lora as discussed in the other thread ....
Well I'm a noob too, using SD for a month now. There's a lot to tackle but let's talk about some points of my workflow, of what I believe to understand.
So first of you got to ask yourself how complex is the scene going to be. One person alone is usually no problem. Multiple is tricky.
Then you have the lora vs no lora approach.
So first of I try to use as little lora as possible, as they add an additional layer of balancing weights, on top of the already necessary balancing of prompts. Lora deform your composition a lot and you want to avoid them. Then you have a ton of lora that only work well with certain checkpoints or contain the opposite of what you want to gen (anime vs realistic vs CGI for example). If you have to use lora from the opposite style you want, you need to set the lora <tag> low, like on 0.3 or even lower, to avoid oversaturation of the style bleeding into your composition. And at the same time increase the weight of the prompt keyword, that is used by the lora. I try to avoid going above 1.9 as that seams to cause artifacts and I'm doing better by removing keywords, adding keywords or shifting them. Sometimes the most important isn't far enough up in the list. Using stuff like BREAK to separate certain elements might help too.
So far I found using "latent couple" extension and "composible lora" extension, to give me good results with multiple people and multiple lora. You can enable and add controlnet as well. There's even a latent couple helper tool to make it easier to select the parts of an image you want to be person A and person B. Haven't tried more than 4 people yet but there's almost no limit I guess.
You are generally on a good track (meaning you picked the right balancing of weights and prompts) when faces get fixed in hires fix (2x resolution) automatically. Meaning without enabling restore faces option. Some checkpoints are bad at faces or the combination of your lora, so it's a bit of a pain searching for a different one and testing it. I have like 40 now and I seam to download more instead of less. Haha. But maybe learning to use one and sticking too it is smarter as some like or dislike certain prompts (usually described on the checkpoint civitai page)
Increasing CFG can help or adding more prompts. If you use lora, you can look into the details. I use civitai helper extension and click on the small exclamation mark (!) to see trigger words of the lora (even more than used in example images or description of the lora on civitai) and there you often find words that trigger the lora, resulting in more weight generated for that lora. For example if the training data used more images with say a women with black hair, than it's easier to generate a women with black hair instead of forcing blond hair.
I usually generate 4 images at once in low steps and low resolution, like 512x512 first. If I reached at least 1 in 4 images being similar to what I want, then I start with hires fix and later img2img ultimate SD upscaling + controlnet. (I'm not a fan of extra upscaling)
At the end I do generate like 20 to 100 images until my composition is nearly where I want it to be. And I prefer to not use controlnet, so I can easily reuse the image prompts I used. Nothing is worse than needing to search for the right controlnet, depth, canny, reference and more and the weights to get close again.
If it's about using a certain position, you can combine multiple controlnet with lower weight. That's usually smart if you want an exact pose.
Well I hope this did help a little. Cheers!
Very solid run down, and I agree with most of this.
In particular - Latent couple and composable LORA are amazing tools and the OP should definitely look into them.