This guide is meant for beginners that want to try generating your own anime pictures or modifying existing pictures, e.g. KK screenshots. It will guide you through installing a local copy of WebUI and some SD (stable diffusion) models.
Word of caution: you will need* a beefy NVIDIA GPU, details below.
Example of what can be accomplished with this guide (% refers to how much the AI was allowed to redraw, roughly speaking).
First, some basics:
You can get basic image generation set up pretty easily, but beware, the rabbit hole is very deep. There are new models and scripts released every day, each possibly an improvement. If you want to have the best available, prepare for a lot of reading and a new hobby. This guide is not that.
Running Stable Diffusion on other ecosystems is possible but limited. Performance will be far worse and some things that require cuda wont work at all.
NVIDIA provides programmers with an API called Compute Unified Device Architecture (CUDA) which allows access to a big set of general purpose cores, which act similar to CPU cores but are a lot faster at parallel computing tasks. Most of todays machine learning ecosystems are build upon CUDA accelerated libraries and can only leverage their full potential when CUDA is available.
Stable Diffusion used to require large amounts of VRAM, which can only be found in very expensive GPUs such as the RTX 3090/4090 or NVIDIAs workstation cards.
Since the the release in September 2022, VRAM usage for inference has improved tremendously and you can easily get away with 16, 10 or even 8 GB or VRAM. Even less VRAM is possible but starts to suffer from similar issues as using non NVIDIA cards.
Only advanced tasks such as training still require bigger amounts of VRAM, but even that has improved by a lot.
Nevertheless, the higher the VRAM the better, as you can go for higher resolution and higher batchsizes (how many images are generated simultaneously.
.safetensor
and not .ckpt
files.git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
. Afterwards you can auto update by running the git pull
command..safetensor
/.ckpt
files, should be a few GB big) you have in the models\Stable-diffusion
directory (it should already exist and contain a "Put Stable Diffusion checkpoints here.txt" file in it).Tipp: If you want to use other UIs or programs that use Stable Diffusion, which have their own models folder, you can store your models in a central folder and symlink it to the models folder for each UI. This way you dont have to keep multiple copies of the big checkpoint files.
Note: If you have a 16xx series GPU (e.g. GTX1660), you will most likely get a black screen when generating images. To fix this, you have to edit webui-user.bat
and add --precision full --no-half --medvram
to COMMANDLINE_ARGS
.
On all cards: If you get black images, try using --no-half-vae
webui-user.bat
(as normal user, not administrator). See if there are any errors in the console window that opened. It should download and install a bunch of other requirements, and once ready say: Running on local URL: http://127.0.0.1:7860
.You should now see something like this:
outputs
folders. If you use the batch feature the -grids
folders will also contain a copy.Note: If you have issues getting things to start/run properly, check this page for troubleshooting steps.
Write a rough description of what you want in the "Prompt" field. You can use natural language, but for anime models you'll most likely have to write in danbooru tags (e.g. 1girl, solo, dress, long hair, blue eyes
). It's very common to prefix your prompts with Masterpiece, best quality,
and put a bunch of bad tags in the negative (e.g.lowres, blurry, bad anatomy, bad hands, worst quality, jpeg artifacts
)
Generate will yield you a picture that at least mostly fits the description. Now starts the process of "prompt engineering", which is bascially the art of writing prompts in a way that makes the AI generate what you want. A big part of that is emphasis. You can use ()
to increase emphasis by a factor of 1.1. If you want more control over the emphasis use this syntax: (<tag>:<factor>)
(e.g. (big eyes:1.2)
or (long hair:0.8)
). It should be noted that too much emphasis can quickly make things worse than using no emphasis, so dont overdo it. If somehthine does not work out, try prompting differntly (use a synoym or describe the feature) and consider negative prompts.
Powerful anime models such as Anything v3/4 can generate acceptable images almost every try. But still, expect a lot of trail and error before getting something that really suits you.
Note: You can hover over some of the buttons and labels to see a more info popup.
Next, there's the "Sampling Steps" section. It controls how many times the AI approximates the image. Euler a
and 20-40 steps is optimal for a start. Higher values require more time to process, but do not increase the VRAM usage. 30-40 steps is usually the best for most models, its genrally accepted that anything over 50 is a waste of time and engery. It's worth experimenting with other samplers but for Anime Euler a
and Euler
are genrally the best.
The default resolution of Stable Diffusion is 512x512. But depending on the model you are using you might get better results with higher resolution and different aspect ratios. Bigger Images will take more time and VRAM though.
The batch options let you generate multiple images at once. Batch Size
indicates how many images to gernate simultaneously (taking more time and VRAM), while Batch Count
refers to how many batches in a row it should generate with the same settings. There really isnt any reason to increase the batch count, is there are not benefits over hitting the generate button again when it finishes.
"CFG Scale" specifies how strongly your prompt should be followed. Too high value might cause disturbing effects, like a barefoot
prompt generating a picture full of detached and mangled feet. Too low scale will result in the AI doing its own thing and not caring too much about what you want. The optimal values seem to be in the 7-13 range.
Seed is the starting point of the image. The AI will try to turn noise made out of this seed into something that fulfills your prompt.
The AI doesn't actually "draw" a picture like you would. Instead, it tries to remove noise from an image. This approach is called a "Diffusion Model". Those models have proven to work much better than previous more straight-forward approaches. You can read more about it here.
At first, the AI is given your prompt and a random noise texture generated from the Seed. Then, it tries to remove the noise, but going from 0 to finished does not work very well (you can see it for yourself by setting "Sampling Steps" to 1). To get better results, the now-denoised image is given a smaller amount of new noise and fed back into the AI so that it can be denoisd again. This is repeated with less and less added noise for the "Sampling Steps" amount of times. Try increasing the steps 1 by 1 to see the increase in quality.
This is an animation showing a sweep of Sampling Steps from 1 to 21. Each picture was given 1 more step. All pictures use the same prompt and seed.
The more steps you have, the more detailed and sharp the image tends to be, but the scaling is logarithmic, which means it gets much better at lower values, but has very little improvement above a certain point (around 20 is optimal for quality vs render time, 30 for better quality).
More steps doesn't mean better content though, especially with complicated things like hands - you can often get better results with less steps here. This is caused by the AI "overshooting" its target and trying to denoise the shapes that it already fleshed out to a good degree.
This mode is similar to txt2img, except instead of random noise the AI is given an existing image and the generation process starts midway-through. It can be used to apply a prompt to an existing image to modify it, or to completely replace parts of the image. It's very useful for tweaking images generated by txt2img.
The point at which the AI starts redrawing your image is controlled by the "Denoising strength". What this does, is specify at what step the AI is supposed to start at (as opposed to step 0 - pure noise).
If for example you select 0.50 Denoising strength, your picture will be given half of the noise, and the AI will start at halfway point, so if you set the Sampling Steps to 20 the result will be 20 * 0.50 = 10 sampling steps will be preformed. This means that higher values will take longer to process.
The optimal values for "Denoising strength" are different depending on what you are looking for:
Example of different denoising strengths for a Koikatsu screenshot and prompt2d, anime, angry
(the actual prompt was more detailed but you get the idea).
Image size (Width/Height) should be close to or preferably in the same aspect ratio as the input image, or the image will be stretched to fit and the output might become deformed. Either adjust the long edge to fit, or turn on something other than the "Just resize" option:
You can make the AI redraw only a small part of an image by using the "Inpaint" mode. You have to draw over the area you want to redraw with your cursor. The options work mostly the same as in img2img mode.
This mode can be used to fix places that the AI messed up, e.g. to remove extra limbs, change how joints bend or do targetted changes like changing only hair color.
To use it you can draw a mask on your image directly in the webui or upload a mask (an black-white image with black for the masked area). When in inpaint mode there are a few additional sliders and settings:
This will take any image and attempt to upscale it better than a simple resize (similarly to waifu2x, but it's not exactly the same).
All upscalers other than Lanczos use neural networks, which will have to be downloaded during the first time you use them. This process is fully automated, you only need to wait a bit for the download to finish. You can see the download progress in the console window.
These networks are easier and faster to run than the SD Upscale, but also less powerful.
This will not only upscale, but can also help with "fixing" some weirdness in the image because it will do the same denoising as normal img2img.
You usually want to keep the image mostly the way it is, so using a denoising strengh of 0.2 to 0.3 is highly recommended. With very low denoising values it will only do a few steps, therefore I recommend bumping up the steps to at least 35-45.
Because of the way the SD upscale script works, you should always leave batchsize at 1, otherwise you'll just waste time and energy.
You can see what settings and prompt an image was generated with in the "PNG Info" tab. The image file must be unmodified after it was generated, or the metadata might be lost.
Rest of the features are more advanced and require separate guides to use optimally. The rabbit hole goes deep. Very deep
I want you to redraw this picture withlegs, feet
… no, no not like this, the opposite… I didn't mean it that way!
Usually you want to start with some simple set of generic tags like 2d, high quality, highly detailed
for the prompt and 3d, low quality, watermark
for the negative prompt. Add more specific tags as necessary depending on what output you get.
In img2img, generally speaking, the higher the denoising strength is, the better the description has to be or you'll start losing important features of your image. If you want your image to look anything like the source it's not uncommon to hit the 75 tag limit when going with more than 0.3 denoising (in the latest update to Automatic1111's WebUI this limit was apparently increased).
"Prompt Engineering" refers to the process of finding and combining tags that make the AI work better to get the best output possible. You can read about it in detail here and here.
Please note that the following is entirely based on my (Njaecha's) experience and may only apply to the WD 1.3 Float 32 EMA Pruned
model!
Here's a collection of useful tags with preview pictures.
One thing you will certainly notice when playing around with text2img is the AI's bias. Certain tags will often bring concepts or things with them that don't necessarily relate to the tag itself. The cute
tag for example, in combination with girl
will often generate very young looking characters. To prevent the AI from doing that you can write the bias into the negative prompt or write the oppsite as an additional tag into the main prompt. You may want to write a cute girl with a mature body
or put young
in the negative prompt for exmaple.
Secondly here are some settings I can recommend for starting out with a new prompt:
Last but not least a few tips for writing prompts:
focus on upper body
to get less chopped heads while keeping the upper bodyfull body image
if you want to legs and torso, especially good for standing posesworm's eye view
or bird's eye view
catgirl
use girl with catears and a tail
ponytail
, long hair
or over the eye bangs
wearing a red dress
, white leotard
or grey hoodie and adidas leggins
fantasy
for anything with armor or medival weaponscyberpunk
for futuristic stuff like cyborgs or andriodsclassroom
or at school
for a school settingin a forest
or mountains in background
for fantasyat the beach
or in a river
for something with swimsuitsfist
, open hand
or similar to improve how hands are drawnsimple background
will improve resultssolid white background
First of all there is a really useful button in the img2img mode: "Interrogate". When you click that Waifu Diffusion will have a look at your source image and try to describe it. It does that in a way that is easy for it to understand, so you can take that as reference when writing your own prompt. I usually let the AI interrogate my image once and then change the prompt to better fit it. It will often misunderstand certain parts or find things that are not on your image at all.
The interrogate button (marked in yellow). Image on the right is the source image again, so that you can see it better.
When interrogating this image the AI returned
"a girl with a sword and a cat on her shoulder is posing for a picture with her cat ears, by Toei Animations"
which is obviously not quite what the image shows. I would change this to something like "a girl with red hair and cat ears is holding a sword and is doing a defensive pose infront of the camera, pink top, blue skirt, focus on upper body"
Fun fact: almost every Koikatsu image will by interrogated as "by Toei Animations" because thats the more or less only "artist" that Google's BLIP model (which is used for this feature) knows for Koikatsu's style. Sometimes it will also say by sailor moon
though.
In the screenshot above you can also see my recommended base settings for img2img with a Koikatsu source image:
After you ran with those base settings you can adjust them:
In case you roll a really good image but there is this one thing bothering you, instead of going into inpaint to try to fix it, you can also copy the seed, change the settings slightly and regenerate. Stable Diffusion is a "frozen" model by default, so generating with the same settings on the same seed will result in the same image.
In img2img it is especially useful to describe the clothing your character is wearing. The color will usually stay the same but the type of clothing might heavily differ from the source image if you dont.
While the AI is impressively good at understanding the images, there might be parts where there is something unnatural in the source image (for example skin clipping through clothing). This can confuse the AI and make it try to generate some kind of object from it, which we dont want. A quick and easy solution for that to hop into photoshop and simply edit those things away. It doesn't have to be a good edit, just enough that Waifu Diffusion wont get confused.
All in all photoshop (or GIMP) is very useful for any removing small mistakes the AI made. Or you can combine two or more good images to get one great image. Fro example take the face from image A and the body from image B.
Furthermore, most things I said in the txt2img section also apply to img2img. If you skipped to this part right away consider giving it a read.
Automatic1111's webui has support for extensions now and there is a very useful extension for tagging called stable-diffusion-webui-wd14-tagger. It can analyse any images and use an image recognition AI called deepdanbooru that will basically tag the image for you. You can then just copy paste these tags to be used with Waifu Diffusion (remember: WD is trained on Danbooru). Note: This also works quite well for NovelAI, they seem to use a simalar tagging system.
Installing it so quite simple:
The "WD 1.4 Tagger" extension is towards the bottom of the list.
To use the extension open the "Tagger" tab and choose either "wd14" or "deepdanbooru" from the Interrogator
dropdown. If the dropdown is empty you did not install the additional models corretly. Read the above and make sure you put the dowloaded files in the correct folders.
Then just choose or drag'n'drop any image as Source
and it will spit out a bunch of tags in the top right. I recommend putting the Treshold
slider to something above 0.5 so that it only spits out tags with a confidence score of more than 50%.
wd14 and deepdanbooru "find" different tags so its worth trying both and looking at differences and the confidence ratings.
Now you can copy the tags to txt2img or img2img or use as inspiration which tags to put in a prompt of your own. Depending on how many tags you use and how confident the interrogation was, you can generate images that are quite similar to the one you entered.
"Artists" (the by [artist name]
or in the style of [name of work]
tags) are bascially a way to tell the AI what style it is supposed to mimic. If you ask it to generate a Picture of Hatsune Miku in the style of HR Giger
for example you can get some really freaky results:
As Waifu Diffusion is trained on Danbooru, you can try some of your favourite doujin artists, but often the amount of images in the training data is too small for it to "know" those. As a rule of thumb you could say that the more famous an artist (on a global scale) is, the higher is the chance that WD knows their style.
I personally don't use artists for anime images and koikatsu img2img as its not really necessary but if your source image already has some kind of style you might want to specify it. If you made a Jojo character in koikatsu for example, writing in the style of Jojo's bizarre adventures
is probably a nice addition. It's also a lot of fun to try out what your character would look like in certain styles:
original image [...] in the style of dragon ball
here I had to use a denoising strength of 0.5 because I wanted the image to change a lot