Last update

How to to first steps on creating your own LoRA?

Imagine you have a small file, that, when included, will inevitably make character of any complexity or imitate any possible style? Well, no longer need to imagine things - this technology exists for a long time and is called LoRA, or Low-Ranking Adaptation. Just go step by step with this guide, and you will easely make best LoRAs there is.

Notice: You DO need at least 8 GB VRAM video card, and probably next to 32 GBs of RAM. The best results in training will be achieved at 24 GB VRAM.

Step 1: Downloading and preparing Training Script Pack

First of all, we need to download the Trainer. Honestly, there is a whole lot of them, both having GUI interface and not, both using Diffusers-type models and not. I am pretty experienced in using Linux Console, so I will be using KohyaSS script pack, available here: https://github.com/kohya-ss/sd-scripts

Notice, that everything I do can be replicated on most modern consoles, including Google Cloud Computing, AND pretty much every single rentable GPU server. If you are using Windows 7 or lower, or don't want to learn basics of console commands, maybe the training isn't really for you... but you can find other tutorial.

On that note, all actions/commands I do is done in Linux, Ubuntu 20.04. Items starting with # - is considered commentary, you don't need to paste them into console.

# First, this command will install nessesary programs on your server.
# This step must be done only once per server.
sudo apt install git python3.10 pyhton3.10-venv python3.10-pip

# Then, I create a folder where I will store my trainer files, and then fix it's access rights for my user.
# Step must be done once.
mkdir /opt/kohyass
sudo chown $(id -u):$(id -g) /opt/kohyass

# I open my newly created folder as my working folder
cd /opt/kohyass

# If I never installed trainer, this command will download all nessesary training scripts into my folder.
# Again, this step needs to be done once.
git clone https://github.com/kohya-ss/sd-scripts.git .

# This command will ensure that script is at its latest version
# Use this command every time before launching script.
git pull

# This command will create virtual environment.
# You don't need to use it more than once
python3.10 -m venv .venv

# With this command, I will activate my newly created virtual environment.
# It must be always used when I want to train something.
source .venv/bin/activate

# Finally, this batch of commands will ensure that I have freshest trainer.
# Again, always use them before training.
pip install --upgrade pip
pip install -r requirements.txt
pip install lycoris_lora
pip install bitsandbytes
pip install xformers

Those commands will install KohyaSS onto your server and prepare it to be used.

Notice: You actually can run them all just once. It is always recommended to keep your software at latest version, but is not required to.

Step 2: Configurating your dataset

First of all, identify what you want to train.

If you want to train a character, then every single image caption must have this character's name. Alternatively, it must contain some kind of token, such as sks. Also, guarantee that this token is at the front of the tag lines.
If you want to train a style, then you logically would think that you don't need any token - after all, you probably think of it as 'if you activated this LoRA, then you want this style'. A-a-a-a-and you would be dead wrong with that. Use something simple, like SKS token at the front lines, or else you will suffer same fate I did with my Experiment.
Finally, if you want to train a concept, then you do need a name or a token for this concept, as well.

Whenever you train Style or Character, I always recommend you to start with a most nessesary tag - character species. Then, possibly, two tags for it's type (anthro, feral, kemono...) and sex (male, female, ambiguous gender, herm, etc). If there is no character, then start with 'detailed background' tag, and gradually name what displayed on this background.

In the end, you must have a folder; full of your training pictures and captions.

ls /sd/datasets

drwxr-xr-x  3 kohyass kohyass 4.0K Oct  4 16:19 .
drwxr-xr-x 10 kohyass kohyass 4.0K Oct  4 16:19 ..
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image1.jpg
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image1.txt
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image2.jpg
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image2.txt
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image3.jpg
-rw-r--r--  1 kohyass kohyass 6.6K Oct  4 16:19 image3.txt
...

Notice - you store captions for images in text files, named as images. So for image named image1.jpg; you store it's description/tags in image1.txt.

Then you must create a file detemining your dataset, called dataset_config.toml. You can look at examples here. You can keep most of the file as is, but take note of next items:

[general]
# Resolution: How big should be images of your training dataset.
# If you train something SD1.5-based, you might want to set it to 512
# If you train some furry-like network, you might want to set it to 640 or 768
# And, if you train SDXL, then set it to 1024
resolution = 512

# If set to true, your caption will be split among commas (,) and
# randomly shuffled, so your "tiger, female, looking at viewer" will
# become "female, looking at viewer, tiger". Because networks try to match exact
# phrasing to final picture, shuffling helps to keep multiple variants of
# phrasing to result into nessesary picture instead of just strict one.
# Recommended to never disable.
shuffle_caption = true


# Number of tokens that must be kept at any cost and rate.
# If set to 1, then ONE LEFTMOST token would be never shuffled, nor dropped.
# Very, very useful setting. Recommended to always keep it at '1'.
keep_tokens = 1

# Caption Tag Dropout Rate. Measures from 0 (0%) to 1 (100%).
# Represents a chance to remove tag from caption when training.
# Extremely useful, same as shuffling caption, allows multiple phrasing
# variants with some words being absent. Recommended to keep at '0.1' (10%).
# You MIGHT WANT to increase it to 0.2 or 0.25 when training STYLES.
caption_tag_dropout_rate = 0.1

# Caption Dropout Rate. Instead of dropping single tags, this
# measures rate of dropping whole caption, and training image without
# any tags whatsoever. Useless when training concepts or characters,
# but might help when teaching styles. Keep at 0 to 0.02.
caption_dropout_rate = 0

# Caption Global Dropout. Instead of dropping all tags on one image
# it would drop all tags for whole epoch (1000-2000 images) at once.
# Rarely used. Keep at 0.
caption_dropout_every_n_epochs = 0


# Flip Augmentation and Color Augmentation
# Randomly makes horizontal flip of image or shifts its color
# somewhat left and right on HSL scale. First, activating any
# of those will require you to deactivate 'cache_latents' setting
# winch speeds up training. Second, any of those will generally
# start to generate images with wrong colors or mirrored duos.
# Not recommended!
flip_aug = false
color_aug = false

# Those settings allow you to train different image sizes at one go
# So instead of 'just square images' it will allow you to train
# Landscapes and portraits at the same time.
# Keep on 'true'.
enable_bucket = true

# Extension of files that should be considered captions. Keep at 'txt'.
caption_extension = ".txt"

# Bucket settings. Just know that you want to keep reso_steps to
# a multiple of 64. Just keep it at 64.
bucket_reso_steps = 64
bucket_no_upscale = false

# This is maximum and minimum image size of what you can train.
# If you switch resolution to number higher than 640, then
# change those walues to 320 and 1280.
min_bucket_reso = 256
max_bucket_reso = 1024


# And finally, how to treat any dataset.
[[datasets]]

# Subsets. You may have more than one of those sections.
# They describe single folder in your computer.
[[datasets.subsets]]
# Path to your directory full of images.
image_dir = "/sdssd/datasets/dataset/v1"

# Class tokens. May be skipped if you have bunch of caption files.
class_tokens = ["sks"]

# Amount of repeats. If your dataset have 30 images, then 10 repeats
# will turn them to 300 images.
# For good result, try to have ~2000 images per epoch.
# (you might have less images per epoch, but it will just spawn more epochs.)
num_repeats = 10

Step 3: Configurating your training parameters

So, when you got your datasets prepared and ready - you probably need to do it once in your life for a single dataset - you need to tell Trainer on HOW EXACTLY you want it to train network using this dataset. You will also specify on how fast the training will go, and other parameters...

To understand how training_config.toml is functioning, you need to analyze how argument parsing works. Simply saying, look closer at this file. Remove dashes (--) from the left side, and what is left is what you can use in configuration. You can create as many sub-sections as needed, they doesn't matter in the end.

# First main section is Main Arguments.
# Those are changed, pretty much, every time you start a new training.
# Remember that you can override any of those with external arguments
# like --output_dir "/tmp/training"
[main_arguments]
# First two items - is a path to a CKPT or SAFETENSORS model
# and it's VAE. I recommend to keep VAE at one that is recommended to
# model itself, like vae-ft-mse-840000-ema-pruned.safetensors, but feel free
# to completely remove this string to use built-in VAE inside model.
pretrained_model_name_or_path = "/sdssd/resources/checkpoints/baseline/fluffyrock_3m_offsetnoise_e68.safetensors"
vae = "/sdssd/resources/checkpoints/vaes/vae-ft-mse-840000-ema-pruned.safetensors"

# Clip Skip. The magic setting, that some of models set to 2, like NovelAI one.
# For a lot of models, keep this setting on 1, tho.
clip_skip = 1

# Output control - both where output will be placed and how named.
# I prefer to keep them in special separated directory and
# follow special naming system:
#   Begin with word describing model - OBJECT, STYLE or CONCEPT
#   Follow with shortest description - if it is an OBJECT, then
#     winch one? If STYLE then who had it? And so on..
#   Finish with version identifier. It would help to distinguish
#     between datasets and settings used to make this LoRA.
output_dir = "/sdssd/training_artifacts/"
output_name = "STYLE_SOUL_AB12_V1"


# Next is Auxillary arguments.
# Those changes from time to time, depending on object of training.
# Carefully inspect those and tweak to your liking.
[aux_arguments]
# Network_module regulates winch model file will be used to train networks.
# Initially there is pre-installed networks like "network.lora", but
# if you did every command from guide the "lycoris.kohya" will be available
# winch is somewhat better than 'just LoRA'.
network_module = "lycoris.kohya"

# Next three items describe what network you trying to make.
# Network Dimensions. Think of them as 'memory cells'. The more
# memory cells you give to LoRA, the more details it will be able
# to ingest. But give too much - and LoRA will simply overlearn
# and produce only pictures it has remembered.
network_dim = 16

# Network Alpha. In short, a number that keeps networks from
# slipping to zero (and therefore not learning at all).
# Long story short, simply keep it in range from 1 to
# half of used network_dim and everything fill be pretty fine.
network_alpha = 8

# Arguments for network.
# If you set "algo=lora"; then you will activate LoCON training.
# LoCONs is pretty good and basic, and can be used for characters.
# If you set "algo=loha"; you will activate LoHA training.
# LoHA requires much, much less dimensions, since it is capable
# of learning exponentially more information. It is good for styles.
# Convolutional Dimension should generally be at around half of dimension.
# And it's alpha should be generally at 1.
network_args = [ "algo=lora", "conv_dim=8", "conv_alpha=1"]


# Next, the Epochs.
# Amount of epochs you want your model to train on.
# Generally, if your dataset is made of 2000 images, even 8 is more than enough.
# But it always depends on how large is your Learning Rates.
max_train_epochs = 12

# How much models you want to save. If you set it to 2, then every
# 2nd epoch will be saved. Set it to 4, then every 4th, and so on.
# Because 12 is a very small amount, and 2000 images per dataset is
# a really huge amount, I recommend to save every single epoch trained.
save_every_n_epochs = 1


# A batch of settings to increase how your training is working.
# Rule of thumb - go higher until your card gives up, then go back to working parameters.
# Allows you to have significantly coherent result.
# At resolution of 640 and RTX3090 (24 GB VRAM), you can set vae=4, train=8 no problems.
vae_batch_size = 4
train_batch_size = 8                   
max_data_loader_n_workers = 8
persistent_data_loader_workers = true

# If your videocard isn't good at increasing train_batch_size, try gradient_checkpointing instead!
# Is like 'poor man's train_batch_size'! Actually does have near-same results (but costs RAM)!
# In this example, I disabled it.
gradient_checkpointing = false
gradient_accumulation_steps = 1

# You might want to keep that active to achieve deeper darks and lighter lights
noise_offset = 0.1


# Finally, Other arguments.
# Those pretty much never change, so don't really touch them.
[other_arguments]
# Global learning rate. The speed on how fast your network learns.
# You might want to tweak it until you find your 'perfect spot'.
# 0.0001 (1e-4) is a good place to start for 512x512 image dataset.
learning_rate = 1e-4

# If for some reason you want to tweak LRs of UNET and TE
# separately, you can do it here. It is recommended to keep
# TE as half of UNET.
unet_lr = 1e-4
text_encoder_lr = 5e-5

# Optimizer. Exists in multiple variations, like
# AdamW8bit or Lion8bit.
# Depending on optimizer you may get vastly different results.
# In my experience, LION requires for your learning rate to be significantly lower,
# as it is quite an efficient algorythm. ADAM is fine for beginning.
optimizer_type = "AdamW8bit"

# How learning rate should change.
# Most of the time, straight up setting it to LR itself is
# the best choice on training.
lr_scheduler = "constant"
lr_warmup_steps = 0

# Seed. Usually better if gets set.
# Makes your training repeatable as well.
seed = 6

# V Parametrization is usually reserved for V2 + 768;
# but can be enabled if model used it already! Ignore the warnings.
v_parameterization = false
# Enforces 'betas' to be better with noise, so your picture will achieve.
# Deeper darks and lighter lights. However, must be used with v_params.
zero_terminal_snr = false

# Couple of parameters that allow you
# to train only one thing over the other.
# You don't need them, believe me.
network_train_unet_only = false
network_train_text_encoder_only = false


# Set to FALSE if you have a lot of fast RAM, like 32-64 GBs of DDR4.
# Will give more room to your videocard to increase batch size.
lowram = true

# This will speed up the process of working with latents
# and therefore increasing training speed. Incompatible
# with flip_aug and color_aug, tho.
cache_latents = true

# Always use XFormers to save on memory
xformers = true

# Allows you to increase token length training.
# No need to limit ourselves to 75.
max_token_length = 225

# Append metadata to files! Helpful, as it shows
# information inside Automatic1111.
no_metadata = false

# Always save as safetensors to keep your models from executing arbitrary code
save_model_as = "safetensors"
# Always save as FP16. It really saves on space AND gives approximately same results
save_precision = "fp16"
# Always LOAD model as FP16 for same reasons.
mixed_precision = "fp16"

# Set to TRUE only if you use SDv2 based model.
v2 = false

# Where to send tensorboard-style logs. Allows you to
# see final progress on training.
logging_dir = /sdssd/trainings/logs
# Sometimes you need to add prefix to your log files.
log_prefix = ""

# And couple of simply recommended settings without explanation.
max_grad_norm = 1.0
prior_loss_weight = 1.0

There is couple presets. For more, check Algo Details of Lycoris.Kohya In short, there is next settings and advices on machine learning.

# The one that gets better characters (globally recommended place to start)
network_module = "lycoris.kohya"
network_args = [ "conv_dim=8", "conv_alpha=1", "algo=lora" ]
network_dim = 16
network_alpha = 8

# The one that gets better styles (globally recommended place to start)
network_module = "lycoris.kohya"
network_args = [ "conv_dim=4", "conv_alpha=1", "algo=loha" ]
network_dim = 8
network_alpha = 4

# My personal starting point for the one that gets better styles
network_module = "lycoris.kohya"
network_args = [ "conv_dim=4", "conv_alpha=1", "algo=loha" ]
network_dim = 8
network_alpha = 4

learning_rate = 1e-5
unet_lr = 1e-5
text_encoder_lr = 5e-6

optimizer_type = "AdamW8bit"

max_train_epochs = 25

Step 4: Running your training!

All you need is one command to rule them all. Personally, I prefer this one:

/opt/kohyass/.venv/bin/accelerate launch --num_cpu_threads_per_process 8 /opt/kohyass/train_network.py --dataset_config /path/to/dataset_config.toml --config_file /path/to/training_config.toml

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search