Published October 12, 2024

AniPack - Customized Anime Character Generator

An all-in-one-place virtual anime character creator, with customized characteristic, looks, and video.

280

Generative AI - MI210: 2nd Place

Pervasive AI Developer Contest

AniPack - Customized Anime Character Generator

Things used in this project

Hardware components

MacBook Air M1 2020

AMD Instinct™ MI210 Accelerator

Software apps and online services

Hugging Face

Civitai

Story

Introduction

To address the time and cost constraints faced by anime illustrators and manga artists in creating customized characters, as well as providing a way for anime lovers to enjoy talking to a virtual anime companion, this all-in-one-place anime character generator application is developed. This application will offer a unique solution by allowing users to generate personalized anime characters through various input methods like text, images, scribbles, and lineart. The use of Hugging Face packages, CivitAI model weights, and the Gradio interface will enhance the functionality of the app. Additionally, the incorporation of AMD AI hardware will optimize the process of character generation. Various features such as chatbot, video generation, clothes variation, pose variation, background variation and chibi generator will be implemented.

Instructions

Please follow the instructions on the README.md file in https://github.com/ShellCodingGH/AniPack (attached in this project) to start using this application.

Features

The project includes various features to manipulate and enhance character images, videos and dialogues. Users can create a prototype character with options to vary clothing, poses, background, and more. One feature allows users to input a text prompt and generate a character image based on that text. Another feature enables users to transform portrait images into anime characters. Users can also upload scribbles or sketch directly on a sketchpad to create a character. Additionally, there is a feature to convert professional lineart into a colorized anime character.

Users can also chat with the virtual anime companion they created, with a customized personality trait, such as "a lively boy" or "an elegant lady" etc.

Furthermore, users can modify the clothing and accessories of the input image using prompt or painting on the outlines of the parts wished to modify. Videos of the character can also be instantiated with prompts and images inputs. Backgrounds can be varied by entering a text prompt. Additionally, users can generate a cute, chibi version of their character for a more cartoonified look.

Project Planning

This section will describe how the "AniPack" project was planned before the construction of the project.

Brainstorming

To brainstorm the ideas for this project, current gaps in the industry of image generation was looked at, and we found that there lack a consistent character generative software, especially in the Japanese anime-style genre. We found out that manga artists and anime artists particularly need a software that can generate a consistent-looking anime character for their artworks, as well as that anime lovers nowadays are currently seeking for chances to talk to an anime character with a personality of their like, thus we built this software that can benefit both artists and general anime lovers.

To suit the needs of anime artists, we included a prototype generation procedure, which allows various forms of inputs such as text, image, scribbles and lineart. A variation procedure that can vary the generated prototype's pose, background, clothes etc is included to suit different occasions according to the user's needs. The inpainting feature helps remove defects and achieve superior consistency in the looks of the character. A short video generation feature is included to help with the user's creative work. The chibi generation feature allows user to create a cute avatar for their needs, which could be common among manga and anime artwork creations. Finally, a background removal functionality helps manga artist to crop out the character and add customized background and changes to the character generated.

Apart from the features above, to benefit general anime lovers, we included a chatbot and voice synthesizer to accompany their needs. Users can customize a personality for the character they will be chatting to, then choose a voice characteristic(e.g. cute, elegant) and generate a voice for their anime companion, whose image will be shown on the left hand side of the chatbot tab. This can make the chatbot feature more realistic as if the user is actually talking with a real anime character.

Model Selection

We searched for and tried various models on civitai.com, we found out that Anime God v1.0 is particularly good at the tasks we brainstormed for this project. Additional models such as Counterfeit-V2.0, MeinaMix on the same website was also tested, however their effect seem to be a bit more limited for specific features such as pose variation, where dense dots seem to appear on the generated images more easily than the applied model.

We further searched and tested a few LoRA and textual inversion models, and selected some with good effects.

civitai.com

Anime God v1.0

Project Development

To use this app, user would first generate through text, image, scribbles or lineart a character prototype. Then they can change the clothes, poses and backgrounds, as well as generating a customized short video of their character. Lastly they could create a chibi version of their generated character and can remove the background of the generated anime images. A congratulation audio message with a anime-style voice would also be played to encourage the user for their generations.

Project Development

This section described how this project is created.

Installing and Importing Libraries

First, we installed the necessary libraries. The list is provided in the attachment. Install them with pip by copying and pasting the whole file into the terminal then press "Enter".

After installing the libraries, libraries were imported by logging into Hugging Face with a token. To get a token, go to huggingface.co/settings/tokens to create and copy a token:

HF_TOKEN = your_huggingface_token
import torch
import numpy as np
import cv2
import os
from moviepy.editor import *
import torchvision
from torchvision.io import write_video
import moviepy.editor as mp
from random import randrange
from transformers import AutoImageProcessor, SegformerForSemanticSegmentation
from stable_diffusion_reference import StableDiffusionReferencePipeline
import gradio as gr
from PIL import Image
from diffusers import StableDiffusionPipeline, StableDiffusionControlNetPipeline, ControlNetModel, DPMSolverMultistepScheduler, StableDiffusionImg2ImgPipeline, StableDiffusionControlNetInpaintPipeline, StableDiffusionControlNetImg2ImgPipeline, StableDiffusionInpaintPipeline
from controlnet_aux import HEDdetector, LineartDetector, OpenposeDetector, LineartAnimeDetector
from diffusers.utils import load_image
from diffusers import DiffusionPipeline
from diffusers import I2VGenXLPipeline
from diffusers.utils import export_to_gif, load_image
import concurrent.futures
from compel import Compel
import onnxruntime as rt
import imageio
import torchvision.transforms as transforms
import transformers
from transformers import pipeline, CLIPTextModel
from huggingface_hub import login
from PIL import ImageDraw
import huggingface_hub
import argparse
import json
import os
import re
import tempfile
import logging
logging.getLogger('numba').setLevel(logging.WARNING)
import librosa
from torch import no_grad, LongTensor
import commons
import utils
import gradio.utils as gr_utils
import gradio.processing_utils as gr_processing_utils
import ONNXVITS_infer
import models
from text import text_to_sequence, _clean_text
from text.symbols import symbols
from mel_processing import spectrogram_torch
import psutil
from datetime import datetime

Main Feature Implementation

This section will show how the major features - including the text-to-image, image-to-image, scribble-to-image, lineart-to-image, chatbot, voice synthesizer, pose variation, inpainting, video generation, background variation and chibi generator - are implemented.

major features in AniPack

Model Selection

The model Anime God v1.0 on Civitai is used as a backbone model for all features. An exemplar usage is as follows:

pipe_txt2img = DiffusionPipeline.from_pretrained("usrname/repo_id", torch_dtype=torch.float16, token=HF_TOKEN)

To use a specific model, please download its weights in safetensors form. Some websites like civitai.com has some good models to search for.

LoRAs and Textual Inversions

On top of the base model Anime God v1.0, it appears that using the model itself solely would lack in some areas in certain situations, such as generating good-looking eyes for long shots. Therefore specifically designed LoRA weights were applied to improve the image quality. Despite with LoRA and the base model, the images generated could still produce some defects frequently, such as distorted hands. Thus to further refine these defects, several textual inversions were introduced in addition to the model. An exemplar usage of LoRA and textual inversion is as follows:

pipe_txt2img.load_lora_weights("usrname/repo_id")
pipe_txt2img.fuse_lora(lora_scale=0.5)
pipe_txt2img.load_textual_inversion("usrname/repo_id")

To use a specific LoRA/textual inversion, please download its weights in safetensors form. Some websites like civitai.com has some good LoRAs/textual inversions to search for.

Strengthen/weaken Keywords of a Prompt - Compel

Compel allows users to strengthen/weaken the influence of particular keywords in the prompt with "+"(strengthen) and "-"(weaken). Note that extensive strengthening or weakening with multiple "+" and "-"(e.g. ++++++++ and --------) can produce defects in images generated. An exemplar usage is as follows:

Character Prototype Generation

building character prototype with text

build prototype with an image of Emma Watson; scribble and lineart inputs can also be accepted

The full script for character prototype generation is as follows:

def text_to_anime(prompt, negative_prompt, height, width, num_images=4):
    global pipe_txt2img
    
    # load model
    if (pipe_txt2img == None):
        gr.Info("Loading model")

        # load pipe
        pipe_txt2img = DiffusionPipeline.from_pretrained(
            "shellypeng/animever10-god-model", 
            torch_dtype=torch.float16, token=HF_TOKEN
        )
        
        # load LoRAs
        pipe_txt2img.load_lora_weights("shellypeng/detail-tweaker")
        pipe_txt2img.fuse_lora(lora_scale=0.5)
        pipe_txt2img.load_lora_weights("shellypeng/lora-eyes")
        pipe_txt2img.fuse_lora(lora_scale=0.3)

        # load textual inversions
        pipe_txt2img.load_textual_inversion("shellypeng/badhandv4")
        pipe_txt2img.load_textual_inversion("shellypeng/bad-prompt")
        pipe_txt2img.load_textual_inversion("./EasyNegative.pt")
        pipe_txt2img.load_textual_inversion("shellypeng/deepnegative")
        pipe_txt2img.load_textual_inversion("shellypeng/verybadimagenegative")
        
        # load scheduler
        pipe_txt2img.scheduler = DPMSolverMultistepScheduler.from_config(pipe_txt2img.scheduler.config, use_karras_sigmas=True)

        # send to gpu
        pipe_txt2img.to(device)
        gr.Info("Finished loading model")
    if height % 8:
        raise gr.Error("Please input a height with a value of multiple of 8 on the slider.")
        
    if width % 8:
        raise gr.Error("Please input a width with a value of multiple of 8 on the slider.")
        
    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_txt2img.tokenizer, text_encoder=pipe_txt2img.text_encoder)
  
    # positive prompt with weights
    prompt = prompt + hidden_booster_text 

    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generate result image(s)
    res = gen_batch_img(pipe_txt2img, height, width, prompt_embeds, negative_prompt_embeds, 40, num_images=num_images)

    return res

def scribble_to_image(prompt, negative_prompt, input_img, height, width):
    global pipe_scribble, hed, controlnet_scribble
    
    # load model
    if (pipe_scribble == None):
        print_msg(LOAD_MODEL_OPTION)
        
        # load HED processor for preprocessing of the ControlNet
        hed = HEDdetector.from_pretrained('lllyasviel/Annotators')
        controlnet_scribble = ControlNetModel.from_pretrained(
            "lllyasviel/sd-controlnet-scribble", torch_dtype=torch.float16, use_safetensors=True)
        
        # load pipe
        pipe_scribble = StableDiffusionControlNetPipeline.from_pretrained("shellypeng/animever10-god-model", controlnet=controlnet_scribble, torch_dtype=torch.float16)

        # load LoRAs
        pipe_scribble.load_lora_weights("shellypeng/detail-tweaker")
        pipe_scribble.fuse_lora(lora_scale=0.5)
        pipe_scribble.load_lora_weights("shellypeng/lora-eyes")
        pipe_scribble.fuse_lora(lora_scale=0.6)

        # load textual inversions
        pipe_scribble.load_textual_inversion("shellypeng/badhandv4")
        pipe_scribble.load_textual_inversion("shellypeng/bad-prompt")
        pipe_scribble.load_textual_inversion("shellypeng/deepnegative")
        pipe_scribble.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_scribble.load_textual_inversion("./EasyNegative.pt")

        # load scheduler
        pipe_scribble.scheduler = DPMSolverMultistepScheduler.from_config(pipe_scribble.scheduler.config, use_karras_sigmas=True)

        # send to gpu
        pipe_scribble.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)
        
    if input_img is None:
        raise gr.Error("Please provide a input image.")
        
    if height % 8:
        raise gr.Error("Please input a height with a value of multiple of 8 on the slider.")
        
    if width % 8:
        raise gr.Error("Please input a width with a value of multiple of 8 on the slider.")
    
    # preprocessing input image
    input_img = np.array(input_img)
    input_img[input_img > 100] = 255
    input_img = Image.fromarray(input_img)
    input_img = hed(input_img, scribble=True)
    input_img.save("hed_img.png")
    input_img = load_image(input_img)
    
    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_scribble.tokenizer, text_encoder=pipe_scribble.text_encoder)

    # positive prompt with weights
    prompt = prompt + hidden_booster_text
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generating 4 result images
    res = gen_batch_img(pipe_scribble, height, width, prompt_embeds, negative_prompt_embeds, 40, input_img, 4)
    return res

def real_img2img_to_anime(prompt, negative_prompt, input_img, height, width):
    global pipe_img2img
    
    if input_img is None:
        raise gr.Error("Please provide a input image.")
    
    if height % 8:
        raise gr.Error("Please input a height with a value of multiple of 8 on the slider.")
        
    if width % 8:
        raise gr.Error("Please input a width with a value of multiple of 8 on the slider.")
    
    # preprocessing input image
    input_img = load_image(input_img)
    
    # load model
    if (pipe_img2img == None):
        print_msg(LOAD_MODEL_OPTION)
        
        # load pipe
        pipe_img2img = StableDiffusionImg2ImgPipeline.from_pretrained("shellypeng/animever10-god-model",
                                                                               torch_dtype=torch.float16)
        
        # load LoRAs
        pipe_img2img.load_lora_weights("shellypeng/detail-tweaker")
        pipe_img2img.fuse_lora(lora_scale=0.5)
        pipe_img2img.load_lora_weights("shellypeng/lora-eyes")
        pipe_img2img.fuse_lora(lora_scale=0.6)

        # load textual inversions
        pipe_img2img.load_textual_inversion("shellypeng/badhandv4")
        pipe_img2img.load_textual_inversion("shellypeng/bad-prompt")
        pipe_img2img.load_textual_inversion("shellypeng/deepnegative")
        pipe_img2img.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_img2img.load_textual_inversion("./EasyNegative.pt")

        # load scheduler
        pipe_img2img.scheduler = DPMSolverMultistepScheduler.from_config(pipe_img2img.scheduler.config, use_karras_sigmas=True)

        # send to gpu
        pipe_img2img.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)


    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_img2img.tokenizer, text_encoder=pipe_img2img.text_encoder)

    # positive prompt with weights
    prompt = prompt + hidden_booster_text
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generating 4 result images
    res = gen_batch_img(pipe_img2img, height, width, prompt_embeds, negative_prompt_embeds, 40, input_img, 4)

    return res

def lineart(prompt, negative_prompt, lineart_image):
    global pipe_lineart, lineart_processor
    
    if lineart_image is None:
        raise gr.Error("Please provide a lineart image.")
    
    # load model
    if pipe_lineart == None:
        print_msg(LOAD_MODEL_OPTION)
        
        # load lineart processor for preprocessing of the ControlNet
        lineart_processor = LineartAnimeDetector.from_pretrained("lllyasviel/Annotators")
        controlnet_lineart = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15s2_lineart_anime", torch_dtype=torch.float16, token=HF_TOKEN)
        
        # load pipe
        pipe_lineart = StableDiffusionControlNetPipeline.from_pretrained(
            "shellypeng/animever10-god-model", controlnet=controlnet_lineart,
            torch_dtype=torch.float16, token=HF_TOKEN
        )
        
        # load LoRAs
        pipe_lineart.load_lora_weights("shellypeng/detail-tweaker")
        pipe_lineart.fuse_lora(lora_scale=0.5)
        pipe_lineart.load_lora_weights("shellypeng/lora-eyes")
        pipe_lineart.fuse_lora(lora_scale=0.3)
        
        # load textual inversions
        pipe_lineart.load_textual_inversion("shellypeng/badhandv4")
        pipe_lineart.load_textual_inversion("shellypeng/bad-prompt")
        pipe_lineart.load_textual_inversion("shellypeng/deepnegative")
        pipe_lineart.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_lineart.load_textual_inversion("./EasyNegative.pt")
        pipe_lineart.scheduler = DPMSolverMultistepScheduler.from_config(pipe_lineart.scheduler.config, use_karras_sigmas=True)
        
        # send to gpu
        pipe_lineart.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)
    
    # preprocessing input image
    lineart_image = load_image(lineart_image)
    width, height = lineart_image.size
    lineart_image = lineart_processor(lineart_image)

    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_lineart.tokenizer, text_encoder=pipe_lineart.text_encoder)

    # positive prompt with weights
    prompt = prompt + hidden_booster_text 
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generating 4 result images
    res = gen_batch_img(pipe_lineart, height, width, prompt_embeds, negative_prompt_embeds, 40, lineart_image, num_images=4)
    return res

The get_batch_img function is as follows:

def gen_batch_img(pipe, height, width, prompt_embeds, negative_prompt_embeds, num_inference_steps, input_img=None, strength=1.0, control_image=None, mask_image=None, num_images=4):
    res = []
    for x in range(num_images):
        i = 0
        res_image = Image.fromarray(np.zeros((64, 64)))
        while not res_image.getbbox() and i < 10:
            if isinstance(pipe, StableDiffusionReferencePipeline):
                res_image = pipe(ref_image=input_img,
                      prompt_embeds=prompt_embeds,
                      negative_prompt_embeds=negative_prompt_embeds,
                      num_inference_steps=num_inference_steps,
                      reference_attn=True,
                      reference_adain=True).images[0]
            elif isinstance(pipe, StableDiffusionControlNetInpaintPipeline):
                res_image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, image=input_img, strength=strength, num_inference_steps=num_inference_steps, control_image=control_image, mask_image=mask_image).images[0]
            elif input_img is not None:
                res_image = pipe(height=height, width=width, image=input_img, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=40).images[0]
            else:
                res_image = pipe(height=height, width=width, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=num_inference_steps).images[0]
            i+=1
        res.append(res_image)
    return res

Chatbot and Voice Synthesizer

chatbot and voice synthesizer

The full script for chatbot and voice synthesization is as follows:

language_marks = {
    "Japanese": "",
    "日本語": "[JA]",
    "简体中文": "[ZH]",
    "English": "[EN]",
    "Mix": "",
}

# limit text and audio length in huggingface spaces
limitation = os.getenv("SYSTEM") == "spaces"  

# set gpu device
device = "cuda"

# some hidden text to enchance quality and reduce defects
hidden_booster_text = ", masterpiece-anatomy-perfect, dynamic, dynamic colors, bright colors, high contrast, excellent work, extremely elaborate picture description, 8k, obvious light and shadow effects, ray tracing, obvious layers, depth of field, best quality, RAW photo, best quality, highly detailed, intricate details, HD, 4k, 8k, high quality, beautiful eyes, sparkling eyes, beautiful face, masterpiece,best quality,ultimate details,highres,8k,wallpaper,extremely clear,"
hidden_negative = ", boobs++, internal-organs-outside-the-body, internal-organs-visible, anatomy-description, unprompted-nsfw ,worst-human-external-anatomy, worst-human-hands-anatomy, worst-human-fingers-anatomy, worst-detailed-eyes, worst-detailed-fingers, worst-human-feet-anatomy, worst-human-toes-anatomy, worst-detailed-feet, worst-detailed-toes, camera, smartphone, worst-facial-details, ugly-detailed-fingers, ugly-detailed-toes,fingers-in-worst-possible-shape, worst-detailed-eyes, undetailed-eyes, undetailed-fingers, undetailed-toes, boobs, big boobs, huge boobs, sexy, dirty, d cup, e cup, g cup, slutty, bad-image-v2-39000,badhandv4,ng_deepnegative_v1_75t,EasyNegative,bad_prompt_version2, worst quality, low quality, extra digits, text, signature, bad anatomy, mutated hand, error, missing finger, cropped, worse quality, bad quality, lowres, floating limbs, bad hands, anatomical nonsense, Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
hidden_vary_negative = ""

# login to huggingface for use of models
login(HF_TOKEN)

# list of congratulation voice messages
audio_list = ["congrats-audios/audio1.wav", "congrats-audios/audio2.wav", "congrats-audios/audio3.wav", "congrats-audios/audio4.wav", 
             "congrats-audios/audio5.wav", "congrats-audios/audio6.wav", "congrats-audios/audio7.wav", "congrats-audios/audio8.wav", 
             "congrats-audios/audio9.wav", "congrats-audios/audio10.wav", "congrats-audios/audio11.wav", "congrats-audios/audio12.wav", 
             ]

# plays congratulation voice messages
def play_audio():
    audio = audio_list[randrange(len(audio_list))]
    return audio

def mult_thread_audio():
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(play_audio)
        audio = future.result()
    return audio

# print some messages on screen when certain features are triggered
def print_msg(option):
    if option == "load model":
        gr.Info("Loading model")
    elif option == "finish model loading":
        gr.Info("Finished loading model")

# options for input to print_msg function to decide what to print on screen
LOAD_MODEL_OPTION = "load model"
FINISH_MODEL_LOADING_OPTION = "finish model loading"

# number of models used in this app
num_models = 10
pipe_txt2img, pipe_img2img, pipe_scribble, pipe_lineart, pipe_pose, pipe_video, pipe_chibi, pipe_inpaint, pipe_chatbot, rmbg_model = [None for _ in range(num_models)]

models_tts = []
models_vc = []
models_info = [
    {
        "title": "Trilingual",
        "languages": ['日本語', '简体中文', 'English', 'Mix'],
        "description": """
    This model is trained on a mix up of Umamusume, Genshin Impact, Sanoba Witch & VCTK voice data to learn multilanguage.
    All characters can speak English, Chinese & Japanese.\n\n
    To mix multiple languages in a single sentence, wrap the corresponding part with language tokens
     ([JA] for Japanese, [ZH] for Chinese, [EN] for English), as shown in the examples.\n\n
    这个模型在赛马娘，原神，魔女的夜宴以及VCTK数据集上混合训练以学习多种语言。
    所有角色均可说中日英三语。\n\n
    若需要在同一个句子中混合多种语言，使用相应的语言标记包裹句子。
    （日语用[JA], 中文用[ZH], 英文用[EN]），参考Examples中的示例。
    """,
        "model_path": "./pretrained_models/G_trilingual.pth",
        "config_path": "./configs/uma_trilingual.json",
        "examples": [['你好，训练员先生，很高兴见到你。', '草上飞 Grass Wonder (Umamusume Pretty Derby)', '简体中文', 1, False],
                     ['To be honest, I have no idea what to say as examples.', '派蒙 Paimon (Genshin Impact)', 'English',
                      1, False],
                     ['授業中に出しだら，学校生活終わるですわ。', '綾地 寧々 Ayachi Nene (Sanoba Witch)', '日本語', 1, False],
                     ['[JA]こんにちわ。[JA][ZH]你好！[ZH][EN]Hello![EN]', '綾地 寧々 Ayachi Nene (Sanoba Witch)', 'Mix', 1, False]],
        "onnx_dir": "./ONNX_net/G_trilingual/"
    },
    {
        "title": "Japanese",
        "languages": ["Japanese"],
        "description": """
                       This model contains 87 characters from Umamusume: Pretty Derby, Japanese only.\n\n
                       这个模型包含赛马娘的所有87名角色，只能合成日语。
                       """,
        "model_path": "./pretrained_models/G_jp.pth",
        "config_path": "./configs/uma87.json",
        "examples": [['お疲れ様です，トレーナーさん。', '无声铃鹿 Silence Suzuka (Umamusume Pretty Derby)', 'Japanese', 1, False],
                     ['張り切っていこう！', '北部玄驹 Kitasan Black (Umamusume Pretty Derby)', 'Japanese', 1, False],
                     ['何でこんなに慣れでんのよ，私のほが先に好きだっだのに。', '草上飞 Grass Wonder (Umamusume Pretty Derby)', 'Japanese', 1, False],
                     ['授業中に出しだら，学校生活終わるですわ。', '目白麦昆 Mejiro Mcqueen (Umamusume Pretty Derby)', 'Japanese', 1, False],
                     ['お帰りなさい，お兄様！', '米浴 Rice Shower (Umamusume Pretty Derby)', 'Japanese', 1, False],
                     ['私の処女をもらっでください！', '米浴 Rice Shower (Umamusume Pretty Derby)', 'Japanese', 1, False]],
        "onnx_dir": "./ONNX_net/G_jp/"
    },
]

# synthesize voice function
def create_tts_fn(model, hps, speaker_ids):
    global tts_fn
    def tts_fn(text, speaker, language, speed, is_symbol):
        if text == None:
            raise gr.Error("Please generate a chat response in the chatbot first to generate audio.")
        if limitation:
            text_len = len(re.sub("\[([A-Z]{2})\]", "", text))
            max_len = 150
            if is_symbol:
                max_len *= 3
            if text_len > max_len:
                return "Error: Text is too long", None
        if language is not None:
            text = language_marks[language] + text + language_marks[language]
        speaker_id = speaker_ids[speaker]
        stn_tst = get_text(text, hps, is_symbol)
        with no_grad():
            x_tst = stn_tst.unsqueeze(0).to(device)
            x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device)
            sid = LongTensor([speaker_id]).to(device)
            audio = model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8,
                                length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy()
        del stn_tst, x_tst, x_tst_lengths, sid
        return "Success", (hps.data.sampling_rate, audio)

    return tts_fn


def create_vc_fn(model, hps, speaker_ids):
    def vc_fn(original_speaker, target_speaker, input_audio):
        if input_audio is None:
            return "You need to upload an audio", None
        sampling_rate, audio = input_audio
        duration = audio.shape[0] / sampling_rate
        if limitation and duration > 30:
            return "Error: Audio is too long", None
        original_speaker_id = speaker_ids[original_speaker]
        target_speaker_id = speaker_ids[target_speaker]

        audio = (audio / np.iinfo(audio.dtype).max).astype(np.float32)
        if len(audio.shape) > 1:
            audio = librosa.to_mono(audio.transpose(1, 0))
        if sampling_rate != hps.data.sampling_rate:
            audio = librosa.resample(audio, orig_sr=sampling_rate, target_sr=hps.data.sampling_rate)
        with no_grad():
            y = torch.FloatTensor(audio)
            y = y.unsqueeze(0)
            spec = spectrogram_torch(y, hps.data.filter_length,
                                     hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length,
                                     center=False)
            spec_lengths = LongTensor([spec.size(-1)])
            sid_src = LongTensor([original_speaker_id])
            sid_tgt = LongTensor([target_speaker_id])
            audio = model.voice_conversion(spec, spec_lengths, sid_src=sid_src, sid_tgt=sid_tgt)[0][
                0, 0].data.cpu().float().numpy()
        del y, spec, spec_lengths, sid_src, sid_tgt
        return (hps.data.sampling_rate, audio)

    return vc_fn


def get_text(text, hps, is_symbol):
    text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = LongTensor(text_norm)
    return text_norm


def create_to_symbol_fn(hps):
    def to_symbol_fn(is_symbol_input, input_text, temp_text):
        return (_clean_text(input_text, hps.data.text_cleaners), input_text) if is_symbol_input \
            else (temp_text, temp_text)

    return to_symbol_fn

parser = argparse.ArgumentParser()
parser.add_argument("--share", action="store_true", default=False, help="share gradio app")
args = parser.parse_args()
for info in models_info:
    name = info['title']
    lang = info['languages']
    examples = info['examples']
    config_path = info['config_path']
    model_path = info['model_path']
    description = info['description']
    onnx_dir = info["onnx_dir"]
    hps = utils.get_hparams_from_file(config_path)
    model = ONNXVITS_infer.SynthesizerTrn(
        len(hps.symbols),
        hps.data.filter_length // 2 + 1,
        hps.train.segment_size // hps.data.hop_length,
        n_speakers=hps.data.n_speakers,
        ONNX_dir=onnx_dir,
        **hps.model)
    utils.load_checkpoint(model_path, model, None)
    model.eval()
    model.to(device)
    speaker_ids = hps.speakers
    speakers = list(hps.speakers.keys())
    models_tts.append((name, description, speakers, lang, examples,
                       hps.symbols, create_tts_fn(model, hps, speaker_ids),
                       create_to_symbol_fn(hps)))
    models_vc.append((name, description, speakers, create_vc_fn(model, hps, speaker_ids)))


# load chatbot pipe and send to gpu
pipe_chatbot = pipeline("text-generation", model="vicgalle/Roleplay-Llama-3-8B", device=device)

# chatbot function
def chatbot_infer(prompt, chat_history, role):
    global pipe_chatbot
    if pipe_chatbot == None:
        pipe_chatbot = pipeline("text-generation", model="vicgalle/Roleplay-Llama-3-8B", device=device)
        
    messages = [
        {"role": role, "content": prompt},
    ]
    response = pipe_chatbot(messages, max_new_tokens=250, do_sample=True)[0]["generated_text"][1]["content"]
    
    return response

# remove the generated chat response related to action to generate audioes
def remove_between_asterisks(text):
    while True:
        start_index = text.find('*')
        if start_index == -1:
            break

        end_index = text.find('*', start_index + 1)
        if end_index == -1:
            break

        text = text[:start_index] + text[end_index + 1:]

    return text

Character Variation (Pose Variation)

This part is crucial to consistent-looking character generation. We thought of using techniques like DreamBooth and inpainting, but the former requires training form user, which has a rather complex process and takes a rather long time before user can infer; the later could be tedious as user need to draw on the specific parts of the image. We thus used the Stable Diffusion reference community pipeline, this pipeline reduces need of training from techniques like DreamBooth and quickly generates a consistent looking image without extensive inputs from user such as painting. This method only requires an input image and a textprompt.

Stable Diffusion reference community pipeline on GitHub

To accomplish this feature, we first need to put the script for Stable Diffusion reference community pipeline into the workspace folder first(attached in the attachment), then import it as a library, this import process is included in the Installing and Importing Libraries section.

consistent-looking character generated with AniPack

The full script for character variation is as follows:

def pose_to_anime(prompt, negative_prompt, image):
    global pipe_pose
    
    if image is None:
        raise gr.Error("Please provide a input image.")

    # load image
    image = load_image(image)
    
    # load model
    if (pipe_pose == None):
        print_msg(LOAD_MODEL_OPTION)
        
        # load pipe
        pipe_pose = StableDiffusionReferencePipeline.from_pretrained(
            "shellypeng/animever10-god-model", 
            torch_dtype=torch.float16, token=HF_TOKEN
        )

        # load LoRAs
        pipe_pose.load_lora_weights("shellypeng/detail-tweaker")
        pipe_pose.fuse_lora(lora_scale=0.5)
        pipe_pose.load_lora_weights("shellypeng/lora-eyes")
        pipe_pose.fuse_lora(lora_scale=0.6)

        # load textual inversions
        pipe_pose.load_textual_inversion("shellypeng/badhandv4")
        pipe_pose.load_textual_inversion("shellypeng/bad-prompt")
        pipe_pose.load_textual_inversion("shellypeng/deepnegative")
        pipe_pose.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_pose.load_textual_inversion("./EasyNegative.pt")
        pipe_pose.scheduler = DPMSolverMultistepScheduler.from_config(pipe_pose.scheduler.config, use_karras_sigmas=True)

        # send to gpu
        pipe_pose.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)

    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_pose.tokenizer, text_encoder=pipe_pose.text_encoder)

    # positive prompt with weights
    prompt = prompt + hidden_booster_text
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_vary_negative +hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generating 4 result images
    res = gen_batch_img(pipe_pose, height, width, prompt_embeds, negative_prompt_embeds, 40, image, 4)

    return res

Inpainting

inpainting tab

The full script for inpainting is as follows:

# preprocess for inpainting
def make_inpaint_condition(image, image_mask):
    image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
    image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0

    assert image.shape[0:1] == image_mask.shape[0:1], "image and image_mask must have the same image size"
    image[image_mask > 0.5] = -1.0  # set as masked pixel
    image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
    image = torch.from_numpy(image)
    return image


# fills the inpainted outline
def fill_mask(mask_image):
    # convert mask to numpy array
    mask_image = np.array(mask_image)
    
    # obtain contours of enclosed shapes
    contours, _ = cv2.findContours(mask_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # fill the enclosed contours with white color
    cv2.drawContours(mask_image, contours, -1, color=(255, 255, 255), thickness=cv2.FILLED)
    mask_image = Image.fromarray(mask_image)
    
    return mask_image

def inpaint(prompt, negative_prompt, image, btn):
    global pipe_inpaint
    
    # load model
    if (pipe_inpaint == None):
        print_msg(LOAD_MODEL_OPTION)
        
        # load ControlNet for inpainting
        controlnet_inpaint = ControlNetModel.from_pretrained(
            "lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, variant="fp16")
        
        # load pipe
        pipe_inpaint = StableDiffusionControlNetInpaintPipeline.from_pretrained(
            "shellypeng/animever10-god-model", controlnet=controlnet_inpaint, torch_dtype=torch.float16
        )
        
        # load LoRAs
        pipe_inpaint.load_lora_weights("shellypeng/detail-tweaker")
        pipe_inpaint.fuse_lora(lora_scale=0.5)
        pipe_inpaint.load_lora_weights("shellypeng/lora-eyes")
        pipe_inpaint.fuse_lora(lora_scale=0.6)

        # load textual inversions
        pipe_inpaint.load_textual_inversion("shellypeng/badhandv4")
        pipe_inpaint.load_textual_inversion("shellypeng/bad-prompt")
        pipe_inpaint.load_textual_inversion("shellypeng/deepnegative")
        pipe_inpaint.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_inpaint.load_textual_inversion("./EasyNegative.pt")
        pipe_inpaint.scheduler = DPMSolverMultistepScheduler.from_config(pipe_inpaint.scheduler.config, use_karras_sigmas=True)
        
        # send to gpu
        pipe_inpaint.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)
        
    
    compel_proc = Compel(tokenizer=pipe_inpaint.tokenizer, text_encoder=pipe_inpaint.text_encoder)

    # positive prompt with weights
    prompt = prompt + hidden_booster_text
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)
    
    # set strength - how dissimilar to the reference image of the generated images
    strength = 0.9
    
    # load the mask layer of input image
    mask_image = load_image(image["layers"][0])
    
    if image is None or mask_image is None:
        raise gr.Error("Please provide an input image.")
    
    # convert to numpy for preprocessing
    mask_image = np.array(mask_image)
    
    # convert user's mask outline to white for ControlNet Inpaint Pipeline to recognize(it can only recognize monochrome images)
    mask_image[np.all(mask_image == [93, 63, 211], axis=-1)] = [255, 255, 255]
    mask_image = Image.fromarray(mask_image).convert('L')

    # fill the enclosed outline the user has drawn
    mask_image = fill_mask(mask_image)
    image = load_image(image["background"])
    image_shape = image.size
    mask_image.resize(image_shape)

    # make the control image
    control_image = make_inpaint_condition(image, mask_image)

    # generating 4 result images
    res = gen_batch_img(pipe_inpaint, height, width, prompt_embeds, negative_prompt_embeds, 40, image, strength, control_image, mask_image, 4)

    return res

Video Generation

For video generation, we used the I2VGen pipeline from Hugging Face, and applied Anime God v1.0 model on it.

generated video

The full script for video generation is as follows:

def I2VGen_video(prompt, negative_prompt, height, width, num_inference_steps, num_frames, video_image, fps=30):
    global pipe_video
    
    if video_image == None:
        raise gr.Error("Please provide a input image.")
        
    if height % 8:
        raise gr.Error("Please input a height with a value of multiple of 8 on the slider.")
        
    if width % 8:
        raise gr.Error("Please input a width with a value of multiple of 8 on the slider.")
    
    # get attributes - height, size, width - from input image
    size = video_image.size
    height = video_image.size[1]
    width = video_image.size[0]
    
    # load model
    if (pipe_video == None):
        print_msg(LOAD_MODEL_OPTION)
        
        # load pipe
        pipe_video = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16)
        pipe_video.to(device) 
        print_msg(FINISH_MODEL_LOADING_OPTION)
        
    # generate result video
    output = pipe_video(
        prompt=prompt+hidden_booster_text,
        negative_prompt=negative_prompt+hidden_negative,
        image=video_image,
        target_fps=fps,
        num_frames=num_frames,
        num_inference_steps=num_inference_steps,
        height=size[1],
        width=size[0]
    ).frames[0]
    
    return output

Chibi Generation

We used the Chibi-Artstyle LoRA available on civitai.com to accomplish this feature. By increasing the LoRA weights and using Compel to emphasize the word chibi(e.g. chibi++) can increase the chibi effect applied for the image generation.

UI interface for chibi generation

The full script for chibi generation is as follows:

def chibi(prompt, negative_prompt, image, height, width):
    global pipe_chibi
    
    if image is None:
        raise gr.Error("Please provide an input image.")
        
    if height % 8:
        raise gr.Error("Please input a height with a value of multiple of 8 on the slider.")
        
    if width % 8:
        raise gr.Error("Please input a width with a value of multiple of 8 on the slider.")
    
    # preprocessing input image
    image = load_image(image)
    
    # load model
    if pipe_chibi == None:
        print_msg(LOAD_MODEL_OPTION)
        
        # load pipe
        pipe_chibi = StableDiffusionImg2ImgPipeline.from_pretrained("shellypeng/animever10-god-model",
                                                                       torch_dtype=torch.float16, token=HF_TOKEN)

        # load LoRAs
        pipe_chibi.load_lora_weights("shellypeng/detail-tweaker")
        pipe_chibi.fuse_lora(lora_scale=0.5)
        pipe_chibi.load_lora_weights("shellypeng/chibi-artstyle")
        pipe_chibi.fuse_lora(lora_scale=2.0)

        # load textual inversions
        pipe_chibi.load_textual_inversion("shellypeng/badhandv4")
        pipe_chibi.load_textual_inversion("shellypeng/bad-prompt")
        pipe_chibi.load_textual_inversion("shellypeng/deepnegative")
        pipe_chibi.load_textual_inversion("shellypeng/verybadimagenegative")
        pipe_chibi.load_textual_inversion("./EasyNegative.pt")
        pipe_chibi.scheduler = DPMSolverMultistepScheduler.from_config(pipe_chibi.scheduler.config, use_karras_sigmas=True)
        
        # send to gpu
        pipe_chibi.to(device)
        print_msg(FINISH_MODEL_LOADING_OPTION)
        

    # prompt weighter to add weights to prompts
    compel_proc = Compel(tokenizer=pipe_chibi.tokenizer, text_encoder=pipe_chibi.text_encoder)

    # positive prompt with weights
    prompt = "chibi+++" + prompt + hidden_booster_text
    prompt_embeds = compel_proc(prompt)

    # negative prompt with weights
    negative_prompt = negative_prompt + hidden_negative
    negative_prompt_embeds = compel_proc(negative_prompt)

    # generating 4 result images
    res = gen_batch_img(pipe_chibi, height, width, prompt_embeds, negative_prompt_embeds, 40, image, strength=0.5, num_images=4)

    return res

Background Removal

For the background removal feature, we used a model that segments anime characters out of backgrounds.

generated image with background removed

The full script for background removal is as follows:

# get a mask for background removal
def get_mask(img, s=1024):
    img = (img / 255).astype(np.float32)
    h, w = h0, w0 = img.shape[:-1]
    h, w = (s, int(s * w / h)) if h > w else (int(s * h / w), s)
    ph, pw = s - h, s - w
    img_input = np.zeros([s, s, 3], dtype=np.float32)
    img_input[ph // 2:ph // 2 + h, pw // 2:pw // 2 + w] = cv2.resize(img, (w, h))
    img_input = np.transpose(img_input, (2, 0, 1))
    img_input = img_input[np.newaxis, :]
    mask = rmbg_model.run(None, {'img': img_input})[0][0]
    mask = np.transpose(mask, (1, 2, 0))
    mask = mask[ph // 2:ph // 2 + h, pw // 2:pw // 2 + w]
    mask = cv2.resize(mask, (w0, h0))[:, :, np.newaxis]
    return mask

def rmbg_fn(img):
    global rmbg_model
    
    if img is None:
        raise gr.Error("Please provide an input image.")
    
    # load model
    if rmbg_model == None:
        print_msg(LOAD_MODEL_OPTION)
        
        providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
        model_path = huggingface_hub.hf_hub_download("skytnt/anime-seg", "isnetis.onnx")
        rmbg_model = rt.InferenceSession(model_path, providers=providers)
        print_msg(FINISH_MODEL_LOADING_OPTION)
    
    # remove background
    img = np.array(img)
    mask = get_mask(img)
    img = (mask * img + 255 * (1 - mask)).astype(np.uint8)
    mask = (mask * 255).astype(np.uint8)
    img = np.concatenate([img, mask], axis=2, dtype=np.uint8)
    mask = mask.repeat(3, axis=2)
    img = Image.fromarray(img)
    return img

Prompt Engineering

Some positive and negative prompts that refines frequently-appearing defects in generated images are used.

Upon research, the following negative prompts has good effect on correcting hand defects, facial defects and others.

Generation Speed Optimization

Multi-threading were used to increase generation speed by four-fold. During research, other techniques like DeepSpeed, using a specialized model like SD XL Turbo, reducing transfering from CPU to GPU(e.g. avoid saving image that operates on CPU then infer on GPU), or using a faster GPU hardware can increase generation speed as well. For this project multi-threading was mainly used. An exemplar usage is as follows:

def mult_thread_remove_bg(image_box):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(function, input1, input2)
        result_image = future.result()
    return result_image

The full script for multi-threaing is as follows:

def mult_thread_chatbot_infer(prompt, chat_history, role):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(chatbot_infer, prompt, chat_history, role)
        response = future.result()
    return response

def mult_thread_voice_infer(chatbot, speaker, language="English", speed=1.0, is_symbol=False):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        tts_fn = models_tts[0][6]
        text = chatbot[-1][-1]
        text = remove_between_asterisks(text)
        future = executor.submit(tts_fn, text, speaker, language, speed, is_symbol)
        _, audio = future.result()
    return audio

def mult_thread_txt2img(prompt_box, negative_prompt_box, height, width, num_image=4):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(text_to_anime, prompt_box, negative_prompt_box, height, width, num_image)
        
        # case for input image for video generation
        if(num_image == 1):
            image1 = future.result()
            return image1
        
        # case for character prototype generation
        elif num_image == 4:
            image1, image2, image3, image4 = future.result()
            
            return image1, image2, image3, image4

def mult_thread_scribble(prompt_box, negative_prompt_box, image_box, height, width):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(scribble_to_image, prompt_box, negative_prompt_box, image_box, height=height, width=width)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_live_scribble(prompt_box, negative_prompt_box, image_box, height, width):
    # get the scribbled layer of the input image
    image_box = image_box["composite"]
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(scribble_to_image, prompt_box, negative_prompt_box, image_box, height=height, width=width)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_img2img(prompt_box, negative_prompt_box, image_box, height, width):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(real_img2img_to_anime, prompt_box, negative_prompt_box, image_box, height=height, width=width)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_pose2img(prompt_box, negative_prompt_box, image_box):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(pose_to_anime, prompt_box, negative_prompt_box, image_box)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_inpaint(prompt_box, negative_prompt_box, image, btn):
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(inpaint, prompt_box, negative_prompt_box, image, btn)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_chibi(prompt_box, negative_prompt_box, image_box, height, width):
   
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(chibi, prompt_box, negative_prompt_box, image_box, height=height, width=width)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_lineart(prompt_box, negative_prompt_box, image_box):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(lineart, prompt_box, negative_prompt_box, image_box)
        image1, image2, image3, image4 = future.result()
    return [image1, image2, image3, image4]

def mult_thread_remove_bg(image_box):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(rmbg_fn, image_box)
        image = future.result()
    return image

def mult_thread_I2VGen_video(prompt_box, negative_prompt_box, height, width, num_inference_steps, num_frames, video_image, fps):
    
    # conduct multi-threading
    with concurrent.futures.ThreadPoolExecutor(max_workers=12000) as executor:
        future = executor.submit(I2VGen_video, prompt_box, negative_prompt_box, height, width, num_inference_steps, num_frames, video_image, fps)
        video1 = future.result()
    
    # save video for preview on Gradio app interface
    export_to_gif(video1, "I2VGen_video1.gif")
    clip = mp.VideoFileClip("I2VGen_video1.gif")
    out_file_name = "out.mp4"
    clip.write_videofile(out_file_name)
    return out_file_name

User Interface - Gradio

The Gradio package is used to build the UI. It has a seamless integration with Hugging Face models to change the inference process into visualized UI. The full script for UI is as follows:

# transport selected image in gallery to other tabs in app
def get_select_image(prompt, negative_prompt, evt: gr.SelectData):
    return evt.value["image"]["path"], prompt, negative_prompt

# Build the app UI interface
theme = gr.themes.Soft()
with gr.Blocks(theme=theme, css="""footer {visibility: hidden}""", title="AniPack") as iface:
    # audio player for playing congratulation voice messages
    audio = gr.Audio(autoplay=True, visible=False)
    
    # tab for creating prototype
    with gr.Tab("Create Prototype"):
        gr.Markdown(
            """
            # <b>AniPack
            Welcome to AniPack – your personalized anime companion creator - with dialogue, images, and videos.</b>
            
            AniPack is a collection of tools for creating a personalized anime character as your friendly companion, which enables chatting, image generation and video instantiation. Switch to the corresponding tabs to explore.
            
            Tip: Click on the generated images to send the selected image convenienty to next step; use ++ and -- to emphasize or weaken the a prompt word to make it more/less influential to the generation.
            
            Note: First generations of each tab may be slow. Subsequent generations will be faster.
            """
        )
        gr.Markdown(
            """
            # Build your companion's prototype
            """
        )
        
        # get inputs for creating character prototype
        with gr.Row():
            prompt_box = gr.Textbox(label="Prompt", placeholder="Enter a prompt", lines=3)
            height = gr.Slider(512, 1024, label="Height", step=8, value=800)
        with gr.Row():
            neg_prompt_box = gr.Textbox(label="Negative Prompt", placeholder="Enter a negative prompt(things you don't want to include in the generated image)", lines=3)
            width = gr.Slider(512, 1024, label="Width", step=8, value=640)
        txt2img_gen_btn = gr.Button(value="Generate With Text")

        # sub-tab for image-to-image generation
        with gr.Tab("Upload Image"):
            # get inputs
            img2img_image_box = gr.Image(label="Input Image", height=500, type='pil')
            img2img_gen_btn = gr.Button(value="Generate With Image")
        
        # sub-tab for scribble-to-image generation
        with gr.Tab("Upload Scribble Image"):
            gr.Markdown(
                """
                Please upload an image drawn on digital board(e.g. laptop, drawing pad) with black scribbles and white background, with only black and white colors.
                PS: Scribbles with brush size = 2px comes out with best effect.
                """
            )
            scribble_bg = Image.open("sketch_bg.png")
            # get inputs
            scribble2img_image_box = gr.Image(value=scribble_bg, label="Input Image", height=500, type='pil')
            scribble2img_gen_btn = gr.Button(value="Generate With Scribbles")
            
        # sub-tab for live scribble-to-image generation
        with gr.Tab("Draw Scribbles"):
            # get inputs
            live_scribble2img_image_box = gr.Sketchpad(value=scribble_bg, label="Draw Scribbles", type='pil', brush=gr.Brush(default_size="2", color_mode="fixed", colors=["#000000"]))
            live_scribble2img_gen_btn = gr.Button(value="Generate With Scribbles")
        
        # sub-tab for lineart-to-image generation
        with gr.Tab("Colorize lineart"):
            gr.Markdown(
                """
                Color your anime lineart with text prompts.
                """
            )
            # get inputs
            lineart_image_box = gr.Image(label="Lineart Image", height=512, type='pil')
            lineart_gen_btn = gr.Button(value="Colorize Lineart")

        # gallery to show generated images
        gallery = gr.Gallery(
            label="Generated images", show_label=True, elem_id="gallery"
        , columns=[4], rows=[1], object_fit="contain", height=512, allow_preview=True)

        # handle on-click events - generate images
        txt2img_gen_btn.click(mult_thread_audio, [], [audio])
        txt2img_gen_btn.click(fn=mult_thread_txt2img, inputs=[prompt_box, neg_prompt_box, height, width], outputs=[gallery])
        
        img2img_gen_btn.click(mult_thread_audio, [], [audio])
        img2img_gen_btn.click(fn=mult_thread_img2img, inputs=[prompt_box, neg_prompt_box, img2img_image_box, height, width], outputs=[gallery])
        
        scribble2img_gen_btn.click(mult_thread_audio, [], [audio])
        scribble2img_gen_btn.click(fn=mult_thread_scribble, inputs=[prompt_box, neg_prompt_box, scribble2img_image_box, height, width], outputs=[gallery])
        
        live_scribble2img_gen_btn.click(mult_thread_audio, [], [audio])
        live_scribble2img_gen_btn.click(fn=mult_thread_live_scribble, inputs=[prompt_box, neg_prompt_box, live_scribble2img_image_box, height, width], outputs=[gallery])
        
        lineart_gen_btn.click(mult_thread_audio, [], [audio])
        lineart_gen_btn.click(fn=mult_thread_lineart, inputs=[prompt_box, neg_prompt_box, lineart_image_box], outputs=[gallery])
    
    # tab for chatbot
    with gr.Tab("Chat"):    
        gr.Markdown(
            """
            # Chat with a personalized anime companion with a personality and look of your choice.
            
            In "Personality", enter the companion's characteristics, e.g. a tsundere girl, a lively boy, an elegant lady. Then upload a preferred image of your companion and can then start chatting!
            """
        )
        
        # get inputs for chatbot
        with gr.Row(equal_height=True):
            chatbot_input_img = gr.Image(label="Companion Image", interactive=True, type='pil')
            
            with gr.Column():
                placeholder = gr.Textbox(visible=False)
                chatbot_personality_box = gr.Textbox(label="Personality", placeholder="Enter the personality of your companion", lines=1)
                char_dropdown = gr.Dropdown(choices=speakers, value=speakers[0], label="Select Companion's Voice")
                chatbot_ = gr.Chatbot(height=350, render=False)
                chatbot = gr.ChatInterface(
                    chatbot_infer,
                    chatbot=chatbot_,
                    additional_inputs=[chatbot_personality_box]
                )
                with gr.Row(equal_height=True):
                    audio_output = gr.Audio(label="Output Audio", autoplay=True)
                    audio_gen_btn = gr.Button("Generate Audio")
                    audio_gen_btn.click(mult_thread_voice_infer,
                              inputs=[chatbot_, char_dropdown],
                              outputs=[audio_output])
                
        # transport selected image from prototye creation to chatbot tab
        gallery.select(get_select_image, [prompt_box, neg_prompt_box], [chatbot_input_img, placeholder, placeholder])
        
    # tab for pose variation / consistent looking character with variation
    with gr.Tab("Vary Poses"):    
        gr.Markdown(
            """
            # Vary your character's poses
            
            Best effect comes with a close match of your prompt that generated the uploaded image.
            """
        )
        
        # get inputs for variation creation
        with gr.Column():
            pose_prompt_box = gr.Textbox(label="Prompt", placeholder="Enter a prompt", lines=3, scale=1)
            pose_neg_prompt_box = gr.Textbox(label="Negative Prompt", placeholder="Enter a negative prompt(things you don't want to include in the generated image)", lines=3, scale=1)
        pose_input_img = gr.Image(label="Current Image", height=350, type='pil')
        pose_gen_btn = gr.Button(value="Vary Pose")
        
        # gallery to show generated images
        pose_gallery = gr.Gallery(
            label="Generated images", show_label=True, elem_id="gallery"
        , columns=[4], rows=[1], object_fit="contain", height=512, allow_preview=True)
        
        # handle on-click events - generate pose 
        pose_gen_btn.click(mult_thread_audio, [], [audio])
        pose_gen_btn.click(fn=mult_thread_pose2img, inputs=[pose_prompt_box, pose_neg_prompt_box, pose_input_img], outputs=[pose_gallery])
        
        # transport selected image from prototye creation to pose variation tab
        gallery.select(get_select_image, [prompt_box, neg_prompt_box], [pose_input_img, pose_prompt_box, pose_neg_prompt_box]) 
    # tab for inpainting
    with gr.Tab("Inpainting"):
        gr.Markdown(
            """
            # (Optional) Refine your companion with inpainting
            Paint the outlines of the area that you wish to modify. This keeps other parts of the image perfectly consistent.
            """
        )
        
        # get inputs for inpainting
        inpaint_prompt_box = gr.Textbox(label="Prompt", placeholder="Enter what you wish to replace the inpaint", lines=3)
        inpaint_neg_prompt_box = gr.Textbox(label="Negative Prompt", placeholder="Enter a negative prompt(things you don't want to include in the generated image)", lines=3)
        inpaint_image_box = gr.ImageEditor(interactive=True, type="pil", height=500, brush=gr.Brush(default_size="10"))
        
        # transport selected image from prototye creation to inpainting tab
        gallery.select(get_select_image,  [prompt_box, neg_prompt_box], [inpaint_image_box, inpaint_prompt_box, inpaint_neg_prompt_box])
        inpaint_btn = gr.Button(value="Inpaint")
        
        # gallery to show generated images
        inpaint_gallery = gr.Gallery(
            label="Inpainted Images", show_label=True, elem_id="gallery"
        , columns=[4], rows=[1], object_fit="contain", height=512)
        
        # handle on-click events - generate images
        inpaint_btn.click(mult_thread_audio, [], [audio])
        inpaint_btn.click(fn=mult_thread_inpaint, inputs=[inpaint_prompt_box, inpaint_neg_prompt_box, inpaint_image_box], outputs=[inpaint_gallery])
    
    # tab for video generation
    with gr.Tab("Generate Video"):
        gr.Markdown(
            """
            # Generate Video
            Create a video of your companion.
            Larger fps = smaller motion and smoother video.
            Lower fps = larger motion and more intermittent video.
            """
        )
        
        # get inputs for video generation
        with gr.Row():
            vid_prompt_box = gr.Textbox(label="Prompt", placeholder="Enter a prompt", lines=3)
            height = gr.Slider(512, 1024, label="Height", step=8, value=880)
        with gr.Row():
            vid_neg_prompt_box = gr.Textbox(label="Negative Prompt", placeholder="Enter a negative prompt(things you don't want to include in the generated image)", lines=3)
            width = gr.Slider(512, 1024, label="Width", step=8, value=640)
#             video_image_box = gr.Image(label="Reference Image", type='pil', height=512)
        video_steps = gr.Slider(1, 500, label="Inference Steps", value=100)
        with gr.Row():
            num_frames = gr.Slider(10, 40, label="Number of Frames", value=10)
            fps_slider = gr.Slider(20, 50, label="FPS", value=40)

        video_gen_btn = gr.Button(value="Generate Video")
        
        with gr.Row():
            video_image = gr.Image(label="Video Image", type='pil')
            
            video1_box = gr.Video(label="Video", height=512)
            
        # handle on-click events - generate videos
        video_gen_btn.click(mult_thread_audio, [], [audio])
        video_gen_btn.click(fn=mult_thread_I2VGen_video, inputs=[vid_prompt_box, vid_neg_prompt_box, height, width, video_steps, num_frames, video_image, fps_slider], 
                            outputs=[video1_box])
        
        # transport selected image from pose variation to video generation tab
        pose_gallery.select(get_select_image, [pose_prompt_box, pose_neg_prompt_box], [video_image, vid_prompt_box, vid_neg_prompt_box]) # try replace None by gallery

    # tab for chibi generation
    with gr.Tab("Generate Chibi"):
        gr.Markdown(
            """
            # Make a cute Chibi for your companion.
            """
        )
        
        # get inputs for chibi generation
        avatar_prompt_box = gr.Textbox(label="Prompt", placeholder="Enter a prompt", lines=3)
        avatar_height = gr.Slider(512, 1024, label="Height", step=8, visible=False)
        avatar_neg_prompt_box = gr.Textbox(label="Negative Prompt", placeholder="Enter a negative prompt(things you don't want to include in the generated image)", lines=3)
        avatar_width = gr.Slider(512, 1024, label="Width", step=8, visible=False)
        
        with gr.Row():
            avatar_ref_image = gr.Image(label="Reference Image", height=512, type='pil')
            
            # gallery to show generated images
            avatar_gallery = gr.Gallery(
            label="Generated images", show_label=True, elem_id="gallery"
            , columns=[4], rows=[1], object_fit="contain", height=512, allow_preview=True)
        
        avatar_gen_btn = gr.Button(value="Generate Chibi")
        
        # handle on-click events - generate images
        avatar_gen_btn.click(mult_thread_audio, [], [audio])

        avatar_gen_btn.click(fn=mult_thread_chibi, inputs=[avatar_prompt_box, avatar_neg_prompt_box, avatar_ref_image, avatar_height, avatar_width], outputs=[avatar_gallery])
        
        pose_gallery.select(get_select_image, [pose_prompt_box, pose_neg_prompt_box], [avatar_ref_image, avatar_prompt_box, avatar_neg_prompt_box]) # try replace None by gallery

    # tab for background removal
    with gr.Tab("Remove Background"):
        gr.Markdown(
            """
            # Remove background for your companion.
            """
        )
        
        # get inputs for background removal
        with gr.Row():
            anime_char_image = gr.Image(label="Input Image", height=512, type='pil')
            anime_char_remove_bg_image = gr.Image(label="Generated Image", height=512, type='pil')
        
        remove_bg_btn = gr.Button(value="Remove Background")
        
        # handle on-click events - generate images
        remove_bg_btn.click(mult_thread_audio, [], [audio])
        remove_bg_btn.click(fn=mult_thread_remove_bg, inputs=[anime_char_image], outputs=[anime_char_remove_bg_image])
    
    # tab for memory release
    with gr.Tab("Release Memory"):
        gr.Markdown(
            """
            # Please delete some models when encountering the "Out of Memory" error. 
            
            The percentage after each button's label determines how much memory will be gained upon the deletion. Beware that models will have to be re-loaded after deletion, thus the waiting time for the corresponding feature will be longer in the first generation of the corresponding model deleted.
            """
        )
        
        # buttons to release models
        with gr.Column():
            chatbot_release_btn = gr.Button(value="Chatbot - 50%")
            scribble2img_release_btn = gr.Button(value="Scribble to Image Generator - 7%")
            video_release_btn = gr.Button(value="Video Generator - 7%")
            txt2img_release_btn = gr.Button(value="Text to Image Generator - 5%")
            img2img_release_btn = gr.Button(value="Image to Image Generator - 5%")
            lineart2img_release_btn = gr.Button(value="Lineart to Image Generator - 5%")
            pose2img_release_btn = gr.Button(value="Pose Variation Generator - 5%")
            inpaint_release_btn = gr.Button(value="Inpainting Generator - 5%")
            chibi_release_btn = gr.Button(value="Chibi Generator - 5%")
            rmbg_release_btn = gr.Button(value="Remove Background Generator - 5%")
            
            # handle on-click events - model deletion
            chatbot_release_btn.click(fn=delete_models, inputs=[chatbot_release_btn], outputs=[])
            scribble2img_release_btn.click(fn=delete_models, inputs=[scribble2img_release_btn], outputs=[])
            video_release_btn.click(fn=delete_models, inputs=[video_release_btn], outputs=[])
            txt2img_release_btn.click(fn=delete_models, inputs=[txt2img_release_btn], outputs=[])
            img2img_release_btn.click(fn=delete_models, inputs=[img2img_release_btn], outputs=[])
            lineart2img_release_btn.click(fn=delete_models, inputs=[lineart2img_release_btn], outputs=[])
            pose2img_release_btn.click(fn=delete_models, inputs=[pose2img_release_btn], outputs=[])
            inpaint_release_btn.click(fn=delete_models, inputs=[inpaint_release_btn], outputs=[])
            chibi_release_btn.click(fn=delete_models, inputs=[chibi_release_btn], outputs=[])
            rmbg_release_btn.click(fn=delete_models, inputs=[rmbg_release_btn], outputs=[])

Full Python Script

The full script for this project is in the attachment.

Credits

shelly peng

1 project • 1 follower

Contact

Thanks to wen3052963913173401, Nerfgun3, CyberAIchemist, Kokoboy, Euge_, Havoc, lenML, yunleme, luciI, mattmdjaga, okotaku, skytnt, platcha, and Bin Lei.

Comments

Please log in or sign up to comment.

Awards

Generative AI - MI210: 2nd Place

Pervasive AI Developer Contest

AniPack - Customized Anime Character Generator

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Instructions

Features

Project Planning

Brainstorming

Model Selection

Project Development

Project Development

Installing and Importing Libraries

Main Feature Implementation

Model Selection

LoRAs and Textual Inversions

Strengthen/weaken Keywords of a Prompt - Compel

Character Prototype Generation

Chatbot and Voice Synthesizer

Character Variation (Pose Variation)

Inpainting

Video Generation

Chibi Generation

Background Removal

Prompt Engineering

Generation Speed Optimization

User Interface - Gradio

Full Python Script

Code

GitHub Repository to this Project

Credits

shelly peng

Comments

Awards

Embed the widget on your own site

AniPack - Customized Anime Character Generator

AniPack - Customized Anime Character Generator

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Instructions

Features

Project Planning

Brainstorming

Model Selection

Project Development

Project Development

Installing and Importing Libraries

Main Feature Implementation

Model Selection

LoRAs and Textual Inversions

Strengthen/weaken Keywords of a Prompt - Compel

Character Prototype Generation

Chatbot and Voice Synthesizer

Character Variation (Pose Variation)

Inpainting

Video Generation

Chibi Generation

Background Removal

Prompt Engineering

Generation Speed Optimization

User Interface - Gradio

Full Python Script

Code

GitHub Repository to this Project

Credits

shelly peng

Comments

Awards