# Webaverse AI Architecture
The following covers all of the AI systems that are running and required for Webaverse to run properly.
<br />
# Domains
- Text Generation
-- Dialog
-- Quests
-- Descriptions
-- Lore
-- Ongoing Story
- Image Generation
-- Image previews
- Voice Generation
-- Character voices
- Animation Generation
-- Character animations
-- Mob animations
-- Pet animations
- Model Generation
-- Weapons
-- Consumables
-- Wearables
-- Mobs
-- Pets
-- Vehicles
-- Characters
- Music Generation
-- In-game Music
-- Character theme songs
-- Musical objects
- Sound Generation
-- Weapon sound FX
-- Mob sound FX
-- Pet sound FX
-- Vehicle sound FX
<br />
# Models
## Search
Used for searching and doing ANN matching on anything, especially searching the Wikipedia or searching through a corpus for the closest match of something
**STATUS: DEPLOYED**
REPO: https://github.com/webaverse/weaviate-server
### Current Implementation
Weaviate - https://weaviate.io/
### Future Implementations
- https://qdrant.tech/benchmarks/single-node-speed-benchmark/ -- much faster than weaviate
- https://github.com/facebookresearch/faiss -- in development for a long time from facebook, as many compelling features, getting baked into Blenderbot
## Text Generation
- Used for generated all text and completions. Needs to be general purpose and fine tunable, able to handle long bodies of text and complex situations. Fast response time is also helpful.
**STATUS: DEPLOYED**
### Current Implementation
GPT-3
- Great in every way but expensive and not open source, many things we can't do with it
### Future Implementation Candidates
OPT-175b
- https://opt.alpa.ai/ / https://alpa.ai/tutorials/opt_serving.html
- Comparable to early GPT-3 pre-instruct and good quality when finetuned
- Used in Blenderbot -- https://blenderbot.ai/
Unified Language Model
- https://github.com/microsoft/unilm
## Emotion Recognition and Toxicity Detection
Used for rapidly determining sentiment, emotion and hate in text
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
XtremeDistil trained on GoEmotion dataset
- https://huggingface.co/bergum/xtremedistil-l6-h384-go-emotion
- Fast enough to run in the browser
Distilbert Toxicity detection
- https://huggingface.co/dapang/distilbert-base-uncased-finetuned-toxicity
## Voice Generation
Used for all character voice generation. Needs to be super fast and human sounding
**STATUS: DEPLOYED**
REPO: https://github.com/webaverse/tiktalknet
### Current Implementation
TikTalkNet - https://github.com/webaverse/tiktalknet
- Super fast and lightweight but not SOTA
### Future Implementation Candidates
- https://github.com/NATSpeech/NATSpeech
- https://github.com/Rongjiehuang/ProDiff
- https://github.com/keonlee9420/DiffGAN-TTS
## Image Generation
Used for generating character portraits, backgrounds, objects, textures and in-game artwork
**STATUS: DEPLOYED**
REPO: https://github.com/webaverse/stable-diffusion-webui
DEPRECATED: https://github.com/webaverse/stable-diffusion
### Future Implementation Candidates
- Kali Yuga is making great stuff, especially re: 8bit, pixel art and such, but it's all based on k-diffusion and Disco Diffusion
- https://github.com/KaliYuga-ai/Pixel-Art-Diffusion/blob/main/Pixel_Art_Diffusion_v3_0_(With_Disco_Symmetry).ipynb
- https://colab.research.google.com/drive/1ANvbcAI20-B-HXk5I0JwpRQvXPALBqtJ
## Sound Generation
Used for ambient sounds in the world, as well as sound effects attached to objects and mobs
**STATUS: DEPLOYED**REPO: https://github.com/webaverse/diffsound
### Current Implementation
DiffSound
- https://github.com/yangdongchao/Text-to-sound-Synthesis
### Future Implementation Candidates
Audio Diffusion -- similar to DiffSound but samples are much better, probably due to datasets
- https://github.com/archinetai/audio-diffusion-pytorch#text-conditional-generation
- https://felixkreuk.github.io/text2audio_arxiv_samples/
- https://github.com/lucidrains/audiolm-pytorch (not yet working)
## Music Generation (Audio)
Used for all music generated in Webaverse
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
- https://github.com/archinetai/audio-diffusion-pytorch
This version of Audio Diffusion features fine-tuned models on specific pieces
- https://github.com/teticio/audio-diffusion
- https://huggingface.co/spaces/teticio/audio-diffusion
## Music Generation (MIDI)
Used for ambient audio generation in Webaverse. May be much faster to generate and process consistent long pieces than audio-only methods.
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
- https://github.com/sjhan91/Mixture2Music_Official
- https://salu133445.github.io/musegan
## Text to 3D
Used for generation of all 3D objects and features in the world, based on descriptions, images or general class types
**STATUS: DEPLOYED**
REPO: https://github.com/webaverse/stable-dreamfusion
### Current Implementation
Stable Dreamfusion - https://github.com/ashawkey/stable-dreamfusion
- SOTA, uses SD
- Slow (50 minutes/generation) and quality isn't great
GET3D - https://github.com/nv-tlabs/GET3D
- Relatively fast, ~ 1 min / model
- Needs research on conditional generation
### Future Implementation Candidates
https://nv-tlabs.github.io/LION/ - not released yet
## Text to Motion
Used for generation of humanoid animations
**STATUS: DEPLOYED**
REPO: https://github.com/webaverse/motion-diffusion-model
### Current Implementation
Motion Diffusion
- https://github.com/webaverse/motion-diffusion-model
### Other Implementations
https://github.com/mingyuan-zhang/MotionDiffuse - Seems very similar
## Image to Text & Visual Question Answering
Used for describing images so that the game can incorporate user images into the story, analyze screenshots, generate labels for training data or prompts for inverted generation
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
- https://github.com/salesforce/BLIP - works really well for prompt-like captions, also does visual question answering
- https://github.com/webaverse/CLIP-Caption-Reward - detailed descriptions
- https://github.com/pharmapsychotic/clip-interrogator - does a really good job giving back prompts
- https://huggingface.co/dandelin/vilt-b32-finetuned-vqa
## Audio to Text
Used for captioning or describing audio or sounds
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
https://github.com/TheoCoombes/ClipCap - uses CLAP from LAION to do many things, including captioning audio and audio2img
### 2D Image Animation
Generate animation from 2D images, especially synced with audio or text for characters and portraits
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
https://github.com/yoyo-nb/Thin-Plate-Spline-Motion-Model
### Model Rigging
Add bones to objects, characters, mobs and pets that don't have a rig
**STATUS: NOT DEPLOYED**
### Future Implementation Candidates
- https://github.com/zhan-xu/RigNet
<br />
# Datasets
### 3d models
- http://yulanguo.me/dataset.html
- https://shapenet.org/
### Human models
- https://github.com/open-mmlab/mmhuman3d/tree/main/configs/gta_human