Ableton Live Suite

@memex_pibo • Posted by vany5921

Published July 22, 2020 © MIT

Application of M5StickC in VR real-time performance system

Thanks @memex_pibo Provided materials，this is a detailed workflow about VR live performance

AdvancedProtip15 hours1,817

Application of M5StickC in VR real-time performance system

Things used in this project

Hardware components

M5Stack M5StickC ESP32-PICO Mini IoT Development Board

Ultrasonic Sensor - HC-SR04 (Generic)

Software apps and online services

Unity

Arduino IDE

Story

My name is Pibo (@memex_pibo).

We are twins music group called "memex" that works in a VR space.

Around the end of May, I made a VR live system called "Omnipresence Live" and performed a live titled "#Interpretation disagreement".

I think it has become a rare VR live where live performances are performed with audio-reactive space production and there is no limit on the number of people.

This entry describes its implementation and operation.

archive

The spatial archive of "#interpretation disagreement" is published as the world of VRChat.

You can experience it here. It takes about 50 minutes.

https://vrchat.com/home/world/wrld_21a48553-fd25-40d0-8ff0-b4402b36172a

We also have an archive of 360° stereo images on YouTube.

https://youtu.be/CIahS2Z1_Ds

It is easier to understand if you see it, so I would like you to see it once.

System diagram

It is a system diagram when "#interpretation disagreement" is performed.

Two memex in a remote place share voice and motion over the net, convert it to video once, and deliver it to the world of VRChat.

Process description

First, we will explain the implementation of "Omnipresence Live", which is the core of the technology to realize live performances, and then introduce the case of "#interpretation disagreement" as a concrete operation.

"Omnipresence Live" commentary topic

•Convert audio volume to OSC

•Convert MIDI signal to OSC

•Receive OSC value with Unity and write to RenderTexture

•Video encoding measures when you want to send information as video

•Visualize the vertex animation texture of the avatar in real time

•Reconstruct avatar motion from texture with embedded vertex information

•Use Quaternion to express the rotation of the object only with Shader

Explanation topic of "#interpretation disagreement"

•Remote session with live song and live guitar

•Para audio reactive space production that responds to each track that makes up a song

•Directed event management that fires at a specific timing of music including live music

•Remote motion capture

•Both live guitar performance and guitar motion recording

•Real-time display of tweets in VR space

•Return monitor for remote motion recording

•Build a dedicated video streaming server with AWS

•Implementation of Omnipresence Live reconstructor in VRChat

•Spatial editing function on VRChat

Omnipresence Live implementation

What is Omnipresence Live

It is a system that converts all the avatar's motion, performance, and audio information, which are the elements that make up a VR live, into a video and delivers it, then reconstructs the VR live by receiving the delivered video.

Feature

•There is no limit to the number of participants

•However, not all spectators can see the audience

•Live songs and live performances are possible

•Raw motion recording is possible

•Paraaudio reactive (reaction reacts to each track of music) space production is possible

Issues to be solved

There is no way to do a VR live with audio-reactive space production in time with live performance (without making it)

If it is a pure solution, it seems that it is necessary to create a server and a corresponding client that can deliver motions and effects that share voice and time code

Technique

By converting motion/effect information into video, the motion/effect is delivered in a state where the timing matches the sound.

How it works

It can be divided into the following two systems.

Event Visualizer: Export avatar motion and VR space production parameters as frame images

Reconstructor: Reconstruct the avatar movement and VR space production from the image generated by the event visualizer.

Event visualizer

The event visualizer converts the motion information and space information of each frame avatar into the image below.

The part surrounded by light blue is the performance information, and the part surrounded by green is the motion information.

Event visualizer system configuration

Ableton Live Suite

-Overview

•DAW (software used for composition, etc.)

-Role

•Live timeline management

•Bridging to Max for Live

•Live audio processing

Max for Live

-Overview

•A plugin that allows you to run MAX/MSP (Visual Programming Environment) on each Ableton Live track

-Role

•Send the volume of each live audio track (song, guitar, drum kick, drum snare, etc.) via OSC

•Send MIDI signal by OSC

Unity (Direction Signal Visualizer)

-Role

•Write volume/MIDI signal received by OSC to 1920px x 1080px texture

Unity (Vertex Visualizer)

-Role

•Write the position information of the vertices of the VRM avatar one by one to the texture of 1920px × 1080px

Unity (Position/Posture Visualizer)

-Role

•Write the Transform of any object to a 1920px x 1080px texture

Ableton Live Suite

Send audio to Max for Live

By inserting the purple Max for Live plug-in described below into the audio reactive track, live song track, and live guitar track, the volume and pitch of each part can be converted individually to OSC.

Max For Live plug-in

Max for Live is a visual programming environment that can handle audio signals that can work with Ableton Live very easily.

The volume and pitch (scale) of the audio of each track of Ableton Live Suite is calculated and sent by OSC.

Unity (Direction Signal Visualizer)

The production signals that match the volume, pitch, and song development are converted to textures.

It receives the OSC signal from Max for Live and images it.

We use https://github.com/hecomi/uOSC to receive OSC.

We will take a slightly different approach with values that require as little precision as possible.

A value that does not require precision

Converts rough values such as loudness to rough values called color brightness.

Specifically, this method is used when the accuracy of 8 bits or less (0 to 255) is sufficient.

A GameObject with the following OSCVolumeAndPitchVisualizer.cs attached is prepared for the tracks that send the OSC.

OSCVolumeAndPitchVisualizer.cs writes the color based on the sound of the track to the coordinate of the texture defined for each track.

Volume = Brightness (HSV V)
Pitch = Hue (HSV H)

I call the compute shader with 1 thread to write the color, but I think Texture2D.SetPixels() is fine.

///OSCVolumeAndPitchVisualizer.cs
///Simplified 

using UnityEngine;
using uOSC;

public class OSCVolumeAndPitchVisualizer : MonoBehaviour
{
    /// Compute shader that writes a float value between 0 and 1 to a pixel with a monochrome value between 0 and 255
    public ComputeShader NormalizedRGBValueTo64pxRGBBrightness;
    /// Pixel position to write
    public int row, column;
    /// Texture to write
    [SerializeField]
    private RenderTexture output;
    /// OSC receiving server
    [SerializeField] uOscServer server;
    /// OSC address to receive
    [SerializeField] string address;
    /// volume: 0~1
    /// pitch: 0=A (La) is an integer from 0 to 11
    [SerializeField] float volume = 0, pitch = -2;
    /// hsv（color）value: 0=1
    [SerializeField] float h, s = 1.0f, v;

    void Start()
    {
        server.onDataReceived.AddListener(OnDataReceived);
    }

    void OnDataReceived(Message message)
    {
        if (message.address == (address + "/volume"))
        {
            float.TryParse(message.values[0].GetString(), out volume);
        }
        else if (message.address == (address + "/pitch"))
        {
            // 0 = A, 1 = A# If 0 to 11 is not detected, -1 will be returned
            float.TryParse(message.values[0].GetString(), out pitch);
        }
    }

    private void Update()
    {
        var dt = Time.deltaTime;

        /// Volume as it is brightness
        v = volume;
        /// Pitch to hue (in fact, it was complemented with LerpAngle so that it would not change)
        h = pitch / 12.0f;

        /// Convert HSV color to RGB
        var rgb = Color.HSVToRGB(h, s, v * v);

        SetBlockRGB(row, column, rgb);
    }

    /// <summary>
    /// Paint the specified block (8px x 8px) with the specified color
    /// </summary>
    /// <param name="rowInMethod">Row of blocks</param>
    /// <param name="columnInMethod">Row of blocks</param>
    /// <param name="rgbInMethod">color</param>
    private void SetBlockRGB(int rowInMethod, int columnInMethod, Color rgbInMethod)
    {
        // Decide which kernel (process) you want to call
        var kernel = NormalizedRGBValueTo64pxRGBBrightness.FindKernel("CSMain");

        // Pass necessary data, reference, etc.
        NormalizedRGBValueTo64pxRGBBrightness.SetInt("row", rowInMethod);
        NormalizedRGBValueTo64pxRGBBrightness.SetInt("column", columnInMethod);
        NormalizedRGBValueTo64pxRGBBrightness.SetFloat("normalizedRed", rgbInMethod.r);
        NormalizedRGBValueTo64pxRGBBrightness.SetFloat("normalizedGreen", rgbInMethod.g);
        NormalizedRGBValueTo64pxRGBBrightness.SetFloat("normalizedBlue", rgbInMethod.b);
        NormalizedRGBValueTo64pxRGBBrightness.SetTexture(kernel, "OutPosition", output);

        // Run Compute Shader
        NormalizedRGBValueTo64pxRGBBrightness.Dispatch(kernel, 1, 1, 1);
    }
}
///Normalized8bitRGBValueTo64pxRGB.c
// the value passed from cs
RWTexture2D<float4> OutPosition;
int row,column;
float normalizedRed, normalizedGreen, normalizedBlue;

[numthreads(1,1,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{   
    for(uint x=0; x < 8; x++){
        for(uint y=0; y < 8; y++){
            OutPosition[uint2((row) * 8 + x, (column) * 8 + y)] 
            = float4(normalizedRed, normalizedBlue, normalizedGreen, 1.0);
        }
    }
}

Value that requires accuracy as much as possible

Information limit

For example, if the x coordinate of the object is expressed as 100=1m using the value of 8bit=0~255, it means that the distance of 2.56m can be moved only by 1cm.

In addition, if you can express only 8 bits in a situation where you want to slowly fade over 25 seconds, the value can be updated only about 10 times per second, which makes it look crunchy. (Actually, it becomes even more jarring when encoded)

If you want more precision, you need to be careful to write values to pixels so that the encoding doesn't strip the information.

Video encoding problem

Video encoding is a problem when parameters are converted to images and distributed as video, and parameters are restored from the received video.

It is a process that makes communication of distribution and reception easier by compressing the amount of color data in a form that is difficult for humans to notice.

However, this is a shackle when you want to send an accurate value.

Generally, digital colors are represented by the degree of three colors of red, green, and blue = RGB value.

In image/video data, one pixel is often represented by a total of 24 bits, 2^8=8 bits of 0 to 255 for red, green, and blue.

If the color of 1 pixel 24bit is 1920px × 1080px pixel number, 30fps uncompressed video, the required number of bits per second is

It becomes 24*1920*1080*30 / 1024^2 =1423Mbps (I am worried that the value is too large and the calculation is correct)

However, 1920px × 1080px video is exchanged at 3 to 6Mbps on general video streaming site 1.

Thanks to the encoder working hard and compressing.

Encoding measures: block and brightness 8bit

As a method of sending information that is somewhat robust to encoding, the following method was adopted as a result of trial and error. (I think there is a better way if you read the H.264 encoding specifications properly...)

Write one information in a block like 4px × 4px, 8px × 8px as much as possible
Express the value only with brightness without using color, that is, 8 bits from 0 to 255

Embedding information larger than 8 bits by dividing it into two colors: Rejection plan

If you want to handle a value larger than 8 bits, you must use two or more blocks to represent one value. But……

Even 8bit=255 levels of brightness will be a value with noise due to encoding.

In order to express a 16-bit value, I tried to divide the first block into 8 bits in the first half, 2 blocks in the second half, and so on, using 2 blocks.

However, the error has become very large.

For example, suppose that the value you want to send is 21, 855 as shown below.

If the encoding changed the value of block 1 by 3, then the result would be 21, 087. The value of block 1 has changed from 85 to 82, only 3/255, but the total of 2 blocks will change 3*256=768.

Considering the case where the x-coordinate of the object is expressed as 10000=1m using the value of 16bit=0 to 65535 held by these 2 blocks, the error of brightness 3 of 1 block due to encoding easily causes a shift of about 8cm. It will be.

Information that is larger than 8 bits is divided into two colors and embedded

Therefore, we will take the method of assigning 1 bit to 2 blocks alternately. 21, 855 can be expressed like this.

The value of block 1 changed 3 from 3 to 6, but the value was 21, 885 and the error was 30. If 10000=1m, it's 3mm.

Implementation

The implementation looks something like this. (It's just a bit operation)

(The extension is actually.compute, but it is hard to read because it is not highlighted in Qiita, so it is called.c)

//Writer8x16px16bitUnsignedInt.c
//the value passed from CS
RWTexture2D<float4> OutPosition;
float val;
int row;
int column;

float4 oddBitOfUintTo8bitBrightness(uint posMillimeter){
    // 0bo-p-q-r-s-t-u-v- -> 0b00000000opqrstuv 
    uint goal = 0;
    goal = (posMillimeter &     2) ==     2 ? 1 : goal;
    goal = (posMillimeter &     8) ==     8 ? goal | 2 : goal;
    goal = (posMillimeter &    32) ==    32 ? goal | 4 : goal;
    goal = (posMillimeter &   128) ==   128 ? goal | 8 : goal;
    goal = (posMillimeter &   512) ==   512 ? goal | 16 : goal;
    goal = (posMillimeter &  2048) ==  2048 ? goal | 32 : goal;
    goal = (posMillimeter &  8192) ==  8192 ? goal | 64 : goal;
    goal = (posMillimeter & 32768) == 32768 ? goal | 128 : goal;

    return  float4(
        goal & 0xff,
        goal & 0xff,
        goal & 0xff,
        0) / 255.0;
}

float4 evenBitOfUintTo8bitBrightness(uint posMillimeter){
    // 0b-o-p-q-r-s-t-u-v -> 0b00000000opqrstuv 
    uint goal = 0;
    goal = (posMillimeter &     1) ==     1 ? 1 : goal;
    goal = (posMillimeter &     4) ==     4 ? goal | 2 : goal;
    goal = (posMillimeter &    16) ==    16 ? goal | 4 : goal;
    goal = (posMillimeter &    64) ==    64 ? goal | 8 : goal;
    goal = (posMillimeter &   256) ==   256 ? goal | 16 : goal;
    goal = (posMillimeter &  1024) ==  1024 ? goal | 32 : goal;
    goal = (posMillimeter &  4096) ==  4096 ? goal | 64 : goal;
    goal = (posMillimeter & 16384) == 16384 ? goal | 128 : goal;

    return  float4(
        goal & 0xff,
        goal & 0xff,
        goal & 0xff,
        0) / 255.0;
}

[numthreads(1,1,1)]
void CSMainFHD (uint3 id : SV_DispatchThreadID)
{
    uint uintValue = val * 65535.0f;

    for(uint x=0; x < 8; x++){
        for(uint y=0; y < 8; y++){
            OutPosition[uint2((row) * 8 + x, (column) * 8 + y)] =  oddBitOfUintTo8bitBrightness(uintValue);
            OutPosition[uint2((row + 1) * 8 + x, (column) * 8 + y)] = evenBitOfUintTo8bitBrightness(uintValue);
        }
    }
}

Unity (Vertex Visualizer)

Vertex animation

The position information of the avatar's vertices is written to the texture one vertex at a time as described above.

Vertex Animation Texture, which is a method of writing the vertex position in each key frame of the animation to the texture and reconstructing it with the image.

In implementing, I referred to sugi-cho's repository "Animation-Texture-Baker".

https://github.com/sugi-cho/Animation-Texture-Baker

Constraint

The position information of the vertices is represented by 32bit float in Unity, but it is difficult to guarantee its accuracy for the reason described above, so it is represented by 16bit value by setting the range limit.

Specifically, I expressed it as follows.

The vertex position can move only within the range of -3.2767m to 3.2767m.
Vertices beyond this range will not be drawn
Add a 3.2767m offset to the position value and express the position as a 16-bit value 0 to 65535 of unsigned int

Output image

The white horizontal block at the bottom of the texture below is the vertex information for one avatar.

This is a close-up of the one part (green area in the lower left) where the vertex information is drawn.

The x, y, and z coordinates for each vertex are written in two 4px × 4px blocks.

Implementation

The implementation looks something like this.

//RealtimeVertexBaker16bitUnsignedInt.cs
public class RealtimeVertexBaker16bitUnsignedInt : MonoBehaviour
{
    public ComputeShader infoTexGen;
    public Material material;

    // offset of where in the texture to write
    public int columnOffset=0;

    private SkinnedMeshRenderer _skin;
    private int vertexCount;

    private const int TEX_WIDTH = 1920, TEX_HEIGHT = 1080;
    [SerializeField]
    private RenderTexture pRt;
    private Mesh mesh;
    private List<Vector3> posList;
    private ComputeBuffer posBuffer;

    private void Start()
    {
        // Get the Skinned Mesh Renderer for the avatar
        _skin = GetComponent<SkinnedMeshRenderer>();
        vertexCount = _skin.sharedMesh.vertexCount;

        mesh = new Mesh();

        // Make the render texture writable
        pRt.enableRandomWrite = true;
    }

    void Update()
    {
        // Make a mesh for the current frame from SkinnedMeshRender
         _skin.BakeMesh(mesh);
        // Create a container to pass the value to the compute shader
        // Create a buffer with the number of vertices * Vector3, like a C dynamic memory allocation
        posBuffer = new ComputeBuffer(vertexCount, System.Runtime.InteropServices.Marshal.SizeOf(typeof(Vector3)));

        // Set the vertex position information of mesh
        posBuffer.SetData(mesh.vertices);

        var kernel = infoTexGen.FindKernel("CSMainFHD");

        // Pass required data and references
        infoTexGen.SetInt("VertCount", vertexCount);
        infoTexGen.SetInt("ColumnOffset", columnOffset);
        infoTexGen.SetBuffer(kernel, "Pos", posBuffer);
        infoTexGen.SetTexture(kernel, "OutPosition", pRt);

         // run the compute shader
         // argument is the number of threads
         // The number of threads is the number of vertices * 1 * 1

        infoTexGen.Dispatch(kernel, vertexCount, 1, 1);

        posBuffer.Release();
    }

}
//VertexWriter16bitUnsignedIntFHD.c
// the value passed from CS
RWTexture2D<float4> OutPosition;
StructuredBuffer<float3> Pos;
int VertCount;
int ColumnOffset;

[numthreads(1,1,1)]
void CSMainFHD (uint3 id : SV_DispatchThreadID)
{
    // 
    // id.x is the vertex ID as it is
    // row = id.x % (TEX_WIDTH / 4) // 4はcolumnあたりのx方向の画素数
    // column = id.x / (TEX_WIDTH / 4)

    uint index = id.x;
    float3 pos = Pos[index];

    int TEX_WIDTH = 1920;
    uint row = index % (TEX_WIDTH / 4);
    uint column = index / (TEX_WIDTH/ 4) + ColumnOffset;

    uint posXMillimeter = (pos.x + 3.2767f) * 10000.0f;
    uint posYMillimeter = (pos.y + 3.2767f) * 10000.0f;
    uint posZMillimeter = (pos.z + 3.2767f) * 10000.0f;

    //pos.x1
    OutPosition[uint2(row * 4 + 0, column * 6 + 0)] = oddBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 0, column * 6 + 1)] = oddBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 1, column * 6 + 0)] = oddBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 1, column * 6 + 1)] = oddBitOfUintTo8bitBrightness(posXMillimeter);

    //pos.x2
    OutPosition[uint2(row * 4 + 2, column * 6 + 0)] = evenBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 2, column * 6 + 1)] = evenBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 3, column * 6 + 0)] = evenBitOfUintTo8bitBrightness(posXMillimeter);
    OutPosition[uint2(row * 4 + 3, column * 6 + 1)] = evenBitOfUintTo8bitBrightness(posXMillimeter);

    //pos.y1
    //abridgement
    //pos.y2
    //abridgement
    //pos.z1
    //abridgement
    //pos.z2
    //abridgement 
}

Unity (Position/Posture Visualizer)

Objects with the Skinned Mesh Renderer wrote the vertex positions, while objects that didn't change shape just wrote the object's Position and Rotation.

Rotation is expressed as x, y, z, w of Quaternion in the range of -1 to 1, so it is normalized to the range of 0 to 1 and written as an 8-bit value.

Reconstructor

Reconstruct the staging and motion from the images generated by the event visualizer.

All of the shaders are written to implement the reconstructor in VRChat, but I think it is more straightforward to use RenderTexture.ReadPixels() in C# when you do not need to rely on a special GPU.

Directing is expressed by operating arbitrary shader parameters.

Motion restores the vertex position from the image with the vertex shader of the shader attached to the same model that wrote the vertex information to the texture.

Reconstruction of production

It controls the shader parameters by calling the color of a specific pixel in the image.

When reading colors from encoded video, I feel that the error is reduced when I discard the outer frame of the 8px × 8px rectangular area and read the center of 6px × 6px.

ReadBlock.c
float texelSizeX = (1.0 / 1920.0);
    float texelSizeY = (1.0 / 1080.0);

    float4 color = float4(1,1,1,1);

    // read one block of 8*8 square specified by value1 row and column
    // Read center 6*6 and add up for the time being
    for(uint x = 0; x < 6; x++){
        for(uint y = 0; y < 6; y++){
            float2 address = float2(
            // Since it is 1 block at 8px, it is more accurate to discard row*8, the outer periphery of the block.
                ( (_row) * 8 + 1 + x ) * texelSizeX,
                ( (_column) * 8 + 1 + y ) * texelSizeY );
            color += tex2Dlod(_valueTex, float4(address.x, address.y, 0, 0));
        }
    }

    // Divide total by 36 and average
    color = color / 36.0;

When reading 16bit value from 2 blocks, the following code was used (simply bit operation)

Unpack.c
float unpackUintFromDoubleFloat4(float4 oneSecond, float4 twoSecond){
    // Combine 8bit values oneSecond = 0bxxxxxxxx and twoSecond = 0byyyyyyyy to 16bit goal = 0bxyxyxyxyxyxyxyxy
    uint4 oS = uint4(oneSecond * 255.0 + 0.5);
    uint4 tS = uint4(twoSecond * 255.0 + 0.5);

    uint firstGoal = (oS.x & 1) == 1 ? 2 : 0;
    firstGoal = (oS.x & 2) == 2 ? firstGoal | 8 : firstGoal;
    firstGoal = (oS.x & 4) == 4 ? firstGoal | 32 : firstGoal;
    firstGoal = (oS.x & 8) == 8 ? firstGoal | 128 : firstGoal;
    firstGoal = (oS.x & 16) == 16 ? firstGoal | 512 : firstGoal;
    firstGoal = (oS.x & 32) == 32 ? firstGoal | 2048 : firstGoal;
    firstGoal = (oS.x & 64) == 64 ? firstGoal | 8192 : firstGoal;
    firstGoal = (oS.x & 128) == 128 ? firstGoal | 32768 : firstGoal;

    uint secondGoal = (tS.x & 1) == 1 ? 1 : 0;
    secondGoal = (tS.x & 2) == 2 ? secondGoal | 4 : secondGoal;
    secondGoal = (tS.x & 4) == 4 ? secondGoal | 16 : secondGoal;
    secondGoal = (tS.x & 8) == 8 ? secondGoal | 64 : secondGoal;
    secondGoal = (tS.x & 16) == 16 ? secondGoal | 256 : secondGoal;
    secondGoal = (tS.x & 32) == 32 ? secondGoal | 1024 : secondGoal;
    secondGoal = (tS.x & 64) == 64 ? secondGoal | 4096 : secondGoal;
    secondGoal = (tS.x & 128) == 128 ? secondGoal | 16384 : secondGoal;

    uint goal = firstGoal | secondGoal;
    float value = goal;
    return value;
}

This is an example of moving an object around the circumference using the read hue value.

//vert.c
// in vert shader

//------RGB to HSV -------
float3 hsv = rgb2hsv(color);
//------------------------

float rad = radians(hsv.x*360.0);
v.vertex.x += cos(rad);
v.vertex.y += sin(rad);

//------------------------

v2f o;
o.vertex = UnityObjectToClipPos(v.vertex);
o.uv = TRANSFORM_TEX(v.uv, _MainTex);
UNITY_TRANSFER_FOG(o,o.vertex);
return o;

Reconstruction of motion

Replace the position of each vertex with the position read from the texture

//vert.c
appdata vert (appdata v, uint vid : SV_VertexID)
{

        // Omitting the value fetching from the texture is the same as the production
        // Convert the 16-bit value from 0 to 65536 to the original position information of -3.2767m to 3.2767m.
    float posX = unpackUintFromDoubleFloat4(oneSecondPosX, twoSecondPosX) / 10000.0f - 3.2767f;
    float posY = unpackUintFromDoubleFloat4(oneSecondPosY, twoSecondPosY) / 10000.0f - 3.2767f;
    float posZ = unpackUintFromDoubleFloat4(oneSecondPosZ, twoSecondPosZ) / 10000.0f - 3.2767;

    float3 pos = float3(posX, posY, posZ);

    appdata o;
    o.vertex = v.vertex;

        // Replace the vertex position with the one read from the texture
    o.vertex.xyz = pos;
    o.uv = v.uv;
    return o;
}

Harm of encoding noise

If there is no deterioration due to encoding, you can restore it neatly with the above code.

(Left is the mesh restored from the texture, right is the original avatar)

However, since the information is severely deteriorated by encoding, reading the encoded video as it is will look like this.

The polygons containing the vertices are sized to cover the whole because some of the vertices have blown away.

Not all vertices are messed up, there are some normal polygons inside

Measures against encoding noise

Therefore, we will filter apparently outliers with a geometry shader. (It's not a good solution...)

Filtered by the following 3 conditions.

Extremely different ratio of sides in polygon

Distance between vertices is too large

The vertex is in a position where it seems to be unused

The actual code is as follows (let's not write the value directly...)

//Filter.c
[maxvertexcount(3)]
void geom(triangle appdata IN[3], inout TriangleStream<g2f> triStream)
{
    // Flag to omit this polygon
    bool isBug = false;

    // Get the length of each side of the triangle you are looking at
    float sideLength0to1 = length(IN[0].vertex - IN[1].vertex);
    float sideLength1to2 = length(IN[1].vertex - IN[2].vertex);
    float sideLength2to0 = length(IN[2].vertex - IN[0].vertex);

    float rateThreshold = 5.0;
    // Filter: Erase if the ratio of sides is wrong
    isBug =
        sideLength0to1 > sideLength1to2 * rateThreshold ||
        sideLength1to2 > sideLength2to0 * rateThreshold ||
        sideLength2to0 > sideLength0to1 * rateThreshold
        ? true : isBug;

    // Filter: If the distance between certain vertices is x[m] or more
    float threshold = 0.4;
    isBug =
        sideLength0to1 > threshold ||
        sideLength1to2 > threshold ||
        sideLength2to0 > threshold
        ? true : isBug;

    // Filter: If the vertex is out of range
    for (int i = 0; i < 3; i++)
    {
        appdata v = IN[i];
        isBug = 
        v.vertex.x > 1.0 ||
        v.vertex.y > 2.0 ||
        v.vertex.z > 1.0 ||
        v.vertex.x < -1.0 ||
        v.vertex.y < -1.0 ||
        v.vertex.z < -1.0
        ? true : isBug;

    }

    [unroll]
    for (int i = 0; i < 3; i++)
    {
        // Projectively transform each of the three vertices received from the vertex shader to determine the polygon position as in normal rendering.
        appdata v = IN[i];
            g2f o;
                // If the isBug flag is present, skip the vertex position to the origin (looks good with discard)
        o.vertex = isBug ? float4(0,0,0,0) : UnityObjectToClipPos(v.vertex);
        o.uv = v.uv;
        o.normal = UnityObjectToWorldNormal(normal);
        triStream.Append(o);
    }
}

As a result of filtering, it became a shape that retained some of the prototype.

Reconstruction of position/posture

Use this when reconstructing the position/orientation of an object that is not a Skinned Mesh Renderer.

Rotate the object using the Quaternion value read from the image in Shader.

The rotation of the object is expressed by rotating the vector from the origin to each vertex in the object's local coordinate system with Quaternion.

//rotateWithQuaternion.c
float4 quatenionAxQuaternionB(float4 qa, float4 qb)
{
    return float4(
        qa.w * qb.x + qa.x * qb.w + qa.y * qb.z - qa.z * qb.y,
        qa.w * qb.y - qa.x * qb.z + qa.y * qb.w + qa.z * qb.x,
        qa.w * qb.z + qa.x * qb.y - qa.y * qb.x + qa.z * qb.w,
        qa.w * qb.w - qa.x * qb.x - qa.y * qb.y - qa.z * qb.z
    );
}

v2f vert (appdata v, uint vid : SV_VertexID)
{
        // -----------省略-------------

    float4 quaternion = float4(qx,qy,qz, qw);
    float4 conjugateQ = float4(-qx, -qy, -qz, qw); // 共役
    float4 vertAsQ = float4(v.vertex.x, v.vertex.y, v.vertex.z, 0);

    float4 rotatedPos = quatenionAxQuaternionB(quatenionAxQuaternionB(quaternion, vertAsQ), conjugateQ);

    v2f o;
    o.vertex = UnityObjectToClipPos(rotatedPos);
    o.uv = v.uv;
    return o;
}

# Interpretation disagreement operation

From here, I will describe how to operate the live "#interpretation disagreement" using Omnipresence Live.

Overview

VR live of the artist "memex" to whom the author belongs

Production team:

memex

Alan (@memex_aran): Vocal

Pibo (author) (@memex_pibo): Guitar

World effect design

Mikipom (@cakemas0227)

Two people in real-time session from their remote home

Also perform motion capture while playing

Memex live on multiple instances of VRChat at the same time

An instance is a unit of space. Users can usually only join one instance at a time. There is a limit to the number of people that can be placed in one instance (up to about 60), so you can participate in multiple instances = there is no limit on the number of people.

Explanation topic of "#interpretation disagreement"

•Remote session with live song and live guitar

•Para audio reactive space production that responds to each track that makes up a song

•Directed event management that fires at a specific timing of music including live music

•Remote motion capture

•Both live guitar performance and guitar motion recording

•Real-time display of tweets in VR space

•Return monitor for remote motion recording

•Build a dedicated video streaming server with AWS

•Implementation of Omnipresence Live reconstructor in VRChat

•Spatial editing function on VRChat

System Configuration

About delivery audio

We delivered the audio of the remote session using NET DUETTO, which enables low-latency sessions with performers in remote areas.

Play the guitar while playing the accompaniment with the guitarist's DAW, and send the sound to the vocal side with NET DUETTO.

The vocal side sings while listening to the accompaniment and the guitar, and sends the sound to the guitarist side with NET DUETTO.

NETDUETTO has a function that can output the mix result of the session as a virtual audio input device, which is input to the distribution software OBS Studio.

Audio reactive production

The video below shows the spatial effect when the sound is played in the order of drum kick 2 times → drum snare → guitar.

In order to create an effect that responds to the sound of each part that makes up such a piece of music, the volume of each part was expressed using the brightness of the color using the event visualizer described above, and distributed as a video.

It responds to the following tracks in order from the left.

1. kick

2. Snare

3. Hat

4. Guitar (live): Including pitch

5. Bass: Including pitch

6. A prominent sound that can be arbitrarily replaced for each song

7. A discreet sound that can be arbitrarily replaced for each song

8. Vocal (live song): Including pitch

9. Hamory Chorus

In addition, from the top, four types of output are output for the following purposes.

1. Volume value as it is

2. Using the volume value, with gentle attenuation

3. What the volume value accumulates (restarts from black when the sound gets brighter and becomes maximum at each sound)

4. Fixed display with maximum brightness so that the pitch can be acquired easily

By setting the number of tracks to be sent and the roles in advance, we made it easier to interact with Mikipom who designed the world effect.

Premise

In order to create a production that responds to each part, the paradata that each part has is required.

This time we could easily prepare the sound source for each part for our own music, but if that is not the case, I think that it is possible to prepare a sound with some parts separated by using tools such as iZotope RX7.

routing

Since NETDUETTO's VST plug-in has a function to output the voices of each performer separately, it is used to extract the vocal track alone at a remote location.

Settings on Ableton Live

The sound of each part will be loaded into each track of Ableton Live Suite with no sound output.

Sounds are output only for accompaniment sound in which parts that are separately prepared are mixed, songs and guitar tracks that are input in real time.

By inserting the Max for Live plug-in that converts the volume to OSC and sends it to each track, the volume value is sent from each track by OSC.

Management of production events

The video below is a production where the SE of the live ends, the intro of the first song begins, and the world itself appears.

In this way, in addition to the audio reactive elements, in order to perform at the specific timing in the song, we have placed triggers for the production event on the timeline of Ableton Live.

A trigger converts a MIDI Note/MIDI Pitch Bend into an OSC and places the OSC on an image with an event visualizer to fire a directing event.

Flow from production to actual playback

Mikipom, who is in charge of design, creates the direction element with Shader Only

Mikipom, who is in charge of design, makes a demo that operates shader parameters according to the song in Unity's Timeline

Create a progress table based on the timeline, at which timing you want to move which parameter to which value

Make sure that the parameters can be expressed in the range of 0 to 1.

Assign the parameters you want to move to MIDI Note Num (Doremifasolacid...)

Based on the progress table, I will arrange a MIDI Note on Ableton Live for operating parameters

Represent 0~1 parameters using MIDI Pitch Bend on each MIDI Note

Send MIDI Note Num and Pitch Bend on OSC

Color the parameters on different pixels using the Event Visualizer

The shader placed in the world reads the color of a specific pixel and plays the effect.

Progress table

It is a progress table of when and which parameter of which material is moved.

Stage event management

We have prepared as many tracks as there are parameters to be moved.

MIDI notes are arranged according to the specific timing of the song.

MIDI Note for production control

A MIDI Note that moves the parameter representing the width of an object from 0 to 1.

The scales listed in the Notes column correspond to the material parameters.

For example:

LA: Flag to display artist
Shi: Hue of the whole world
Do: Floor height

Motion recording

Remote recording of vocal motion

Vocal motion recording was done using virtual motion capture and EVMC4U.

Virtual Motion Capture is software that allows you to easily capture the motion of a VRM model using a SteamVR compatible device and then send that motion using OSC.

EVMC4U is a group of scripts that can apply the motion information sent from virtual motion capture to a VRM model on Unity in real time.

The tracking device used HTC VIVE CE and VIVE Tracker.

The motion is sent from the vocal PC to the global IP of the author's home network by OSC by virtual motion capture, and is received by EVMC4U of the author's PC.

By using this method, you can play the motion of NET DUETTO with almost no gap.

Record guitar motion while playing the guitar live

Using the M5StickC on the guitar side, two on the right hand side and the iPhone 11, you can easily track the performance of the guitar.

The M5StickC is a small, low-priced microcomputer with many functions such as an acceleration sensor and Wi-Fi connection function with a display.

The motion of the avatar was expressed by sending the value sensed by M5StickC to the PC via Wi-Fi.

The left side of the image is the guitar side, the right side of the image is the right hand side M5StickC.

The left image is the writer holding the low fret and looking to the lower right direction, and the right image is the writer holding the high fret and looking to the upper left direction.

This was produced when we did live acoustic cover broadcasting in the past.

You can see the actual movement from the broadcast archive below.

【memex】アコースティック歌生放送！＃めめなま【3000人記念】

Guitar

It senses the position of the left hand on the fingerboard of the guitar and the posture of the guitar.

The ultrasonic sensor (HC-SR04) was used for the left hand position, and the M5 Stick C built-in acceleration sensor was used for sensing the guitar posture.

An ultrasonic sensor is a sensor that emits ultrasonic waves and can find the distance to the target by dividing the time from when it hits the target until it returns to the target by the speed of sound.

By attaching this to the back of the head of the guitar, you can roughly measure the position of your left hand.

As you can see when you actually play the guitar, depending on the position you hold it in, the wrist may not be at the height just behind the neck, so accurate tracking is not possible. You can even tell that your left hand is moving.

Also, as you can see from the image above, the hand of my avatar is abstracted, so it seems that another countermeasure is necessary for an avatar with fingers.

Using Final IK, the position of the avatar's hand is fixed and moved on the guitar fingerboard.

FinalIK is an asset that allows the hands of an avatar to reach the target position naturally using inverse kinematics.

The two white Spheres in front of the image are the positions where you want your avatar's left hand to be when you are holding 1F and the positions where you want your avatar's left hand to be when holding 12F.

The left hand moves according to the distance value of the ultrasonic sensor on the straight line connecting the two Spheres.

The implementation looks like this:

//ultrasonicAndIMUWifi.c
#include <M5StickC.h>
#include <WiFi.h>
#include <WiFiUDP.h>
#include <OSCMessage.h>

/**
 * For WiFi access
 */
const char ssid[] = "[SSID of your router]";
const char pass[] = "[password]";
static WiFiUDP wifiUdp; 
static const char *kRemoteIpadr = "[Private IP of PC you want to receive]";
static const int kRmoteUdpPort = 8001; //Destination port
static const int kLocalPort = 7000;  //Own port
boolean connected = false;

/**
 * HCSR-04
 */
  int Trig = 26;
  int Echo = 36;
  int Duration;
  float Distance;

/**
 * IMU
 */
float pitch = 0.0F;
float roll  = 0.0F;
float yaw   = 0.0F;

float temp = 0;

/**
 * Setup
 */
static void WiFi_setup()
{
  WiFi.begin(ssid, pass);
  while( WiFi.status() != WL_CONNECTED) {
    delay(500);  
  }  
}

static void Serial_setup()
{
  Serial.begin(115200);
  Serial.println(""); // to separate line  
}

static void Hcsr04_setup()
{
    pinMode(Trig,OUTPUT);
    pinMode(Echo,INPUT);
}

void setup() {
  Hcsr04_setup();
  Serial_setup();
  WiFi_setup();
  M5.begin();
  M5.IMU.Init();
}

void loop() {
  /**
   * Distance measurement
   */
  digitalWrite(Trig,LOW);
  delayMicroseconds(1);
  digitalWrite(Trig,HIGH);
  delayMicroseconds(11);
  digitalWrite(Trig,LOW);
  Duration = pulseIn(Echo,HIGH);
  if (Duration>0) {
    Distance = Duration/2;
    Distance = Distance*340*100/1000000; // ultrasonic speed is 340m/s = 34000cm/s = 0.034cm/us 

    OSCMessage msgDistance("/leftHand/distance");
    msgDistance.add(Distance);
    wifiUdp.beginPacket(kRemoteIpadr, kRmoteUdpPort);
    msgDistance.send(wifiUdp);
    wifiUdp.endPacket();  
  }

 /**
  * IMU measurement
  */
  M5.IMU.getAhrsData(&pitch,&roll,&yaw);
  M5.IMU.getTempData(&temp);

  /**
   * OSCSend
   */
  OSCMessage msgPitch("/guitar/pitch");
  msgPitch.add(pitch);
  OSCMessage msgRoll("/guitar/roll");
  msgRoll.add(roll);
  OSCMessage msgYaw("/guitar/yaw");
  msgYaw.add(yaw);


  wifiUdp.beginPacket(kRemoteIpadr, kRmoteUdpPort);
  msgPitch.send(wifiUdp);
  wifiUdp.endPacket();  

  wifiUdp.beginPacket(kRemoteIpadr, kRmoteUdpPort);
  msgRoll.send(wifiUdp);
  wifiUdp.endPacket();  

  wifiUdp.beginPacket(kRemoteIpadr, kRmoteUdpPort);
  msgYaw.send(wifiUdp);
  wifiUdp.endPacket();  

  delay(33);
}

Guitar posture

By using the acceleration sensor of M5StickC, the posture of the guitar is limited to a specific axis.

The accelerometer is a sensor used to determine whether a smartphone is held vertically or horizontally, and can measure the posture of the device with some accuracy. (The rotation of the axis that gravity does not affect cannot be measured)

We measure how perpendicular or parallel the guitar is to the ground at an angle and apply it to the rotation of the guitar.

Right hand side

Using the acceleration sensor of M5StickC, the movement of picking a guitar is expressed.

The mechanism is the same as the guitar posture.

In Final IK, I placed the right hand of the avatar on the pickup of the guitar and operated only rotation.

Head

I'm using Face Tracking on iPhone 11 to move the avatar's head and hips.

Face Tracking parameters are sent to PC by OSC using iPhone application "ZIG SIM Pro".

ZIG-Project https://zig-project.com/

Face rotation, which expresses the posture of the face, is extracted from the parameters of Face Tracking and applied to the Head bone and Spine bone of the avatar.

It is a little unnatural that the head is moving with the bottom of the neck fixed, so the waist is also moving at the same time.

The implementation looks something like this.

//OSCHeadAndSpineRotator.cs
public class OSCHeadAndSpineRotator : MonoBehaviour
{
    float pitch, roll, yaw;
    const string uuid = "[Device ID that can be confirmed in ZIG SIM]";
    private Animator animator;
    private Transform head, spine;
    private Quaternion initalRotationl, headInitialLocalRotation, spineInitialLocalRotation, preHeadLocalRotation, preSpineLocalRotation;

    [SerializeField] Vector3 eularRotationOffset;
    [SerializeField] float slerpRate = 10f;
    [SerializeField] uOscServer server;

    void Start()
    {
        server.onDataReceived.AddListener(OnDataReceived);
        animator = GetComponent<Animator>();
        head = animator.GetBoneTransform(HumanBodyBones.Head);
        spine = animator.GetBoneTransform(HumanBodyBones.Spine);
        headInitialLocalRotation = head.localRotation;
        spineInitialLocalRotation = spine.localRotation;
    }

    void OnDataReceived(Message message)
    {
        if (message.address == "/ZIGSIM/" + uuid + "/facerotation")
        {

            Quaternion q = new Quaternion(
                float.Parse(message.values[0].GetString()),
                float.Parse(message.values[1].GetString()),
                float.Parse(message.values[2].GetString()),
                float.Parse(message.values[3].GetString())
                );

            var thisFrameHeadLocalRotation = Quaternion.Slerp(preHeadLocalRotation, headInitialLocalRotation * q * Quaternion.Euler(eularRotationOffset), Time.deltaTime * slerpRate);
            var thisFrameSpineLocalRotation = Quaternion.Slerp(preSpineLocalRotation, spineInitialLocalRotation * q * Quaternion.Euler(eularRotationOffset), Time.deltaTime * slerpRate);

            // Rotate the head about 80% of the obtained rotation and rotate the hips about 40% (this value is preferred)
            head.localRotation = Quaternion.Lerp(headInitialLocalRotation, thisFrameHeadLocalRotation, 0.8f);
            spine.localRotation = Quaternion.Lerp(spineInitialLocalRotation, thisFrameSpineLocalRotation, 0.4f);

            preHeadLocalRotation = thisFrameHeadLocalRotation;
            preSpineLocalRotation = thisFrameSpineLocalRotation;

        }
    }
}

Real-time display of tweets in the VR space

The text of the tweet is placed directly on the distribution video, and the image is transparently displayed on the reconstructor side as an image in space.

We updated the hashtag tweet about every 10 seconds using a library called Twity to use the Twitter API in Unity.

GitHub-toofusan/Twity: Twitter API Client for Unity C# (ex-name: twitter-for-unity) https://github.com/toofusan/Twity

The text is displayed in Canvas directly on the display.

The appearance animation can be attached on the reconstructor side, but I didn't want the reconstructor side to have a state, so I attached the animation before putting it on the distribution video.

I used a plugin called Text Juicer for character animation.

GitHub-badawe/Text-Juicer: Simple tool to create awesome text animations https://github.com/badawe/Text-Juicer

Return monitor

Prepare a return monitor that can check the appearance of the avatar and the console output of Unity.

A return monitor was required mainly for the following two purposes.

To confirm that the intended motion is reflected
To read aloud a tweet at a live MC time

By using Unity's multi-display function, it will be displayed on the screen as a window separate from the window displaying the texture for distribution.

The console output is displayed on Canvas over the avatar display so that you can see at a glance what the motion of their avatar is like during live and what tweets are displayed on the audience side.

For implementation of displaying console output, refer to this article.

【Unity】 I want to display Debug.Log on the game screen! -Back dried fish https://www.urablog.xyz/entry/2017/04/25/195351

The same return monitor can be shared by sending the return monitor window to the vocal PC via Discord screen sharing.

HLS server

The video was delivered from a dedicated streaming server built on AWS in consideration of various circumstances.

The structure is as follows.

With the following article as a reference, I proceeded with almost the same procedure, and I was able to confirm the operation in about 4 hours. (AWS is great)

I tried live distribution with OBS and AWS Elemental MediaLive | Developers.IO https://dev.classmethod.jp/articles/live-aws-elemental-medialive-with-obs/

Using Amazon CloudFront with MediaPackage-AWS Elemental MediaPackage https://docs.aws.amazon.com/ja_jp/mediapackage/latest/ug/cdns-cf.html

Make playable on VRChat

Place the live object represented by the reconstructor shader in the world

Artist models, stages, GPU particles, etc.

Use the VRChat SDK component VRC_SyncVideoStream to access the URL of the HLS video distributed by CloudFront

Shoot the image with the camera placed in the world and write it to the Render Texture of 1920 * 1080

Specify a RenderTexture of 3 as the material texture for each object

Enable spatial editing on VRChat

As shown in the image below, I made it possible to reposition almost objects by hand.

It is reflected only to spectators in the same instance.

-Attach VRChat SDK component VRC_Pickup to the object

What was tough

That is all for the explanation. Make a note of the hard work.

Shader difficult

I heard that it seems that motion can be expressed with images using VAT that animates with shaders, but I started investigating shaders, but I could not understand at all

I wouldn't be able to read a single line without this
Unity Shader Programming Vol.01 (v.2.2.1) [PDF]-XJINE's-BOOTH https://booth.pm/ja/items/931290
I don't know the reference
I still don't know how to find out the desired information HLSL? CG?

When you want to keep something secret, you can't ask anyone

I've been making it since about December, but when I asked people, it seemed like it would be solved in an instant, so I was stuck.

It's difficult to collect information when you want to keep secret what you are making...

Encoding scary

I thought I'd go into more detail about H.264 encoding, but I was frustrated.

There are too many types Document is too long

Only the recognition that it seems that pixels are put together in blocks to some extent or that saturation is likely to be attenuated remains

Pixel shift in Unity Capture?

The image input to OBS with Unity Capture was slightly deviated in pixels and could not be used for this purpose, so I built it and used OBS window capture

I was really scared when I had a problem because I needed to restart with one out...

In conclusion

In the not-so-distant future, at the live venue where audio reactive production is performed as you like, one user can hold a VR live session with members in remote areas in real time, and a future where any number of spectators can participate at the same time I want you to come.

As of July 2020, there are various problems such as it is difficult to capture the motion of playing an instrument while performing live, and it is difficult to match the time code of the voice and the spatial production, so it is easy to hold such a VR live Not.

For myself as an engineer, “#interpretation disagreement” was a challenge to realize the desired future of VR live with technology that can be used as a user again.

I hope that someday the future can be said to be like "I couldn't do such a live performance without doing this kind of trouble in the past".

reference

GitHub-sugi-cho/Animation-Texture-Baker https://github.com/sugi-cho/Animation-Texture-Baker

Calculate fluid without script in Unity – EL-EMENT blog http://el-ement.com/blog/2018/12/13/unity-fluid-with-shader/

Bit operation summary-Qiita https://qiita.com/qiita_kuru/items/3a6ab432ffb6ae506758