Leo's Home page -- Github Page -- License: CC BY-SA 4.0
This document contains a complete description step by step of the process and ideas behind the proposed encodings, it might be long.
The ideas here were inspired by the work I was doing to help with Few Shot Music Generation project. My fork is located here
The encoding instead of being a huge one-hot vector (dimension 4708) it can be separated in several inputs each of them a one-hot vector of lower dimensionality and a time encoding given by a sum of sinusoids with different period.
The source code is provided to automatically do the encoding as a python package with numpy.
This current analysis tries to find another alternative encoding to try in conjunction with the previous one.
Current encoding based in the Magenta project has the following properties:
The idea of the current analysis is to discuss the following points:
The current encoding separates a TIME_SHIFT event as a separate event, in my opinion the time of an event is an intrinsic property of it, so separating it might make more difficult for the network to learn what the time means.
The other issue is that similar structures (in music as well as in other kind of languages) might repeat at different time scales, the fact that the time event is treated as one separate event with it's own duration (actually 100 different types of TIME_SHIFT event) might make it harder to the network to detect patterns. This added to the high number of input features (4708) makes me think that the training time and number of samples needed for the training might be big.
# current input one-hot vector dimension
16*128*2+32*16+100
4708
In this study I propose another way of encoding that I would like to try reducing the number of features between 1 and 2 orders of magnitude and providing a Fourier series approximation inspired time and velocity approximation.
I know that is not a traditional encoding, but I think might be worth trying (specially as I don't have so many computational resources)
The basic idea is that a feature vector will be composed of several different features (7) encoded in different ways
From a midi file extract:
Event = [instrument, octave, note(s), action (on|off), velocity, time]
As there might be more than 1 note, octave or instrument event at the same this could be dealt as:
As MIDI events are independent (each note on|off event of a single instrument is a different event) and for the purpose of this encoding proposal the solution would be to do actual one-hot encoding and separate all the events. A single event will correspond to a single instrument and note event with the corresponding velocity and time encoding.
# computation of the number of octaves needed for the encoding
import numpy as np
np.ceil(128/12.)
11.0
# dimensionality analysis -> basic (one)-hot encoding
16 + 11 + 12 + 2
41
The basic coding of the event will contain a minimum of 41 dimensions to which must be added the encoding of the velocity and time encoding
Both solutions are presented here, One-Hot Encoding and Fourier Inspired Encoding.
In this case the same number of velocities will be used as in the current approach, this means that the velocity vector will have a dimension of 32
The resolution that is looked at is to be able to differentiate at least 32 different velocities.
As a description later on time encoding analysis will show, we can deal with a resolution of 256 velocities with $ 256 = 2^8 $ sinusoidals which would take the dimensionality of the input to $ 41 + 8 = 49 $
For a resolution of 32 velocities we need only $ 32 = 2^5 $ for a the dimensionality of the input to $ 41 + 5 = 46 $
2**8, 2**5
(256, 32)
Let's do a simple linear function, this can not hold as it diverges
Let's imaging an encoding where the time is cutted in 3 different parts:
the functions will be linear in a range, but will have undefined elements at the borders, the derivatives are undefined there and the NNs are shown that they do not behave nice with these kind of functions
Then we can think of a continuous function, sinusoidal ones are really good for this, well studied and implemented in every standard library.
Now let's think for a moment from the point of view of listening to music (instead of time passing) for this we'd like a time resolution:
Note: There are studies on real-time latency issue but is not of our concern here.
For a perfect reconstruction the sampling frequency to discretize a function is given by the Nyquist theorem:
If a function x(t) contains no frequencies higher than B hertz,
it is completely determined by giving its ordinates at a series
of points spaced 1/(2B) seconds apart.
Which means that for a perfect reconstruction for a given sample rate $fs$ is guaranteed for a Bandlimit $B < fs/2$
Let's imagine that we want to approximate the time for a minimum and maximum time interval, we could use sum of sinusoids to approximate it.
In our case $B = 100Hz $ for the maximum frequency (100 times 10 ms for 1 sec), hence the sampling rate would be 5ms, taking in account that also there will be reconstruction errors and some other issues we could argue that $ fs<5ms $ let's say $ 4ms $ to be sure
And for the case of the 10 minutes one, this gives us $ 10*60 = 600 s $ so $ 1/600 ~ 0.00167Hz => fs < 0.000833Hz $
As a reference the human hearing resolution is about $ 20KHz $ and the CD sampling is $ 44.1KHz $
With a sampling in each resolution level we should be able to reconstruct the patterns at different time-scales.
f = (1./600)
fs = f/2
f,fs
(0.0016666666666666668, 0.0008333333333333334)
We are using time as reference, Not all the existing time but just a part of it, a range that is big enough to accomodate our work. In the case of music a few minutes (~10?) would suffice. In the case of lyrics they are even shorter if we count the number of letters than the number of milliseconds in a music piece.
Periodic functions should be easier for a NN to learn to include periodic elements (languages have periodic elements) than a linear one.
The Fourier Series can represent periodic functions with sums of sinusoidal functions.
To be able to represent a linear function with a Fourier Series, we can think of a sawtooth wave and just take only one period.
(TODO graph here)
Euler's Formula \begin{equation} e^{ix} = cos(x) + i\sin(x) \end{equation}
\begin{equation} cos(x) = \frac{e^{ix} + e^{-ix} } {2} \end{equation}\begin{equation} sin(x) = \frac{e^{ix} - e^{-ix} } {2 i } \end{equation}Fourier Series
\begin{equation} f(x) = \sum_{n=-\infty}^\infty(c_n(f) e^{i2\pi {n/T} x} ) \end{equation}where \begin{equation} c_n(f) = \frac{1}{T} \int_{-T/2}^{T/2} ( f(t) e^{-i2\pi {n/T} x} ) \end{equation}
In our case there is no need to go so far as using the Fourier Series but this gives an inspiration on how to encode the values. Also we don't want (or even need):
We could use either the Complex Domain values or go for the Discrete Cosine Transform but we'll do something simpler that is choosing sinusoidals
Let's come back to time and think how we can encode the time, which is a linear function that just grows.
For this analysis we constrain the time to be only a segment between $ 0s $ and a max value $ Ms $ in seconds
To be able to recreate the original function we would need a series of functions that can represent elements that go from the maximum to the minimum resolution.
Let the first sinusoidal represent the values up to the period $ T = 5ms = 0.005s $ and the last one $ Ms $
Now let's work directly in milliseconds and forget about the unit, which for the purpose of the experiment does not affect the result as long as we can reconstruct the input correctly.
The main idea:
To mix Fourier Series with the Shannon sampling, trying to reconstruct not only the input signal patterns at each time scale, but being able to reconstruct the time signal (a linear function) restrained to a period interpreting it as a SawTooth-Wave, which means that we should be able not only to give different time scales to the network but also be able to reconstruct the (locally) linear signal in an easy way, which should be needed to be able to re-encode a MIDI or text file from the output of the network.
We can encode any value between $ [0,5]ms $ , for the next we can duplicate the period $ T $ and we'll get the next resolution level. Advancing like this we need at least $ n $ values such as $ 2^n >= M $ and for the minimum resolution we need at most $ n $ values such as $ 2^n < 5 $
Finding the exponents can be quickly achieved with a bit shifting trick:
I've found an error in the reasoning (to verify) I need to do one approximation PER SCALE not for all together ....
def get_upper_pow2(M):
i = 1
n = 0
while (i <= M):
i = i << 1
n+=1
return n,i
def get_lower_pow2(M):
i = 1
n = -1
while (i <= M):
i = i << 1
n+=1
return n,i
for m in range (1,11): # iterating to show the vector dimension needed for each minute increment resolution
M = m * 60 * 1000 # 10 minutes, 60 seconds per minute 1000 ms -> putting all into
n,i = get_upper_pow2(M)
print( m, n, i)
1 16 65536 2 17 131072 3 18 262144 4 18 262144 5 19 524288 6 19 524288 7 19 524288 8 19 524288 9 20 1048576 10 20 1048576
# and the smallest resolution:
get_lower_pow2(5)
(2, 8)
Which means that we can encode 10 minutes worth of music with a $ 20 - 2 = 18 $ dimensional vector, $ 36 $ if we count a complex-value vector.
For a total of $ 46 + 36 = 82 $ dimensions instead of $ 4708 $
Let's think as an encoding as the $sin$ and $cos$ values, and the network works directly on this domain giving only the $sin$ and $cos$ values as results, this allows the network to work in sinusoidal periodic functions instead of the time domain.
The decoding will take place just later to be able to do a reconstruction.
The series needed then is one representing a SawTooth Wave but restrained only to the time period we need. This allows us to represent periodic elements inside a linear function.
The other advantage of this function is that we can approximate it only with a real valued function, avoiding the complex domain and restraining the number of coefficients (and so dimensions) needed to only the real valued ones.
This gives is a total of $ 46 + 18 = 64 $ dimensions for a complete representation of the input vector
Which should be also great about this encoding is that if we create hierarchical representations, and each layer in the hierarchy corresponds to a different base resolution (ms, s, minutes, ...) then with limited amount of extra dimensions we could encode increasingly longer time periods.
def get_upper_pow2(M):
i = 1
n = 0
while (i <= M):
i = i << 1
n+=1
return n
def get_lower_pow2(m):
i = 1
n = -1
while (i <= m):
i = i << 1
n+=1
return n
class Lin2WaveEncoder():
"""
Fourier Series Inspired Approximation of a linear function encoding.
Takes the linear function as one period of a SawTooth-Wave and encodes
the period with enough elements to approximate the function wiht the
resolution needed for a (close to) perfect reconstruction of the input
"""
def __init__(self, min_val, max_val, neg_allowed=True):
"""
@param min_val: minimum value to encode
@param max_val: maximum value to encode
@param neg_allowed=True : allow negative values in the encoded vecto. If False
will move the sinusoids to the range [0,1]
"""
self.min = min_val
self.max = max_val
self.neg_allowed = neg_allowed
self.n_low = np.max(1, get_lower_pow2(min_val))
self.n_high = get_upper_pow2(max_val)
self.n = self.n_high - self.n_low
# Each period indicates a resolution level, this allows for different (time) scales
self.periods = [2**i for i in range(self.n_low, self.n_high)] # choose periods acording to sampling -> maybe we should add 3rds and 5ths of each to allow for more complex time pattern compositions
# Fourier divisor coefficients of the Series
self.coefficients = [2*i for i in range(self.n_low, self.n_high)] # just counting the elements
print(self.coefficients, self.periods)
def encode(self, x):
"""
@param x: input vector to encode
@return vector encoding the input x in the value points of x for the sinusoidal encoders
"""
vec = []
for n,T in zip(self.coefficients, self.periods):
val = np.stack( np.sin(n * x / T) ) # base term of a SawTooth Wave of period T Fourier Series
#val = np.stack( np.sin( x / T) ) # it seems to work better for the mix without the constant
vec.append(val)
ret = np.stack(vec)
if not self.neg_allowed:
ret = (ret + 1. ) / 2. # pull all the encoded values in range [0,1]
return ret
def decode(self, vec):
"""
@param vec: multi-dimensional input vector to decode into the original signal
"""
x = 0
scale_factor = []
for n,T in zip(self.coefficients, self.periods):
scale_factor.append((1./n)*T) # Scale factor of each term
#scale_factor.append(T) # Scale factor of each term
scale_factor = np.array(scale_factor)
tx = vec
tx = tx * scale_factor[:,None]
x = np.sum(tx, axis=0)
if not self.neg_allowed: # here as is commutative to make less computations
x = (x * 2.) -1 # go back to the original range
return x
we = Lin2WaveEncoder(1,1000000) # 10 seconds max -> 13 parameters
[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38] [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288]
l_short = np.linspace(0,np.pi,10000) # for short periods the reconstruction works great -> this is already usable if the input is scaled to these kind of elements
l_long = np.linspace(0,10000,1000) # for long periods this starts to work less and less well
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(l_short)
[<matplotlib.lines.Line2D at 0x7f47ffe61940>]
plt.plot(l_long)
[<matplotlib.lines.Line2D at 0x7f480802bf28>]
el_short = we.encode(l_short)
el_long = we.encode(l_long)
el_short.shape, el_long.shape
((13, 10000), (13, 1000))
dl_short = we.decode(l_short)
dl_long = we.decode(l_long)
dl_short.shape, dl_long.shape
((10000,), (1000,))
plt.plot(dl_short)
[<matplotlib.lines.Line2D at 0x7f4808024278>]
plt.clf()
plt.plot(dl_long)
[<matplotlib.lines.Line2D at 0x7f4807f87978>]
WARNING : there is stil something wrong with the scale of the reconstruction. Also, there is an issue that now I have not managed to reproduce and is the deformation that occurred before with the samples .... I have to find out what is there
def fourier10000(x):
ret = -10000*np.sin(np.pi*x/5000)/np.pi - 5000*np.sin(np.pi*x/2500)/np.pi - 10000*np.sin(3*np.pi*x/5000)/(3*np.pi) - 2500*np.sin(np.pi*x/1250)/np.pi - 2000*np.sin(np.pi*x/1000)/np.pi - 5000*np.sin(3*np.pi*x/2500)/(3*np.pi) - 10000*np.sin(7*np.pi*x/5000)/(7*np.pi) - 1250*np.sin(np.pi*x/625)/np.pi - 10000*np.sin(9*np.pi*x/5000)/(9*np.pi) - 1000*np.sin(np.pi*x/500)/np.pi
return ret
fel_short = fourier10000(l_short)
fel_long = fourier10000(l_long)
plt.plot(fel_short)
[<matplotlib.lines.Line2D at 0x7f4807efb668>]
plt.plot(fel_long)
[<matplotlib.lines.Line2D at 0x7f4807ee05c0>]
There are already available libraries that can compute the fourier series approximation for a function, so this would make it easier to compute.
Here there are some tests with sympy to compute it.
import sympy
from sympy import fourier_series, pi
from sympy import abc
from sympy.plotting import plot as spplot
M = 1 # seconds
sm = fourier_series(abc.x, (abc.x, 0, M))
sm10k = fourier_series(abc.x, (abc.x, 0, 10000))
sm.truncate(10)
-sin(2*pi*x)/pi - sin(4*pi*x)/(2*pi) - sin(6*pi*x)/(3*pi) - sin(8*pi*x)/(4*pi) - sin(10*pi*x)/(5*pi) - sin(12*pi*x)/(6*pi) - sin(14*pi*x)/(7*pi) - sin(16*pi*x)/(8*pi) - sin(18*pi*x)/(9*pi) - sin(20*pi*x)/(10*pi)
from sympy import lambdify
fsm = lambdify(abc.x, sm.truncate(10), "numpy")
fsm(1)
7.796343665038751e-16
sm.truncate(10).args
(-sin(2*pi*x)/pi, -sin(4*pi*x)/(2*pi), -sin(6*pi*x)/(3*pi), -sin(8*pi*x)/(4*pi), -sin(10*pi*x)/(5*pi), -sin(12*pi*x)/(6*pi), -sin(14*pi*x)/(7*pi), -sin(16*pi*x)/(8*pi), -sin(18*pi*x)/(9*pi), -sin(20*pi*x)/(10*pi))
sm.truncate(10).args[0].args
(-1, 1/pi, sin(2*pi*x))
sm10k.truncate(10).args
(-10000*sin(pi*x/5000)/pi, -5000*sin(pi*x/2500)/pi, -2500*sin(pi*x/1250)/pi, -2000*sin(pi*x/1000)/pi, -1250*sin(pi*x/625)/pi, -1000*sin(pi*x/500)/pi, -10000*sin(3*pi*x/5000)/(3*pi), -10000*sin(7*pi*x/5000)/(7*pi), -10000*sin(9*pi*x/5000)/(9*pi), -5000*sin(3*pi*x/2500)/(3*pi))
f1 = sm10k.truncate(10).args[0]
np.product(f1.args[:-1])
-10000/pi
coef_fact = [ (np.product(f.args[:-1]), lambdify(abc.x, f.args[-1], "numpy") ) for f in sm10k.truncate(10).args]
z1,z2 = zip(*coef_fact)
z1
(-10000/pi, -5000/pi, -2500/pi, -2000/pi, -1250/pi, -1000/pi, -10000/(3*pi), -10000/(7*pi), -10000/(9*pi), -5000/(3*pi))
lambdify(abc.x, z1[0], "numpy")(0)
-3183.098861837907
pi.evalf()
3.14159265358979
sm10k.truncate(10)
-10000*sin(pi*x/5000)/pi - 5000*sin(pi*x/2500)/pi - 10000*sin(3*pi*x/5000)/(3*pi) - 2500*sin(pi*x/1250)/pi - 2000*sin(pi*x/1000)/pi - 5000*sin(3*pi*x/2500)/(3*pi) - 10000*sin(7*pi*x/5000)/(7*pi) - 1250*sin(pi*x/625)/pi - 10000*sin(9*pi*x/5000)/(9*pi) - 1000*sin(pi*x/500)/pi
sm.scale(10000).truncate(10)
-10000*sin(2*pi*x)/pi - 5000*sin(4*pi*x)/pi - 10000*sin(6*pi*x)/(3*pi) - 2500*sin(8*pi*x)/pi - 2000*sin(10*pi*x)/pi - 5000*sin(12*pi*x)/(3*pi) - 10000*sin(14*pi*x)/(7*pi) - 1250*sin(16*pi*x)/pi - 10000*sin(18*pi*x)/(9*pi) - 1000*sin(20*pi*x)/pi
sm.truncate(10).evalf(1)
-0.3*sin(2*pi*x) - 0.2*sin(4*pi*x) - 0.1*sin(6*pi*x) - 0.08*sin(8*pi*x) - 0.06*sin(10*pi*x) - 0.05*sin(12*pi*x) - 0.05*sin(14*pi*x) - 0.04*sin(16*pi*x) - 0.04*sin(18*pi*x) - 0.03*sin(20*pi*x)
s = fourier_series(abc.x, (abc.x, 0, pi))
s.scale(10).truncate(10)
-10*sin(2*x) - 5*sin(4*x) - 10*sin(6*x)/3 - 5*sin(8*x)/2 - 2*sin(10*x) - 5*sin(12*x)/3 - 10*sin(14*x)/7 - 5*sin(16*x)/4 - 10*sin(18*x)/9 - sin(20*x)
s.truncate(20)
-sin(2*x) - sin(4*x)/2 - sin(6*x)/3 - sin(8*x)/4 - sin(10*x)/5 - sin(12*x)/6 - sin(14*x)/7 - sin(16*x)/8 - sin(18*x)/9 - sin(20*x)/10 - sin(22*x)/11 - sin(24*x)/12 - sin(26*x)/13 - sin(28*x)/14 - sin(30*x)/15 - sin(32*x)/16 - sin(34*x)/17 - sin(36*x)/18 - sin(38*x)/19 - sin(40*x)/20