Leo's Home page -- Github Page -- License: CC BY-SA 4.0
I've been meaining to finish (for a definition of finish) and publish this for a couple of years now, but I think it's better to just publish now than keep waiting as I've moved with many other projects and also a busy life.
This needs more sorting and writing down the results which I should re-run and update with the latest technologies to make it complete, but the ideas and main conclusions do stil hold.
Many other references exist and the subject has advanced with models of containing several billion of parameters. The current publication and ideas remain the same as they are pre-processing stages before the network input layer.
Currently NLP deals with Out of Vocabulary (OOV) in different ways, this leads to several non-necessarilly efficient ways of pre-processing NLP datasets to be able to deal with those input vocabularies and the need to rethink the issue for every new application.
In this work we present different techniques of deteriministically encode character and token level input that deals with OOV and allows to encode all possible symbols in a computationally efficient manner to encode all (or part) of the UTF-8 domain in a fixed size vector. This input can be used as-is or can be compressed and efficiently pre-trained in a large vocabulary with an autoencoder.
These techniques eliminate the need of complex and compute consuming pre-processing to find the existing tokens in the domain replacing it for a much simpler one that works for any dataset within the defined initial character subset defined (here we study utf-8 and some subsets of it). This work focuses on being able to encode a symbol, as once it is encoded the network that takes it as input can be fine-tunned without further modification.
We are not looking for winning the SoTA race to the bottom, we are looking for alternatives that expand knowledge and model usability under restricted resource usage and for more task variability in a single trained model.
Note: There is a great reading about SoTA and Leaderboards here by Anna Rogers
Currently for NLP tasks there is the need to first analyze the input domain, extract the tokens and train an embedding network while also the need to deal with Out of Vocabulary (OOV) words or symbols and Polysemy.
It is important to separate (and we do in this work) the encoding part (to be able to represent the symbol) from the learning to use those symbols (the network to be able to do something useful with it) as this work focuses solely on being able to encode all feasible symbols in a defined text encoding domain. In this case the work is done for UTF-8 which is the most used text encoding in the web 94.6% according to w3techs.
Diverse techniques deal differently with OOV, ranging from techniques that can not deal with them, like GloVe - Pennington et al. 2014 or Word2Vec - Milikov et al. 2013 to others such as Universal Sentence Encoder -Cer et al. 2018 or the one for FastText can encode OOV with subword embeddings. One of the most used techniques is Byte Pair Encoding from Neural Machine Translation of Rare Words with Subword Units Sennrich et al. 2015 which has the advantage of compressing the input vector size compared to character based models hence accelerates training compared to a full character level input on the current SoTA and at the same time having less number of tokens than word level, which shows a good compromise on both sides.
All current methods deal with subdomains of the possible inputs available, which for most tasks is enough, nevertheless the weakness is that they can not deal with all possible input symbols, which for the current study is UTF-8. A previous work on this is inspired on a previous work by the same author ROCHA 2018 MIDI encoding analysis
In the case of continual learning the need to add new symbols is a given, be it due to adding new domain in the same language, new languages or new symbols that were previously non existing that get added to the utf-8 encoding
Also it is important noting that we are looking for over-represented vector spaces with redundant information in different ways that could be exploited by the network differently in a per-case basis.
This work analyzes UTF-8 encoding and presents a technique to be able to encode all or part of the UTF-8 domain in a computationally efficient way. The same technique can be used for other text encodings without any modification, and as utf-8 is a superset of other encodings (ASCII, Unicode, ...) the same matrix encoding can be applied without any modification in those datasets.
Notes:
Note 1:
It is worth noting here that the author has also experience in communications which allowed during the curse of this research the analysis of multiple Error Correcting Codes (ECCs) and different kinds of encoding (for example encoding as a Fourier Series), the conclusion is that even if the one-hot is the best in distance, other codes can be used and a multi-hot sparse is the simplest to implement (and fastest to encode but there is no comparison presented here). As a note, one pending task is to analyze ECCs in an end to end manner for a neural network. Some of these analysis (without proper formating) can be found in the notebooks folders at text subfolder in the minibrain project where most of the experimental code is located.
Note 2:
There is an interesting pre-print Controllable Variational Autoencoder that uses Control Theory to improve Variational Auto encoders and it seems to use one of the things I didn't know how to add to the codes.
One-hot encoding is one of the most used to encode categorical variables, in the case of State of the Art NLP tasks is used to encode the input symbols, this is computationally expensive and the goal here is to reduce this complexity leaving memory and computational space for other more complex tasks in the network.
As this file tries to encode all the characters possible by utf-8 we have to check the feasible number so:
From Wikipedia utf-8
UTF-8 is a variable width character encoding capable of encoding all 1,112,064.
$$ 17×2^{16} = 1114112 $$code points minus 2,048 technically-invalid surrogate code points
This is, if encoding with one-hot we would need 1.1M parameters per neuron in the input layer, which is expensive. The goal is to reduce this complexity (which we argue is unnecessary) by orders of magnitude.
Since the entire utf-8 univers is NOT the entire $2^{32}$ domain, but there are limitations explained in the utf-8 description
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
The UTF-8 code is formed by 4 segments, we will refer to this often during the current work.
The thing is that the number of elements in the table should be at most $2^{21}$, There is only the need to create a index that can handle the 4 cases which can be done with 4 different conversion tables.
In fact it is possible to just cut the utf-8 value in chunks and do one-hot per different parts:
The prefix of each can be taken away and replaced by the initial one-hot
So a complete code would be: $ 4 + 256 + 64 + 64 + 64 = 452 $
Instead of having a vector of dimension 1,112,064 to encode any utf-8 value, one with dimension 452 would be able to encode everything in the utf-8 domain.
This embedding can stil be reduced but should be sparse enough already to make a good input, the goal here is to have sparse vector that makes each vector far enough of the others, at least by one dimension. Adding the redundancy code (the first 4 dimensions) allows to make distance even bigger for vectors that should be further appart while taking into account the locality of the utf-8 encoding, this is because each character set is close to the ones used with it.
For example:
To cut even more memory consumption the table can be generated for 1-4 segments of the utf-8 code, taking into account that the 4th segment is mostly composed of:
We can safely ignore this 4th segment (for the purpose of this article and most usages) which adds to most of the code-points
In this particular analysis (due to resource consumption) CJK, Indic and some Miscelaneous Symbols are not (and will not be) needed then the 3rd segment can be safely ignored too reducing even more the memory consumption of the application
So the result would be:
Segment | # of code points | First index | Last index | Vector Size | # code exceptions | Size (MB) | Matrix Size (MB) | Sparse Size (MB) |
---|---|---|---|---|---|---|---|---|
4 | 1107904 | 61440 | 1107904 | 452 | 790656 | 11538.59 | 3820.59 | 83.59 |
3 | 59328 | 2112 | 61439 | 388 | 4224 | 530.71 | 175.62 | 3.59 |
2 | 1984 | 128 | 2111 | 324 | 128 | 14.84 | 4.90 | 0.09 |
1 | 128 | 0 | 127 | 260 | 0 | 0.77 | 0.25 | 0.005 |
Where:
Notice that the code for 1 segment corresponds to one-hot encoding of ASCII encoding (plus the vector of size 4 that we don't modify in any case)
Instead of parting from the characters, this study parts from the binary encoding and then asking Python to generate the character from it. UTF-8 defines many of the codes possible in the code space as invalid, unused or reserved which means that this will generate an execution exception during the process of going from the binary number to the character.
Checking the matrix memory consumption, if using Sparse Matrix for the encodings any new mobile and computer device should be able to deal with the complete code. This is even if we maintain in memory ALL the the code elements without taking into account that many of these are invalid.
Considering also the invalid codes: $\frac{790656}{1107904} = 0.71$, only $29\%$ of this code is valid, so the complete code in Dense mode would be around $3821 * 0.29 = 1108$MB and in sparse mode around $25.8$MB which makes the memory requirements much smaller
To this end we can use the codes available in the utf-8 and add the mapping to the encoding and decoding dictionaries (not the matrix), The first 0x20 codes in UTF-8 signal different communication an control codes, we can use those same for our NLP purposes or we could choose the invalid codes at 0xC0 and 0xC1 or codes larger than U+10FFFF (at the end of segment 4 as per in our vocabulary for this paper).
In order to avoid any issues and as we don't count on using the current models for communication protocols but only for NLP purposes (this encoding could also be used for allowing the network to deal directly with network communication protocols) we decide to use the control codes at the beginning of the block of the first utf-8 segment.
To this end we re-map some chosen printable characters to control codes, we choose some printable characters that are not used much (except for some old terminal applications and maybe some other ones) in order to allow for a better visual debugging tool and easier text manipulation.
This is arbitrary and can be changed but we start selecting interesting characters for use ▲△▴▵▶▷▸▹►▻▼▽▾▿◀◁◂◃◄◅◆◇◈◉◊○◌◍◎●◐◑◒◓█▒ and from these we choose a few.
Box Codes for the delimitation to avoid any encoding collision:
We set a reserved code space bigger than the Control Codes to be used used for special purposes. We set RESERVED_CODE_SPACE to 33 being the n# 32 space character \t, \r, \n and ' ' (space) are in this space, we insist on DO NOT OVERRIDE these codes
The we define the following codes:
The <unk> (Unknown) element should not be used as per design there should be no unrepresented symbols in the code design, but is added for completion and in case of future use.
Note that the mapping is created in a way to be as close as possible to the original meaning of the control symbols.
Here we just name some ideas but this work will be explored further in a work that is already ongoing but that will be presented in the future. So no more detail is presented in the current work
While a character level encoding might be faster at pre-processing time, it might not be the fastest at training or execution time (needs more inputs).
The character level encoding can thus be made into any type of subword embedding with a few different techniques:
This section is dedicated to code execution and measurements to fill the table above
from sparse_encoders import create_measure_tables
create_measure_tables()
number of codes = 128 number of code_exceptions = 0 number of codes = 1984 number of code_exceptions = 128 number of codes = 59328 number of code_exceptions = 4224 number of codes = 1107904 number of code_exceptions = 790656 | Segments | exec_time (sec) | matrix_shape | Size in Disk (MB): | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 1 | 0.004 | (128, 260) | 0.77 | 0.25 | 0.00 | codes/utf8_codes-1seg.pkl | | 2 | 0.094 | (1984, 324) | 14.83 | 4.90 | 0.09 | codes/utf8_codes-2seg.pkl | | 3 | 2.051 | (59328, 388) | 530.26 | 175.62 | 3.59 | codes/utf8_codes-3seg.pkl | | 4 | 45.606 | (1107904, 452) | 11530.14 | 3820.59 | 83.59 | codes/utf8_codes-4seg.pkl |
The execution of all the previous code is done in a single thread of an Intel i7 7700.
Size of the embedding matrix grows with the number of code points and embedding size, observing the size of the matrices, the sparse representation of them is negligeable in comparison with current models.
NVidia provides support for sparse operations with cuSPARSE (documentation) which means we can use these matrices.
Nevertheless many applications work in dense mode and if is the case working with less than the 4 segments would be advisable. The 3 segments embedding provides enough representation for all languages on earth and the 2 segment one support most languages already as stated in a previous section.
As seen in the code execution above, even if vectors are big to keep saved and download, the creation of the vectors is defined and can be recreated with less than a minute execution in a single thread of an off-the-shelf CPU.
In the literature overfitting is an evil creature, but in this case, as we know the entire domain, we are going to use it to our advantage with overfitting the sparse input (the multi-hot encoded vector) into a smaller embedding vector than the input, the goal is lossless compression here.
This is done to be able to make a more informed decision at the end of the study and show the comparative results. We are able to show that overfitting underperforms manual encoding (which comes at no surprise here), where overfitting fails to decode vectors, manually created codes are smaller and repeatable.
Once the network is trained (basically an overfitted autoencoder), a new encoding matrix is generating making each element of the input domain pass by the autoencoder and getting the latent vector, which is used to generate a matrix that can be given as Embedding directly to the network.
As the UTF-8 coding uses a max of 4 bytes for the code representation, there is at least the need to use vectors of size 32. The smallest code for this would be directly using the utf-8 code as the embedding (which we should also test as input to be able to compare the results)
But then, any vector representation that handles the complete domain must be at least of dimension 32, here we'll test several dimensions for each number of codes, the representation of 1 segment coding is just done for completion, the one for 2 segments coding might be useful but having only 1984 elements a one-hot encoding does not pose a big problem with current resources, the code starts to be more interesting for 3 segment and 4 segment coding.
The training is done with the following configuration:
Batch Size: Size of all the simbols, each batch contains once every symbol in the domain
Network Configuration: Autoencoder
Loss: CrossEntropy
And we measure:
Output Vector Embedding Size
Execution Time
Matrix Size on Disk (here only the dense matrix is taken into account as there should be close to no sparsity)
The experiments on overfitting were run on different vector sizes from 32 to 128 for encodings using 3 and 4 segments
TODO do the runs again (with a better and more clean script) and put here the results
Although I should present the results these were not conclusive at the time of the experiments. The codes were learnt and overfit and it could be used but at the moment there seemed that there were no extra benefit on doing it this way as it needs learning while the other presented codes here are deterministic.
In this section we develop a methodology and different codes with different vector sizes and distances. All the generated codes are saved and later used to test in NLP tasks.
Two different ideas are used to create the sparse codes, the first one is choose k from N, the second idea is to generate a multi-hot with the combination of different smaller codes of co-prime sizes, this leads to longer cycles on the combination of the codes
Although any code order might be enough we look for repeatable, predictable and deterministic process to reach to the same values each time we recompute the code.
For completion, this study also deals with different sparse coding techniques, the basic idea will be choosing the
We need to basically do the following: ${N\choose k}$
Where $ 32 <= N <= M$ and defining $M$ as the maximum value of the desired vector dimension
and $ k $ should be minimized to augment the sparcity of the vector as much as possible
Also We can again add some redundancy as in the previous multi-hot code. i.e. the first 4 elements should indicate which of the UTF-8 segment are being used for the code-point.
${N\choose k} = \frac{(n)!}{k!(n-k)!} $
TODO There might be an issue (to explore in the validation part) with this method is that we can not use a multi-softmax decoding method, instead classic multi-class classification methods need to be used (a Sigmoid layer) and then apply vector similarity. In the case where mixing both ${N\choose k}$ and co-prime methods the decoding part of the network architecture gets a bit more complex.
The vector sizes are explored here:
from sparse_encoders import sparse_Nk_dimension_analysis
%%time
results = sparse_Nk_dimension_analysis()
len(results)
CPU times: user 12.9 ms, sys: 358 µs, total: 13.2 ms Wall time: 13.1 ms
18
# results: (code points needed, possible code points, vector size, number of ones in code, sparsity ratio)
results
[(128, 136, 17, 2, '0.118'), (128, 165, 11, 3, '0.273'), (128, 210, 10, 4, '0.400'), (128, 252, 10, 5, '0.500'), (128, 210, 10, 6, '0.600'), (1984, 2016, 64, 2, '0.031'), (1984, 2024, 24, 3, '0.125'), (1984, 2380, 17, 4, '0.235'), (1984, 2002, 14, 5, '0.357'), (1984, 3003, 14, 6, '0.429'), (59328, 59640, 72, 3, '0.042'), (59328, 66045, 37, 4, '0.108'), (59328, 65780, 26, 5, '0.192'), (59328, 74613, 22, 6, '0.273'), (1107904, 1125180, 190, 3, '0.016'), (1107904, 1150626, 74, 4, '0.054'), (1107904, 1221759, 45, 5, '0.111'), (1107904, 1344904, 34, 6, '0.176')]
We can observe that the size of the vectors on these representation are much smaller than the manual one-hot code by segment created before. We can again use the same technique as in the first part of adding first 4 vector elements that represent the segment to which the symbol belongs.
There is another point too to take into account, the size of the vector is important as hardware is more adapted for certain sizes, it also has consequences on the kind of techniques can be done, for example groups convolution.
It is convenient then to have vector sizes of powers of two and also multiple of 96 (tensor operation sizes in NVidia tensor cores are of size 96 .... Soething to understand better here: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/)
The idea here is to create multiple ohe-hot codes, each code of a prime size (which is the simplest way to choose co-primes), then joining the codes into one bigger.
To maximize the cycle we choose to have the dimensions of each one-hot be co-prime to each other, the simplest solution is to choose prime numbers as the size of the sub-codes.
Again we look for combinations that minimize the vector size.
%%time
from sparse_encoders import multihot_primes_conf_finder
_, codes_1seg, codes_2seg, codes_3seg, codes_4seg= multihot_primes_conf_finder()
CPU times: user 332 ms, sys: 8.77 ms, total: 341 ms Wall time: 340 ms
The shortest codes that can handle the needed codebook sizes are:
codes_1seg[0], codes_2seg[0], codes_3seg[0], codes_4seg[0]
(((2, 3, 5, 7), 4, 0.235, 17, 210), ((3, 5, 11, 13), 4, 0.125, 32, 2145), ((11, 13, 19, 23), 4, 0.061, 66, 62491), ((23, 31, 37, 43), 4, 0.03, 134, 1134383))
To build the codes, for each code we pad with redundancy such as we get an embedding vector of a size in $[32, 48, 64, 96, 128, 192, 256, 384]$
But we can also look for a defined sparsity in the final vector embeddings, as extra property, all the vectors have the same parity which is a nice to have to be able to check during decoding
To take those two elements into account we'll select from the smallest vector sizes, we can see some of these here:
codes_1seg[:5], codes_2seg[:5], codes_3seg[:5], codes_4seg[:5]
([((2, 3, 5, 7), 4, 0.235, 17, 210), ((3, 5, 11), 3, 0.158, 19, 165), ((2, 5, 13), 3, 0.15, 20, 130), ((2, 7, 11), 3, 0.15, 20, 154), ((3, 5, 13), 3, 0.143, 21, 195)], [((3, 5, 11, 13), 4, 0.125, 32, 2145), ((2, 7, 11, 13), 4, 0.121, 33, 2002), ((3, 5, 7, 19), 4, 0.118, 34, 1995), ((3, 7, 11, 13), 4, 0.118, 34, 3003), ((3, 5, 11, 17), 4, 0.111, 36, 2805)], [((11, 13, 19, 23), 4, 0.061, 66, 62491), ((11, 13, 17, 29), 4, 0.057, 70, 70499), ((11, 17, 19, 23), 4, 0.057, 70, 81719), ((7, 13, 23, 29), 4, 0.056, 72, 60697), ((7, 17, 19, 29), 4, 0.056, 72, 65569)], [((23, 31, 37, 43), 4, 0.03, 134, 1134383), ((23, 29, 37, 47), 4, 0.029, 136, 1159913), ((23, 29, 41, 43), 4, 0.029, 136, 1175921), ((17, 37, 41, 43), 4, 0.029, 138, 1108927), ((19, 29, 43, 47), 4, 0.029, 138, 1113571)])
Also, to fill the remaining space we can choose to combine co-prime codes with N choose k. This has the advantage of generating different patterns with redundant information that can be exploited by the network later.
First we generate the codes for sparse and co-prime methods without filling for the HW optimization size, then we do the filling. All codes are saved to be able to experiment later with them and compare results.
from sparse_encoders import create_sparse_Nk_codes, all_multihot_primes
%%time
codes = create_sparse_Nk_codes()
| Segments | code size | Vector Size | N | k |exec_time (sec) | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 1 | 128 | (128, 17) | 17 | 2 | 0.001 | 0.00 | 0.00 | codes/utf8_sparse_codes-1_N-17_k-2_seg | | 2 | 1984 | (1984, 24) | 24 | 3 | 0.003 | 0.05 | 0.05 | codes/utf8_sparse_codes-2_N-24_k-3_seg | | 3 | 59328 | (59328, 37) | 37 | 4 | 0.059 | 2.09 | 2.04 | codes/utf8_sparse_codes-3_N-37_k-4_seg | | 4 | 1107904 | (1107904, 45) | 45 | 5 | 1.024 | 47.55 | 47.55 | codes/utf8_sparse_codes-4_N-45_k-5_seg | CPU times: user 1.01 s, sys: 75.7 ms, total: 1.09 s Wall time: 1.09 s
%%time
cp_codes = all_multihot_primes()
| {} | {} | {} | {} | {:.3f} | {:.2f} | {:.2f} | {} | | 1 | 128 | (128, 19) | (3, 5, 11) | 0.001 | 0.00 | 0.00 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_1_seg | | 2 | 1984 | (1984, 32) | (3, 5, 11, 13) | 0.001 | 0.06 | 0.07 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_2_seg | | 3 | 59328 | (59328, 66) | (11, 13, 19, 23) | 0.034 | 3.73 | 2.04 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_3_seg | | 4 | 1107904 | (1107904, 134) | (23, 31, 37, 43) | 0.637 | 141.58 | 38.04 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_4_seg | CPU times: user 549 ms, sys: 127 ms, total: 675 ms Wall time: 673 ms
As we can observe these codes are light on memory (at least up to 3 segment which should be enough for most NLP tasks in most languages), in the case of sparse matrix representation, all the representations are low on memory consumption.
To take advantage of hardware vector representations it might be more advantageous to combine different codes as they cycle differently, two representations will be done, one parting from the complete ${N\choose k}$ and completing with co-prime coding and the other parting from co-prime and completing with ${N\choose k}$.
Starting from the ${N\choose k}$ codes we construct smaller codes than starting from the co-prime technique, this has the consequence of having a complete duplicate complete code, which gives more redundancy than the smaller codes.
The configuration decision is done a bit arbitrarilly just trying to get a good decision while maintaining the vector dimension fixed in the sizes named above. The configuration is the following:
In the following table NCODES[X] represents a list of codes created for the particular test-case.
# N choose k + coprime multihot
# code dim, N,k,target dim, prime dim, primes
Nk_coprimes = [(NCODES[0], 17, 2, 32, 15, (3, 5, 7)),
(NCODES[1], 24, 3, 48, 24, (5, 8, 11)),
(NCODES[2], 37, 4, 64, 27, (3, 5, 8, 11)),
(NCODES[3], 45, 5, 96, 51, (3, 7, 11, 13, 17))]
# coprime multihot + N choose k
# code dim, primes, (N, k)
code_config = [(NCODES[0], (3, 5, 11), (13, 3)),
(NCODES[1], (3, 5, 11, 13), (32, 3)),
(NCODES[2], (11, 13, 19, 23), (30, 4)),
(NCODES[3], (23, 31, 37, 43), (58, 5))]
To add redundancy to the codes we can try several methods. We can do it basically mixing methods, or adding a new one.
While linear transformations can achieve multiple reorderings of the input vectors, having different and redundant representations will make it easier for the learning to find the parts that are best to interpret or pay attention to for the task at hand.
Basically this method assigns a dimension for each part of the code in the segment,of the vector representation. It goes as follows:
Let $C$ be a code of vector dimension $d$ such as $dim(C)=d$ where $c \in C$ is only formed by binary elements $[0,1]$
Let $s$ be a defined subset of the dimensions from $C$ and $dim(s)=ds < dim(C)=d $ and $s_i \forall i=[1,ds]$
Let $N$ be the set composed of (different) elements of $dim(C)$ in code $C$ (note, $N$ does not have to be complete in the sense of representing all the elements possible in $C$) where $dim(N)=n$
$N[s_i]_j = int \left( \dfrac{j}{\dfrac{n}{ds}} \right)$ where $j$ is the index of the vector of $N$ and $s_i$ is the $i^{th}$ dimension of the set of dimensions $s$ from $N_j$ and $int(x)$ is the integer part of the value $x$
Normally a network will create their own redundancy depending on the needs of the application, the study's goal here is to create a code that can also be used for applications that continually learn different tasks on different domains and multiple languanges.
While the codebooks generated by this method will be mostly ignored if the application is single language only (english for example), later adding parallel columns (another study in progress) will make use of different parts of the code, the redundancies embedded might (to try/prove/future work) allow for easier learning.
from sparse_encoders import create_choose_Nk_coprimes_codes, create_coprimes_choose_Nk_codes
%%time
nkcp = create_choose_Nk_coprimes_codes()
| Segments | code size | Vector Size | N | k | primes |exec_time (sec) | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 1 | 128 | (128, 32) | 17 | 2 | (3, 5, 7) | 0.001 | 0.00 | 0.01 | codes/utf8_N-17k-2-coprime_codes-128_primes-(3, 5, 7)_1_seg | | 2 | 1984 | (1984, 48) | 24 | 3 | (5, 8, 11) | 0.002 | 0.09 | 0.10 | codes/utf8_N-24k-3-coprime_codes-1984_primes-(5, 8, 11)_2_seg | | 3 | 59328 | (59328, 64) | 37 | 4 | (3, 5, 8, 11) | 0.061 | 3.62 | 4.07 | codes/utf8_N-37k-4-coprime_codes-59328_primes-(3, 5, 8, 11)_3_seg | | 4 | 1107904 | (1107904, 96) | 45 | 5 | (3, 7, 11, 13, 17) | 1.381 | 101.43 | 95.09 | codes/utf8_N-45k-5-coprime_codes-1107904_primes-(3, 7, 11, 13, 17)_4_seg | CPU times: user 1.31 s, sys: 140 ms, total: 1.45 s Wall time: 1.44 s
%%time
cpnk = create_coprimes_choose_Nk_codes()
| Segments | code size | Vector Size | N | k | primes |exec_time (sec) | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 1 | 128 | (128, 32) | 13 | 3 | (3, 5, 11) | 0.002 | 0.00 | 0.01 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_N-13k-3_1-seg | | 2 | 1984 | (1984, 64) | 32 | 3 | (3, 5, 11, 13) | 0.007 | 0.12 | 0.12 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_N-32k-3_2-seg | | 3 | 59328 | (59328, 96) | 30 | 4 | (11, 13, 19, 23) | 0.057 | 5.43 | 2.98 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_N-30k-4_3-seg | | 4 | 1107904 | (1107904, 192) | 58 | 5 | (23, 31, 37, 43) | 3.375 | 202.86 | 85.58 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_N-58k-5_4-seg | CPU times: user 3.09 s, sys: 328 ms, total: 3.41 s Wall time: 3.44 s
As the code that seems the most useful for most languages is the one derived from 2 and 3 segments from UTF-8 we dedicate special attention to it and generate an extra code here in two versions with two and three-fold redundancy. the two-fold redundancy will be of a non-compliant vector size (not in the vector size list specified in the section above), the other will be $2^n$
For completion and having extra redundancy we also create a code that contains both complete co-prime and ${N \choose k}$ for the 2 segment case (the other cases are complete already), this is, a vector of size 128 where:
coprimes | N | k | Size | Target Size | remaining |
---|---|---|---|---|---|
(3, 5, 11, 13) | 24 | 3 | 56 | 64 | 8 |
We need to use the remaining elements to create a good enough repetition that is not in the same period/cycle as the other existing parts of the code (reason why the coprime code is created instead of using another method). The idea is to maximize the information obtained looking at only part of the code.
Some ideas into how to use the remaining 8 dimensions are:
TODO For completion we can also create a vector of size 64 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.
For completion and having extra redundancy we also create a code that contains both complete co-prime and ${N \choose k}$ for the 3 segment case (the other cases are complete already), this is, a vector of size 128 where:
coprimes | N | k | Size | Target Size | remaining |
---|---|---|---|---|---|
(11, 13, 19, 23) | 37 | 4 | 103 | 128 | 25 |
The remaining 25 elements to complete dimension 128 then is treated again as an extra redundancy element, we can add several types but the idea is to keep not only redundancy but also sparsity, doing a $ {25 \choose x} $ for $ x \in [2,3,4,5,6]$ works correctly at sparsities $[0.078, 0.086, 0.093, 0.10, 0.11]$ respectively. If $x=6$ then the code completes again giving a triplicate with different patterns that should allow the network to use the information differently.
As in the case of 2 segments code we can treat the remaining to fill our needs, there are mostly two ideas to work on:
TODO For completion we can also create a vector of size 96 and 128 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.
Due to time and resource restrictions the first set of experiments will only take care on this setting, the decision (after some thought) is to be the following:
* The code will be set as a coprime based method (easier to apply multiple Softmax)
* The redundancy will be set as a coprime based method, but segmented by the total number of elements in the code divided by the cycle of the prime (Single Cycle Segmentation Method above).
* The decoder will finish (optionally and by configuration) by a multi-softmax layer (multiple softmax, one for each part of the code that contains a one-hot)
While now we have a methodology to encode utf-8 by segments and any list of elements, now we need to decide on what to encode.
Natural Language, as well as computer languages contain a certain amount of symbols, inlcuding emojis, these for example are defined in the 4th segment (in this paper's vocabulary) of utf-8, this means that a block coding as per segment will not suffice. Also there might be a few other symbols that appear in some other ocasions that might be useful to be able to encode (for example for some programing languages or symbolic representations such as mathematics)
So for this what we'll do is create a new list of characters and symbols (unicode points) to encode, for this we download some of the sites from wikipedia taht contain the unicode block defitions and then we'll clean and sort them.
Even though Phonetic Symbols are present in every dictionary and we plan to use them in the future they consist of 341 code-points which we want to avoid.
For the purpose of this work we'll avoid using most CJK and Indic scripts as they have too many symbols for the current work, while we'll maintain the most language variety that we can keeping the number of characters limited. Concerning the African scripts, there are two reasons not to include them here either, the first is the number of training samples, which is not sufficient for the current work. The second one is the number of symbols that need to be added to the encoding.
Ancient and historic scripts are skipped for the same reasons as CJK and African languages
Nevertheless, the way that codes are generated here the codes are extendable with new symbols without changing the previous defined elements.
The codes that are looked after are then: Languages mostly based on Greek, Roman and Cyrillic alphabets (Most European languages)
If the number and number of training samples allows we'll also add Semitic as Hebrew (not too many code-points) and Arabic languages.
Adding Emoticons as they are used in current social networks and symbols as they might be of use and they don't have too many code-points.
Concerning the Box Drawing, Block Elements and Geometric Shapes. They'll be excluded as they will be used as special characters for the file separation (as CSV or TSV, but we can't use tabs or commas for the current purpose) and as U+2582 (▁) is used as special character in some tokenization algorithms. Although this might pose problem if trying to play terminal based games with the current methodology.
import ntpath
import os
unicode_wiki_sites = [
"https://en.wikipedia.org/wiki/List_of_Unicode_characters",
"https://en.wikipedia.org/wiki/Armenian_(Unicode_block)",
"https://en.wikipedia.org/wiki/Glagolitic_(Unicode_block)",
"https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)",
"https://en.wikipedia.org/wiki/Cyrillic_Supplement",
"https://en.wikipedia.org/wiki/Cyrillic_Extended-A",
"https://en.wikipedia.org/wiki/Cyrillic_Extended-B",
"https://en.wikipedia.org/wiki/Cyrillic_Extended-C",
"https://en.wikipedia.org/wiki/Greek_and_Coptic",
"https://en.wikipedia.org/wiki/Phonetic_symbols_in_Unicode",
"https://en.wikipedia.org/wiki/Coptic_Epact_Numbers",
"https://en.wikipedia.org/wiki/Coptic_(Unicode_block)",
"https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)",
"https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet",
"https://en.wikipedia.org/wiki/Thaana_(Unicode_block)",
"https://en.wikipedia.org/wiki/Arabic_script_in_Unicode",
]
cmd = "lynx -dump -nolist '{}' > ./text/wiki-unicode/{}.txt"
for site in unicode_wiki_sites:
head, tail = ntpath.split(site)
tail = tail.replace('(','').replace(')','')
print(tail)
command = cmd.format(site, tail)
print(command)
os.system(command)
List_of_Unicode_characters lynx -dump -nolist 'https://en.wikipedia.org/wiki/List_of_Unicode_characters' > ./text/wiki-unicode/List_of_Unicode_characters.txt Armenian_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Armenian_(Unicode_block)' > ./text/wiki-unicode/Armenian_Unicode_block.txt Glagolitic_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Glagolitic_(Unicode_block)' > ./text/wiki-unicode/Glagolitic_Unicode_block.txt Cyrillic_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)' > ./text/wiki-unicode/Cyrillic_Unicode_block.txt Cyrillic_Supplement lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Supplement' > ./text/wiki-unicode/Cyrillic_Supplement.txt Cyrillic_Extended-A lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-A' > ./text/wiki-unicode/Cyrillic_Extended-A.txt Cyrillic_Extended-B lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-B' > ./text/wiki-unicode/Cyrillic_Extended-B.txt Cyrillic_Extended-C lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-C' > ./text/wiki-unicode/Cyrillic_Extended-C.txt Greek_and_Coptic lynx -dump -nolist 'https://en.wikipedia.org/wiki/Greek_and_Coptic' > ./text/wiki-unicode/Greek_and_Coptic.txt Phonetic_symbols_in_Unicode lynx -dump -nolist 'https://en.wikipedia.org/wiki/Phonetic_symbols_in_Unicode' > ./text/wiki-unicode/Phonetic_symbols_in_Unicode.txt Coptic_Epact_Numbers lynx -dump -nolist 'https://en.wikipedia.org/wiki/Coptic_Epact_Numbers' > ./text/wiki-unicode/Coptic_Epact_Numbers.txt Coptic_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Coptic_(Unicode_block)' > ./text/wiki-unicode/Coptic_Unicode_block.txt Emoticons_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)' > ./text/wiki-unicode/Emoticons_Unicode_block.txt Unicode_and_HTML_for_the_Hebrew_alphabet lynx -dump -nolist 'https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet' > ./text/wiki-unicode/Unicode_and_HTML_for_the_Hebrew_alphabet.txt Thaana_Unicode_block lynx -dump -nolist 'https://en.wikipedia.org/wiki/Thaana_(Unicode_block)' > ./text/wiki-unicode/Thaana_Unicode_block.txt Arabic_script_in_Unicode lynx -dump -nolist 'https://en.wikipedia.org/wiki/Arabic_script_in_Unicode' > ./text/wiki-unicode/Arabic_script_in_Unicode.txt
from preprocessors.symbol_extraction import *
selected_path = 'selected_sources_small'
%%time
extract_all_chars(selected_path)
CPU times: user 27.4 ms, sys: 0 ns, total: 27.4 ms Wall time: 24.2 ms
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 3 new chars on file selected_sources_small/Greek_and_Coptic.txt 90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt Total number of symbols: 3040 Number of Selected Symbols: 1569
Excluding math and drawing symbols there is a significant reduction in the number of symbols to represent
Also we'll add the first control codes codepoints from utf-8.
%%time
extract_all_chars(selected_path)
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 3 new chars on file selected_sources_small/Greek_and_Coptic.txt 90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt Total number of symbols: 3040 Number of Selected Symbols: 1569 CPU times: user 19.7 ms, sys: 3.91 ms, total: 23.6 ms Wall time: 22.6 ms
import pickle
path = '../..//minibrain/predictors/sequence/text/utf8-codes/txt2num_2seg.pkl'
txt2num = pickle.load(open(path, 'rb'))
utf8chars_2seg = txt2num.keys()
ccodes = set(list(utf8chars_2seg)[:32])
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 3 new chars on file selected_sources_small/Greek_and_Coptic.txt 90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt Total number of symbols: 3040 Number of Selected Symbols: 1569
all_codes = set(sclist)
all_codes.update(ccodes)
all_codes = sorted(list(all_codes))
len(all_codes)
1601
from sparse_encoders import create_codematrix_from_conf
# generating codeblock and code dict
# config = [(2, 1916, (24, 3), (3, 5, 11, 13), (6, 2), 64, 9 / 64)]
# config = [(2, 2112, (24, 3), (3, 5, 11, 13), (6, 2), 64, 9 / 64)]
config = [(2, 2112, (24, 3), (3, 5, 11, 13), (4, 6, 8, 10, 12), 96, 13 / 96)]
create_codematrix_from_conf(config)
| Segments | code size | Vector Size | N | k | primes | cycles | exec_time (sec) | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 2 | 2112 | (2112, 96) | 24 | 3 | (3, 5, 11, 13) | (4, 6, 8, 10, 12) | 0.012 | 0.19 | 0.14 | codes/utf8_2-seg_2112-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense |
[array([[ True, False, False, ..., True, True, True], [False, True, False, ..., True, True, True], [False, False, True, ..., True, True, True], ..., [ True, False, False, ..., False, False, False], [False, True, False, ..., False, False, False], [False, False, True, ..., False, False, False]])]
from collections import OrderedDict
codes_ids = OrderedDict(list(enumerate(all_codes)))
import pickle
fname = 'codes/adhoc-code-1916-codepoints.pkl'
with open(fname, 'wb') as f:
pickle.dump(codes_ids, f, pickle.HIGHEST_PROTOCOL)
from sparse_encoders import create_codebook
from preprocessors.symbol_extraction import *
charset = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 3 new chars on file selected_sources_small/Greek_and_Coptic.txt 90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt Total number of symbols: 3040 Number of Selected Symbols: 1569
Excluding more characters to make the charset smaller and thus the Softmax faster (due to failure with FAISS for the moment in GPU I need to cut number of chars)
%%time
# all codebook and dictionaries are created with this function:
# CHARSET_PATH = "codes/all_chars.chars"
config = (2, 1838+33, (24, 3), (3, 5, 11, 13), (4, 6, 8, 10, 12), 96, 13 / 96)
ofname = "codes/adhoc-codebook-1871.pkl"
codebook = create_codebook(charset, config, ofname, reserved_spaces=33)
| Segments | code size | Vector Size | N | k | primes | cycles | exec_time (sec) | Matrix Size in Disk (MB): | Sparse Matrix Size in Disk (MB): |code path | 2 | 1871 | (1871, 96) | 24 | 3 | (3, 5, 11, 13) | (4, 6, 8, 10, 12) | 0.004 | 0.17 | 0.12 | codes/utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense | saving file codes/adhoc-codebook-1871.pkl with codes.shape (1871, 96) | char2int 1609 | int2char 1602 CPU times: user 6.72 ms, sys: 599 µs, total: 7.32 ms Wall time: 6.24 ms
code, char2int, int2char = codebook
Here we manually verify some code properties and make sure that all was correctly built
code.shape
(1871, 96)
len(char2int.keys()), len(int2char.keys())
(1609, 1602)
set(char2int.keys()).difference(set(int2char.values()))
{'\x00', '\x01', '\x02', '\x03', '\x04', '\x15', '\x1a'}
set(int2char.keys()).difference(set(char2int.values())), set(char2int.values()).difference(set(int2char.keys()))
(set(), set())
int2char[10], int2char[9], int2char[33], int2char[32],
('\n', '\t', '!', ' ')
char2int['\n'], char2int['\t']
(10, 9)
max(int2char.keys()), max(char2int.values())
(1601, 1601)
int2char[32]
' '
char2int[' ']
32
Decoding the vectors can be done by cosine similarity (or any other method of vector similarity), in this case we use faiss library from Facebook AI Research.
There are other ways of dealing with decoding, as it is a binary vector and we know beforehand the parity.
For this character-level encodings one trick is at the last layer, instead of using a single big Softmax layer using multiple ones (Multi-Softmax), depending on the code configuration, one per part of the code as per the code configuration. It is known in the domain that one of the performance issues is the existence of big softmax layers, this is so that multiple studies try to limit the performance issue generated by them. (TODO add more references, I remember one from a google paper that they cut the number of trainable parameters in a NLP network by repeating all the transformer blocks and add a technique at the input and output level to try to get the input smaller)
Small softmax layers ar much faster to process than big ones. Although rest to be measured as future work depending on the actual size of the softmax (too small vectors might not take the real advantage of GPGPUs) , maybe for really small vectors might be just good enough to do some clipping and binarization, or just put the biggest element to 1 and the rest to 0, which would have the issue of not being differentiable.
Ideas: TODO Some different techniques should be tested and should be a mixture of:
With:
Note: Previous to the last decoding layer we can use Transformer blocks with as many heads as ones in the vector there are, each should be able to target attention to a different important vector dimension.
The other advantage of small softmax layers is that they can be implemented in hardware (ASIC or FPGA for which there is available work done), so a specific text encoder and decoder can be implemented for the given encoding techniques and configurations.
There are also available techniques to do similarity search on hardware (which is a need specially in biological applications), the fact that the vector sizes and number of elements is fixed makes it feasible to create and maintain the vectors in memory and should be studied if applying error correcting codes adds some (if any) advantage.
To validate this method is sufficient to show that the performance of a network does not decay with compared to one-hot in several tasks. This article deals with this in a restricted environment
There are a few key points to measure:
To this end simple enough NLP tasks will be tackled such as the training and testing time is not excesive (running on a single end user graphic GPU card, in this case an RTX2080ti).
The tasks will be evaluated on the same (except of the first embedding layer) networks with different encoding, to be able to compare networks all will be done at character level. For completeness some otehr methods also will be evaluated, mainly BPE which is currently used in most SoTA papers.
While the approach up to here shows that the network can learn a representation given the proposed encodings, it has some difficulties due to the dataset size and number of parameters network.
It starts to learn correctly and shows a quick improvement on the loss, although it shows errors in the decoding. The errors are:
After simplification of the tasks, going only to the PoS tagging instead of all the errors (manually checked) are still mainly of a mismatch between the amount of tags given and expected.
From this we can infer that the dataset is not big enough AND the network is not big enough AND we don't have enough resources. But while this shows some promise the expectations from a network without much external knowledge given a priori seem too much.
Several studies show that adding external a-priori information on the task can not only reduce the amount of parameters needed but also improve the resulting metrics.
So the current status is to deal with augmenting the information given to the network, not only in the encoding but also in the way the network can deal with externally given or added information.
For this purpose we start studying a multi step encoding in NewCompositionalCodebook notebook its main idea is:
For the encoding then the wanted set of tokens can be composed by operating in the same manner. All elements are normalized with NFKC method for the first normalization, during the compositional code generation the normalization is NFKD to separate the composed symbols in its basic unitary components.
During training of the different networks one of the issues was the NaN during training in several ocasions, this lead to a quite frustrating journey where many times 12+hs of training were stalled, then trying to restart training from different checkpoints and with different seeds to make things work. After many frustrating hours, sometimes the networks converged and sometimes diverged. This seems to be an issue and on 10/03/2020 (dd/mm/yyyy) a pre-print was published ReZero is All You Need: Fast Convergence at Large Depth that explains the issue and gives a solution, which of course I SHOULD BE USING.
This is quite an important issue. While having more tokens facilitates the encoding and the given information, it also makes the encoder and decoder bigger in memory.
For the current status on the encoder, it does not make the training slower as the values are fixed, but it does make the decoder bigger as we are currently using a one-hot vector for the decoder. So the first step would be to replace the one-hot decoder step with faiss similarity search. This has another disadvantage too, as it is memory hungry.
The current NLP state of the art takes multiple steps to make things easier for the network to learn. This includes passing everything to lowercase, normalizing text, eliminating diacritics, and so on (as a few examples). Also the current multi-language approaches are networks that train mainly in english then later are "specialized" in other languages, which creates issues, works as How Multilingual is Multilingual BERT? show a bit of what are these problems. These languages specializations come in different files, so a network can load one of the models, and multiple models are needed for multiple languages. This leads to another issue, when more than one language is mixed (like would happen in a real multi-lingual scenario) on the same context, these networks can not necessarilly understand the mix. Even if it is not a real big issue, is something than in the reality of multi-lingual people happens often (like the author of this text) and frustrates, so would be a real nice to have.
So for the encoding scheme we would like to not only make it sparse, easy and deterministic to generate, but also include redundancies and in a way include different normalization statuses in the code. This is all explored and done in NewCompositionalCodebook_exploration.ipynb source code.
The codes generated there:
Tokenization can take different forms (BPE, word level, sentencepiece, character level, etc) and is an active field of research.
In this work we are looking for a tokenization that allows the following features:
Many of these points are already filled with all the character-level encodings, specially the NewCompositionalCodebook (which more complex than previous attempts is more expressive).
The current encoding has the advantage of also being able to capture similarities in the case of misspellings (or different spellings like with/without diacritics and accidentals) without changing the entire codeword, instead being only altered in certain ways. In the case of wanting to correct the inputs we can do a similarity search before encoding and sending the vector to the neural network. Even if this would mean more pre-processing.
The next step is deciding if word or sub-word level encodings make sense to be added. While this would make the sequences to represent shorter and more expresive it would make the size of the encoder and decoder dictionaries bigger. The issue with the current pytorch and faiss is that this dictionary would need to be duplicated, which makes memory consumption an issue.
The other issue here is that the multi-lang approach from the beginning makes this memory requirement even more intensive. For this a compromise decision and measurements need to be done.
While the network can be trained (like LISA) to do the POS tagging and other information can be also given to the network, this work wants to also focus on what kind of external information can be given to the network. In LISA, one head is trained to attend to the syntactic parent of the token. A more comprehensive approach could be done, adding an external source where a head can attend to the resulting syntactic/semantic role of the token as well as many other references (tense, gender, number, declination, etc). This could be an interesting approach, as the representation of the different elements can be learnt, these heads can (and will) give external information, quantized by an internal representation in a memory with the different possibilities. Some extensions to this are already inplace in Attending to Entities for Better Text Understanding
This external information can take different forms like:
The challenges here are multiple:
One approach to having extra information would be to create a comprehensive dictionary from the beginning (like downloading the entire wikimatrix database) and adding the information from there (genre, tense, etc....) to the sentence
This is another open point in the internal discussion in my mind that I still have to fix.
Thinking about some ideas I think that a graph representation of the meta-concepts (POS tag, syntactic/gramatical function, etc) while reading this extract:
From , Vol. 1, No. 1, Article . Publication date: April 2020. Deep Learning Based Text Classification: A Comprehensive Review • 15
"Yao et al. [103] used a similar Graph CNN (GCNN) model for text classification. They built a single text graph
for a corpus based on word co-occurrence and document word relations, then learned a Text Graph Convolutional
Network (Text GCN) for the corpus, as shown in Fig. 14. The Text GCN is initialized with one-hot representation
for word and document, and then jointly learns the embeddings for both words and documents, as supervised by
the known class labels for documents."
For the moment I've worked on compositionality for the token representation. Also there is the similarity and inclusion search that needs to be addressed, for this I might have to look into Bloom Filters, I note this here to not forget about it.
While most models concentrate in doing a first pre-training on Language Model (LM) and then doing fine-tunning I want to concentrate in a meta-model that can deal with different tasks.
For this I want to work on a constructive manner where simpler support tasks are trained first, then increasing the complexity of the network to take advantage of the previously trained tasks.
Also instead of trying to train for a big length of text, I'll start with simpler and shorter phrases, doing POS, parent tagging and other tasks available in CONLLU dataset simultaneously with some basic LM tasks as in Unified Language Model Pre-training, this should give a base from which build a more comprehensive Language Modeling task.
Adding the Unified Language model pre-trained with the approach of Backpropamine: Neuromodulated Plasticity and Neuromodulated Meta-Learning Algorithm (ANML) from Uber labs
This work shows a different take of the current approach in how to deal with input coding for Natural Language Processing tasks on Deep Learning Networks.
Some different ways of encoding as well as the code to do it are available in the repository
This partial study is a part of a deeper study on how to make networks train faster and be able to run on consumer grade GPUs in a competitive way (even if they are not SoTA).
Next steps are: