Leo's Home page -- Github Page -- License: CC BY-SA 4.0

DRAFT - V3¶

I've been meaining to finish (for a definition of finish) and publish this for a couple of years now, but I think it's better to just publish now than keep waiting as I've moved with many other projects and also a busy life.

This needs more sorting and writing down the results which I should re-run and update with the latest technologies to make it complete, but the ideas and main conclusions do stil hold.

Flexible Universal Character Level Encoding methods for Text in NLP (FlexCodes)¶

A Study of Multiple Encoding Techniques for Low Resource Consumption Character Level Embeddings for NLP Tasks¶

  • Leonardo M. Rocha
  • Contact Me

Note [Nov 2022]¶

Many other references exist and the subject has advanced with models of containing several billion of parameters. The current publication and ideas remain the same as they are pre-processing stages before the network input layer.

Abstract¶

Currently NLP deals with Out of Vocabulary (OOV) in different ways, this leads to several non-necessarilly efficient ways of pre-processing NLP datasets to be able to deal with those input vocabularies and the need to rethink the issue for every new application.

In this work we present different techniques of deteriministically encode character and token level input that deals with OOV and allows to encode all possible symbols in a computationally efficient manner to encode all (or part) of the UTF-8 domain in a fixed size vector. This input can be used as-is or can be compressed and efficiently pre-trained in a large vocabulary with an autoencoder.

These techniques eliminate the need of complex and compute consuming pre-processing to find the existing tokens in the domain replacing it for a much simpler one that works for any dataset within the defined initial character subset defined (here we study utf-8 and some subsets of it). This work focuses on being able to encode a symbol, as once it is encoded the network that takes it as input can be fine-tunned without further modification.

We are not looking for winning the SoTA race to the bottom, we are looking for alternatives that expand knowledge and model usability under restricted resource usage and for more task variability in a single trained model.

Note: There is a great reading about SoTA and Leaderboards here by Anna Rogers

Contributions¶

  1. We explore the following deterministic encoding techniques:
    1. Segment-Multihot-Encoding (SME) technique for UTF8 and we call it SME-UTF8
    2. Choosing k of N ${N\choose k}$
    3. Multiple Joint co-prime codes
  2. We present a python library that can create the necessary character-level codes and compute the token level from it
  3. All the source code is made available under MIT license in the project's repository
  4. We also present as a curiosity added to this study Overfitting Compression using overfitting to compress an entire known domain into a smaller vector.

Notes¶

  • There is an extra advantage of this methodology is that with deterministic and defined process with small representation size can be implemented efficiently in hardware.
  • This is also presented directly as an ipython notebook to also be Executable in the places that is needed. We call this, the Executable Paper and the idea is to improve reproducibility.

Introduction and Related Work¶

Currently for NLP tasks there is the need to first analyze the input domain, extract the tokens and train an embedding network while also the need to deal with Out of Vocabulary (OOV) words or symbols and Polysemy.

It is important to separate (and we do in this work) the encoding part (to be able to represent the symbol) from the learning to use those symbols (the network to be able to do something useful with it) as this work focuses solely on being able to encode all feasible symbols in a defined text encoding domain. In this case the work is done for UTF-8 which is the most used text encoding in the web 94.6% according to w3techs.

Diverse techniques deal differently with OOV, ranging from techniques that can not deal with them, like GloVe - Pennington et al. 2014 or Word2Vec - Milikov et al. 2013 to others such as Universal Sentence Encoder -Cer et al. 2018 or the one for FastText can encode OOV with subword embeddings. One of the most used techniques is Byte Pair Encoding from Neural Machine Translation of Rare Words with Subword Units Sennrich et al. 2015 which has the advantage of compressing the input vector size compared to character based models hence accelerates training compared to a full character level input on the current SoTA and at the same time having less number of tokens than word level, which shows a good compromise on both sides.

Some important references are:

  • ELMo - Peters et al. 2018
  • ULM-FiT - Howard, Ruder 2018
  • BERT - Devlin et al. 2018
  • AlBERT - Lan et al. 2020
  • CamemBERT - Martin et al. 2019
  • GPT2 - Radfor et al. 2019 and GPT3 - Lederer et al 2020

All current methods deal with subdomains of the possible inputs available, which for most tasks is enough, nevertheless the weakness is that they can not deal with all possible input symbols, which for the current study is UTF-8. A previous work on this is inspired on a previous work by the same author ROCHA 2018 MIDI encoding analysis

In the case of continual learning the need to add new symbols is a given, be it due to adding new domain in the same language, new languages or new symbols that were previously non existing that get added to the utf-8 encoding

Also it is important noting that we are looking for over-represented vector spaces with redundant information in different ways that could be exploited by the network differently in a per-case basis.

This work analyzes UTF-8 encoding and presents a technique to be able to encode all or part of the UTF-8 domain in a computationally efficient way. The same technique can be used for other text encodings without any modification, and as utf-8 is a superset of other encodings (ASCII, Unicode, ...) the same matrix encoding can be applied without any modification in those datasets.

Notes:

  • All the current SoTA methods are trained in clusters (and costs) that are unavailable to most users.
  • The current work is part or a larger work on trying to get enough permormance in commercially available (and relatively accessible) single GPUs for end users (being at the current time the NVidia RTX2080ti one of the more computationally strong cards in the market at the start of this work).

Character Level NLP¶

Character Level NLP does come with some benefits, but also with its own drawbacks like:

  • more computation needed due to longer input sequences
  • issues and errors on character level

To read more about this LightTag company has a nice post blog about it.


Notes¶

Note 1:

It is worth noting here that the author has also experience in communications which allowed during the curse of this research the analysis of multiple Error Correcting Codes (ECCs) and different kinds of encoding (for example encoding as a Fourier Series), the conclusion is that even if the one-hot is the best in distance, other codes can be used and a multi-hot sparse is the simplest to implement (and fastest to encode but there is no comparison presented here). As a note, one pending task is to analyze ECCs in an end to end manner for a neural network. Some of these analysis (without proper formating) can be found in the notebooks folders at text subfolder in the minibrain project where most of the experimental code is located.

Note 2:

There is an interesting pre-print Controllable Variational Autoencoder that uses Control Theory to improve Variational Auto encoders and it seems to use one of the things I didn't know how to add to the codes.


Sections¶

  1. This paper deals first with an analysis of the UTF-8 encoding
  2. Then goes to deal with the construction of the multi-hot code proposed
  3. After works on compressing the multi-hot code with Overfitting (yes, Overfitting)
  4. Then goes to the evaluation of the codes and compare results with other encoding methods.
  5. Conclusion
  6. Future Work
  7. Appendices

UTF-8 Analysis¶

One-Hot encoding¶

One-hot encoding is one of the most used to encode categorical variables, in the case of State of the Art NLP tasks is used to encode the input symbols, this is computationally expensive and the goal here is to reduce this complexity leaving memory and computational space for other more complex tasks in the network.

Number of code-points¶

As this file tries to encode all the characters possible by utf-8 we have to check the feasible number so:

From Wikipedia utf-8

UTF-8 is a variable width character encoding capable of encoding all 1,112,064.

$$ 17×2^{16} = 1114112 $$

code points minus 2,048 technically-invalid surrogate code points

This is, if encoding with one-hot we would need 1.1M parameters per neuron in the input layer, which is expensive. The goal is to reduce this complexity (which we argue is unnecessary) by orders of magnitude.

UTF-8 structure and Encoding Details¶

Since the entire utf-8 univers is NOT the entire $2^{32}$ domain, but there are limitations explained in the utf-8 description

Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The UTF-8 code is formed by 4 segments, we will refer to this often during the current work.

The thing is that the number of elements in the table should be at most $2^{21}$, There is only the need to create a index that can handle the 4 cases which can be done with 4 different conversion tables.

In fact it is possible to just cut the utf-8 value in chunks and do one-hot per different parts:

  • there are only 4 segment ranges, that can be coded to add redundancy in one-hot also add there either hamming or other ECC
  • the largest value is for 8 bits -> 256 values
  • the others contain 6 bits -> 64 values

The prefix of each can be taken away and replaced by the initial one-hot

So a complete code would be: $ 4 + 256 + 64 + 64 + 64 = 452 $

Instead of having a vector of dimension 1,112,064 to encode any utf-8 value, one with dimension 452 would be able to encode everything in the utf-8 domain.

This embedding can stil be reduced but should be sparse enough already to make a good input, the goal here is to have sparse vector that makes each vector far enough of the others, at least by one dimension. Adding the redundancy code (the first 4 dimensions) allows to make distance even bigger for vectors that should be further appart while taking into account the locality of the utf-8 encoding, this is because each character set is close to the ones used with it.

For example:

  • segment 1 corresponds to ASCII
  • segments 2 and 3 encodes many (if not most) of the existing alphabets in the live languages
  • segment 4 encodes mostly suplementary codes, private use and Invalid codes.

Encoding details¶

UTF-8 Segments¶

To cut even more memory consumption the table can be generated for 1-4 segments of the utf-8 code, taking into account that the 4th segment is mostly composed of:

  • Supplementary Multilingual Plane (SMP) of historic scripts
  • Supplementary Special-purpose Plane (SSP)
  • Private Use Areas (SSU)
  • Invalid Codes

We can safely ignore this 4th segment (for the purpose of this article and most usages) which adds to most of the code-points

In this particular analysis (due to resource consumption) CJK, Indic and some Miscelaneous Symbols are not (and will not be) needed then the 3rd segment can be safely ignored too reducing even more the memory consumption of the application

So the result would be:

Segment # of code points First index Last index Vector Size # code exceptions Size (MB) Matrix Size (MB) Sparse Size (MB)
4 1107904 61440 1107904 452 790656 11538.59 3820.59 83.59
3 59328 2112 61439 388 4224 530.71 175.62 3.59
2 1984 128 2111 324 128 14.84 4.90 0.09
1 128 0 127 260 0 0.77 0.25 0.005

Where:

  • Segment: number of segments used from utf-8 to generate the code
  • # of Code Points: The total encoded code points generated
  • Vector Size: The embedding vector size
  • First / Last Index: corresponds to the segment first and last index in the embedding matrix
  • # Code Exceptions: Number of code exceptions during encoding with Python, notice that we use the standard library for this.
  • Size (MB): Size of the embedding matrix and conversion dictionaries (from-to code) once saved in disk in Dense mode
  • Matrix Size (MB): Size of the embedding matrix in disk in Dense Mode
  • Sparse Size (MB): Size of the embedding matrix in disk in Sparse mode

Notice that the code for 1 segment corresponds to one-hot encoding of ASCII encoding (plus the vector of size 4 that we don't modify in any case)

Encoding Size Notes and Partial Conclusions¶

  • Encoding Exceptions:

Instead of parting from the characters, this study parts from the binary encoding and then asking Python to generate the character from it. UTF-8 defines many of the codes possible in the code space as invalid, unused or reserved which means that this will generate an execution exception during the process of going from the binary number to the character.

  • Memory Consumption:

Checking the matrix memory consumption, if using Sparse Matrix for the encodings any new mobile and computer device should be able to deal with the complete code. This is even if we maintain in memory ALL the the code elements without taking into account that many of these are invalid.

Considering also the invalid codes: $\frac{790656}{1107904} = 0.71$, only $29\%$ of this code is valid, so the complete code in Dense mode would be around $3821 * 0.29 = 1108$MB and in sparse mode around $25.8$MB which makes the memory requirements much smaller

Signaling the Start and End of a Sequence¶

To this end we can use the codes available in the utf-8 and add the mapping to the encoding and decoding dictionaries (not the matrix), The first 0x20 codes in UTF-8 signal different communication an control codes, we can use those same for our NLP purposes or we could choose the invalid codes at 0xC0 and 0xC1 or codes larger than U+10FFFF (at the end of segment 4 as per in our vocabulary for this paper).

In order to avoid any issues and as we don't count on using the current models for communication protocols but only for NLP purposes (this encoding could also be used for allowing the network to deal directly with network communication protocols) we decide to use the control codes at the beginning of the block of the first utf-8 segment.

To this end we re-map some chosen printable characters to control codes, we choose some printable characters that are not used much (except for some old terminal applications and maybe some other ones) in order to allow for a better visual debugging tool and easier text manipulation.

This is arbitrary and can be changed but we start selecting interesting characters for use ▲△▴▵▶▷▸▹►▻▼▽▾▿◀◁◂◃◄◅◆◇◈◉◊○◌◍◎●◐◑◒◓█▒ and from these we choose a few.

Box Codes for the delimitation to avoid any encoding collision:

  • SEPARATOR = '█' # to use as separator in CSV or TSV files
  • CHAR_MASK = '▒' # To use as a mask for a character

We set a reserved code space bigger than the Control Codes to be used used for special purposes. We set RESERVED_CODE_SPACE to 33 being the n# 32 space character \t, \r, \n and ' ' (space) are in this space, we insist on DO NOT OVERRIDE these codes

The we define the following codes:

  • NUL = ('◌', 0x00, '◁NUL▷') # NUL control code -> for Padding for example
  • SOH = ('◀', 0x01, '◁SOH▷') # SOH control code (Start of Heading) -> example: to indicate a task description or tgt lang
  • STX = ('◂', 0x02, '◁STX▷') # STX control code (Start of Text) -> start of text
  • ETX = ('▸', 0x03, '◁ETX▷') # ETX control code (End of Text) -> end of text
  • EOT = ('▶', 0x04, '◁EOT▷') # EOT control code (End of Transmission) -> end of document
  • UNK = ('◍', 0x15, '◁UNK▷') # NAK control code (Negative Acknowledge) -> Unknown value
  • SUB = ('◁SUB▷', 0x1A) # SUB control code (Substitute) -> Garbled or Invalid Characters
  • MSK = ('▒', 0x1A, '◁MSK▷') # Use this instead as mask for a single character

The <unk> (Unknown) element should not be used as per design there should be no unrepresented symbols in the code design, but is added for completion and in case of future use.

Note that the mapping is created in a way to be as close as possible to the original meaning of the control symbols.


Encoding and Decoding Subwords composed by character level combinations¶

Here we just name some ideas but this work will be explored further in a work that is already ongoing but that will be presented in the future. So no more detail is presented in the current work

Encoding subwords and other character level combination types (like BPE)¶

While a character level encoding might be faster at pre-processing time, it might not be the fastest at training or execution time (needs more inputs).

The character level encoding can thus be made into any type of subword embedding with a few different techniques:

  • Sum the vectors for the encodings: advantage: easier, disadvantage, does not manage the order of the characters
  • Convolution: advantage: convolution is order dependent, disadvantages: conv of 2 vectors of dimension d is 2*d, more costly at pre-processing time
  • Circular Convolution: advantage: convolution is order dependent, disadvantage: more costly at pre-processing time

Decoding subwords and other character level combination types (like BPE)¶

  • In the case of the decoding of these sub-word level encodings again the decoding can be done by similarity search, only that the space of the decoder will be bigger than character level.
  • It can be done with one-hot decoding, which still is good because of the savings during the encoding, the decoding domain can be a restricted domain for example only one output language or a classification task.
  • It can also be done at character level doing the deconvolution of a known last element with the previous vector and then iterate from there but this requires learning and needs to be studied.

Codes Creation¶

This section is dedicated to code execution and measurements to fill the table above

In [1]:
from sparse_encoders import create_measure_tables
create_measure_tables()
number of codes =  128
number of code_exceptions =  0
number of codes =  1984
number of code_exceptions =  128
number of codes =  59328
number of code_exceptions =  4224
number of codes =  1107904
number of code_exceptions =  790656
| Segments | exec_time (sec) |  matrix_shape | Size in Disk (MB): | Matrix Size in Disk (MB):            | Sparse Matrix Size in Disk (MB): |code path
| 1 | 0.004 | (128, 260) | 0.77 | 0.25 | 0.00 | codes/utf8_codes-1seg.pkl |
| 2 | 0.094 | (1984, 324) | 14.83 | 4.90 | 0.09 | codes/utf8_codes-2seg.pkl |
| 3 | 2.051 | (59328, 388) | 530.26 | 175.62 | 3.59 | codes/utf8_codes-3seg.pkl |
| 4 | 45.606 | (1107904, 452) | 11530.14 | 3820.59 | 83.59 | codes/utf8_codes-4seg.pkl |

Observations¶

Execution of previous results¶

The execution of all the previous code is done in a single thread of an Intel i7 7700.

Embedding Sizes¶

Size of the embedding matrix grows with the number of code points and embedding size, observing the size of the matrices, the sparse representation of them is negligeable in comparison with current models.

NVidia provides support for sparse operations with cuSPARSE (documentation) which means we can use these matrices.

Nevertheless many applications work in dense mode and if is the case working with less than the 4 segments would be advisable. The 3 segments embedding provides enough representation for all languages on earth and the 2 segment one support most languages already as stated in a previous section.

Execution Time¶

As seen in the code execution above, even if vectors are big to keep saved and download, the creation of the vectors is defined and can be recreated with less than a minute execution in a single thread of an off-the-shelf CPU.

PC Configuration:¶

Hardware¶

intel i7 7700
64GB of RAM
GPU-0 GTX1080 -> runs the GUI and other tasks, sometimes used for train and testing
GPU-1 RTX2080ti -> only used for computation

Software¶

Cuda V10.1
Pytorch V1.3
Python 3.7

Overfitting Compression¶

In the literature overfitting is an evil creature, but in this case, as we know the entire domain, we are going to use it to our advantage with overfitting the sparse input (the multi-hot encoded vector) into a smaller embedding vector than the input, the goal is lossless compression here.

This is done to be able to make a more informed decision at the end of the study and show the comparative results. We are able to show that overfitting underperforms manual encoding (which comes at no surprise here), where overfitting fails to decode vectors, manually created codes are smaller and repeatable.

Once the network is trained (basically an overfitted autoencoder), a new encoding matrix is generating making each element of the input domain pass by the autoencoder and getting the latent vector, which is used to generate a matrix that can be given as Embedding directly to the network.

As the UTF-8 coding uses a max of 4 bytes for the code representation, there is at least the need to use vectors of size 32. The smallest code for this would be directly using the utf-8 code as the embedding (which we should also test as input to be able to compare the results)

But then, any vector representation that handles the complete domain must be at least of dimension 32, here we'll test several dimensions for each number of codes, the representation of 1 segment coding is just done for completion, the one for 2 segments coding might be useful but having only 1984 elements a one-hot encoding does not pose a big problem with current resources, the code starts to be more interesting for 3 segment and 4 segment coding.

The training is done with the following configuration:

Batch Size: Size of all the simbols, each batch contains once every symbol in the domain
Network Configuration: Autoencoder
Loss: CrossEntropy

And we measure:

Output Vector Embedding Size
Execution Time
Matrix Size on Disk (here only the dense matrix is taken into account as there should be close to no sparsity)


The experiments on overfitting were run on different vector sizes from 32 to 128 for encodings using 3 and 4 segments

TODO do the runs again (with a better and more clean script) and put here the results

Although I should present the results these were not conclusive at the time of the experiments. The codes were learnt and overfit and it could be used but at the moment there seemed that there were no extra benefit on doing it this way as it needs learning while the other presented codes here are deterministic.

Sparse Codes¶

In this section we develop a methodology and different codes with different vector sizes and distances. All the generated codes are saved and later used to test in NLP tasks.

Two different ideas are used to create the sparse codes, the first one is choose k from N, the second idea is to generate a multi-hot with the combination of different smaller codes of co-prime sizes, this leads to longer cycles on the combination of the codes

Although any code order might be enough we look for repeatable, predictable and deterministic process to reach to the same values each time we recompute the code.

Sparse Codes, Choosing k of N ${N\choose k}$¶

For completion, this study also deals with different sparse coding techniques, the basic idea will be choosing the

We need to basically do the following: ${N\choose k}$

Where $ 32 <= N <= M$ and defining $M$ as the maximum value of the desired vector dimension

and $ k $ should be minimized to augment the sparcity of the vector as much as possible

Also We can again add some redundancy as in the previous multi-hot code. i.e. the first 4 elements should indicate which of the UTF-8 segment are being used for the code-point.

${N\choose k} = \frac{(n)!}{k!(n-k)!} $

TODO There might be an issue (to explore in the validation part) with this method is that we can not use a multi-softmax decoding method, instead classic multi-class classification methods need to be used (a Sigmoid layer) and then apply vector similarity. In the case where mixing both ${N\choose k}$ and co-prime methods the decoding part of the network architecture gets a bit more complex.

The vector sizes are explored here:

In [2]:
from sparse_encoders import sparse_Nk_dimension_analysis
In [3]:
%%time
results = sparse_Nk_dimension_analysis()
len(results)
CPU times: user 12.9 ms, sys: 358 µs, total: 13.2 ms
Wall time: 13.1 ms
Out[3]:
18
In [4]:
# results: (code points needed, possible code points, vector size, number of  ones in code, sparsity ratio)
results
Out[4]:
[(128, 136, 17, 2, '0.118'),
 (128, 165, 11, 3, '0.273'),
 (128, 210, 10, 4, '0.400'),
 (128, 252, 10, 5, '0.500'),
 (128, 210, 10, 6, '0.600'),
 (1984, 2016, 64, 2, '0.031'),
 (1984, 2024, 24, 3, '0.125'),
 (1984, 2380, 17, 4, '0.235'),
 (1984, 2002, 14, 5, '0.357'),
 (1984, 3003, 14, 6, '0.429'),
 (59328, 59640, 72, 3, '0.042'),
 (59328, 66045, 37, 4, '0.108'),
 (59328, 65780, 26, 5, '0.192'),
 (59328, 74613, 22, 6, '0.273'),
 (1107904, 1125180, 190, 3, '0.016'),
 (1107904, 1150626, 74, 4, '0.054'),
 (1107904, 1221759, 45, 5, '0.111'),
 (1107904, 1344904, 34, 6, '0.176')]

We can observe that the size of the vectors on these representation are much smaller than the manual one-hot code by segment created before. We can again use the same technique as in the first part of adding first 4 vector elements that represent the segment to which the symbol belongs.

There is another point too to take into account, the size of the vector is important as hardware is more adapted for certain sizes, it also has consequences on the kind of techniques can be done, for example groups convolution.

It is convenient then to have vector sizes of powers of two and also multiple of 96 (tensor operation sizes in NVidia tensor cores are of size 96 .... Soething to understand better here: https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/)

Sparse Codes, multiple joint co-prime codes¶

The idea here is to create multiple ohe-hot codes, each code of a prime size (which is the simplest way to choose co-primes), then joining the codes into one bigger.

To maximize the cycle we choose to have the dimensions of each one-hot be co-prime to each other, the simplest solution is to choose prime numbers as the size of the sub-codes.

Again we look for combinations that minimize the vector size.

In [5]:
%%time
from sparse_encoders import multihot_primes_conf_finder
_, codes_1seg, codes_2seg, codes_3seg, codes_4seg= multihot_primes_conf_finder()
CPU times: user 332 ms, sys: 8.77 ms, total: 341 ms
Wall time: 340 ms

The shortest codes that can handle the needed codebook sizes are:

In [6]:
codes_1seg[0], codes_2seg[0], codes_3seg[0], codes_4seg[0]
Out[6]:
(((2, 3, 5, 7), 4, 0.235, 17, 210),
 ((3, 5, 11, 13), 4, 0.125, 32, 2145),
 ((11, 13, 19, 23), 4, 0.061, 66, 62491),
 ((23, 31, 37, 43), 4, 0.03, 134, 1134383))

To build the codes, for each code we pad with redundancy such as we get an embedding vector of a size in $[32, 48, 64, 96, 128, 192, 256, 384]$

But we can also look for a defined sparsity in the final vector embeddings, as extra property, all the vectors have the same parity which is a nice to have to be able to check during decoding

To take those two elements into account we'll select from the smallest vector sizes, we can see some of these here:

In [7]:
codes_1seg[:5], codes_2seg[:5], codes_3seg[:5], codes_4seg[:5]
Out[7]:
([((2, 3, 5, 7), 4, 0.235, 17, 210),
  ((3, 5, 11), 3, 0.158, 19, 165),
  ((2, 5, 13), 3, 0.15, 20, 130),
  ((2, 7, 11), 3, 0.15, 20, 154),
  ((3, 5, 13), 3, 0.143, 21, 195)],
 [((3, 5, 11, 13), 4, 0.125, 32, 2145),
  ((2, 7, 11, 13), 4, 0.121, 33, 2002),
  ((3, 5, 7, 19), 4, 0.118, 34, 1995),
  ((3, 7, 11, 13), 4, 0.118, 34, 3003),
  ((3, 5, 11, 17), 4, 0.111, 36, 2805)],
 [((11, 13, 19, 23), 4, 0.061, 66, 62491),
  ((11, 13, 17, 29), 4, 0.057, 70, 70499),
  ((11, 17, 19, 23), 4, 0.057, 70, 81719),
  ((7, 13, 23, 29), 4, 0.056, 72, 60697),
  ((7, 17, 19, 29), 4, 0.056, 72, 65569)],
 [((23, 31, 37, 43), 4, 0.03, 134, 1134383),
  ((23, 29, 37, 47), 4, 0.029, 136, 1159913),
  ((23, 29, 41, 43), 4, 0.029, 136, 1175921),
  ((17, 37, 41, 43), 4, 0.029, 138, 1108927),
  ((19, 29, 43, 47), 4, 0.029, 138, 1113571)])

Also, to fill the remaining space we can choose to combine co-prime codes with N choose k. This has the advantage of generating different patterns with redundant information that can be exploited by the network later.

Code Generation¶

First we generate the codes for sparse and co-prime methods without filling for the HW optimization size, then we do the filling. All codes are saved to be able to experiment later with them and compare results.

In [8]:
from sparse_encoders import create_sparse_Nk_codes, all_multihot_primes
In [9]:
%%time 
codes = create_sparse_Nk_codes()
| Segments | code size | Vector Size | N | k |exec_time (sec) |  Matrix Size in Disk (MB):                | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 17) | 17 | 2 | 0.001 | 0.00 | 0.00 | codes/utf8_sparse_codes-1_N-17_k-2_seg |
| 2 | 1984 | (1984, 24) | 24 | 3 | 0.003 | 0.05 | 0.05 | codes/utf8_sparse_codes-2_N-24_k-3_seg |
| 3 | 59328 | (59328, 37) | 37 | 4 | 0.059 | 2.09 | 2.04 | codes/utf8_sparse_codes-3_N-37_k-4_seg |
| 4 | 1107904 | (1107904, 45) | 45 | 5 | 1.024 | 47.55 | 47.55 | codes/utf8_sparse_codes-4_N-45_k-5_seg |
CPU times: user 1.01 s, sys: 75.7 ms, total: 1.09 s
Wall time: 1.09 s
In [10]:
%%time 
cp_codes = all_multihot_primes()
| {} | {} | {} | {} | {:.3f} | {:.2f} | {:.2f} | {} |
| 1 | 128 | (128, 19) | (3, 5, 11) | 0.001 | 0.00 | 0.00 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_1_seg |
| 2 | 1984 | (1984, 32) | (3, 5, 11, 13) | 0.001 | 0.06 | 0.07 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_2_seg |
| 3 | 59328 | (59328, 66) | (11, 13, 19, 23) | 0.034 | 3.73 | 2.04 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_3_seg |
| 4 | 1107904 | (1107904, 134) | (23, 31, 37, 43) | 0.637 | 141.58 | 38.04 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_4_seg |
CPU times: user 549 ms, sys: 127 ms, total: 675 ms
Wall time: 673 ms

As we can observe these codes are light on memory (at least up to 3 segment which should be enough for most NLP tasks in most languages), in the case of sparse matrix representation, all the representations are low on memory consumption.

Combining sparse codes¶

To take advantage of hardware vector representations it might be more advantageous to combine different codes as they cycle differently, two representations will be done, one parting from the complete ${N\choose k}$ and completing with co-prime coding and the other parting from co-prime and completing with ${N\choose k}$.

Starting from the ${N\choose k}$ codes we construct smaller codes than starting from the co-prime technique, this has the consequence of having a complete duplicate complete code, which gives more redundancy than the smaller codes.

The configuration decision is done a bit arbitrarilly just trying to get a good decision while maintaining the vector dimension fixed in the sizes named above. The configuration is the following:

In the following table NCODES[X] represents a list of codes created for the particular test-case.

# N choose k + coprime multihot
# code dim, N,k,target dim, prime dim, primes
Nk_coprimes = [(NCODES[0], 17, 2, 32, 15, (3, 5, 7)),
               (NCODES[1], 24, 3, 48, 24, (5, 8, 11)),
               (NCODES[2], 37, 4, 64, 27, (3, 5, 8, 11)),
               (NCODES[3], 45, 5, 96, 51, (3, 7, 11, 13, 17))]

# coprime multihot + N choose k
# code dim, primes, (N, k)
code_config = [(NCODES[0], (3, 5, 11), (13, 3)),
               (NCODES[1], (3, 5, 11, 13), (32, 3)),
               (NCODES[2], (11, 13, 19, 23), (30, 4)),
               (NCODES[3], (23, 31, 37, 43), (58, 5))]

Redundant Codes Methods¶

To add redundancy to the codes we can try several methods. We can do it basically mixing methods, or adding a new one.

While linear transformations can achieve multiple reorderings of the input vectors, having different and redundant representations will make it easier for the learning to find the parts that are best to interpret or pay attention to for the task at hand.

Single Cycle multi-one-hot Segmentation Method¶

Basically this method assigns a dimension for each part of the code in the segment,of the vector representation. It goes as follows:

Let $C$ be a code of vector dimension $d$ such as $dim(C)=d$ where $c \in C$ is only formed by binary elements $[0,1]$

Let $s$ be a defined subset of the dimensions from $C$ and $dim(s)=ds < dim(C)=d $ and $s_i \forall i=[1,ds]$

Let $N$ be the set composed of (different) elements of $dim(C)$ in code $C$ (note, $N$ does not have to be complete in the sense of representing all the elements possible in $C$) where $dim(N)=n$

$N[s_i]_j = int \left( \dfrac{j}{\dfrac{n}{ds}} \right)$ where $j$ is the index of the vector of $N$ and $s_i$ is the $i^{th}$ dimension of the set of dimensions $s$ from $N_j$ and $int(x)$ is the integer part of the value $x$

Extra Notes on Mixed and redundant Codebooks¶

Normally a network will create their own redundancy depending on the needs of the application, the study's goal here is to create a code that can also be used for applications that continually learn different tasks on different domains and multiple languanges.

While the codebooks generated by this method will be mostly ignored if the application is single language only (english for example), later adding parallel columns (another study in progress) will make use of different parts of the code, the redundancies embedded might (to try/prove/future work) allow for easier learning.

In [11]:
from sparse_encoders import create_choose_Nk_coprimes_codes, create_coprimes_choose_Nk_codes
In [12]:
%%time
nkcp = create_choose_Nk_coprimes_codes()
| Segments | code size | Vector Size | N | k | primes |exec_time (sec) |  Matrix Size in Disk (MB):                        | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 32) | 17 | 2 | (3, 5, 7) | 0.001 | 0.00 | 0.01 | codes/utf8_N-17k-2-coprime_codes-128_primes-(3, 5, 7)_1_seg |
| 2 | 1984 | (1984, 48) | 24 | 3 | (5, 8, 11) | 0.002 | 0.09 | 0.10 | codes/utf8_N-24k-3-coprime_codes-1984_primes-(5, 8, 11)_2_seg |
| 3 | 59328 | (59328, 64) | 37 | 4 | (3, 5, 8, 11) | 0.061 | 3.62 | 4.07 | codes/utf8_N-37k-4-coprime_codes-59328_primes-(3, 5, 8, 11)_3_seg |
| 4 | 1107904 | (1107904, 96) | 45 | 5 | (3, 7, 11, 13, 17) | 1.381 | 101.43 | 95.09 | codes/utf8_N-45k-5-coprime_codes-1107904_primes-(3, 7, 11, 13, 17)_4_seg |
CPU times: user 1.31 s, sys: 140 ms, total: 1.45 s
Wall time: 1.44 s
In [13]:
%%time 
cpnk = create_coprimes_choose_Nk_codes()
| Segments | code size | Vector Size | N | k | primes |exec_time (sec) |  Matrix Size in Disk (MB):                        | Sparse Matrix Size in Disk (MB): |code path
| 1 | 128 | (128, 32) | 13 | 3 | (3, 5, 11) | 0.002 | 0.00 | 0.01 | codes/utf8_coprime_codes-128_primes-(3, 5, 11)_N-13k-3_1-seg |
| 2 | 1984 | (1984, 64) | 32 | 3 | (3, 5, 11, 13) | 0.007 | 0.12 | 0.12 | codes/utf8_coprime_codes-1984_primes-(3, 5, 11, 13)_N-32k-3_2-seg |
| 3 | 59328 | (59328, 96) | 30 | 4 | (11, 13, 19, 23) | 0.057 | 5.43 | 2.98 | codes/utf8_coprime_codes-59328_primes-(11, 13, 19, 23)_N-30k-4_3-seg |
| 4 | 1107904 | (1107904, 192) | 58 | 5 | (23, 31, 37, 43) | 3.375 | 202.86 | 85.58 | codes/utf8_coprime_codes-1107904_primes-(23, 31, 37, 43)_N-58k-5_4-seg |
CPU times: user 3.09 s, sys: 328 ms, total: 3.41 s
Wall time: 3.44 s

As the code that seems the most useful for most languages is the one derived from 2 and 3 segments from UTF-8 we dedicate special attention to it and generate an extra code here in two versions with two and three-fold redundancy. the two-fold redundancy will be of a non-compliant vector size (not in the vector size list specified in the section above), the other will be $2^n$

2 Segments Code¶

For completion and having extra redundancy we also create a code that contains both complete co-prime and ${N \choose k}$ for the 2 segment case (the other cases are complete already), this is, a vector of size 128 where:

coprimes N k Size Target Size remaining
(3, 5, 11, 13) 24 3 56 64 8

We need to use the remaining elements to create a good enough repetition that is not in the same period/cycle as the other existing parts of the code (reason why the coprime code is created instead of using another method). The idea is to maximize the information obtained looking at only part of the code.

Some ideas into how to use the remaining 8 dimensions are:

  1. One information that we can add is the segment at which the code belongs to with 2 dimensions and will keep 6 to do another coding.
  2. using a co-prime coding of 3 and 5 -> issue already used (maybe creating and sorting it?)
  3. Using an ${8 \choose 2 }$ coding but this is a sub-period of ${ 24 \choose 3}$, the same applies to ${6 \choose 2}$
  4. We can do ${5 \choose 2}$ and ${3 \choose 1}$ this might just work (selected)

TODO For completion we can also create a vector of size 64 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.

3 Segments Code¶

For completion and having extra redundancy we also create a code that contains both complete co-prime and ${N \choose k}$ for the 3 segment case (the other cases are complete already), this is, a vector of size 128 where:

coprimes N k Size Target Size remaining
(11, 13, 19, 23) 37 4 103 128 25

The remaining 25 elements to complete dimension 128 then is treated again as an extra redundancy element, we can add several types but the idea is to keep not only redundancy but also sparsity, doing a $ {25 \choose x} $ for $ x \in [2,3,4,5,6]$ works correctly at sparsities $[0.078, 0.086, 0.093, 0.10, 0.11]$ respectively. If $x=6$ then the code completes again giving a triplicate with different patterns that should allow the network to use the information differently.

As in the case of 2 segments code we can treat the remaining to fill our needs, there are mostly two ideas to work on:

  • Do a complete coding with $ {25 \choose x} $ (selected) is the simplest one, we choose $x=4$ arbitrarily as it keeps sparsity below $10%$ and gives a big enough cycle of $2300$
  • Do a partial with 3 dimensions to encode the segment the symbol belongs to and do a $ {22 \choose x} $ This method also makes a complete code if $x=5$ and keeps the same sparsity as the previous option but has specific redundant information about utf-8 code block.

TODO For completion we can also create a vector of size 96 and 128 with different co-prime techniques and using some of the codes sorted like in point 2 above. This has the advantage of being able to use Softmax in the decoding layer.

Preliminary Testing Decisions¶

Due to time and resource restrictions the first set of experiments will only take care on this setting, the decision (after some thought) is to be the following:

Code:¶

  • The code will be a multi-hot sparse coding (already decided beforehand)
  • The code will be a co-prime + $N \choose k$ method due to vector size

* The code will be set as a coprime based method (easier to apply multiple Softmax) * The redundancy will be set as a coprime based method, but segmented by the total number of elements in the code divided by the cycle of the prime (Single Cycle Segmentation Method above).

  • The code will be done to comprise 2 and 3 segments of utf-8 code only, being the 3 segment one the most significative for a complete multi-lingual setup and the 2 segment one much smaller and thus faster to encode and decode for simpler testing purposes.

Character Level Encoder:¶

  • The encoder layers will first pass through a few (maybe 2 or 3) linear layers on the vector dimension (spatial) before the temporal part of the NN

Character Level Decoder¶

  • The decoder will be composed first of a transformer layer with the same number of heads as the number of ones in each vector of the codebook

* The decoder will finish (optionally and by configuration) by a multi-softmax layer (multiple softmax, one for each part of the code that contains a one-hot)

  • The Decoder will work directly on the final on the vector dimension
  • The decoder will work on the spatial dimension only
  • The final decoding will be done with vector similarity (faiss) library

Rethinking What to Encode¶

While now we have a methodology to encode utf-8 by segments and any list of elements, now we need to decide on what to encode.

Natural Language, as well as computer languages contain a certain amount of symbols, inlcuding emojis, these for example are defined in the 4th segment (in this paper's vocabulary) of utf-8, this means that a block coding as per segment will not suffice. Also there might be a few other symbols that appear in some other ocasions that might be useful to be able to encode (for example for some programing languages or symbolic representations such as mathematics)

So for this what we'll do is create a new list of characters and symbols (unicode points) to encode, for this we download some of the sites from wikipedia taht contain the unicode block defitions and then we'll clean and sort them.

Even though Phonetic Symbols are present in every dictionary and we plan to use them in the future they consist of 341 code-points which we want to avoid.

For the purpose of this work we'll avoid using most CJK and Indic scripts as they have too many symbols for the current work, while we'll maintain the most language variety that we can keeping the number of characters limited. Concerning the African scripts, there are two reasons not to include them here either, the first is the number of training samples, which is not sufficient for the current work. The second one is the number of symbols that need to be added to the encoding.

Ancient and historic scripts are skipped for the same reasons as CJK and African languages

Nevertheless, the way that codes are generated here the codes are extendable with new symbols without changing the previous defined elements.

The codes that are looked after are then: Languages mostly based on Greek, Roman and Cyrillic alphabets (Most European languages)

If the number and number of training samples allows we'll also add Semitic as Hebrew (not too many code-points) and Arabic languages.

Adding Emoticons as they are used in current social networks and symbols as they might be of use and they don't have too many code-points.

Concerning the Box Drawing, Block Elements and Geometric Shapes. They'll be excluded as they will be used as special characters for the file separation (as CSV or TSV, but we can't use tabs or commas for the current purpose) and as U+2582 (▁) is used as special character in some tokenization algorithms. Although this might pose problem if trying to play terminal based games with the current methodology.

In [32]:
import ntpath
import os

unicode_wiki_sites = [
    "https://en.wikipedia.org/wiki/List_of_Unicode_characters",
    "https://en.wikipedia.org/wiki/Armenian_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Glagolitic_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Cyrillic_Supplement",
    "https://en.wikipedia.org/wiki/Cyrillic_Extended-A",
    "https://en.wikipedia.org/wiki/Cyrillic_Extended-B",
    "https://en.wikipedia.org/wiki/Cyrillic_Extended-C",
    "https://en.wikipedia.org/wiki/Greek_and_Coptic",
    "https://en.wikipedia.org/wiki/Phonetic_symbols_in_Unicode",
    "https://en.wikipedia.org/wiki/Coptic_Epact_Numbers",
    "https://en.wikipedia.org/wiki/Coptic_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet",
    "https://en.wikipedia.org/wiki/Thaana_(Unicode_block)",
    "https://en.wikipedia.org/wiki/Arabic_script_in_Unicode",
]
cmd = "lynx -dump -nolist '{}' > ./text/wiki-unicode/{}.txt"

for site in unicode_wiki_sites:
    head, tail = ntpath.split(site)
    tail = tail.replace('(','').replace(')','')
    print(tail)
    command = cmd.format(site, tail)
    print(command)
    os.system(command)
List_of_Unicode_characters
lynx -dump -nolist 'https://en.wikipedia.org/wiki/List_of_Unicode_characters' > ./text/wiki-unicode/List_of_Unicode_characters.txt
Armenian_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Armenian_(Unicode_block)' > ./text/wiki-unicode/Armenian_Unicode_block.txt
Glagolitic_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Glagolitic_(Unicode_block)' > ./text/wiki-unicode/Glagolitic_Unicode_block.txt
Cyrillic_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)' > ./text/wiki-unicode/Cyrillic_Unicode_block.txt
Cyrillic_Supplement
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Supplement' > ./text/wiki-unicode/Cyrillic_Supplement.txt
Cyrillic_Extended-A
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-A' > ./text/wiki-unicode/Cyrillic_Extended-A.txt
Cyrillic_Extended-B
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-B' > ./text/wiki-unicode/Cyrillic_Extended-B.txt
Cyrillic_Extended-C
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Cyrillic_Extended-C' > ./text/wiki-unicode/Cyrillic_Extended-C.txt
Greek_and_Coptic
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Greek_and_Coptic' > ./text/wiki-unicode/Greek_and_Coptic.txt
Phonetic_symbols_in_Unicode
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Phonetic_symbols_in_Unicode' > ./text/wiki-unicode/Phonetic_symbols_in_Unicode.txt
Coptic_Epact_Numbers
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Coptic_Epact_Numbers' > ./text/wiki-unicode/Coptic_Epact_Numbers.txt
Coptic_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Coptic_(Unicode_block)' > ./text/wiki-unicode/Coptic_Unicode_block.txt
Emoticons_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)' > ./text/wiki-unicode/Emoticons_Unicode_block.txt
Unicode_and_HTML_for_the_Hebrew_alphabet
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet' > ./text/wiki-unicode/Unicode_and_HTML_for_the_Hebrew_alphabet.txt
Thaana_Unicode_block
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Thaana_(Unicode_block)' > ./text/wiki-unicode/Thaana_Unicode_block.txt
Arabic_script_in_Unicode
lynx -dump -nolist 'https://en.wikipedia.org/wiki/Arabic_script_in_Unicode' > ./text/wiki-unicode/Arabic_script_in_Unicode.txt
In [33]:
from preprocessors.symbol_extraction import *
In [34]:
selected_path = 'selected_sources_small'
In [35]:
%%time
extract_all_chars(selected_path)
CPU times: user 27.4 ms, sys: 0 ns, total: 27.4 ms
Wall time: 24.2 ms
In [36]:
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 
48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 
3 new chars on file selected_sources_small/Greek_and_Coptic.txt 
90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 
0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt 
Total number of symbols: 3040
Number of Selected Symbols: 1569

Excluding math and drawing symbols there is a significant reduction in the number of symbols to represent

Also we'll add the first control codes codepoints from utf-8.

In [37]:
%%time
extract_all_chars(selected_path)
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 
48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 
3 new chars on file selected_sources_small/Greek_and_Coptic.txt 
90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 
0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt 
Total number of symbols: 3040
Number of Selected Symbols: 1569
CPU times: user 19.7 ms, sys: 3.91 ms, total: 23.6 ms
Wall time: 22.6 ms
In [38]:
import pickle

path = '../..//minibrain/predictors/sequence/text/utf8-codes/txt2num_2seg.pkl'

txt2num = pickle.load(open(path, 'rb'))
In [39]:
utf8chars_2seg = txt2num.keys()
In [40]:
ccodes = set(list(utf8chars_2seg)[:32])
In [41]:
sclist = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 
48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 
3 new chars on file selected_sources_small/Greek_and_Coptic.txt 
90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 
0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt 
Total number of symbols: 3040
Number of Selected Symbols: 1569
In [42]:
all_codes = set(sclist)
all_codes.update(ccodes)
In [43]:
all_codes = sorted(list(all_codes))
len(all_codes)
Out[43]:
1601
In [44]:
from sparse_encoders import create_codematrix_from_conf
# generating codeblock and code dict 
# config = [(2, 1916, (24, 3), (3, 5, 11, 13), (6, 2), 64, 9 / 64)]
# config = [(2, 2112, (24, 3), (3, 5, 11, 13), (6, 2), 64, 9 / 64)]
config = [(2, 2112, (24, 3), (3, 5, 11, 13), (4, 6, 8, 10, 12), 96, 13 / 96)]
create_codematrix_from_conf(config)
| Segments | code size | Vector Size | N | k | primes | cycles | exec_time (sec) |  Matrix Size in Disk (MB):                            | Sparse Matrix Size in Disk (MB): |code path
| 2 | 2112 | (2112, 96) | 24 | 3 | (3, 5, 11, 13) | (4, 6, 8, 10, 12) | 0.012 | 0.19 | 0.14 | codes/utf8_2-seg_2112-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense |
Out[44]:
[array([[ True, False, False, ...,  True,  True,  True],
        [False,  True, False, ...,  True,  True,  True],
        [False, False,  True, ...,  True,  True,  True],
        ...,
        [ True, False, False, ..., False, False, False],
        [False,  True, False, ..., False, False, False],
        [False, False,  True, ..., False, False, False]])]
In [45]:
from collections import OrderedDict
codes_ids = OrderedDict(list(enumerate(all_codes)))
In [46]:
import pickle
fname = 'codes/adhoc-code-1916-codepoints.pkl'
with open(fname, 'wb') as f:
    pickle.dump(codes_ids, f, pickle.HIGHEST_PROTOCOL)
In [47]:
from sparse_encoders import create_codebook
from preprocessors.symbol_extraction import *
In [49]:
charset = char_stats(selected_path)
2899 new chars on file selected_sources_small/List_of_Unicode_characters.txt 
48 new chars on file selected_sources_small/Cyrillic_Supplement.txt 
3 new chars on file selected_sources_small/Greek_and_Coptic.txt 
90 new chars on file selected_sources_small/Unicode_and_HTML_for_the_Hebrew_alphabet.txt 
0 new chars on file selected_sources_small/Cyrillic_Unicode_block.txt 
Total number of symbols: 3040
Number of Selected Symbols: 1569

Excluding more characters to make the charset smaller and thus the Softmax faster (due to failure with FAISS for the moment in GPU I need to cut number of chars)

In [50]:
%%time
# all codebook and dictionaries are created with this function:
# CHARSET_PATH = "codes/all_chars.chars"
config = (2, 1838+33, (24, 3), (3, 5, 11, 13), (4, 6, 8, 10, 12), 96, 13 / 96)
ofname = "codes/adhoc-codebook-1871.pkl"
codebook = create_codebook(charset, config, ofname, reserved_spaces=33)
| Segments | code size | Vector Size | N | k | primes | cycles | exec_time (sec) |  Matrix Size in Disk (MB):                            | Sparse Matrix Size in Disk (MB): |code path
| 2 | 1871 | (1871, 96) | 24 | 3 | (3, 5, 11, 13) | (4, 6, 8, 10, 12) | 0.004 | 0.17 | 0.12 | codes/utf8_2-seg_1871-codepoints_96-dim_N-24-k3_coprimes-(3, 5, 11, 13)_cycles-(4, 6, 8, 10, 12)_dense |
saving file codes/adhoc-codebook-1871.pkl with codes.shape (1871, 96) | char2int 1609 | int2char 1602
CPU times: user 6.72 ms, sys: 599 µs, total: 7.32 ms
Wall time: 6.24 ms
In [51]:
code, char2int, int2char = codebook

Here we manually verify some code properties and make sure that all was correctly built

In [52]:
code.shape
Out[52]:
(1871, 96)
In [53]:
len(char2int.keys()), len(int2char.keys())
Out[53]:
(1609, 1602)
In [54]:
set(char2int.keys()).difference(set(int2char.values()))
Out[54]:
{'\x00', '\x01', '\x02', '\x03', '\x04', '\x15', '\x1a'}
In [55]:
set(int2char.keys()).difference(set(char2int.values())), set(char2int.values()).difference(set(int2char.keys()))
Out[55]:
(set(), set())
In [56]:
int2char[10], int2char[9], int2char[33], int2char[32], 
Out[56]:
('\n', '\t', '!', ' ')
In [57]:
char2int['\n'], char2int['\t']
Out[57]:
(10, 9)
In [58]:
max(int2char.keys()), max(char2int.values())
Out[58]:
(1601, 1601)
In [59]:
int2char[32]
Out[59]:
' '
In [60]:
char2int[' ']
Out[60]:
32

Decoding¶

Decoding the vectors can be done by cosine similarity (or any other method of vector similarity), in this case we use faiss library from Facebook AI Research.

There are other ways of dealing with decoding, as it is a binary vector and we know beforehand the parity.

For this character-level encodings one trick is at the last layer, instead of using a single big Softmax layer using multiple ones (Multi-Softmax), depending on the code configuration, one per part of the code as per the code configuration. It is known in the domain that one of the performance issues is the existence of big softmax layers, this is so that multiple studies try to limit the performance issue generated by them. (TODO add more references, I remember one from a google paper that they cut the number of trainable parameters in a NLP network by repeating all the transformer blocks and add a technique at the input and output level to try to get the input smaller)

Small softmax layers ar much faster to process than big ones. Although rest to be measured as future work depending on the actual size of the softmax (too small vectors might not take the real advantage of GPGPUs) , maybe for really small vectors might be just good enough to do some clipping and binarization, or just put the biggest element to 1 and the rest to 0, which would have the issue of not being differentiable.

Ideas: TODO Some different techniques should be tested and should be a mixture of:

  • Last layer as Softmax -> for small final layers this can be also applied and check for performance. This is useful also as I can't currently run faiss on GPU (something fails there)
  • Last layer as Sigmoid -> this can be used by all the codebooks
  • Last layer as Multi-Softmax per part of code (dependent on the input codebook but pre-defined), this can only be done in the cases for SME utf-8 and co-prime method.

With:

  • Vector Clipping
  • Binarization

Note: Previous to the last decoding layer we can use Transformer blocks with as many heads as ones in the vector there are, each should be able to target attention to a different important vector dimension.

Decoding on Dedicated Hardware¶

The other advantage of small softmax layers is that they can be implemented in hardware (ASIC or FPGA for which there is available work done), so a specific text encoder and decoder can be implemented for the given encoding techniques and configurations.

There are also available techniques to do similarity search on hardware (which is a need specially in biological applications), the fact that the vector sizes and number of elements is fixed makes it feasible to create and maintain the vectors in memory and should be studied if applying error correcting codes adds some (if any) advantage.

Method Validation¶

To validate this method is sufficient to show that the performance of a network does not decay with compared to one-hot in several tasks. This article deals with this in a restricted environment

There are a few key points to measure:

  • Pre-processing time (dataset) for each method
  • Network Performance
  • Network Size (total and trainable parameters)
  • Network Memory Consumption
  • Network Training Time

To this end simple enough NLP tasks will be tackled such as the training and testing time is not excesive (running on a single end user graphic GPU card, in this case an RTX2080ti).

The tasks will be evaluated on the same (except of the first embedding layer) networks with different encoding, to be able to compare networks all will be done at character level. For completeness some otehr methods also will be evaluated, mainly BPE which is currently used in most SoTA papers.

Preliminary conclusions of the first iteration on this work¶

While the approach up to here shows that the network can learn a representation given the proposed encodings, it has some difficulties due to the dataset size and number of parameters network.

It starts to learn correctly and shows a quick improvement on the loss, although it shows errors in the decoding. The errors are:

  • on the language detection task, mostly ortography is a bit off, but the main idea works correctly.
  • on the different tasks, the results are not good at all

After simplification of the tasks, going only to the PoS tagging instead of all the errors (manually checked) are still mainly of a mismatch between the amount of tags given and expected.

From this we can infer that the dataset is not big enough AND the network is not big enough AND we don't have enough resources. But while this shows some promise the expectations from a network without much external knowledge given a priori seem too much.

Several studies show that adding external a-priori information on the task can not only reduce the amount of parameters needed but also improve the resulting metrics.

  • Token-level Dynamic Self-Attention Network for Multi-Passage Reading Comprehension
  • Attending to Entities for Better Text Understanding
  • Linguistically-Informed Self-Attention for Semantic Role Labeling

So the current status is to deal with augmenting the information given to the network, not only in the encoding but also in the way the network can deal with externally given or added information.

For this purpose we start studying a multi step encoding in NewCompositionalCodebook notebook its main idea is:

  • we create a basic code for the main base symbols from which all other symbols will be composed
  • now we create a new composed codebook of symbols, including the ones that created the base codebook (to have the same format). The composed symbols (like the á à ô ü ... etc ) are now composed by different operations on the basic symbols, this operations are convolution and sum of the vectors and some other analysis (upper/lower case, is num, is alpha, etc). The final composed code is a concatenation of some of the created fields ( choosing all might be too big, this is a compromise choice)
  • words and other more complex composed symbols are created like the previous step

For the encoding then the wanted set of tokens can be composed by operating in the same manner. All elements are normalized with NFKC method for the first normalization, during the compositional code generation the normalization is NFKD to separate the composed symbols in its basic unitary components.

During training of the different networks one of the issues was the NaN during training in several ocasions, this lead to a quite frustrating journey where many times 12+hs of training were stalled, then trying to restart training from different checkpoints and with different seeds to make things work. After many frustrating hours, sometimes the networks converged and sometimes diverged. This seems to be an issue and on 10/03/2020 (dd/mm/yyyy) a pre-print was published ReZero is All You Need: Fast Convergence at Large Depth that explains the issue and gives a solution, which of course I SHOULD BE USING.

Kind of tokens to use - Ideas to explore¶

This is quite an important issue. While having more tokens facilitates the encoding and the given information, it also makes the encoder and decoder bigger in memory.

For the current status on the encoder, it does not make the training slower as the values are fixed, but it does make the decoder bigger as we are currently using a one-hot vector for the decoder. So the first step would be to replace the one-hot decoder step with faiss similarity search. This has another disadvantage too, as it is memory hungry.

Discussion on the current status of NLP and ideas to take into account for the encoder¶

The current NLP state of the art takes multiple steps to make things easier for the network to learn. This includes passing everything to lowercase, normalizing text, eliminating diacritics, and so on (as a few examples). Also the current multi-language approaches are networks that train mainly in english then later are "specialized" in other languages, which creates issues, works as How Multilingual is Multilingual BERT? show a bit of what are these problems. These languages specializations come in different files, so a network can load one of the models, and multiple models are needed for multiple languages. This leads to another issue, when more than one language is mixed (like would happen in a real multi-lingual scenario) on the same context, these networks can not necessarilly understand the mix. Even if it is not a real big issue, is something than in the reality of multi-lingual people happens often (like the author of this text) and frustrates, so would be a real nice to have.

So for the encoding scheme we would like to not only make it sparse, easy and deterministic to generate, but also include redundancies and in a way include different normalization statuses in the code. This is all explored and done in NewCompositionalCodebook_exploration.ipynb source code.

The codes generated there:

  • can capture order of the symbols in a token as well as non-order (bag-of-words like representation)
  • have variants with and without diacritics and accidentals
  • work with symbol normalization to cut the number of equivalent symbols to a few
  • represent upper/lower case and other casing variants while maintaining a lowercase representation
  • can be created in different dimensions such as to fill different requirements.

Tokenization discussion¶

Tokenization can take different forms (BPE, word level, sentencepiece, character level, etc) and is an active field of research.

In this work we are looking for a tokenization that allows the following features:

  • is deterministic (unlike sentencepiece for example) and the same
  • is representative AND extensible
  • allows for infinite vocabulary (unlike sentencepiece)
  • can generate new tokens from existing ones
  • does not take too much memory
  • we can use with a similarity search database
  • manages to show different characteristics of the encoded token (like the NewCompositionalCodebook Encoding)

Many of these points are already filled with all the character-level encodings, specially the NewCompositionalCodebook (which more complex than previous attempts is more expressive).

The current encoding has the advantage of also being able to capture similarities in the case of misspellings (or different spellings like with/without diacritics and accidentals) without changing the entire codeword, instead being only altered in certain ways. In the case of wanting to correct the inputs we can do a similarity search before encoding and sending the vector to the neural network. Even if this would mean more pre-processing.

The next step is deciding if word or sub-word level encodings make sense to be added. While this would make the sequences to represent shorter and more expresive it would make the size of the encoder and decoder dictionaries bigger. The issue with the current pytorch and faiss is that this dictionary would need to be duplicated, which makes memory consumption an issue.

The other issue here is that the multi-lang approach from the beginning makes this memory requirement even more intensive. For this a compromise decision and measurements need to be done.

Possible approaches for tokenization and extra information¶

  • A sub-word tokenization with the most common M values with a sentencepiece-like approach
  • A possible approach would be to make a list of the most common N (with N>1000 for example) words in each comprised language
  • Another (complementary) approach would be to load in GPU memory for the decoder only the tokenizations of the current target language, in the case of the encoder only the source language. This approach is not feasible if a batch contains many languages and can be slow due to memory transfer (although this is a compromise to take into account) This has the advantage of using many languanges, the disadvantage of this approach is that the foreign words (to the source language) would be encoded/decoded in a character-level base.
  • We could add the source and target language as extra information instead of making the network guess it. This would be a big extra help. This language information could also make the key source for a memory in the network that allows for different mappings.
  • The encoding step will make the encoding in a greedy manner, encoding with the longest possible sequence found. This will make it so the tokenization is always the same. This can be done with an encoder represented in a tree.

While the network can be trained (like LISA) to do the POS tagging and other information can be also given to the network, this work wants to also focus on what kind of external information can be given to the network. In LISA, one head is trained to attend to the syntactic parent of the token. A more comprehensive approach could be done, adding an external source where a head can attend to the resulting syntactic/semantic role of the token as well as many other references (tense, gender, number, declination, etc). This could be an interesting approach, as the representation of the different elements can be learnt, these heads can (and will) give external information, quantized by an internal representation in a memory with the different possibilities. Some extensions to this are already inplace in Attending to Entities for Better Text Understanding

This external information can take different forms like:

  • giving the parent dependency tree/graph as input
  • giving some possible pos tags or another kind of syntactic and or semantic information
  • having a way of giving examples ( as in Matching Networks for One Shot Learning) that allow the network to check examples for the learning, or Amazon's Empirical Bayes Transductive Meta-Learning with Synthetic Gradients meta-learning approach
  • Having a way of encoding external examples to add them to a NeuralDB that is later used as a base for each layer, each layer having different entries in the DB

The challenges here are multiple:

  • selecting what kind of information can be given
  • selecting one or multiple ways of adding that external knowledge
  • encoding that information in a "good"way (for whatever good can mean here)
  • training different networks to do that job?
  • Having specialized sub-modules for each part?

One approach to having extra information would be to create a comprehensive dictionary from the beginning (like downloading the entire wikimatrix database) and adding the information from there (genre, tense, etc....) to the sentence

This is another open point in the internal discussion in my mind that I still have to fix.

Ideas¶

Thinking about some ideas I think that a graph representation of the meta-concepts (POS tag, syntactic/gramatical function, etc) while reading this extract:

From , Vol. 1, No. 1, Article . Publication date: April 2020. Deep Learning Based Text Classification: A Comprehensive Review • 15

"Yao et al. [103] used a similar Graph CNN (GCNN) model for text classification. They built a single text graph
for a corpus based on word co-occurrence and document word relations, then learned a Text Graph Convolutional
Network (Text GCN) for the corpus, as shown in Fig. 14. The Text GCN is initialized with one-hot representation
for word and document, and then jointly learns the embeddings for both words and documents, as supervised by
the known class labels for documents."


For the moment I've worked on compositionality for the token representation. Also there is the similarity and inclusion search that needs to be addressed, for this I might have to look into Bloom Filters, I note this here to not forget about it.


While most models concentrate in doing a first pre-training on Language Model (LM) and then doing fine-tunning I want to concentrate in a meta-model that can deal with different tasks.

For this I want to work on a constructive manner where simpler support tasks are trained first, then increasing the complexity of the network to take advantage of the previously trained tasks.

Also instead of trying to train for a big length of text, I'll start with simpler and shorter phrases, doing POS, parent tagging and other tasks available in CONLLU dataset simultaneously with some basic LM tasks as in Unified Language Model Pre-training, this should give a base from which build a more comprehensive Language Modeling task.

Adding the Unified Language model pre-trained with the approach of Backpropamine: Neuromodulated Plasticity and Neuromodulated Meta-Learning Algorithm (ANML) from Uber labs

Conclusion¶

This work shows a different take of the current approach in how to deal with input coding for Natural Language Processing tasks on Deep Learning Networks.

Some different ways of encoding as well as the code to do it are available in the repository

Future Work¶

This partial study is a part of a deeper study on how to make networks train faster and be able to run on consumer grade GPUs in a competitive way (even if they are not SoTA).

Next steps are:

  • creating an encoding mapping for a dataset
  • test a set of NNs performance, including the size of the networks
In [ ]: