Flexible Universal Character Level Encoding methods for Text in NLP (FlexCodes)

Introduction

A few years ago I contacted some researchers through AI-ON and started to help with Few Shot Music Generation project. My fork is located here.

That led me to think about how to cut down the resource usage on the input encodings, which led to this idea. Afterwards I though that it could be generalized to the entire domain of any encoding type, so this led me to another rabbit hole in which I started working and testing ideas on overfitting encodings, error correction codes and some other things, but the main work

The main problem on any encoding is today that the one-hot vector of the input layer is still too big and that we use a learnt encoding now instead of being able to pre-compute it in a deterministic manner.

Source Code and Analysis

Note: the page is not complete with all the conclusions, but does provide source code, some examples, a dimensionality analysis and gives the theoretical pointers to develop the subject further if wanted or needed to.

A page with the analysis

The notebook and source code of different tests is located in the mix_nlp repository.

Conclusion

It is possible to decide on smaller deterministic input encodings that can be further compressed by an input Fully Connected network (as is the current practice for input layers), given a smaller input instead of a one-hot vector provides an important resource reduction.

The decoding of such encodings can be done in two ways, either by a fully connected one-hot with softmax(as is the current practice), or by a similarity search in a Vector Database.