While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences.
This work presents the first complete analysis (from our knowledge) of the Universal Dependencies v2.6 dataset and presents the global and individual results by language.
This work does not intend to be a conference level paper (that is why there are no references to papers on the subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.
The number of analyzed languages is 91, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuation and other symbols presents in the text samples.
The process is the same for any UD version.
The Data is 353MB compressed in gzip format and 2.1GB in uncompressed raw text format.
The entire pipeline takes less than 20 minutes to process and generate all graphs and html output from the uncompressed UD dataset on 1 CPU core of an intel 7700.
The analysis was done with the following software stack:
Some code detail can be seen at the Apendix of this document, the complete code is available in the repository.
The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, we test different distribution fits with the Kolmogorov-Smirnov test.
There are many languages that do not have enough samples so the dsitribution fit will not be good and errors will be big. This is not an issue from the code point of view. The important thing is if this data is used, take into account the number of samples available.
While doing this work we found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.
For some languages the number of samples is too small to make any good assumption from the data.
There is one missing language in the analysis, the language code is qhe correspinding to the UD_Hindi_English-HIENCS folder. This is because the files contain only underscore and space characters.
This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 91 of the 92 represented languages. The analysis then shows the length histograms by character and token length.
The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the application, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.
We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures.
We are currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.
You can select the language you want to explore from the dropdown menu.
The language graphs are interactive, you can turn on and off the histogram, PDF and CDF clicking on the leyend. You can also zoom and pan each graph.
All tables can be reordered clicking over the column titles.
The graphs show (from top to bottom) the UPOS, which equals to the token length, and the character length of the sentences.
The interval column tells what percentage (quantile) of data from the dataset will be captured by choosing the value on htat row. The only displayed value is the max (right) one which is the most interesting one. The low (left) values are not shown.
Select the language yo want to explore:
The following tables contain the list and statistics for all languages. The tables are also interactive, you can reorder them clicking in.
This work is under License: CC BY-SA 4.0