A statistical Exploration of Universal Dependencies Conllu Files

Leonardo M. Rocha

Contact Me

Introduction

While much work is being done in the current days on NLP and NLU, there is little work on describing why a certain length of transformer (or other as LSTM time steps) architecture has been chosen for the training, it is mostly arbitrary and depends on the goal of the work and resources available (mainly hardware). These decisions are hard once the model has been trained and there is nothing that can be done to extend the length of a transformer (for example) without having to retrain the entire network. There are however some works that tackle variable length sequences.

This work presents the first complete analysis (from our knowledge) of the Universal Dependencies v2.6 dataset and presents the global and individual results by language.

This work does not intend to be a conference level paper (that is why there are no references to papers on the subject), but an informational technical report that might help to better select the most effective compromise text or token length for your particular NLP application.

The number of analyzed languages is 91, the token length is measured as the named UPOS tag in the dataset, while the character length is just that. There is no analysis on what constitutes a word or not, this means that a token includes the punctuation and other symbols presents in the text samples.

Notes:

  • The graphs and tables in this post is best viewed in a computer
  • This is not a tutorial

Dataset

Universal Dependencies v2.6

The process is the same for any UD version.

The Data is 353MB compressed in gzip format and 2.1GB in uncompressed raw text format.

The entire pipeline takes less than 20 minutes to process and generate all graphs and html output from the uncompressed UD dataset on 1 CPU core of an intel 7700.

Technology Stack

The analysis was done with the following software stack:

  • Ubuntu 20.04
  • python 3.8 with all dependencies installed in a virtualenv
  • conllu files are dealt with pyconll
  • pycountry v19.8.18 is used to transform between country code and country names
  • NumPy v1.18.4 is used for numerical computation
  • Pandas v1.0.4 is used for DataFrame generation and making easier some operations
  • SciPy v1.4.1 is used for statistics
  • Matplotlib v3.2.1 was used for the first graphing iterations during the exploration stage
  • Bokeh v2.0.2 used for the latest graph and table generation to make interactive presentations
  • Jupyter Lab v2.1.2 was used during the entire process from starting the exploration to writing the current report

Method - Steps

Some code detail can be seen at the Apendix of this document, the complete code is available in the repository.

  1. A single zipped file containing the dataset was downloaded locally and uncompressed
  2. All the file tree is explored, a list of all the files is returned. All files, train, test and dev are used for this analysis to have the maximum amount of data available.
  3. undesired files are filtered out (used only during exploration to select different languages to explore and check errors)
  4. each file:
    1. Open the file
    2. language code (2 or 3 letter) is extracted from the file name
    3. sentence, form and upos fields are extracted for each file
  5. All the separate sentence, upos and forms data from each files are now merged in a single list for each type and enriched with language information
  6. Data is now set in a Pandas Dataframe to simplify handling (select by language in this case). For development or different needs a list of languages can be given to accelerate the processing (for example for exploration purposes)
  7. Every language is now processed independently and several possible distributions are tested, only the one with the best Kolmogorov-Smirnov test result is chosen as a result. This is the most time-expensive step in the pipe.
  8. Statistics are computed for every language
  9. Results are converted to json and saved in a single gz file (available to download in the github repository)
  10. Statistics tables and plots are done per language, graphs can be explored here and are available in the project's github page
  11. Tables are created for upos and text (character) length for all the languages together. These tables are available in the project's github page

Observations

The histograms show a skew on the distribution, this can be a skewed gaussian, a generalized gaussian or a beta distribution form. Due to this, we test different distribution fits with the Kolmogorov-Smirnov test.

There are many languages that do not have enough samples so the dsitribution fit will not be good and errors will be big. This is not an issue from the code point of view. The important thing is if this data is used, take into account the number of samples available.

While doing this work we found quite interesting that are languages whose number of tokens or characters avoid certain bins in the histogram (Bulgarian, Breton Welsh, Danish, Slovak, Tamil and Thai are a few examples of this). This can mean that, either the language structure supports only those lengths, or that the analyzed dataset only contains samples that avoid some sentence lengths.

For some languages the number of samples is too small to make any good assumption from the data.

There is one missing language in the analysis, the language code is qhe correspinding to the UD_Hindi_English-HIENCS folder. This is because the files contain only underscore and space characters.

Conclusion

This work presents a sample length analysis by language on the UniversalDependencies v2.6 dataset presenting the statistics for all 91 of the 92 represented languages. The analysis then shows the length histograms by character and token length.

The best compromise for choosing a sequence length on the NLP architecture for training will depend mostly on the requirements of the application, nevertheless with the numbers here you should be able to make an informed guess on what might be better for your case.

We can see that having a multi-lingual approach will necessary make the needed sequences longer as there is a large variability on sequence length, but appliying to single language might allow you to optimize your neural architectures.

Future Work

We are currently working on a more in depth analysis of the complete Gutenberg project dataset ( ~60K books in several languages) that will discriminate several other text characteristics.

Appendix - Language Graphs

You can select the language you want to explore from the dropdown menu.

The language graphs are interactive, you can turn on and off the histogram, PDF and CDF clicking on the leyend. You can also zoom and pan each graph.

All tables can be reordered clicking over the column titles.

The graphs show (from top to bottom) the UPOS, which equals to the token length, and the character length of the sentences.

The interval column tells what percentage (quantile) of data from the dataset will be captured by choosing the value on htat row. The only displayed value is the max (right) one which is the most interesting one. The low (left) values are not shown.

Select the language yo want to explore:

Appendix - Tables

The following tables contain the list and statistics for all languages. The tables are also interactive, you can reorder them clicking in.

Sentence length by UPOS (token) count - All languages

Sentence length by character count - All languages

This work is under License: CC BY-SA 4.0