Dataset Explorer
Links:
It is important for our community to be able to easily explore the variety of datasets. There is nothing more frustrating than downloading a dataset, only to discover after a few hours it isn’t clean enough or that the topics are different from what you expected. Using our newly developed dataset explorer, you can easily review all of the data available on the Hebrew NLP Resources index and select the one that best suits your needs.
The Explorer includes: comparison of all the corpora based on their size under the Datasets Overview. It is possible to obtain comparative information about two corpora using the Comparative Analysis feature. You can zoom in on selected corpora under Focused Look at a Dataset The data displayed in the Explorer is the output of 2 pre-processing stages: we removed stop words using our list, and we performed lemmatization using Trankit. There are checkboxes to enable this if you desire. As a result of the nature of the project, you may find noise in the result that we do not remove. It may use you as a flag to indicate what steps need to be taken.
Use case: comparison between Knesset and FoodWalla corpus
The first thing we observe comparing datasets size by word count is. For example, the Knesset contains approximately 8 million words, while FoodWalla contains only 0.5 million.
In the “Comparative Analysis” section, we can select Knesset from the dropdown list and get information about this dataset as well as a link to download it. We will use it as the main dataset, and FoodWalla as the secondary dataset. In addition, you may choose to perform the analysis without stop-words and with lemmatization according to your preference. More information about how it works can be found in our documentation. There are also explanations of all the plots available there.
Here is a quick overview of the two datasets
The Knesset has more unique words, which makes sense since it is longer. FoodWalla has a slightly higher percentage of unique words, which is unexpected considering recipes should contain fewer words that don’t repeat, but it may be balanced with the length of the data. While the Knesset has fewer lines than FoodWalla, despite having many more words, we can understand that the basic unit in the Knesset is much more dense. Knesset jargon is expected to be more sophisticated and longer, but the average word count is 4.5 in both cases
Underneath, we can examine the ngram. You can adjust the number of N by using the slider. The decision we made to clean stop words has resulted in distinct list of word between the datasets; if we return stop words, we can see that words like לא, של ואת are very common in both datasets
Under this we can find topics that can give us some sense of what the text is about. The following is an example of an Arutz 7 article:
Also, we can find a degree of similarity with other texts. There is a lot of similarity between Knesset and Arutz 7 Articles, as well as a lot of difference between FoodWalla and Knesset
More tools and analyses are included in the explorer, including the Gini index, Zipf law, and others. Feel free to create Pull Requests for new features to be implemented on the Explorer for the benefit of the community at large