Intro to YAP The building block for Hebrew NLP

Intro to YAP The building block for Hebrew NLP

Credits and Full Disclosure

YAP, also known as Yet Another (natural language) Parser, is an automatic natural language processing tool tailored for Modern Hebrew texts. YAP can automatically annotate Hebrew texts with different kinds of information, including Lemma, Part of Speech tags (Verb, Noun, etc), Morphological Features (gender, number, etc), as well as Syntactic Relations (Subject, Object, Modifier). YAP has been developed at the ONLP research lab led by Prof. Reut Tsarfaty. The original development of YAP was done by Amir More, and later developments were undertaken by Amit Seker. An overview of YAP, its best practices, and the main ways it is used can be found in this document. Texts and images on this page are adopted and adapted from the full documentation provided by the ONLP lab website and its associated github page here. A YAP online demo by the ONLP lab can be found here. While YAP is open-source, any use of YAP in an academic publication or otherwise should cite this article.

Synopsis

YAP is a natural language parser that is designed to automatically process Hebrew texts. It is an extremely powerful tool which allows users to automatically analyze textual data in Hebrew and annotate it with different levels of analysis.
Specifically, YAP provides the following levels of analysis: Morphological Segmentation, Part of Speech Tagging, Morphological Analysis, and syntactic analysis in the form of Dependency Parse Trees.

History

YAP was implemented for testing the hypothesis of Joint Morpho-Syntactic Processing of Morphologically Rich Languages (MRLs) in a Transition Based Framework. The hypothesis, in a nutshell, is that processing both the morphological and syntactic levels in a joint model would be superior to a pipeline with two independent components. The parser was developed as part of Amir More’s MSc thesis at IDC Herzliya under the supervision of Prof. Reut Tsarfaty, now a professor at Bar-Ilan University and a visiting Professor at Google. The models and training regimes have been tuned and improved by Amit Seker from the Open University and the Bar-Ilan research team. YAP is currently maintained and supported as part of the ONLP lab (The NLP research lab at the Bar Ilan University) toolkit.

A paper on the morphological analysis and disambiguation aspect for Modern Hebrew and Universal Dependencies was accepted to COLING 2016. A follow up paper on the complete morpho-syntactic framework was published at TACL, titled Joint Transition-Based Models for Morpho-Syntactic Parsing: Parsing Strategies for MRLs and a Case Study from Modern Hebrew. Yap contains an implementation of the framework and parser of zpar from Z&N 2011 (Transition-based Dependency Parsing with Rich Non-local Features by Zhang and Nivre, 2011) with flags for precise output parity (i.e. bug replication), trained on the morphologically disambiguated Modern Hebrew treebank.

YAP Overview

A Hebrew sentence can be analyzed using YAP in the following ways:

Morphological Analysis

Morphological Analysis shows all the options of interpretations for each word in the sentence. YAPExample For example the sentence “קניתי ארוחה טעימה אתמול”:

0 1 קניתי קנה VB VB gen=F|gen=M|num=S|per=1|tense=PAST 1
0 1 קניתי קנייה NN NN gen=F|num=S|suf_gen=F|suf_gen=M|suf_num=S|suf_per=1 1
1 2 ארוחה ארוחה NN NN gen=F|num=S 2
2 3 טעימה טעימה NN NN gen=F|num=S 3
2 3 טעימה טעים JJ JJ gen=F|num=S 3
3 4 מ מ PREPOSITION PREPOSITION _ 4
3 5 מאוד מאוד RB RB _ 4
4 5 אוד אוד NNT NNT gen=M|num=S 4
4 5 אוד אוד NN NN gen=M|num=S 4

The columns are represented as follows:

  • Column 1-2: From/To - The start and end nodes of the morpheme. The numbers are with respect to the maximal length route.
  • Column 3: Form - the surface form of the morphological segment.
  • Column 4: Lemma.
  • Column 5-6: Part of Speech.
  • Column 7: Morphological Features.
  • Column 8: Token Number - represents the index of the raw (space-delimited) token in the input before segmentation

You can find more information about Morphological Analysis by visiting:

Morphological Disambiguation

Morphological Disambiguation is the next step after a full Morphological Analysis.
Morphological Disambiguation narrows down the options for each word in the sentence based on the Morphological Analysis.
For example the sentence “קניתי ארוחה טעימה אתמול”

0 1 קניתי קנה VB VB gen=F|gen=M|num=S|per=1|tense=PAST 1
1 2 ארוחה ארוחה NN NN gen=F|num=S 2
2 3 טעימה טעים JJ JJ gen=F|num=S 3
3 4 מאוד מאוד RB RB _ 4

The columns are represented as follows:

  • Column 1: Morpheme Index in the Sentence.
  • Column 2: Form of the Morpheme.
  • Column 3: Lemma of the Morpheme.
  • Column 4: Coarse Part of Speech Tag.
  • Column 5: Fine Part of Speech Tag.
  • Column 6: Morphological Features.
  • Column 7: Head Index Pointer.
  • Column 8: Dependency relation to the HEAD.
  • Column 9: Projective Head Unused.
  • Column 10: Dependency relation to the PHEAD Unused.

Useful Links

At the stage of morphological analysis, there were two possible interpretations of the form ״קניתי״: Line 1 - A verb, inflected for first person past tense, lemma=״קנה״ (translated as “I bought”);
Line 2 - A noun with a possessive suffix, lemma=״קנייה״ (translated as ‘my purchase’).
There were also two possible interpretations of the form ״טעימה״: Line 4 - A feminine noun, lemma=״טעימה״ (translated for example as ‘a bite’), Line 5 - An adjective, inflected for feminine gender, lemma=״טעים״ (translated as ‘tasty’).
At the stage of morphological disambiguation, in the case of ״קניתי״ the interpretation presented in line 1 is selected, and in the case of ״טעימה״ the interpretation presented in line 5 is selected

The columns are represented as follows:

  • Column 1-2: From/To - The start and end nodes of the morpheme. The numbers are with respect to the maximal length route.
  • Column 3: Form of the Morpheme - the surface form of the morphological segment.
  • Column 4: Lemma of the Morpheme.
  • Column 5: Coarse Part of Speech Tag.
  • Column 6: Fine Part of Speech Tag.
  • Column 7: Morphological Features.
  • Column 8: Source Token Index.

To find more on Morphological Disambiguation:

  • Morphological Disambiguation of Hebrew.
  • An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation.

Dependency Parsing

Dependency Parsing is the process of analyzing the grammatical structure of a sentence to identify related words and their relationship type. YAP returns a dependency tree of the inputted sentence. To find more information on Hebrew Dependency Parsing:

The Dependency labels of used for YAP originate from this paper:

For example the sentence “אכלתי ארוחה טעימה אתמול”

1 קניתי קנה VB VB gen=F|gen=M|num=S|per=1|tense=PAST 0 ROOT _ _
2 ארוחה ארוחה NN NN gen=F|num=S 1 obj _ _
3 טעימה טעים JJ JJ gen=F|num=S 2 amod _ _
4 מאוד מאוד RB RB 3 mod _ _

YAP via Docker

YAP is currently available as an API service. It can be implemented as a code-based service and accessed as an internet service. For simplicity, you may also use our Docker image. Use the following command to run YAP via Docker:

  $ docker run --name yap --restart always -d -p 8000:8000 public.ecr.aws/nnlp-il/nlp_yap_api:latest

Using YAP

YAP’s API exposes an API service. The following wrappers simplify using YAP by exposing simple functions which call the API service.

Python Wrapper - credit to Amit Shkolnik

The wrapper sends the API a joint command, parses the response, and outputs the following:

  • Tokenized Text.
  • Segmented Text.
  • Lemmas.
  • Dependency tree.
  • md command result.
  • ma command result.

To Run the python wrapper use the following code:

  text1 = 'היום הם הולכים לקניון'
  ip='127.0.0.1:8000'
  yap=YapApi()
  tokenized_text, segmented_text, lemmas, dep_tree, md_lattice, ma_lattice=yap.run(text1, ip)

The output:

  tokenized_text
  היום הם הולכים לקניון
  segmented_text
  היום הם הולכים ל ה קניון
  lemmas
  היום הוא הלך ל ה קניון
  dep_tree
   num  word lemma     pos    pos_2       empty dependency_arc dependency_part dependency_arc_2 dependency_part_2
  0  1  היום  היום      RB      RB                 3      tmod        _         _
  1  2   הם  הוא     PRP     PRP gen=M|num=P|per=3       3      subj        _         _
  2  3 הולכים  הלך      BN      BN gen=M|num=P|per=A       0      ROOT        _         _
  3  4    ל   ל PREPOSITION PREPOSITION                 3     prepmod        _         _
  4  5    ה   ה     DEF     DEF                 6       def        _         _
  5  6  קניון קניון      NN      NN    gen=M|num=S       4      pobj        _         _
  md_lattice
   num num_2  word empty     pos ... num_s_p per tense num_last lemma
  0  0   1  היום   -1      RB ...   -1 -1  -1    1  היום
  1  1   2   הם   -1     PRP ...    P  3  -1    2  הוא
  2  2   3 הולכים   -1      BN ...    P  A  -1    3  הלך
  3  3   4    ל   -1 PREPOSITION ...   -1 -1  -1    4   ל
  4  4   5    ה   -1     DEF ...   -1 -1  -1    4   ה
  5  5   6  קניון   -1      NN ...    S -1  -1    4 קניון
 
  [6 rows x 12 columns]
  ma_lattice
    num num_2  word  lemma ... num_last suf_gen suf_num suf_per
  0  0   1    ה    ה ...    -1   -1    -1   -1
  1  0   2    ה    ה ...    -1   -1    -1   -1
  2  0   3  היום  היום ...    -1   -1    -1   -1
  3  1   3   יום   יום ...    -1   -1    -1   -1
  4  1   3   יום   יום ...    -1   -1    -1   -1
  5  2   3   יום   יום ...    -1   -1    -1   -1
  6  2   3   יום   יום ...    -1   -1    -1   -1
  7  3   4   הם   הוא ...    -1   -1    -1   -1
  8  3   4   הם   הוא ...    -1   -1    -1   -1
  9  4   5 הולכים   הלך ...    -1   -1    -1   -1
  10  4   5 הולכים   הלך ...    -1   -1    -1   -1
  11  5   6    ל    ל ...    -1   -1    -1   -1
  12  5   8 לקניון לקניון ...    -1   -1    -1   -1
  13  5   8 לקניון לקניון ...    -1   -1    -1   -1
  14  5   8 לקניון לקניון ...    -1   -1    -1   -1
  15  5   8 לקניון לקניון ...    -1   -1    -1   -1
  16  5   8 לקניון לקניון ...    -1   -1    -1   -1
  17  5   8 לקניון לקניון ...    -1   -1    -1   -1
  18  5   8 לקניון לקניון ...    -1   -1    -1   -1
  19  5   8 לקניון לקניון ...    -1   -1    -1   -1
  20  5   8 לקניון לקניון ...    -1   -1    -1   -1
  21  6   7    ה    ה ...    -1   -1    -1   -1
  22  6   8  קניון  קניון ...    -1   -1    -1   -1
  23  6   8  קניון  קניון ...    -1   -1    -1   -1
  24  7   8  קניון  קניון ...    -1   -1    -1   -1
  25  7   8  קניון  קניון ...    -1   -1    -1   -1
 
  [26 rows x 15 columns]

R Wrapper - credit to Almog Simchon

The wrapper sends the API a joint command, parses the response, and outputs the following:

  • Lemmas.
  • CoNLL style dependency tree.
  • Lattice style dependency tree. To Run the R wrapper use the following code: ```console text <- “גנן גידל דגן בגן” token <- “YourAPIToken” yap_list <- yap(text, token)

yap_lemmas(yap_list) #get lemmas dep_conll(yap_list) #get CoNLL style dependency tree dep_lattice(yap_list) #get Lattice style dependency tree

The results should be the same as the python wrapper’s results.

The API url can be changed inside the file ```console/yappr/R/yap.R ```

```console
url <- paste0('https://www.langndata.com/api/heb_parser?token=',Token)

The url needs to be the url of the YAP API you wish to use

JavaScript/TypeScript Client -

An NPM library can be used to see both the Morphological Analysis The Morphological Disambiguation and the Dependency Tree into an easy-to-use data structure. To install the Client we need to install it via npm:

npm i @nnlp-il/yap-js-client

Usage of the NPM library in an example:

const { YapClient } = require("@nnlp-il/yap-js-client");
const test = async () => {
const client = new YapClient('http://localhost:8000');
const result = await client.jointAnalysis("גנן גידל גנן בגן");
console.log(result);
}
test();

YAP as a Web Service

YAP API exists as a web service. The service requires registration and has the following limitations:

  • 1 call per 3 seconds.
  • 250 words max per call.
  • Long text requires longer processing time, Set the HTTP call timeout accordingly.

Running YAP Commands Manually

We can run the commands from the terminal/cmd using a local text file and analyze it. Using the command line, you can process one input file at a time. Note it is possible to have multiple sentences in a single file. The original YAP repository contains a manual on how to run YAP in the command line manually. Please ensure that you have Go and Git installed and on your command PATH.
If you are new to GOlang: Set up your environment with one of these guides:

If the setup is not working it is usually either 1 of these 3 things:

  • Go version is above 1.15.
  • GOPATH is not set up correctly.
  • bzip2 is not installed on your pc. If bunzip does not work the extraction of the bzip files can be done manually which allows you to delete the console bunzip2 data/*.bz2 line.

    Running YAP API Manually

    YAP can run as a server listening on port 8000:

    $ ./yap api
    

    To change the port from 8000 we need to yap/webapi/webapi.go and change the 8000 to the wanted port (line 12 here):

    func StartAPIServer(cmd *commander.Command, args []string) error {
    HebrewMorphAnalyazerInitialize(cmd, args)
    MorphDisambiguatorInitialize(cmd, args)
    DepParserInitialize(cmd, args)
    JointParserInitialize()
    router = mux.NewRouter()
    router.HandleFunc("/yap/heb/ma", HebrewMorphAnalyzerHandler)
    router.HandleFunc("/yap/heb/md", MorphDisambiguatorHandler)
    router.HandleFunc("/yap/heb/dep", DepParserHandler)
    router.HandleFunc("/yap/heb/pipeline", HebrewPipelineHandler)
    router.HandleFunc("/yap/heb/joint", HebrewJointHandler)
    log.Fatal(http.ListenAndServe(":8000", router))
    return nil
    

    Using Yap Summary

    To summarize, YAP can be used YAP in a variety of ways:

  • Install YAP and run it manually via command line.
  • Install YAP and run it via API.
  • Sign Up and use the YAP API here

Here is a comparison for using YAP:

Category/YAP Usage Manually Local API Remote API
Need Installation Yes Yes No
RAM Usage 2.5GB 5GB 1GB
Changeable Code Yes Yes No
Can use Different Models Yes Yes No
Can be used with current Wrappers No Yes Yes
Speed Fast Fast Slow
Share: Twitter Facebook
NNLP-IL's Picture

About NNLP-IL

NNLP-IL Blog - a national initiative for the creation of infrastructure, research and development of advanced capabilities for the advancement of the field of NLP in Hebrew and Arabic.

Israel https://nnlp-il.mafat.ai/
-->