What do spaCy's part-of-speech and dependency tags mean?

PythonNlpSpacy

Python Problem Overview


spaCy tags up each of the Tokens in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ properties) and a syntactic dependency to its .head token (stored in the dep and dep_ properties).

Some of these tags are self-explanatory, even to somebody like me without a linguistics background:

>>> import spacy
>>> en_nlp = spacy.load('en')
>>> document = en_nlp("I shot a man in Reno just to watch him die.")
>>> document[1]
shot
>>> document[1].pos_
'VERB'

Others... are not:

>>> document[1].tag_
'VBD'
>>> document[2].pos_
'DET'
>>> document[3].dep_
'dobj'

Worse, the official docs don't contain even a list of the possible tags for most of these properties, nor the meanings of any of them. They sometimes mention what tokenization standard they use, but these claims aren't currently entirely accurate and on top of that the standards are tricky to track down.

What are the possible values of the tag_, pos_, and dep_ properties, and what do they mean?

Python Solutions


Solution 1 - Python

tl;dr answer

Just expand the lists at:

Longer answer

The docs have greatly improved since I first asked this question, and spaCy now documents this much better.

Part-of-speech tags

The pos and tag attributes are tabulated at https://spacy.io/api/annotation#pos-tagging, and the origin of those lists of values is described. At the time of this (January 2020) edit, the docs say of the pos attribute that:

> spaCy maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. The universal tags don’t code for any morphological features and only cover the word type. They’re available as the Token.pos and Token.pos_ attributes.

As for the tag attribute, the docs say:

> The English part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Universal Dependencies v2 POS tag set.

and

> The German part-of-speech tagger uses the TIGER Treebank annotation scheme. We also map the tags to the simpler Universal Dependencies v2 POS tag set.

You thus have a choice between using a coarse-grained tag set that is consistent across languages (.pos), or a fine-grained tag set (.tag) that is specific to a particular treebank, and hence a particular language.

.pos_ tag list

The docs list the following coarse-grained tags used for the pos and pos_ attributes:

  • ADJ: adjective, e.g. big, old, green, incomprehensible, first
  • ADP: adposition, e.g. in, to, during
  • ADV: adverb, e.g. very, tomorrow, down, where, there
  • AUX: auxiliary, e.g. is, has (done), will (do), should (do)
  • CONJ: conjunction, e.g. and, or, but
  • CCONJ: coordinating conjunction, e.g. and, or, but
  • DET: determiner, e.g. a, an, the
  • INTJ: interjection, e.g. psst, ouch, bravo, hello
  • NOUN: noun, e.g. girl, cat, tree, air, beauty
  • NUM: numeral, e.g. 1, 2017, one, seventy-seven, IV, MMXIV
  • PART: particle, e.g. ’s, not,
  • PRON: pronoun, e.g I, you, he, she, myself, themselves, somebody
  • PROPN: proper noun, e.g. Mary, John, London, NATO, HBO
  • PUNCT: punctuation, e.g. ., (, ), ?
  • SCONJ: subordinating conjunction, e.g. if, while, that
  • SYM: symbol, e.g. $, %, §, ©, +, −, ×, ÷, =, :), 
  • VERB: verb, e.g. run, runs, running, eat, ate, eating
  • X: other, e.g. sfpksdpsxmsa
  • SPACE: space, e.g.

Note that the docs are lying slightly when they say that this list follows the Universal Dependencies Scheme; there are two tags listed above that aren't part of that scheme.

One of those is CONJ, which used to exist in the Universal POS Tags scheme but has been split into CCONJ and SCONJ since spaCy was first written. Based on the mappings of tag->pos in the docs, it would seem that spaCy's current models don't actually use CONJ, but it still exists in spaCy's code and docs for some reason - perhaps backwards compatibility with old models.

The second is SPACE, which isn't part of the Universal POS Tags scheme (and never has been, as far as I know) and is used by spaCy for any spacing besides single normal ASCII spaces (which don't get their own token):

>>> document = en_nlp("This\nsentence\thas      some weird spaces in\n\n\n\n\t\t   it.")
>>> for token in document:
...   print('%r (%s)' % (str(token), token.pos_))
... 
'This' (DET)
'\n' (SPACE)
'sentence' (NOUN)
'\t' (SPACE)
'has' (VERB)
'     ' (SPACE)
'some' (DET)
'weird' (ADJ)
'spaces' (NOUN)
'in' (ADP)
'\n\n\n\n\t\t   ' (SPACE)
'it' (PRON)
'.' (PUNCT)

I'll omit the full list of .tag_ tags (the finer-grained ones) from this answer, since they're numerous, well-documented now, different for English and German, and probably more likely to change between releases. Instead, look at the list in the docs (e.g. https://spacy.io/api/annotation#pos-en for English) which lists every possible tag, the .pos_ value it maps to, and a description of what it means.

Dependency tokens

There are now three different schemes that spaCy uses for dependency tagging: one for English, one for German, and one for everything else. Once again, the list of values is huge and I won't reproduce it in full here. Every dependency has a brief definition next to it, but unfortunately, many of them - like "appositional modifier" or "clausal complement" - are terms of art that are rather alien to an everyday programmer like me. If you're not a linguist, you'll simply have to research the meanings of those terms of art to make sense of them.

I can at least provide a starting point for that research for people working with English text, though. If you'd like to see some examples of the CLEAR dependencies (used by the English model) in real sentences, check out the 2012 work of Jinho D. Choi: either his Optimization of Natural Language Processing Components for Robustness and Scalability or his Guidelines for the CLEAR Style Constituent to Dependency Conversion (which seems to just be a subsection of the former paper). Both list all the CLEAR dependency labels that existed in 2012 along with definitions and example sentences. (Unfortunately, the set of CLEAR dependency labels has changed a little since 2012, so some of the modern labels are not listed or exemplified in Choi's work - but it remains a useful resource despite being slightly outdated.)

Solution 2 - Python

Just a quick tip about getting the detail meaning of the short forms. You can use explain method like following:

spacy.explain('pobj')

which will give you output like:

'object of preposition'

Solution 3 - Python

The official documentation now provides much more details for all those annotations at https://spacy.io/api/annotation (and the list of other attributes for tokens can be found at https://spacy.io/api/token).

As the documentation shows, their parts-of-speech (POS) and dependency tags have both Universal and specific variations for different languages and the explain() function is a very useful shortcut to get a better description of a tag's meaning without the documentation, e.g.

spacy.explain("VBD")

which gives "verb, past tense".

Solution 4 - Python

Direct links (if you don't feel like going through endless spacy documentation to get the full tables):

Solution 5 - Python

After the recent update of Spacy to v3. The above links do not work. You may visit these links to get the complete list https://v2.spacy.io/api/annotation

Universal POS Tags enter image description here

English POS Tags enter image description here

Solution 6 - Python

At present, dependency parsing and tagging in SpaCy appears to be implemented only at the word level, and not at the phrase (other than noun phrase) or clause level. This means SpaCy can be used to identify things like nouns (NN, NNS), adjectives (JJ, JJR, JJS), and verbs (VB, VBD, VBG, etc.), but not adjective phrases (ADJP), adverbial phrases (ADVP), or questions (SBARQ, SQ).

For illustration, when you use SpaCy to parse the sentence "Which way is the bus going?", we get the following tree.

By contrast, if you use the Stanford parser you get a much more deeply structured syntax tree.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMark AmeryView Question on Stackoverflow
Solution 1 - PythonMark AmeryView Answer on Stackoverflow
Solution 2 - PythonNuhil MehdyView Answer on Stackoverflow
Solution 3 - PythonSilveriView Answer on Stackoverflow
Solution 4 - PythonOvidiu OpreaView Answer on Stackoverflow
Solution 5 - PythonSamualView Answer on Stackoverflow
Solution 6 - PythonrebeccabilbroView Answer on Stackoverflow