mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 09:57:26 +03:00 
			
		
		
		
	* Add spec.jade
This commit is contained in:
		
							parent
							
								
									b57a3ddd7e
								
							
						
					
					
						commit
						ba00c72505
					
				
							
								
								
									
										123
									
								
								docs/redesign/spec.jade
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										123
									
								
								docs/redesign/spec.jade
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
				
			
			@ -0,0 +1,123 @@
 | 
			
		|||
extends ./outline.jade
 | 
			
		||||
 | 
			
		||||
mixin columns(...names)
 | 
			
		||||
  tr
 | 
			
		||||
    each name in names
 | 
			
		||||
      th= name
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
mixin row(...cells)
 | 
			
		||||
  tr
 | 
			
		||||
    each cell in cells
 | 
			
		||||
      td= cell
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
block body_block
 | 
			
		||||
  article(class="page docs-page")
 | 
			
		||||
    p.
 | 
			
		||||
      This document describes the target annotations spaCy is trained to predict.
 | 
			
		||||
      This is currently a work in progress. Please ask questions on the issue tracker,
 | 
			
		||||
      so that the answers can be integrated here to improve the documentation.
 | 
			
		||||
 | 
			
		||||
    h2 Tokenization
 | 
			
		||||
 | 
			
		||||
    p Tokenization standards are based on the OntoNotes 5 corpus.
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      The tokenizer differs from most by including tokens for significant
 | 
			
		||||
      whitespace. Any sequence of whitespace characters beyond a single space
 | 
			
		||||
      (' ') is included as a token. For instance:
 | 
			
		||||
 | 
			
		||||
    pre.language-python
 | 
			
		||||
      code
 | 
			
		||||
        | from spacy.en import English
 | 
			
		||||
        | nlp = English(parse=False)
 | 
			
		||||
        | tokens = nlp('Some\nspaces  and\ttab characters')
 | 
			
		||||
        | print([t.orth_ for t in tokens])
 | 
			
		||||
        
 | 
			
		||||
    p Which produces:
 | 
			
		||||
    
 | 
			
		||||
    pre.language-python
 | 
			
		||||
      code
 | 
			
		||||
        | ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      The whitespace tokens are useful for much the same reason punctuation is
 | 
			
		||||
      – it's often an important delimiter in the text.  By preserving
 | 
			
		||||
      it in the token output, we are able to maintain a simple alignment
 | 
			
		||||
      between the tokens and the original string, and we ensure that no
 | 
			
		||||
      information is lost during processing.
 | 
			
		||||
 | 
			
		||||
    h3 Sentence boundary detection
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      Sentence boundaries are calculated from the syntactic parse tree, so
 | 
			
		||||
      features such as punctuation and capitalisation play an important but
 | 
			
		||||
      non-decisive role in determining the sentence boundaries.  Usually this
 | 
			
		||||
      means that the sentence boundaries will at least coincide with clause
 | 
			
		||||
      boundaries, even given poorly punctuated text.
 | 
			
		||||
 | 
			
		||||
    h3 Part-of-speech Tagging
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
 | 
			
		||||
      tag set.  We also map the tags to the simpler Google Universal POS Tag set.
 | 
			
		||||
 | 
			
		||||
      Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124
 | 
			
		||||
 | 
			
		||||
    h3 Lemmatization
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      A "lemma" is the uninflected form of a word. In English, this means:
 | 
			
		||||
 | 
			
		||||
    ul
 | 
			
		||||
      li Adjectives: The form like "happy", not "happier" or "happiest"
 | 
			
		||||
      li Adverbs: The form like "badly", not "worse" or "worst"
 | 
			
		||||
      li Nouns: The form like "dog", not "dogs"; like "child", not "children"
 | 
			
		||||
      li Verbs: The form like "write", not "writes", "writing", "wrote" or "written" 
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      The lemmatization data is taken from WordNet. However, we also add a
 | 
			
		||||
      special case for pronouns: all pronouns are lemmatized to the special
 | 
			
		||||
      token -PRON-.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
    h3 Syntactic Dependency Parsing
 | 
			
		||||
 | 
			
		||||
    p.
 | 
			
		||||
      The parser is trained on data produced by the ClearNLP converter. Details
 | 
			
		||||
      of the annotation scheme can be found here:  http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
 | 
			
		||||
 | 
			
		||||
    h3 Named Entity Recognition
 | 
			
		||||
 | 
			
		||||
    table
 | 
			
		||||
      thead
 | 
			
		||||
        +columns("Entity Type", "Description")
 | 
			
		||||
      
 | 
			
		||||
      tbody
 | 
			
		||||
        +row("PERSON", "People, including fictional.")
 | 
			
		||||
        +row("NORP", "Nationalities or religious or political groups.")
 | 
			
		||||
        +row("FACILITY", "Buildings, airports, highways, bridges, etc.")
 | 
			
		||||
        +row("ORG", "Companies, agencies, institutions, etc.")
 | 
			
		||||
        +row("GPE", "Countries, cities, states.")
 | 
			
		||||
        +row("LOC", "Non-GPE locations, mountain ranges, bodies of water.")
 | 
			
		||||
        +row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services")
 | 
			
		||||
        +row("EVENT", "Named hurricanes, battles, wars, sports events, etc.")
 | 
			
		||||
        +row("WORK_OF_ART", "Titles of books, songs, etc.")
 | 
			
		||||
        +row("LAW", "Named documents made into laws")
 | 
			
		||||
        +row("LANGUAGE", "Any named language")
 | 
			
		||||
 | 
			
		||||
    p The following values are also annotated in a style similar to names:
 | 
			
		||||
 | 
			
		||||
    table
 | 
			
		||||
      thead
 | 
			
		||||
        +columns("Entity Type", "Description")
 | 
			
		||||
      
 | 
			
		||||
      tbody
 | 
			
		||||
        +row("DATE", "Absolute or relative dates or periods")
 | 
			
		||||
        +row("TIME", "Times smaller than a day")
 | 
			
		||||
        +row("PERCENT", 'Percentage (including “%”)')
 | 
			
		||||
        +row("MONEY", "Monetary values, including unit")
 | 
			
		||||
        +row("QUANTITY", "Measurements, as of weight or distance")
 | 
			
		||||
        +row("ORDINAL", 'first", "second"')
 | 
			
		||||
        +row("CARDINAL", "Numerals that do not fall under another type")
 | 
			
		||||
		Loading…
	
		Reference in New Issue
	
	Block a user