kengz
73a38bd4d1
Merge remote-tracking branch 'upstream/master'
2016-12-30 12:19:59 -05:00
kengz
da44183ae1
move parse_tree logic to a new tokens/printers.py file
2016-12-30 12:19:18 -05:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00
Matthew Honnibal
15c9b59f0e
Fix Issue #461 : O tag was being clobbered by doc.ents.__set__
2016-10-23 15:50:26 +02:00
Matthew Honnibal
2c3a67b693
Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.
2016-10-23 14:49:31 +02:00
Matthew Honnibal
3588a18fb8
Fix hook names in doc
2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773
Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.
2016-10-19 20:54:22 +02:00
Matthew Honnibal
9b60186266
Fix doc class
2016-10-17 15:23:47 +02:00
Matthew Honnibal
b67697a97b
Improve API for doc.merge() and span.merge(), to use keyword arguments.
2016-10-17 14:02:13 +02:00
Matthew Honnibal
fbb7f3f15c
Add user_data attribute to Doc object.
2016-10-17 11:43:22 +02:00
Matthew Honnibal
62230dd13a
Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring
2016-10-17 02:42:51 +02:00
Matthew Honnibal
311a985fe0
Add input error handling in Doc
2016-10-16 18:16:42 +02:00
Matthew Honnibal
06322ba99d
Add words and spaces keyword arguments to Doc.
2016-10-16 18:13:03 +02:00
Matthew Honnibal
6736977d82
Revert "Changes to Doc and Token for new string store scheme"
...
This reverts commit 99de44d864
.
2016-09-30 20:11:15 +02:00
Matthew Honnibal
99de44d864
Changes to Doc and Token for new string store scheme
2016-09-30 20:00:21 +02:00
Matthew Honnibal
d3dc5718b2
Fix syntax error in Doc
2016-09-28 11:39:49 +02:00
Matthew Honnibal
1b520e7bab
Improve docstrings for Doc object
2016-09-28 11:15:13 +02:00
Matthew Honnibal
fc4a7ad794
Test and fix Issue #411 : IndexError when .sents property is used on empty string.
2016-09-27 18:49:14 +02:00
Matthew Honnibal
15e42a1ba9
Allow entities to be set by Span, or by 4-tuple (with entity ID)
2016-09-24 01:17:43 +02:00
Matthew Honnibal
2735b6247b
Fix orths_and_spaces in Doc.__init__
2016-09-21 14:52:05 +02:00
Matthew Honnibal
cdc10e9a1c
* Fix Issue #375 : noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work.
2016-05-20 10:14:06 +02:00
Matthew Honnibal
5d86c30f0b
* Fix Issue #367 : Missing has_vector property on Doc and Span objects
2016-05-09 12:36:14 +02:00
Matthew Honnibal
76021cb853
* Fix bug in Doc.text, introduced by a862edc
2016-05-04 11:02:16 +02:00
Matthew Honnibal
29a114e645
* Don't assign 0-valued tags in Doc.from_array
2016-05-02 16:07:50 +02:00
Matthew Honnibal
508fd1f6dc
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.
2016-05-02 14:25:10 +02:00
Matthew Honnibal
872695759d
Merge pull request #306 from wbwseeker/german_noun_chunks
...
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Matthew Honnibal
ad119c074f
* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.
2016-03-29 13:02:42 +11:00
Wolfgang Seeker
d65ef41d08
make error messages language independent
2016-03-24 11:47:09 +01:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
d9312bc9ea
add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators
2016-03-09 16:18:48 +01:00
Matthew Honnibal
af8514cb0c
* Refine the way the is_parsed attribute is set by from_array
2016-02-06 14:44:35 +01:00
Matthew Honnibal
6bb007d16e
* Make set_parse nogil
2016-01-30 20:27:52 +01:00
Matthew Honnibal
f24833d607
* Fix merge for coordinations
2016-01-18 16:03:19 +01:00
Matthew Honnibal
fc8f26584a
* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.
2016-01-16 17:52:40 +01:00
Matthew Honnibal
54a98eaf19
* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future
2016-01-16 17:13:50 +01:00
Matthew Honnibal
a9b612abdf
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 09:01:12 +11:00
Matthew Honnibal
56499d89ef
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 08:55:34 +11:00
Andreas Grivas
562db6d2d0
* merge add lex last - add index finder funcs
2015-11-07 07:57:04 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
9482d616bc
* Rename spans.pyx to span.pyx
2015-11-03 23:51:05 +11:00
Matthew Honnibal
116da5990a
* Clean up setting of tag in doc.from_bytes
2015-11-03 23:48:57 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
9e37437ba8
* Fix assign_tag in doc.merge
2015-11-03 19:07:02 +11:00
Matthew Honnibal
833eb35c57
* Fix tag assignment in doc.from_array
2015-11-03 18:45:54 +11:00
Matthew Honnibal
09664177d7
* Fix tag handling in doc.merge, and assign sent_start when setting heads.
2015-11-03 18:15:52 +11:00
Matthew Honnibal
604ceac4c6
* Fix morphological assignment in doc.merge()
2015-11-03 17:57:51 +11:00
Matthew Honnibal
5e040855a5
* Ensure morphological features and lemmas are loaded in from_array, re Issue #152
2015-11-03 17:56:50 +11:00
Andreas Grivas
d418f00eb1
fixed error when printing unicode
2015-11-02 20:23:18 +02:00
Matthew Honnibal
52fc338001
* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152
2015-10-28 10:43:22 +11:00
Andreas Grivas
93ada458e2
added __repr__ that prints text in ipython for doc, token, and span objects
2015-10-21 14:11:46 +03:00
Matthew Honnibal
135062d23c
* Fix error with merged text when merged region did not have trailing whitespace
2015-10-19 15:47:04 +11:00
Matthew Honnibal
a7e6c5ac8f
* Fix Issue #122 : Incorrect calculation of children after Doc.merge()
2015-10-18 17:17:27 +11:00
Matthew Honnibal
94bafc1417
* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS
2015-10-10 17:57:29 +11:00
Yubing (Tom) Dong
0f601b8b75
Update docstring of Doc.__getitem__
2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong
3fd3bc79aa
Refactor to remove duplicate slicing logic
2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong
2fc33e8024
Allow step=1 when slicing a Doc
2015-10-06 00:57:05 -07:00
Matthew Honnibal
ab694b0364
* Fix open-bounded slice indices.
2015-09-29 23:03:09 +10:00
Matthew Honnibal
f7283a5067
* Fix vectors bugs for OOV words
2015-09-22 02:10:25 +02:00
Matthew Honnibal
f32927efbf
* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97 . Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.
2015-09-21 18:35:40 +10:00
Matthew Honnibal
77856c4fcd
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.
2015-09-17 11:50:11 +10:00
Matthew Honnibal
60c26b2dfa
* Fix slicing when start or stop is None
2015-09-15 14:43:10 +10:00
Matthew Honnibal
65dc0d1dfb
* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.
2015-09-14 17:49:58 +10:00
Matthew Honnibal
c08f10083c
* Add test and test_with_ws attributes.
2015-09-13 10:27:42 +10:00
Matthew Honnibal
9e7bfe8449
* Fix space at end of merged token
2015-09-10 14:45:17 +02:00
Matthew Honnibal
31ccf494e6
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-09 14:33:38 +02:00
Matthew Honnibal
07686470a9
* Don't consider a coordinated NP a base chunk
2015-09-09 14:32:28 +02:00
Matthew Honnibal
0e24d099a1
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
2015-09-09 03:40:44 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26
* Begin merge of Gazetteer and DE branches
2015-09-06 19:45:15 +02:00
Matthew Honnibal
fd1eeb3102
* Add POS attribute support in get_attr
2015-09-06 04:13:03 +02:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
6f1743692a
* Work on language-independent refactoring
2015-08-23 20:49:18 +02:00
Matthew Honnibal
b0f5c39084
* Fix handling of exclusion entities
2015-08-06 17:28:43 +02:00
Matthew Honnibal
10d869d102
* Don't allow conjunction between NPs in base NP chunks
2015-08-06 16:31:53 +02:00
Matthew Honnibal
9c1724ecae
* Gazetteer stuff working, now need to wire up to API
2015-08-06 00:35:40 +02:00
Matthew Honnibal
eb7138c761
* Add attr relation in base NP detection
2015-08-01 00:34:40 +02:00
Matthew Honnibal
4988356cf0
* Fix dependency type bug from merged tokens
2015-08-01 00:33:24 +02:00
Matthew Honnibal
78a9068319
* Fix spacy attr on merged tokens
2015-07-30 04:25:58 +02:00
Matthew Honnibal
430e2edb96
* Fix noun_chunks issue
2015-07-30 03:51:50 +02:00
Matthew Honnibal
74d8cb3980
* Add noun_chunks iterator, and fix left/right child setting in Doc.merge
2015-07-30 02:29:49 +02:00
Matthew Honnibal
b5132bed7d
* Set left and right children when loading parse from byte string
2015-07-28 21:03:18 +02:00
Matthew Honnibal
aa7a964a4f
* Add a type declaration for doc.from_array
2015-07-27 22:57:22 +02:00
Matthew Honnibal
2060935cdb
* Remove explicit bytes type in doc.from_bytes, to accept bytearray
2015-07-24 04:54:13 +02:00
Matthew Honnibal
0bb839d299
* Fix string coercion for Python 3
2015-07-24 03:49:30 +02:00
Matthew Honnibal
a0e36e8efc
* Add working to/from bytes API to Doc
2015-07-23 01:14:45 +02:00
Matthew Honnibal
4d61239eac
* Reorganize the serialization functions on Doc
2015-07-22 04:53:01 +02:00
Matthew Honnibal
8743a8c084
* Update Doc serialization for new Packer interface
2015-07-20 01:38:04 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Matthew Honnibal
6b13e7227c
* Remove duplicate get_lex_attr method from doc.pyx
2015-07-18 22:46:07 +02:00
Matthew Honnibal
ced59ab9ea
* Make minor efficiency improvement in Doc.__iter__
2015-07-18 04:10:53 +02:00
Matthew Honnibal
cf0c788892
* Tests passing on round-trip pack/unpack on basic example
2015-07-17 21:20:48 +02:00
Matthew Honnibal
dfdf19f6a9
* Draft a from_orth method for Doc
2015-07-17 16:39:54 +02:00
Matthew Honnibal
db9dfd2e23
* Major refactor of serialization. Nearly complete now.
2015-07-17 01:27:54 +02:00
Matthew Honnibal
a6f401580d
* Add from_array function to Doc.
2015-07-16 17:46:11 +02:00
Matthew Honnibal
e2133d990e
* Move serialization functionality out into a Serializer object
2015-07-16 11:21:44 +02:00
Matthew Honnibal
01fab6bb90
* Improve de/serialize functions
2015-07-16 01:26:35 +02:00
Matthew Honnibal
0e07c1ed2a
* draft de/serialization functions in doc.pyx
2015-07-16 01:16:33 +02:00
Matthew Honnibal
9d956b07e9
* Fix import of attrs in doc.pyx, and update the get_token_attr function.
2015-07-16 01:15:34 +02:00
Matthew Honnibal
935ac53ee3
* Extend count_by method
2015-07-14 03:20:09 +02:00
Matthew Honnibal
81aa4e6dcc
* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API
2015-07-14 00:10:11 +02:00
Matthew Honnibal
8214b74eec
* Restore _py_tokens cache, to handle orphan tokens.
2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
3ea8756c24
* Add spacy/tokens/doc.pyx, for Doc class in its own file
2015-07-13 19:58:26 +02:00