Commit Graph

6840 Commits

Author SHA1 Message Date
Matthew Honnibal
30e67fa808 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 16:08:23 +02:00
Matthew Honnibal
b0f6fd3f1d Disable tokenizer cache for special-cases. Fixes #1250 2017-10-24 16:08:05 +02:00
Matthew Honnibal
63f0bde749 Add test for #1250: Tokenizer cache clobbered special-case attrs 2017-10-24 16:07:18 +02:00
ines
8492d5be6d Always make lemmatizer return a list of lemmas, not a set 2017-10-24 16:00:56 +02:00
ines
95f866f99f Add lookup argument to Lemmatizer.load 2017-10-24 16:00:56 +02:00
ines
95f6174516 Remove tensorizer from model pipeline example in spacy package 2017-10-24 16:00:56 +02:00
ines
6686e53530 Allow GitHub embeds to specify optional language 2017-10-24 16:00:56 +02:00
ines
56a47f137f Add title description for tokenizer 2017-10-24 16:00:56 +02:00
ines
3944c1d6e7 Document lemmatizer 2017-10-24 16:00:56 +02:00
ines
c9dc88ddfc Document current JSON format for training 2017-10-24 16:00:56 +02:00
ines
2b8e7c45e0 Use better training data JSON example 2017-10-24 16:00:56 +02:00
ines
090aed940a Add test for currently failing span.as_doc case 2017-10-24 16:00:56 +02:00
ines
4ef81a9ebc Fix whitespace 2017-10-24 16:00:56 +02:00
Matthew Honnibal
18f1c1d0ba Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 14:29:43 +02:00
Matthew Honnibal
4bea65a1a8 Fix Issue #1450: Off-by-1 in * and ? matches
Patterns that end in variable-length operators e.g. * and ? now end on
the correct token. Previously, they were off by 1: the next token was
pulled into the match, even if that's where the pattern failed.
2017-10-24 14:26:27 +02:00
Matthew Honnibal
391d5ef0d1 Normalize imports in regression test 2017-10-24 14:25:49 +02:00
ines
c55db0a4a1 Add example sentences for Japanese and Chinese (see #1107) 2017-10-24 13:02:24 +02:00
ines
66f8f9d4a0 Fix Japanese tokenizer
JapaneseTokenizer now returns a Doc, not individual words
2017-10-24 13:02:19 +02:00
Matthew Honnibal
5ae0b8613a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 12:41:07 +02:00
Matthew Honnibal
dd5b2d8fa3 Check for out-of-memory when calling calloc. Closes #1446 2017-10-24 12:40:47 +02:00
ines
9bf5751064 Pretty-print JSON 2017-10-24 12:22:17 +02:00
Matthew Honnibal
0f9d966317 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-24 12:10:58 +02:00
Matthew Honnibal
b66b8f028b Fix #1375 -- out-of-bounds on token.nbor() 2017-10-24 12:10:39 +02:00
Matthew Honnibal
a68d89a4f3 Add failing test for bug #1375 -- no out-of-bounds error for token.nbor() 2017-10-24 12:05:25 +02:00
ines
6675755005 Add training data JSON example 2017-10-24 12:05:10 +02:00
Matthew Honnibal
ccd2ab1a62 Merge pull request #1443 from ramananbalakrishnan/develop-get-lca-matrix
Add LCA matrix for spans and docs
2017-10-24 11:22:46 +02:00
Matthew Honnibal
ef3e5a361b Merge pull request #1442 from explosion/feature/fix-sp
💫Fix SP tag, tweak Vectors.__init__, fix Morphology
2017-10-24 10:24:07 +02:00
Matthew Honnibal
fdf25d10ba Merge pull request #1440 from ramananbalakrishnan/develop
Support single value for attribute list in doc.to_array
2017-10-24 10:23:12 +02:00
ines
7701984f13 Document Span.as_doc 2017-10-23 10:38:27 +02:00
ines
db15902e84 Tidy up 2017-10-23 10:38:21 +02:00
ines
3f0a157b33 Fix typo 2017-10-23 10:38:13 +02:00
ines
a31f048b4d Fix formatting 2017-10-23 10:38:06 +02:00
Ines Montani
0ed0c41bad Merge pull request #1448 from jerbob92/feature/fix-training-new-entity-type-example
Fix #1444: fix training new entity type example
2017-10-22 15:43:33 +02:00
Jeroen Bobbeldijk
84c6c20d1c Fix #1444: fix pipeline logic and wrong paramater in update call 2017-10-22 15:18:36 +02:00
Matthew Honnibal
490ad3eaf0 Check that empty strings are handled. Closes #1242 2017-10-21 00:52:14 +02:00
Matthew Honnibal
8f8bccecb9 Patch deserialisation for invalid loads, to avoid model failure 2017-10-21 00:51:42 +02:00
Ramanan Balakrishnan
d2fe56a577
Add LCA matrix for spans and docs 2017-10-20 23:58:00 +05:30
Matthew Honnibal
d8391b1c4d Fix #1434: Matcher failed on ending ? if no token 2017-10-20 16:49:36 +02:00
Matthew Honnibal
fec53f09f7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-20 16:28:34 +02:00
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
9010a1a060 Create vectors correctly 2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e Remove print statement 2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
ines
108f1f786e Update symbols and document missing token attributes (see #1439) 2017-10-20 13:08:44 +02:00