Commit Graph

4160 Commits

Author SHA1 Message Date
Matthew Honnibal
f111b228e0 Fix re-parsing of previously parsed text
If a Doc object had been previously parsed, it was possible for
invalid parses to be added. There were two problems:

1) The parse was only being partially erased
2) The RightArc action was able to create a 1-cycle.

This patch fixes both errors, and avoids resetting the parse if one is
present. In theory this might allow a better parse to be predicted by
running the parser twice.

Closes #1253.
2017-10-20 16:27:36 +02:00
Matthew Honnibal
9010a1a060 Create vectors correctly 2017-10-20 14:19:46 +02:00
Matthew Honnibal
33229b1c9e Remove print statement 2017-10-20 14:19:29 +02:00
Matthew Honnibal
cfae54c507 Make change to Vectors.__init__ 2017-10-20 14:19:04 +02:00
Matthew Honnibal
ebecaddb76 Make 'data_or_width' two keyword args in Vectors.__init__
Previously the data and width options were one argument in Vectors,
which meant you couldn't say vectors = Vectors(strings, width=300).
It's better to have two keywords.
2017-10-20 14:17:15 +02:00
Matthew Honnibal
49895fbef6 Rename 'SP' special tag to '_SP'
Renaming the tag with an underscore lets us add it to the tag map
without worrying that we'll change the sequence of tags, which throws
off the tag-to-ID mapping. For instance, if we inserted a 'SP' tag,
the "VERB" tag is pushed to a different class ID, and the model is all
messed up.
2017-10-20 14:01:12 +02:00
Matthew Honnibal
506cf2eb13 Remove cpdef enum, to avoid too much code generation 2017-10-20 14:00:23 +02:00
Matthew Honnibal
6218af0105 Remove cpdef enum, to avoid too much code generation 2017-10-20 13:59:57 +02:00
Matthew Honnibal
92ac9316b5 Fix initialization of vectors, to address serialization problem 2017-10-20 13:59:24 +02:00
Ramanan Balakrishnan
0726946563
cleanup to_array implementation using fixes on master 2017-10-20 17:09:37 +05:30
ines
108f1f786e Update symbols and document missing token attributes (see #1439) 2017-10-20 13:08:44 +02:00
ines
4acab77a8a Add missing symbol for LAW entities (resolves #1427) 2017-10-20 13:07:57 +02:00
Ramanan Balakrishnan
b3ab124fc5
Support strings for attribute list in doc.to_array 2017-10-20 11:46:57 +05:30
Ramanan Balakrishnan
7b9b1be44c
Support single value for attribute list in doc.to_array 2017-10-19 17:00:41 +05:30
Matthew Honnibal
61bc203f3f Merge pull request #1438 from explosion/feature/fast-parser
💫 Improve runtime CPU efficiency of parser/NER
2017-10-19 02:42:21 +02:00
Matthew Honnibal
15e5a04a8d Clean up more depth=0 conditional code 2017-10-19 01:48:43 +02:00
Matthew Honnibal
906c50ac59 Fix loop typing, that caused error on windows 2017-10-19 01:48:39 +02:00
ines
24512420b1 Show error if data_path does not exist or is None (see #1102) 2017-10-19 00:53:49 +02:00
ines
bf415fd778 Add test for serializing extension attrs (see #1085) 2017-10-19 00:53:08 +02:00
Matthew Honnibal
960788aaa2 Eliminate dead code in parser, and raise errors for obsolete options 2017-10-19 00:42:34 +02:00
Matthew Honnibal
bbfd7d8d5d Clean up parser multi-threading 2017-10-19 00:25:21 +02:00
Matthew Honnibal
f018f2030c Try optimized parser forward loop 2017-10-18 21:48:00 +02:00
Matthew Honnibal
65bf5e85bd Improve piping in language.pipe 2017-10-18 21:46:12 +02:00
Matthew Honnibal
633a75c7e0 Break parser batches into sub-batches, sorted by length. 2017-10-18 21:45:01 +02:00
Ines Montani
f0d577e460 Merge pull request #1425 from explosion/feature/hindi-tokenizer
💫 Basic Hindi tokenization support
2017-10-18 13:34:52 +02:00
Matthew Honnibal
394633efce Make doc pickling support hooks 2017-10-17 19:44:09 +02:00
Matthew Honnibal
fe844148f6 Test pickling hooks 2017-10-17 19:43:52 +02:00
Matthew Honnibal
cdb0c426d8 Improve deserialization of user_data, esp. for Underscore 2017-10-17 19:29:20 +02:00
Matthew Honnibal
374819edf8 Test user_data deserialization, re #1085 2017-10-17 19:28:54 +02:00
Matthew Honnibal
e35a83d142 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-17 18:22:06 +02:00
Matthew Honnibal
f45973848c Rename 'tokens' variable 'doc' in tokenizer 2017-10-17 18:21:41 +02:00
Matthew Honnibal
839de87ca9 Make lambda func a named function, for pickling 2017-10-17 18:21:20 +02:00
Matthew Honnibal
9baa8fe7ec Convert closure to functools.partial, to promote pickling 2017-10-17 18:20:52 +02:00
Matthew Honnibal
32a8564c79 Fix doc pickling 2017-10-17 18:20:24 +02:00
Matthew Honnibal
8ca97f32a3 Fix doc pickling test 2017-10-17 18:19:57 +02:00
Matthew Honnibal
9ce7d6af87 Make lex attr functions top-level functions, to promote pickling 2017-10-17 18:19:18 +02:00
Matthew Honnibal
1cc85a89ef Allow reasonably efficient pickling of Language class, using to_bytes() and from_bytes(). 2017-10-17 18:18:49 +02:00
Matthew Honnibal
0d57b9748a Serialize lex_attr_getters with dill, for better pickle support 2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e Fix trailing whitespace on morphology features 2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines
8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
ines
485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Matthew Honnibal
19531bad4c Merge branch 'develop' into feature/streaming-data-memory-growth 2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31 Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth 2017-10-16 20:49:48 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39 Clean up remnant of frozen in StringStore 2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8 Remove obsolete is_frozen functionality from StringStore 2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
d5418553eb Fix whitespace 2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c Make sure from_disk passes string to numpy (see #1421)
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d Fix and document matcher operator 'shadowing' behaviour 2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658 Document operator semantics in Matcher docstring 2017-10-16 12:06:33 +02:00
ines
266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00
ines
cd6a29dce7 Port over changes from #1294 2017-10-14 13:28:46 +02:00
ines
38c756fd85 Port over changes from #1287 2017-10-14 13:16:21 +02:00
ines
612224c10d Port over changes from #1157 2017-10-14 13:11:39 +02:00
ines
9b3f8f9ec3 Fix formatting and add comment on languages 2017-10-14 13:11:18 +02:00
ines
a4d974d97b Port over URL pattern changes from #1411 2017-10-14 12:58:07 +02:00
ines
09aed58140 Port over changes from #1333 and add comments 2017-10-14 12:52:59 +02:00
Matthew Honnibal
cf6da9301a Update lemmatizer test 2017-10-12 22:50:52 +02:00
Matthew Honnibal
9b90d235d1 Fix tag check in lemmatizer 2017-10-12 22:50:43 +02:00
Matthew Honnibal
dc01acd821 Escape encoding in validate function 2017-10-12 22:23:21 +02:00
Matthew Honnibal
27b927259a Add locale_escape compat function 2017-10-12 22:22:04 +02:00
ines
9c6de3dcfa Merge branch 'develop' into feature/cli-validate 2017-10-12 21:44:28 +02:00
Matthew Honnibal
462caf835a Fix SBD test 2017-10-12 21:18:22 +02:00
ines
fff1028391 Add validate CLI command 2017-10-12 20:05:06 +02:00
Matthew Honnibal
908f44c3fe Disable history features by default 2017-10-12 14:56:11 +02:00
Matthew Honnibal
a955843684 Increase default number of epochs 2017-10-12 13:13:01 +02:00
Matthew Honnibal
cecfcc7711 Set default hyper params back to 'slow' settings 2017-10-12 13:12:26 +02:00
Ines Montani
37aa523a8e Merge pull request #1408 from explosion/feature/dot-underscore
💫 Custom attributes via Doc._, Token._ and Span._
2017-10-11 18:35:56 +02:00
ines
8ce6f96180 Don't make copies of language data components 2017-10-11 15:34:55 +02:00
ines
51519251c2 Fix underscore method test 2017-10-11 13:34:19 +02:00
ines
c6ae49e8bf Fix formatting 2017-10-11 13:34:11 +02:00
ines
453c47ca24 Add German lemmatizer tests 2017-10-11 13:27:26 +02:00
ines
15fe0fd82d Fix tests 2017-10-11 13:27:18 +02:00
ines
6dd14dc342 Add lookup lemmas to tokens without POS tags 2017-10-11 13:27:10 +02:00
ines
9620c1a640 Add lemma_lookup to Language defaults 2017-10-11 13:26:05 +02:00
ines
9fd471372a Add lookup lemmatizer to lemmatizer as lookup() method 2017-10-11 13:25:51 +02:00
ines
e0ff145a8b Merge branch 'develop' into feature/dot-underscore 2017-10-11 11:57:05 +02:00
ines
c1d6d43c83 Merge branch 'develop' into feature/lemmatizer 2017-10-11 11:56:35 +02:00
Matthew Honnibal
17c467e0ab Avoid clobbering existing lemmas 2017-10-11 03:33:06 -05:00
Matthew Honnibal
807e109f2b Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-10-11 02:47:59 -05:00
Matthew Honnibal
6e552c9d83 Prune number of non-projective labels more aggressiely 2017-10-11 02:46:44 -05:00