Commit Graph

1933 Commits

Author SHA1 Message Date
Matthew Honnibal
014b6936ac Fix #608 -- __version__ should be available at the base of the package. 2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7 Increment version 2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994 Update version 2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808 Fix Issue #376: and/or was tagged as a noun. 2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e Fix #368: Tokenizer handled pattern 'unicode close quote, period' incorrectly. 2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756 Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one. 2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82 Fix #602, #603 --- Broken build 2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331 Prefer to import from symbols instead of parts_of_speech 2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001 Test #595 -- Bug in lemmatization of base forms. 2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec Fix #588: Matcher should reject empty pattern. 2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec Test Issue #588: Matcher accepts invalid, empty patterns. 2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb Add tokenizer exception for 'Ph.D.', to fix 592. 2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b Import Jieba inside zh.make_doc 2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680 Remove deprecated tokens_from_list test. 2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595 Test Issue #599: .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents. 2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2 Fix Issue #600: Missing setters for Token attribute. 2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d Test Issue #600 2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29 Fix test 2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc Add import fr 2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982 Fix infixes in Italian 2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c Fix infixes in spanish and portuguese 2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a Fix infixes in french 2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb Add test for french tokenizer 2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044 Add test for loading languages 2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b Fix stray POS in language stubs 2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576 Handle null prefix/suffix/infix search in tokenizer 2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423 Check that patterns aren't null before compiling regex for tokenizer 2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33 Link languages in __init__.py 2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965 Stub out support for Italian 2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7 Stub out support for French, Spanish, Italian and Portuguese 2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83 Specify that spacy.util is encoded in utf8 2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395 Add draft Jieba tokenizer for Chinese 2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b Check for class-defined make_docs method before assigning one provided as an argument 2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d Work on draft Italian tokenizer 2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177 Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596 2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf Add __init__.py file for regression tests 2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20 Fix variable error in token 2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce Fix variable error in Span 2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa Use 32 bit hashes for OOV, re Issue #589, Issue #285 2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd Add test for Issue #589 2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb Fix Issue #587: Segfault in Matcher, due to simple error in the state machine. 2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595 Improve test slightly 2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4 Test Issue #587: Matcher segfaults on particular input 2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208 Infer types in transition_system.pyx 2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94 Fix training evaluate method 2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898 Fix Issue #429. Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found. 2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3 Test Issue 429: No valid actions for NER after matcher adds a new entity label. 2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state. 2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912 Fix test, after IOB tweak. 2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87 Fix clobbering of 'missing' named ent values after assigning ents. 2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477 Remove dead code 2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579 Improve error message when oracle fails for non-projective trees, re Issue #571. 2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2 Fix Python 3 basestring error 2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f Update strings.pxd 2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff Workaround Issue #285: Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings. 2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99 Test workaround for Issue #285: Streaming data memory growth 2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7 Update test 2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2 Update test 2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125 Update test 2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe Update test 2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3 Update test 2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83 Increment version 2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e Fix Issue #566 2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294 Fix vector_norm test. 2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8 Make installation print data path. 2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064 Try to fix weird install glitch. 2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd Increment version 2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77 Fix vector norm when loading lexemes. 2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e Fix similarity test for Python 3 2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98 Test Issue #514: Serializer fails when new entity type has been added. 2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1 Fix broken test, re Issue #461 2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38 Test Issue #514: Serialization fails after adding a new entity label. 2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa Fix Issue #459 -- failed to deserialize empty doc. 2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322 Test Issue #459: Fail to deserialize empty doc 2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c Fix bug where huffman codec died if given empty freqs dict. 2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f Test that huffman codec works with empty freqs dict 2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e Fix Issue #461: O tag was being clobbered by doc.ents.__set__ 2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9 Test Issue #461: ent_iob tag incorrect after setting entities. 2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646 Fix compile error 2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693 Fix calculation of vector norm, re Issue #522. Need to consolidate the calculations into a helper function. 2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a Fix calculation of L2-norm for Lexeme 2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac Add tests to verify that Issue #442 is fixed in 1.1 2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af Fix create_pipeline keyword argument. 2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4 Fix vector_norm when vector is assigned to Lexeme. 2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f Add tests for using the new Entity ID tracking in the rule matcher 2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9 Add test for vector resizing, re Issue #544 2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc Bump version to 1.1.0 2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4 Add resize_vectors method to Vocab 2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824 Fix JSON in tokenizer 2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f Fix JSON in tagger 2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807 Increment version 2016-10-21 00:00:03 +02:00