Commit Graph

2450 Commits

Author SHA1 Message Date
Matthew Honnibal
3980f1b0cb Ignore more morphology attributes in deprecated mode of intify_attrs 2016-12-18 17:33:46 +01:00
Matthew Honnibal
7a98ee5e5a Merge language data change 2016-12-18 17:03:52 +01:00
Matthew Honnibal
e4c951c153 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93 Fix formatting 2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data 2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd Fix tokenizer test 2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5 Use base language data as default 2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09 Remove trailing whitespace 2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c Add base tag map 2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11 Reorganise language data 2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8 Whitespace 2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96 Add import in regression test 2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5 Set tag_map to None if it's not seen in the data by vocab 2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b Update header for morphology class 2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7 Filter out morphology keys in deprecated attrs 2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100 Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced. 2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd Wire up lemmatizer rules for English 2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04 Whitespace 2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882 Break language data components into their own files 2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db Update English language data 2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd Add lemma rules 2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8 Add morph rules 2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9 Add entity rules 2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d Fix formatting 2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0 Break language data components into their own files 2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f Add ENT_ID constant 2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453 Refactor loading of morphology exceptions, adding a method add_special_case. 2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6 Remove unnecessary argument in test 2016-12-18 14:06:27 +01:00
Ines Montani
121c310566 Remove trailing whitespace 2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3 Fix tag map for German 2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3 Fix typo 2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635 Change test595 to mock data, instead of requiring model. 2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff Check POS key in lemmatizer, to update it for new data format 2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e Restore missing '' character in tokenizer exceptions. 2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9 Remove duplicates in tag map 2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8 Fix tag map 2016-12-17 22:44:22 +01:00
Ines Montani
577adad945 Fix formatting 2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136 Fix typo 2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc Fix typo 2016-12-17 13:59:30 +01:00
Ines Montani
afda532595 Use symbols in tag map 2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9 Fix formatting 2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6 Reformat dutch language data to match new style 2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504 Resolve stopwords conflict to merge Dutch 2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f Merge pull request #688 from nlesc-sherlock/dutch
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f Add missing lemmas to tokenizer exceptions (fixes #674) 2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd Expand tokenizer exceptions with unicode apostrophe (fixes #685) 2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612 Fix formatting 2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67 Move shared functions and constants to global language data 2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086 Move update_exc to global language data utils 2016-12-17 12:29:02 +01:00
Ines Montani
f324311249 Add global language data utils 2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a Add encoding declaration 2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334 Add tokenizer exception for "gonna" (fixes #691) 2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa Revert "Add exception for "gonna""
This reverts commit 280c03f67b.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b Add exception for "gonna" 2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2 Fix typo in stopwords (fixes #689) 2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a Merge github.com:explosion/spaCy into dutch 2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7 Revert "Add acl to symbols.pyx" 2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed Update symbols.pxd 2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777 Add acl to symbols.pyx 2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24 Adding partial hyphen and quote handling support. 2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3 Passing Hungatian abbrev tests. 2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9 Add Portuguese stopwords 2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc Update Portuguese language data 2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950 Remove unused data and download script 2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104 Remove unused data 2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660 Add French stopwords 2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb Update French language data 2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6 Add Italian stopwords 2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea Update Italian language data 2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e Add Spanish language data 2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786 Remove unused import 2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761 Split punctuation into its own file 2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8 Remove time from German language data 2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9 Add emoticons 2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f Fix formatting 2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35 Reorganize exceptions for English and German 2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda Add update_exc util function 2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad Fix formatting 2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c Fix formatting 2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004 Fix formatting 2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b Add more custom rules for abbreviations 2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa Additional abbreviation tests. 2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023 Added Hungarian resource files. 2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c Update language data for German 2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321 Fix capitalization on morphological features 2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955 First steps towards the Hungarian tokenizer code. 2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df Resolve conflict 2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695 Change morphology and lemmatizer API
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df Remove trailing whitespace 2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80 Apply emoticon exceptions to tokenizer 2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3 Fix formatting 2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee Declare encoding and unicode literals 2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657 Fix __all__ 2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c Add missing emoticons 2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93 Update English language data 2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe Add emoticons 2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294 Add line breaks 2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102 Add test for tokenizer regular expressions 2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32 Reformat language data 2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d Increment version 2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667 2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07 Merge github.com:explosion/spaCy into dutch 2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86 Update language data with tag map from UD_Dutch 2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9 Update Dutch language data
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2 Added language Dutch to init file 2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5 Fix create_tokenizer when nlp is None 2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9 Fix model saving error for Python 3 2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c Fix unicode problem in nonproj module 2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6 Filter out deprecated attributes when reading special-case tokenization rules. 2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2 Exclude morphs from deprecated token attributes for now 2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25 Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1 Merge old training fixes with newer state 2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4 Exclude morphs from deprecated token attributes for now 2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0 Allow dep to be None in scorer, for missing labels. 2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb Fix NER label calculation 2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53 Tweak arc_eager n_gold to deal with negative costs, and improve error message. 2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015 Pass cfg through loading, for training. 2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421 Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state 2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a Fix gold.pyx for 1.0 2016-11-25 08:57:59 -06:00
root
080d29e092 Fix train.py for 1.0 2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135 Test #656, #624: special case rules for tokenizer with attributes. 2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95 Fix #656, #624: Support arbitrary token attributes when adding special-case rules. 2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f Add set_struct_attr staticmethod to token 2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr. 2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51 Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries. 2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840 Add emoticons 2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a Added nl module for dutch 2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322 Added language class and some language data (with some TODOs) for Dutch 2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02 Add line breaks 2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2 Add test for tokenizer regular expressions 2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7 Reformat language data 2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76 Allow German noun chunks to work on Span
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4 Add directory and initial (empty) files for language Dutch 2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641 Fix Issue #639: stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored. 2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4 Fix default path loading. 2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee Work on test for #615 2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89 Fix syntax mistake 2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce Only try to load vectors if they exist. 2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093 Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional. 2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6 Fix another bug related to Language.__init__'s path parameter 2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0 Fix path param of Language.__init__ always being ignored
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389 Merge remote-tracking branch 'origin/master' into specify-data-path 2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72 Let --data-path be specified when running download.py scripts
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9 Strip trailing whitespace 2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326 Update and reformat German stopwords 2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309 Update language_data.py 2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a Add German Stopwords 2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7 Merge pull request #627 from sadovnychyi/patch-1
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29 Fixed bug: eg.guess is a tag id, rather than tag 2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1 Remove duplicated line of vocab declaration
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c Fix #617: Vocab.load() required Path. Should work with string as well. 2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6 Fix test for issue 617 2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329 spacy/tests/regression/test_issue617.py
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f Added a test case to cover the span.merge returning values 2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9 now span.merge returns token like it says on documentation 2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79 Fix PhraseMatcher to work with updated Matcher
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64 Add basic test for PhraseMatcher
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f Fix test for 605 2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439 Test #590: Order dependence in Matcher rules. 2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265 Fix #605: Acceptor now rejects matches as expected. 2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd Test Issue #605 2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac Fix #608 -- __version__ should be available at the base of the package. 2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7 Increment version 2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994 Update version 2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808 Fix Issue #376: and/or was tagged as a noun. 2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e Fix #368: Tokenizer handled pattern 'unicode close quote, period' incorrectly. 2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756 Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one. 2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82 Fix #602, #603 --- Broken build 2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331 Prefer to import from symbols instead of parts_of_speech 2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001 Test #595 -- Bug in lemmatization of base forms. 2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec Fix #588: Matcher should reject empty pattern. 2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec Test Issue #588: Matcher accepts invalid, empty patterns. 2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb Add tokenizer exception for 'Ph.D.', to fix 592. 2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b Import Jieba inside zh.make_doc 2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680 Remove deprecated tokens_from_list test. 2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595 Test Issue #599: .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents. 2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2 Fix Issue #600: Missing setters for Token attribute. 2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d Test Issue #600 2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29 Fix test 2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc Add import fr 2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982 Fix infixes in Italian 2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c Fix infixes in spanish and portuguese 2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a Fix infixes in french 2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb Add test for french tokenizer 2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044 Add test for loading languages 2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b Fix stray POS in language stubs 2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576 Handle null prefix/suffix/infix search in tokenizer 2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423 Check that patterns aren't null before compiling regex for tokenizer 2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33 Link languages in __init__.py 2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965 Stub out support for Italian 2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7 Stub out support for French, Spanish, Italian and Portuguese 2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83 Specify that spacy.util is encoded in utf8 2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395 Add draft Jieba tokenizer for Chinese 2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b Check for class-defined make_docs method before assigning one provided as an argument 2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d Work on draft Italian tokenizer 2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177 Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596 2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf Add __init__.py file for regression tests 2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20 Fix variable error in token 2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce Fix variable error in Span 2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa Use 32 bit hashes for OOV, re Issue #589, Issue #285 2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd Add test for Issue #589 2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb Fix Issue #587: Segfault in Matcher, due to simple error in the state machine. 2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595 Improve test slightly 2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4 Test Issue #587: Matcher segfaults on particular input 2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208 Infer types in transition_system.pyx 2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94 Fix training evaluate method 2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898 Fix Issue #429. Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found. 2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3 Test Issue 429: No valid actions for NER after matcher adds a new entity label. 2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state. 2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912 Fix test, after IOB tweak. 2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87 Fix clobbering of 'missing' named ent values after assigning ents. 2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477 Remove dead code 2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579 Improve error message when oracle fails for non-projective trees, re Issue #571. 2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2 Fix Python 3 basestring error 2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f Update strings.pxd 2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff Workaround Issue #285: Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings. 2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99 Test workaround for Issue #285: Streaming data memory growth 2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7 Update test 2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2 Update test 2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125 Update test 2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe Update test 2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3 Update test 2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83 Increment version 2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e Fix Issue #566 2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294 Fix vector_norm test. 2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8 Make installation print data path. 2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064 Try to fix weird install glitch. 2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd Increment version 2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77 Fix vector norm when loading lexemes. 2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e Fix similarity test for Python 3 2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98 Test Issue #514: Serializer fails when new entity type has been added. 2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1 Fix broken test, re Issue #461 2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38 Test Issue #514: Serialization fails after adding a new entity label. 2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa Fix Issue #459 -- failed to deserialize empty doc. 2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322 Test Issue #459: Fail to deserialize empty doc 2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c Fix bug where huffman codec died if given empty freqs dict. 2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f Test that huffman codec works with empty freqs dict 2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e Fix Issue #461: O tag was being clobbered by doc.ents.__set__ 2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9 Test Issue #461: ent_iob tag incorrect after setting entities. 2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646 Fix compile error 2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693 Fix calculation of vector norm, re Issue #522. Need to consolidate the calculations into a helper function. 2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a Fix calculation of L2-norm for Lexeme 2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac Add tests to verify that Issue #442 is fixed in 1.1 2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af Fix create_pipeline keyword argument. 2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4 Fix vector_norm when vector is assigned to Lexeme. 2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f Add tests for using the new Entity ID tracking in the rule matcher 2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9 Add test for vector resizing, re Issue #544 2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc Bump version to 1.1.0 2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4 Add resize_vectors method to Vocab 2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824 Fix JSON in tokenizer 2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f Fix JSON in tagger 2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807 Increment version 2016-10-21 00:00:03 +02:00
Matthew Honnibal
e80944276f Fix Span.vector_norm 2016-10-20 21:58:56 +02:00
Matthew Honnibal
f5fe4f595b Fix json loading, for Python 3. 2016-10-20 21:23:26 +02:00
Matthew Honnibal
2e92c6fb3a Fix JSON encoding issue on load 2016-10-20 21:06:48 +02:00
Matthew Honnibal
4ad7bb96c9 Increment version. 2016-10-20 20:48:30 +02:00
Matthew Honnibal
5ec32f5d97 Fix loading of GloVe vectors, to address Issue #541 2016-10-20 18:27:48 +02:00
Matthew Honnibal
ddeabd76c4 Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised. 2016-10-20 16:57:53 +02:00
Matthew Honnibal
bfe5cb1244 Increment version. 2016-10-20 14:52:00 +02:00
Matthew Honnibal
f189a3cb00 Fix encoding when opening files in Python 2.7, re Issue #539 2016-10-20 14:42:56 +02:00
Matthew Honnibal
c353a5214d Increment version 2016-10-19 23:51:01 +02:00
Matthew Honnibal
d10c17f2a4 Fix Issue #536: oov_prob was 0 for OOV words. 2016-10-19 23:38:47 +02:00
Matthew Honnibal
dfa752d064 Increment version 2016-10-19 23:19:13 +02:00
Matthew Honnibal
3588a18fb8 Fix hook names in doc 2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773 Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 20:54:22 +02:00
Matthew Honnibal
ed5e178817 Add sentiment property on lexeme object 2016-10-19 20:52:52 +02:00
Matthew Honnibal
d4aaf2752c Fix issue #535: Pipeline elements added even when data not installed. 2016-10-19 19:55:19 +02:00
Matthew Honnibal
04d1c959da Fix version 2016-10-19 03:45:37 +02:00
Matthew Honnibal
d35aa7344e Change version ID to make PyPi happy 2016-10-19 03:24:39 +02:00
Matthew Honnibal
89d2a5c8b3 Increment build version. 2016-10-19 03:05:17 +02:00
Matthew Honnibal
622b0a9674 Tweak download script 2016-10-19 00:52:16 +02:00
Matthew Honnibal
5a5c7192a5 Fix download.py for GloVe vectors. 2016-10-19 00:47:44 +02:00
Matthew Honnibal
edc45c19d6 Update download script 2016-10-19 00:41:14 +02:00
Matthew Honnibal
2bbb050500 Fix default of serializer_freqs 2016-10-18 19:55:41 +02:00
Matthew Honnibal
1b651db9c5 Fix parser creation in Language class. 2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7 Fix loading of tagger. 2016-10-18 19:33:04 +02:00
Matthew Honnibal
76c815f40d Fix spacy.load 2016-10-18 19:23:31 +02:00
Matthew Honnibal
8c8f5c62c6 Add LANG attribute to English and German 2016-10-18 18:52:48 +02:00
Matthew Honnibal
05e2a589a4 Fix None label in matcher 2016-10-18 18:05:21 +02:00
Matthew Honnibal
c3a8a1cf51 Update serializer test. 2016-10-18 16:18:46 +02:00
Matthew Honnibal
7d5212f131 Refactor defaults 2016-10-18 16:18:25 +02:00
Matthew Honnibal
a45a9d5092 Remove stray .tensor attribute from Lexeme 2016-10-18 01:16:32 +02:00
Matthew Honnibal
9258db788a Revert "Have the matcher return character offsets, to handle the match better."
This reverts commit 049c937540.
2016-10-17 16:49:51 +02:00
Matthew Honnibal
7d446e5094 Revert "Update matcher test, to reflect character offset return instead of token offset."
This reverts commit f8d3e3bcfe.
2016-10-17 16:49:49 +02:00
Matthew Honnibal
4bf2c53c13 Revert "Hack on matcher tests, for new implementation."
This reverts commit dbe60644ab.
2016-10-17 16:49:48 +02:00
Matthew Honnibal
2fd97c71cc Revert "Don't try to pickle matcher."
This reverts commit 97bd0c9d00.
2016-10-17 16:49:43 +02:00
Matthew Honnibal
97bd0c9d00 Don't try to pickle matcher. 2016-10-17 16:38:40 +02:00
Matthew Honnibal
dbe60644ab Hack on matcher tests, for new implementation. 2016-10-17 16:12:22 +02:00
Matthew Honnibal
f8d3e3bcfe Update matcher test, to reflect character offset return instead of token offset. 2016-10-17 16:00:10 +02:00
Matthew Honnibal
049c937540 Have the matcher return character offsets, to handle the match better. 2016-10-17 15:58:57 +02:00
Matthew Honnibal
9b60186266 Fix doc class 2016-10-17 15:23:47 +02:00
Matthew Honnibal
6cbdc94959 Lots of updates to Matcher, to make entity handling sane. 2016-10-17 15:23:31 +02:00
Matthew Honnibal
7fd98fc91c Remove deprecation shim around str/bytes in Token. 2016-10-17 14:02:47 +02:00
Matthew Honnibal
b67697a97b Improve API for doc.merge() and span.merge(), to use keyword arguments. 2016-10-17 14:02:13 +02:00
Matthew Honnibal
fbb7f3f15c Add user_data attribute to Doc object. 2016-10-17 11:43:22 +02:00
Matthew Honnibal
c1abc8f6ed Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec 2016-10-17 11:18:41 +02:00
Matthew Honnibal
4ba9eadf3d Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1 2016-10-17 02:45:44 +02:00
Matthew Honnibal
09ab447a18 Remove tensor property from token. 2016-10-17 02:45:09 +02:00
Matthew Honnibal
5d10e2005c Defer some attributes to Doc, via getters_for_tokens attribute. 2016-10-17 02:44:49 +02:00
Matthew Honnibal
8829984efb Remove tensor attribute from Span and Token. 2016-10-17 02:44:04 +02:00
Matthew Honnibal
d15a88c66a Defer some attributes to Doc via getters_for_spans 2016-10-17 02:43:35 +02:00
Matthew Honnibal
62230dd13a Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring 2016-10-17 02:42:51 +02:00
Matthew Honnibal
ae11ea8240 Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 02:42:05 +02:00
Matthew Honnibal
be48a7b4f3 Fix conftest for website tests. 2016-10-17 01:54:26 +02:00
Matthew Honnibal
8951bf6989 Update matcher tests 2016-10-17 01:53:24 +02:00
Matthew Honnibal
0cf4aff470 Set default path in EN/DE tests. 2016-10-17 01:52:49 +02:00
Matthew Honnibal
cd71b6b0a9 Remove test of parser pickle 2016-10-17 01:52:10 +02:00
Matthew Honnibal
5bc101006e Add cfg field to Tagger 2016-10-17 01:03:41 +02:00
Matthew Honnibal
517f090cbf Use GoldParse in tagger.update 2016-10-17 00:55:15 +02:00
Matthew Honnibal
59038f7efa Restore support for prior data format -- specifically, the labels field of the config. 2016-10-17 00:53:26 +02:00
Matthew Honnibal
7887ab3b36 Fix default use of feature_templates in parser 2016-10-16 21:41:56 +02:00
Matthew Honnibal
f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal
311a985fe0 Add input error handling in Doc 2016-10-16 18:16:42 +02:00
Matthew Honnibal
06322ba99d Add words and spaces keyword arguments to Doc. 2016-10-16 18:13:03 +02:00
Matthew Honnibal
ca51f3b77e Use DependencyParser and EntityRecognizer in the Language class. 2016-10-16 17:58:12 +02:00
Matthew Honnibal
195d998a12 Fix GoldParse argument to tagger.update 2016-10-16 17:05:09 +02:00
Matthew Honnibal
274a4d4272 Fix queue Python property in StateClass 2016-10-16 17:04:41 +02:00
Matthew Honnibal
e8c8aa08ce Make action_name optional in StepwiseState 2016-10-16 17:04:16 +02:00
Matthew Honnibal
4bb73b1a93 Fix parser labels in pipeline 2016-10-16 17:03:22 +02:00
Matthew Honnibal
a81c5a7abf Fix name of labels keyword to 'actions'. 2016-10-16 12:00:27 +02:00
Matthew Honnibal
a079677984 Fix omission of O action when creating blank entity recognizer 2016-10-16 11:43:25 +02:00
Matthew Honnibal
5444d38cc6 Update test for biluo tags 2016-10-16 11:42:45 +02:00
Matthew Honnibal
4fc56d4a31 Rename 'labels' to 'actions' in parser options 2016-10-16 11:42:26 +02:00
Matthew Honnibal
8a6b35d266 Delay binding in MakeDoc 2016-10-16 11:41:55 +02:00
Matthew Honnibal
52b48b415e Fix GoldParse class 2016-10-16 11:41:36 +02:00
Matthew Honnibal
3259a63779 Whitespace 2016-10-16 01:47:28 +02:00
Matthew Honnibal
509b30834f Add a pipeline module, to collect and wrap processes for annotation 2016-10-16 01:47:12 +02:00
Matthew Honnibal
0317cea0ad Fix GoldParse 2016-10-15 23:55:07 +02:00
Matthew Honnibal
1c62573a41 Fix spacy.train 2016-10-15 23:53:46 +02:00
Matthew Honnibal
a48aa15384 Improve the API for the GoldParse class. 2016-10-15 23:53:29 +02:00
Matthew Honnibal
e07fe92b27 Draft a refactored init for the GoldParse class 2016-10-15 22:09:52 +02:00
Matthew Honnibal
47afef7d6b Add init.py for gold tests 2016-10-15 21:51:28 +02:00
Matthew Honnibal
86ae665c78 Add function for entity->biluo transformation 2016-10-15 21:51:04 +02:00
Matthew Honnibal
2163fd238f Add tests for entity->biluo transformation 2016-10-15 21:50:43 +02:00
Matthew Honnibal
5e923b9bfa Return None in match_best_version if not path exists. 2016-10-15 14:47:29 +02:00
Matthew Honnibal
2516382106 Fix loading of English in span test 2016-10-15 14:44:37 +02:00
Matthew Honnibal
dda2fc6bef Add empty data directory 2016-10-15 14:25:25 +02:00
Matthew Honnibal
049197e0ae Update tests, somewhat messily. 2016-10-15 14:14:04 +02:00
Matthew Honnibal
1e1a1d9517 Update matcher test 2016-10-15 14:13:41 +02:00
Matthew Honnibal
9cc9ce0f14 Load with default path=False in tests. 2016-10-15 14:13:23 +02:00
Matthew Honnibal
08e9134760 Change default value of path to True 2016-10-15 14:12:54 +02:00
Matthew Honnibal
788657f062 Ensure words are added to vocab before test, so that the lexicon is updated correctly. 2016-10-15 14:12:18 +02:00
Matthew Honnibal
4a1a2bce68 Update version in about.py 2016-10-15 13:44:27 +02:00
Matthew Honnibal
6d8cb515ac Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature. 2016-10-14 17:38:29 +02:00
Matthew Honnibal
2cc515b2ed Add add_flag method to Vocab, re Issue #504. 2016-10-14 12:15:38 +02:00
Matthew Honnibal
f3be9d0a9a Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 03:24:13 +02:00
Matthew Honnibal
9b55d97a8f Update train method 2016-10-13 03:24:53 +02:00
Matthew Honnibal
645d99523a Move merge_sents method into spacy.gold 2016-10-13 03:24:29 +02:00
Matthew Honnibal
41f88ce938 Fix dep model loading in parser 2016-10-12 20:26:38 +02:00
Matthew Honnibal
d9ae2d68af Load features by string-name for backwards compatibility. 2016-10-12 20:15:11 +02:00
Matthew Honnibal
a42fbcf946 Require model for test_is_properties 2016-10-12 19:35:18 +02:00
Matthew Honnibal
20c948361b Use local path in test_lemmatizer 2016-10-12 19:35:00 +02:00
Matthew Honnibal
1318d0bc65 Test with the non-loaded versions of the English and German pipelines. 2016-10-12 19:13:31 +02:00
Matthew Honnibal
0e2bedc373 Fix default labels for parser and NER 2016-10-12 19:12:40 +02:00
Matthew Honnibal
3a03c668c3 Fix message in ParserStateError 2016-10-12 14:44:31 +02:00
Matthew Honnibal
6bf505e865 Fix error on ParserStateError 2016-10-12 14:35:55 +02:00
Matthew Honnibal
ba5e048502 Add docstring for Trainer class. 2016-10-12 14:26:02 +02:00
Matthew Honnibal
847a4a4182 Refactor Language, dropping Language.blank() method. 2016-10-12 13:45:58 +02:00
Matthew Honnibal
ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal
ca32a1ab01 Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
This reverts commit 8423e8627f.
2016-09-30 20:20:22 +02:00
Matthew Honnibal
90baa9c7e6 Revert "Changes to matcher.pyx for new StringStore scheme"
This reverts commit 3ff09614e0.
2016-09-30 20:20:13 +02:00
Matthew Honnibal
1b6b129c04 Revert "Changes to morphology.pyx for new StringStore scheme"
This reverts commit 95f8cfd745.
2016-09-30 20:20:02 +02:00
Matthew Honnibal
1d70db58aa Revert "Changes to iterators.pyx for new StringStore scheme"
This reverts commit 4f794b215a.
2016-09-30 20:19:53 +02:00
Matthew Honnibal
de01e427fd Revert "Changes to strings.pyx for new StringStore scheme"
This reverts commit 22d4752d64.
2016-09-30 20:19:42 +02:00
Matthew Honnibal
9e09b39b9f Revert "Changes to transition systems for new StringStore scheme"
This reverts commit 0442e0ab1e.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
e3285f6f30 Revert "Fix report of ParserStateError"
This reverts commit 78f19baafa.
2016-09-30 20:11:33 +02:00
Matthew Honnibal
6736977d82 Revert "Changes to Doc and Token for new string store scheme"
This reverts commit 99de44d864.
2016-09-30 20:11:15 +02:00
Matthew Honnibal
bd7fe6420c Revert "Changes to test for new string-store"
This reverts commit 21e90d7d0b.
2016-09-30 20:11:01 +02:00
Matthew Honnibal
1f1cd5013f Revert "Changes to vocab for new stringstore scheme"
This reverts commit a51149a717.
2016-09-30 20:10:30 +02:00
Matthew Honnibal
1e7d0af127 Revert "Changes to Lexeme for new string store scheme"
This reverts commit 717741b6cf.
2016-09-30 20:10:13 +02:00
Matthew Honnibal
ba51cb8325 Revert "Changes to tagger for new string store scheme"
This reverts commit f5a6aac906.
2016-09-30 20:09:53 +02:00
Matthew Honnibal
23b7244842 Make sure symbols are unicode strings 2016-09-30 20:02:19 +02:00
Matthew Honnibal
f5a6aac906 Changes to tagger for new string store scheme 2016-09-30 20:01:51 +02:00
Matthew Honnibal
717741b6cf Changes to Lexeme for new string store scheme 2016-09-30 20:01:36 +02:00
Matthew Honnibal
a51149a717 Changes to vocab for new stringstore scheme 2016-09-30 20:01:19 +02:00
Matthew Honnibal
21e90d7d0b Changes to test for new string-store 2016-09-30 20:00:58 +02:00
Matthew Honnibal
99de44d864 Changes to Doc and Token for new string store scheme 2016-09-30 20:00:21 +02:00
Matthew Honnibal
78f19baafa Fix report of ParserStateError 2016-09-30 19:59:22 +02:00
Matthew Honnibal
0442e0ab1e Changes to transition systems for new StringStore scheme 2016-09-30 19:58:51 +02:00
Matthew Honnibal
22d4752d64 Changes to strings.pyx for new StringStore scheme 2016-09-30 19:58:09 +02:00
Matthew Honnibal
4f794b215a Changes to iterators.pyx for new StringStore scheme 2016-09-30 19:57:49 +02:00
Matthew Honnibal
95f8cfd745 Changes to morphology.pyx for new StringStore scheme 2016-09-30 19:57:10 +02:00
Matthew Honnibal
3ff09614e0 Changes to matcher.pyx for new StringStore scheme 2016-09-30 19:56:48 +02:00
Matthew Honnibal
eceeaefe53 Fix defaults for Parser and Entity, adding a blank= argument. 2016-09-30 19:56:06 +02:00
Matthew Honnibal
8423e8627f Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good. 2016-09-30 10:14:47 +02:00
Matthew Honnibal
d3dc5718b2 Fix syntax error in Doc 2016-09-28 11:39:49 +02:00
Matthew Honnibal
1b520e7bab Improve docstrings for Doc object 2016-09-28 11:15:13 +02:00
Matthew Honnibal
81a47c01d8 Fix test for empty sentence string. 2016-09-27 19:21:22 +02:00
Matthew Honnibal
4cbf0d3bb6 Handle errors when no valid actions are available, pointing users to the issue tracker. 2016-09-27 19:19:53 +02:00
Matthew Honnibal
430473bd98 Raise errors when no actions are available, re Issue #429 2016-09-27 19:09:37 +02:00
Matthew Honnibal
fc4a7ad794 Test and fix Issue #411: IndexError when .sents property is used on empty string. 2016-09-27 18:49:14 +02:00
Matthew Honnibal
3d370b7d45 Add test for Issue #445, fixed in 3cb4d455d, with improved lemmatizer logic 2016-09-27 18:39:46 +02:00
Matthew Honnibal
a2f3510d6d Fix lemmatizer 2016-09-27 17:47:05 +02:00
Matthew Honnibal
07776d8096 Fix pos name conflict in lemmatize 2016-09-27 17:35:58 +02:00
Matthew Honnibal
35cd953f9e Fix pos name conflict with morphology 2016-09-27 14:16:22 +02:00
Matthew Honnibal
8e7df3c4ca Expect the parser data, if parser.load() is called. 2016-09-27 14:02:12 +02:00
Matthew Honnibal
bb4f201ad2 Pass morphological features from tag map into the lemmatizer. 2016-09-27 14:01:43 +02:00
Matthew Honnibal
40509e8bca Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed. 2016-09-27 14:01:16 +02:00
Matthew Honnibal
9c8ac91d72 Add test for Issue #435 2016-09-27 13:52:38 +02:00
Matthew Honnibal
3cb4d455d2 Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435 2016-09-27 13:52:11 +02:00
Matthew Honnibal
e233328d38 Fix Issue #371: Lexeme objects were unhashable. 2016-09-27 13:22:30 +02:00
Matthew Honnibal
e382e48d9f Temporarily patch handling of defaul templates for tagger. Need to move these to language_data. 2016-09-27 13:21:28 +02:00
Matthew Honnibal
a44763af0e Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 13:13:01 +02:00
Matthew Honnibal
b14b9b096b Return None if /deps directory not present, instead of trying to load the parser. 2016-09-26 18:48:03 +02:00
Matthew Honnibal
e07b9665f7 Don't expect parser model 2016-09-26 18:09:33 +02:00
Matthew Honnibal
ee6fa106da Fix parser features 2016-09-26 17:57:32 +02:00
Matthew Honnibal
e607e4b598 Fix parser loading 2016-09-26 17:51:11 +02:00
Matthew Honnibal
0b2d7ae9d6 Fix Entity creation 2016-09-26 15:41:22 +02:00
Matthew Honnibal
2debc4e0a2 Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class. 2016-09-26 11:57:54 +02:00
Matthew Honnibal
722199acb8 Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey 2016-09-26 11:07:46 +02:00
Matthew Honnibal
e56653f848 Add language data for German 2016-09-25 15:44:45 +02:00
Matthew Honnibal
7db956133e Move tokenizer data for German into spacy.de.language_data 2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal
d7e9acdcdf Add English language data, so that the tokenizer doesn't require the data download 2016-09-25 14:49:00 +02:00
Matthew Honnibal
82b8cc5efb Whitespace 2016-09-24 22:17:01 +02:00
Matthew Honnibal
fd58f7655a Python 3 compatible basestring 2016-09-24 22:16:43 +02:00
Matthew Honnibal
082e95b19e Python 3 compatible basestring 2016-09-24 22:09:21 +02:00
Matthew Honnibal
f19af6cb2c Python 3 compatible basestring 2016-09-24 22:08:43 +02:00
Matthew Honnibal
3ed4cdfe32 Handle pathlib.Path objects in CFile 2016-09-24 22:01:46 +02:00
Matthew Honnibal
df88690177 Fix encoding of path variable 2016-09-24 21:13:15 +02:00
Matthew Honnibal
af847e07fc Fix usage of pathlib for Python3 -- turning paths to strings. 2016-09-24 21:05:27 +02:00
Matthew Honnibal
453683aaf0 Fix spacy/vocab.pyx 2016-09-24 20:50:31 +02:00
Matthew Honnibal
fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib. 2016-09-24 14:08:53 +02:00
Matthew Honnibal
b00f683a0c Fix matcher test 2016-09-24 11:20:58 +02:00
Matthew Honnibal
eaf4065480 Expose the _patterns private member 2016-09-24 11:20:42 +02:00
Matthew Honnibal
15e42a1ba9 Allow entities to be set by Span, or by 4-tuple (with entity ID) 2016-09-24 01:17:43 +02:00
Matthew Honnibal
60fdf4d5f1 Remove commented out debuggng code 2016-09-24 01:17:18 +02:00
Matthew Honnibal
939a791a52 Update tests 2016-09-24 01:17:03 +02:00
Matthew Honnibal
55f1f7edaf Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a *backwards incompatibility.* 2016-09-24 01:16:45 +02:00
Matthew Honnibal
e48df859b5 Fix typedef import in span.pyx 2016-09-23 16:02:28 +02:00
Matthew Honnibal
4de13606fd Fix token.pyx 2016-09-23 15:07:07 +02:00
Matthew Honnibal
b4de419e19 Import hash_t typedef in token.pyx 2016-09-23 14:22:06 +02:00
Matthew Honnibal
c1a2e96604 Clean up notes at end of token.pyx 2016-09-21 20:45:51 +02:00