Matthew Honnibal
e4c951c153
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93
Fix formatting
2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd
Fix tokenizer test
2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5
Use base language data as default
2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09
Remove trailing whitespace
2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c
Add base tag map
2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11
Reorganise language data
2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8
Whitespace
2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96
Add import in regression test
2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5
Set tag_map to None if it's not seen in the data by vocab
2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b
Update header for morphology class
2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7
Filter out morphology keys in deprecated attrs
2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100
Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.
2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd
Wire up lemmatizer rules for English
2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04
Whitespace
2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882
Break language data components into their own files
2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db
Update English language data
2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd
Add lemma rules
2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8
Add morph rules
2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9
Add entity rules
2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d
Fix formatting
2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0
Break language data components into their own files
2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f
Add ENT_ID constant
2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453
Refactor loading of morphology exceptions, adding a method add_special_case.
2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6
Remove unnecessary argument in test
2016-12-18 14:06:27 +01:00
Ines Montani
121c310566
Remove trailing whitespace
2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3
Fix tag map for German
2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3
Fix typo
2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635
Change test595 to mock data, instead of requiring model.
2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff
Check POS key in lemmatizer, to update it for new data format
2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e
Restore missing '' character in tokenizer exceptions.
2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9
Remove duplicates in tag map
2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8
Fix tag map
2016-12-17 22:44:22 +01:00
Ines Montani
577adad945
Fix formatting
2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136
Fix typo
2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc
Fix typo
2016-12-17 13:59:30 +01:00
Ines Montani
afda532595
Use symbols in tag map
2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9
Fix formatting
2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6
Reformat dutch language data to match new style
2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504
Resolve stopwords conflict to merge Dutch
2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f
Merge pull request #688 from nlesc-sherlock/dutch
...
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f
Add missing lemmas to tokenizer exceptions ( fixes #674 )
2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd
Expand tokenizer exceptions with unicode apostrophe ( fixes #685 )
2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612
Fix formatting
2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67
Move shared functions and constants to global language data
2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086
Move update_exc to global language data utils
2016-12-17 12:29:02 +01:00
Ines Montani
f324311249
Add global language data utils
2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a
Add encoding declaration
2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334
Add tokenizer exception for "gonna" ( fixes #691 )
2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa
Revert "Add exception for "gonna""
...
This reverts commit 280c03f67b
.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b
Add exception for "gonna"
2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2
Fix typo in stopwords ( fixes #689 )
2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a
Merge github.com:explosion/spaCy into dutch
2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7
Revert "Add acl to symbols.pyx"
2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed
Update symbols.pxd
2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777
Add acl to symbols.pyx
2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24
Adding partial hyphen and quote handling support.
2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3
Passing Hungatian abbrev tests.
2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9
Add Portuguese stopwords
2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc
Update Portuguese language data
2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950
Remove unused data and download script
2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104
Remove unused data
2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660
Add French stopwords
2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb
Update French language data
2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6
Add Italian stopwords
2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea
Update Italian language data
2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e
Add Spanish language data
2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786
Remove unused import
2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761
Split punctuation into its own file
2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8
Remove time from German language data
2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9
Add emoticons
2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f
Fix formatting
2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35
Reorganize exceptions for English and German
2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda
Add update_exc util function
2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad
Fix formatting
2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c
Fix formatting
2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004
Fix formatting
2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b
Add more custom rules for abbreviations
2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa
Additional abbreviation tests.
2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023
Added Hungarian resource files.
2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c
Update language data for German
2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321
Fix capitalization on morphological features
2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955
First steps towards the Hungarian tokenizer code.
2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df
Resolve conflict
2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695
Change morphology and lemmatizer API
...
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df
Remove trailing whitespace
2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80
Apply emoticon exceptions to tokenizer
2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3
Fix formatting
2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee
Declare encoding and unicode literals
2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657
Fix __all__
2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c
Add missing emoticons
2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93
Update English language data
2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe
Add emoticons
2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294
Add line breaks
2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102
Add test for tokenizer regular expressions
2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32
Reformat language data
2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d
Increment version
2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada
Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667
2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07
Merge github.com:explosion/spaCy into dutch
2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86
Update language data with tag map from UD_Dutch
2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9
Update Dutch language data
...
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2
Added language Dutch to init file
2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5
Fix create_tokenizer when nlp is None
2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9
Fix model saving error for Python 3
2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c
Fix unicode problem in nonproj module
2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6
Filter out deprecated attributes when reading special-case tokenization rules.
2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2
Exclude morphs from deprecated token attributes for now
2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1
Merge old training fixes with newer state
2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4
Exclude morphs from deprecated token attributes for now
2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0
Allow dep to be None in scorer, for missing labels.
2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb
Fix NER label calculation
2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53
Tweak arc_eager n_gold to deal with negative costs, and improve error message.
2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015
Pass cfg through loading, for training.
2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421
Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state
2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a
Fix gold.pyx for 1.0
2016-11-25 08:57:59 -06:00
root
080d29e092
Fix train.py for 1.0
2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135
Test #656 , #624 : special case rules for tokenizer with attributes.
2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95
Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.
2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f
Add set_struct_attr staticmethod to token
2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e
Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.
2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51
Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.
2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840
Add emoticons
2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a
Added nl module for dutch
2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322
Added language class and some language data (with some TODOs) for Dutch
2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02
Add line breaks
2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2
Add test for tokenizer regular expressions
2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7
Reformat language data
2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76
Allow German noun chunks to work on Span
...
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4
Add directory and initial (empty) files for language Dutch
2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641
Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4
Fix default path loading.
2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee
Work on test for #615
2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89
Fix syntax mistake
2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce
Only try to load vectors if they exist.
2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093
Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.
2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6
Fix another bug related to Language.__init__'s path parameter
2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0
Fix path
param of Language.__init__
always being ignored
...
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389
Merge remote-tracking branch 'origin/master' into specify-data-path
2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72
Let --data-path be specified when running download.py scripts
...
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9
Strip trailing whitespace
2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326
Update and reformat German stopwords
2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309
Update language_data.py
2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a
Add German Stopwords
2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7
Merge pull request #627 from sadovnychyi/patch-1
...
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29
Fixed bug: eg.guess is a tag id, rather than tag
2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1
Remove duplicated line of vocab declaration
...
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c
Fix #617 : Vocab.load() required Path. Should work with string as well.
2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6
Fix test for issue 617
2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329
spacy/tests/regression/test_issue617.py
...
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f
Added a test case to cover the span.merge returning values
2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9
now span.merge returns token like it says on documentation
2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79
Fix PhraseMatcher to work with updated Matcher
...
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64
Add basic test for PhraseMatcher
...
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f
Fix test for 605
2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439
Test #590 : Order dependence in Matcher rules.
2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265
Fix #605 : Acceptor now rejects matches as expected.
2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd
Test Issue #605
2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac
Fix #608 -- __version__ should be available at the base of the package.
2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7
Increment version
2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994
Update version
2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808
Fix Issue #376 : and/or was tagged as a noun.
2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e
Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.
2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756
Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.
2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82
Fix #602 , #603 --- Broken build
2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a
Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.
2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331
Prefer to import from symbols instead of parts_of_speech
2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001
Test #595 -- Bug in lemmatization of base forms.
2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec
Fix #588 : Matcher should reject empty pattern.
2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec
Test Issue #588 : Matcher accepts invalid, empty patterns.
2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb
Add tokenizer exception for 'Ph.D.', to fix 592.
2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b
Import Jieba inside zh.make_doc
2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680
Remove deprecated tokens_from_list test.
2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595
Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.
2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2
Fix Issue #600 : Missing setters for Token attribute.
2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d
Test Issue #600
2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615
Fix doc strings for tokenizer
2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29
Fix test
2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc
Add import fr
2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982
Fix infixes in Italian
2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c
Fix infixes in spanish and portuguese
2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a
Fix infixes in french
2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb
Add test for french tokenizer
2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044
Add test for loading languages
2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b
Fix stray POS in language stubs
2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576
Handle null prefix/suffix/infix search in tokenizer
2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423
Check that patterns aren't null before compiling regex for tokenizer
2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33
Link languages in __init__.py
2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965
Stub out support for Italian
2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7
Stub out support for French, Spanish, Italian and Portuguese
2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83
Specify that spacy.util is encoded in utf8
2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395
Add draft Jieba tokenizer for Chinese
2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b
Check for class-defined make_docs method before assigning one provided as an argument
2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d
Work on draft Italian tokenizer
2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177
Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf
Add __init__.py file for regression tests
2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20
Fix variable error in token
2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce
Fix variable error in Span
2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa
Use 32 bit hashes for OOV, re Issue #589 , Issue #285
2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd
Add test for Issue #589
2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb
Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.
2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595
Improve test slightly
2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4
Test Issue #587 : Matcher segfaults on particular input
2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208
Infer types in transition_system.pyx
2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94
Fix training evaluate method
2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898
Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3
Test Issue 429: No valid actions for NER after matcher adds a new entity label.
2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f
Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.
2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912
Fix test, after IOB tweak.
2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477
Remove dead code
2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579
Improve error message when oracle fails for non-projective trees, re Issue #571 .
2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2
Fix Python 3 basestring error
2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f
Update strings.pxd
2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff
Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.
2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99
Test workaround for Issue #285 : Streaming data memory growth
2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7
Update test
2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2
Update test
2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125
Update test
2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe
Update test
2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3
Update test
2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83
Increment version
2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e
Fix Issue #566
2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294
Fix vector_norm test.
2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8
Make installation print data path.
2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064
Try to fix weird install glitch.
2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd
Increment version
2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77
Fix vector norm when loading lexemes.
2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e
Fix similarity test for Python 3
2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b
Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98
Test Issue #514 : Serializer fails when new entity type has been added.
2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1
Fix broken test, re Issue #461
2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38
Test Issue #514 : Serialization fails after adding a new entity label.
2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa
Fix Issue #459 -- failed to deserialize empty doc.
2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322
Test Issue #459 : Fail to deserialize empty doc
2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c
Fix bug where huffman codec died if given empty freqs dict.
2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f
Test that huffman codec works with empty freqs dict
2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e
Fix Issue #461 : O tag was being clobbered by doc.ents.__set__
2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9
Test Issue #461 : ent_iob tag incorrect after setting entities.
2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646
Fix compile error
2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693
Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.
2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a
Fix calculation of L2-norm for Lexeme
2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac
Add tests to verify that Issue #442 is fixed in 1.1
2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af
Fix create_pipeline keyword argument.
2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4
Fix vector_norm when vector is assigned to Lexeme.
2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f
Add tests for using the new Entity ID tracking in the rule matcher
2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9
Add test for vector resizing, re Issue #544
2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc
Bump version to 1.1.0
2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4
Add resize_vectors method to Vocab
2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824
Fix JSON in tokenizer
2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f
Fix JSON in tagger
2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807
Increment version
2016-10-21 00:00:03 +02:00
Matthew Honnibal
e80944276f
Fix Span.vector_norm
2016-10-20 21:58:56 +02:00
Matthew Honnibal
f5fe4f595b
Fix json loading, for Python 3.
2016-10-20 21:23:26 +02:00
Matthew Honnibal
2e92c6fb3a
Fix JSON encoding issue on load
2016-10-20 21:06:48 +02:00
Matthew Honnibal
4ad7bb96c9
Increment version.
2016-10-20 20:48:30 +02:00
Matthew Honnibal
5ec32f5d97
Fix loading of GloVe vectors, to address Issue #541
2016-10-20 18:27:48 +02:00
Matthew Honnibal
ddeabd76c4
Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.
2016-10-20 16:57:53 +02:00
Matthew Honnibal
bfe5cb1244
Increment version.
2016-10-20 14:52:00 +02:00
Matthew Honnibal
f189a3cb00
Fix encoding when opening files in Python 2.7, re Issue #539
2016-10-20 14:42:56 +02:00
Matthew Honnibal
c353a5214d
Increment version
2016-10-19 23:51:01 +02:00
Matthew Honnibal
d10c17f2a4
Fix Issue #536 : oov_prob was 0 for OOV words.
2016-10-19 23:38:47 +02:00
Matthew Honnibal
dfa752d064
Increment version
2016-10-19 23:19:13 +02:00
Matthew Honnibal
3588a18fb8
Fix hook names in doc
2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773
Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.
2016-10-19 20:54:22 +02:00
Matthew Honnibal
ed5e178817
Add sentiment property on lexeme object
2016-10-19 20:52:52 +02:00
Matthew Honnibal
d4aaf2752c
Fix issue #535 : Pipeline elements added even when data not installed.
2016-10-19 19:55:19 +02:00
Matthew Honnibal
04d1c959da
Fix version
2016-10-19 03:45:37 +02:00
Matthew Honnibal
d35aa7344e
Change version ID to make PyPi happy
2016-10-19 03:24:39 +02:00
Matthew Honnibal
89d2a5c8b3
Increment build version.
2016-10-19 03:05:17 +02:00
Matthew Honnibal
622b0a9674
Tweak download script
2016-10-19 00:52:16 +02:00
Matthew Honnibal
5a5c7192a5
Fix download.py for GloVe vectors.
2016-10-19 00:47:44 +02:00
Matthew Honnibal
edc45c19d6
Update download script
2016-10-19 00:41:14 +02:00
Matthew Honnibal
2bbb050500
Fix default of serializer_freqs
2016-10-18 19:55:41 +02:00
Matthew Honnibal
1b651db9c5
Fix parser creation in Language class.
2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7
Fix loading of tagger.
2016-10-18 19:33:04 +02:00
Matthew Honnibal
76c815f40d
Fix spacy.load
2016-10-18 19:23:31 +02:00
Matthew Honnibal
8c8f5c62c6
Add LANG attribute to English and German
2016-10-18 18:52:48 +02:00
Matthew Honnibal
05e2a589a4
Fix None label in matcher
2016-10-18 18:05:21 +02:00
Matthew Honnibal
c3a8a1cf51
Update serializer test.
2016-10-18 16:18:46 +02:00
Matthew Honnibal
7d5212f131
Refactor defaults
2016-10-18 16:18:25 +02:00
Matthew Honnibal
a45a9d5092
Remove stray .tensor attribute from Lexeme
2016-10-18 01:16:32 +02:00
Matthew Honnibal
9258db788a
Revert "Have the matcher return character offsets, to handle the match better."
...
This reverts commit 049c937540
.
2016-10-17 16:49:51 +02:00
Matthew Honnibal
7d446e5094
Revert "Update matcher test, to reflect character offset return instead of token offset."
...
This reverts commit f8d3e3bcfe
.
2016-10-17 16:49:49 +02:00
Matthew Honnibal
4bf2c53c13
Revert "Hack on matcher tests, for new implementation."
...
This reverts commit dbe60644ab
.
2016-10-17 16:49:48 +02:00
Matthew Honnibal
2fd97c71cc
Revert "Don't try to pickle matcher."
...
This reverts commit 97bd0c9d00
.
2016-10-17 16:49:43 +02:00
Matthew Honnibal
97bd0c9d00
Don't try to pickle matcher.
2016-10-17 16:38:40 +02:00
Matthew Honnibal
dbe60644ab
Hack on matcher tests, for new implementation.
2016-10-17 16:12:22 +02:00
Matthew Honnibal
f8d3e3bcfe
Update matcher test, to reflect character offset return instead of token offset.
2016-10-17 16:00:10 +02:00
Matthew Honnibal
049c937540
Have the matcher return character offsets, to handle the match better.
2016-10-17 15:58:57 +02:00
Matthew Honnibal
9b60186266
Fix doc class
2016-10-17 15:23:47 +02:00
Matthew Honnibal
6cbdc94959
Lots of updates to Matcher, to make entity handling sane.
2016-10-17 15:23:31 +02:00
Matthew Honnibal
7fd98fc91c
Remove deprecation shim around str/bytes in Token.
2016-10-17 14:02:47 +02:00
Matthew Honnibal
b67697a97b
Improve API for doc.merge() and span.merge(), to use keyword arguments.
2016-10-17 14:02:13 +02:00
Matthew Honnibal
fbb7f3f15c
Add user_data attribute to Doc object.
2016-10-17 11:43:22 +02:00
Matthew Honnibal
c1abc8f6ed
Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec
2016-10-17 11:18:41 +02:00
Matthew Honnibal
4ba9eadf3d
Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1
2016-10-17 02:45:44 +02:00
Matthew Honnibal
09ab447a18
Remove tensor property from token.
2016-10-17 02:45:09 +02:00
Matthew Honnibal
5d10e2005c
Defer some attributes to Doc, via getters_for_tokens attribute.
2016-10-17 02:44:49 +02:00
Matthew Honnibal
8829984efb
Remove tensor attribute from Span and Token.
2016-10-17 02:44:04 +02:00
Matthew Honnibal
d15a88c66a
Defer some attributes to Doc via getters_for_spans
2016-10-17 02:43:35 +02:00
Matthew Honnibal
62230dd13a
Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring
2016-10-17 02:42:51 +02:00
Matthew Honnibal
ae11ea8240
Add getters_for_tokens and getters_for_spans attributes to Doc object.
2016-10-17 02:42:05 +02:00
Matthew Honnibal
be48a7b4f3
Fix conftest for website tests.
2016-10-17 01:54:26 +02:00
Matthew Honnibal
8951bf6989
Update matcher tests
2016-10-17 01:53:24 +02:00
Matthew Honnibal
0cf4aff470
Set default path in EN/DE tests.
2016-10-17 01:52:49 +02:00
Matthew Honnibal
cd71b6b0a9
Remove test of parser pickle
2016-10-17 01:52:10 +02:00
Matthew Honnibal
5bc101006e
Add cfg field to Tagger
2016-10-17 01:03:41 +02:00
Matthew Honnibal
517f090cbf
Use GoldParse in tagger.update
2016-10-17 00:55:15 +02:00
Matthew Honnibal
59038f7efa
Restore support for prior data format -- specifically, the labels field of the config.
2016-10-17 00:53:26 +02:00
Matthew Honnibal
7887ab3b36
Fix default use of feature_templates in parser
2016-10-16 21:41:56 +02:00
Matthew Honnibal
f787cd29fe
Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor.
2016-10-16 21:34:57 +02:00
Matthew Honnibal
311a985fe0
Add input error handling in Doc
2016-10-16 18:16:42 +02:00
Matthew Honnibal
06322ba99d
Add words and spaces keyword arguments to Doc.
2016-10-16 18:13:03 +02:00
Matthew Honnibal
ca51f3b77e
Use DependencyParser and EntityRecognizer in the Language class.
2016-10-16 17:58:12 +02:00
Matthew Honnibal
195d998a12
Fix GoldParse argument to tagger.update
2016-10-16 17:05:09 +02:00
Matthew Honnibal
274a4d4272
Fix queue Python property in StateClass
2016-10-16 17:04:41 +02:00
Matthew Honnibal
e8c8aa08ce
Make action_name optional in StepwiseState
2016-10-16 17:04:16 +02:00
Matthew Honnibal
4bb73b1a93
Fix parser labels in pipeline
2016-10-16 17:03:22 +02:00
Matthew Honnibal
a81c5a7abf
Fix name of labels keyword to 'actions'.
2016-10-16 12:00:27 +02:00
Matthew Honnibal
a079677984
Fix omission of O action when creating blank entity recognizer
2016-10-16 11:43:25 +02:00
Matthew Honnibal
5444d38cc6
Update test for biluo tags
2016-10-16 11:42:45 +02:00
Matthew Honnibal
4fc56d4a31
Rename 'labels' to 'actions' in parser options
2016-10-16 11:42:26 +02:00
Matthew Honnibal
8a6b35d266
Delay binding in MakeDoc
2016-10-16 11:41:55 +02:00
Matthew Honnibal
52b48b415e
Fix GoldParse class
2016-10-16 11:41:36 +02:00
Matthew Honnibal
3259a63779
Whitespace
2016-10-16 01:47:28 +02:00
Matthew Honnibal
509b30834f
Add a pipeline module, to collect and wrap processes for annotation
2016-10-16 01:47:12 +02:00
Matthew Honnibal
0317cea0ad
Fix GoldParse
2016-10-15 23:55:07 +02:00
Matthew Honnibal
1c62573a41
Fix spacy.train
2016-10-15 23:53:46 +02:00
Matthew Honnibal
a48aa15384
Improve the API for the GoldParse class.
2016-10-15 23:53:29 +02:00
Matthew Honnibal
e07fe92b27
Draft a refactored init for the GoldParse class
2016-10-15 22:09:52 +02:00
Matthew Honnibal
47afef7d6b
Add init.py for gold tests
2016-10-15 21:51:28 +02:00
Matthew Honnibal
86ae665c78
Add function for entity->biluo transformation
2016-10-15 21:51:04 +02:00
Matthew Honnibal
2163fd238f
Add tests for entity->biluo transformation
2016-10-15 21:50:43 +02:00
Matthew Honnibal
5e923b9bfa
Return None in match_best_version if not path exists.
2016-10-15 14:47:29 +02:00
Matthew Honnibal
2516382106
Fix loading of English in span test
2016-10-15 14:44:37 +02:00
Matthew Honnibal
dda2fc6bef
Add empty data directory
2016-10-15 14:25:25 +02:00
Matthew Honnibal
049197e0ae
Update tests, somewhat messily.
2016-10-15 14:14:04 +02:00
Matthew Honnibal
1e1a1d9517
Update matcher test
2016-10-15 14:13:41 +02:00
Matthew Honnibal
9cc9ce0f14
Load with default path=False in tests.
2016-10-15 14:13:23 +02:00
Matthew Honnibal
08e9134760
Change default value of path to True
2016-10-15 14:12:54 +02:00
Matthew Honnibal
788657f062
Ensure words are added to vocab before test, so that the lexicon is updated correctly.
2016-10-15 14:12:18 +02:00
Matthew Honnibal
4a1a2bce68
Update version in about.py
2016-10-15 13:44:27 +02:00
Matthew Honnibal
6d8cb515ac
Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature.
2016-10-14 17:38:29 +02:00
Matthew Honnibal
2cc515b2ed
Add add_flag method to Vocab, re Issue #504 .
2016-10-14 12:15:38 +02:00
Matthew Honnibal
f3be9d0a9a
Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs
2016-10-14 03:24:13 +02:00
Matthew Honnibal
9b55d97a8f
Update train method
2016-10-13 03:24:53 +02:00
Matthew Honnibal
645d99523a
Move merge_sents method into spacy.gold
2016-10-13 03:24:29 +02:00
Matthew Honnibal
41f88ce938
Fix dep model loading in parser
2016-10-12 20:26:38 +02:00
Matthew Honnibal
d9ae2d68af
Load features by string-name for backwards compatibility.
2016-10-12 20:15:11 +02:00
Matthew Honnibal
a42fbcf946
Require model for test_is_properties
2016-10-12 19:35:18 +02:00
Matthew Honnibal
20c948361b
Use local path in test_lemmatizer
2016-10-12 19:35:00 +02:00
Matthew Honnibal
1318d0bc65
Test with the non-loaded versions of the English and German pipelines.
2016-10-12 19:13:31 +02:00
Matthew Honnibal
0e2bedc373
Fix default labels for parser and NER
2016-10-12 19:12:40 +02:00
Matthew Honnibal
3a03c668c3
Fix message in ParserStateError
2016-10-12 14:44:31 +02:00
Matthew Honnibal
6bf505e865
Fix error on ParserStateError
2016-10-12 14:35:55 +02:00
Matthew Honnibal
ba5e048502
Add docstring for Trainer class.
2016-10-12 14:26:02 +02:00
Matthew Honnibal
847a4a4182
Refactor Language, dropping Language.blank() method.
2016-10-12 13:45:58 +02:00
Matthew Honnibal
ea23b64cc8
Refactor training, with new spacy.train module. Defaults still a little awkward.
2016-10-09 12:24:24 +02:00
Matthew Honnibal
ca32a1ab01
Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
...
This reverts commit 8423e8627f
.
2016-09-30 20:20:22 +02:00
Matthew Honnibal
90baa9c7e6
Revert "Changes to matcher.pyx for new StringStore scheme"
...
This reverts commit 3ff09614e0
.
2016-09-30 20:20:13 +02:00
Matthew Honnibal
1b6b129c04
Revert "Changes to morphology.pyx for new StringStore scheme"
...
This reverts commit 95f8cfd745
.
2016-09-30 20:20:02 +02:00
Matthew Honnibal
1d70db58aa
Revert "Changes to iterators.pyx for new StringStore scheme"
...
This reverts commit 4f794b215a
.
2016-09-30 20:19:53 +02:00
Matthew Honnibal
de01e427fd
Revert "Changes to strings.pyx for new StringStore scheme"
...
This reverts commit 22d4752d64
.
2016-09-30 20:19:42 +02:00
Matthew Honnibal
9e09b39b9f
Revert "Changes to transition systems for new StringStore scheme"
...
This reverts commit 0442e0ab1e
.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
e3285f6f30
Revert "Fix report of ParserStateError"
...
This reverts commit 78f19baafa
.
2016-09-30 20:11:33 +02:00
Matthew Honnibal
6736977d82
Revert "Changes to Doc and Token for new string store scheme"
...
This reverts commit 99de44d864
.
2016-09-30 20:11:15 +02:00
Matthew Honnibal
bd7fe6420c
Revert "Changes to test for new string-store"
...
This reverts commit 21e90d7d0b
.
2016-09-30 20:11:01 +02:00
Matthew Honnibal
1f1cd5013f
Revert "Changes to vocab for new stringstore scheme"
...
This reverts commit a51149a717
.
2016-09-30 20:10:30 +02:00
Matthew Honnibal
1e7d0af127
Revert "Changes to Lexeme for new string store scheme"
...
This reverts commit 717741b6cf
.
2016-09-30 20:10:13 +02:00
Matthew Honnibal
ba51cb8325
Revert "Changes to tagger for new string store scheme"
...
This reverts commit f5a6aac906
.
2016-09-30 20:09:53 +02:00
Matthew Honnibal
23b7244842
Make sure symbols are unicode strings
2016-09-30 20:02:19 +02:00
Matthew Honnibal
f5a6aac906
Changes to tagger for new string store scheme
2016-09-30 20:01:51 +02:00
Matthew Honnibal
717741b6cf
Changes to Lexeme for new string store scheme
2016-09-30 20:01:36 +02:00
Matthew Honnibal
a51149a717
Changes to vocab for new stringstore scheme
2016-09-30 20:01:19 +02:00
Matthew Honnibal
21e90d7d0b
Changes to test for new string-store
2016-09-30 20:00:58 +02:00
Matthew Honnibal
99de44d864
Changes to Doc and Token for new string store scheme
2016-09-30 20:00:21 +02:00
Matthew Honnibal
78f19baafa
Fix report of ParserStateError
2016-09-30 19:59:22 +02:00
Matthew Honnibal
0442e0ab1e
Changes to transition systems for new StringStore scheme
2016-09-30 19:58:51 +02:00
Matthew Honnibal
22d4752d64
Changes to strings.pyx for new StringStore scheme
2016-09-30 19:58:09 +02:00
Matthew Honnibal
4f794b215a
Changes to iterators.pyx for new StringStore scheme
2016-09-30 19:57:49 +02:00
Matthew Honnibal
95f8cfd745
Changes to morphology.pyx for new StringStore scheme
2016-09-30 19:57:10 +02:00
Matthew Honnibal
3ff09614e0
Changes to matcher.pyx for new StringStore scheme
2016-09-30 19:56:48 +02:00
Matthew Honnibal
eceeaefe53
Fix defaults for Parser and Entity, adding a blank= argument.
2016-09-30 19:56:06 +02:00
Matthew Honnibal
8423e8627f
Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.
2016-09-30 10:14:47 +02:00
Matthew Honnibal
d3dc5718b2
Fix syntax error in Doc
2016-09-28 11:39:49 +02:00
Matthew Honnibal
1b520e7bab
Improve docstrings for Doc object
2016-09-28 11:15:13 +02:00
Matthew Honnibal
81a47c01d8
Fix test for empty sentence string.
2016-09-27 19:21:22 +02:00
Matthew Honnibal
4cbf0d3bb6
Handle errors when no valid actions are available, pointing users to the issue tracker.
2016-09-27 19:19:53 +02:00
Matthew Honnibal
430473bd98
Raise errors when no actions are available, re Issue #429
2016-09-27 19:09:37 +02:00
Matthew Honnibal
fc4a7ad794
Test and fix Issue #411 : IndexError when .sents property is used on empty string.
2016-09-27 18:49:14 +02:00
Matthew Honnibal
3d370b7d45
Add test for Issue #445 , fixed in 3cb4d455d
, with improved lemmatizer logic
2016-09-27 18:39:46 +02:00
Matthew Honnibal
a2f3510d6d
Fix lemmatizer
2016-09-27 17:47:05 +02:00
Matthew Honnibal
07776d8096
Fix pos name conflict in lemmatize
2016-09-27 17:35:58 +02:00
Matthew Honnibal
35cd953f9e
Fix pos name conflict with morphology
2016-09-27 14:16:22 +02:00
Matthew Honnibal
8e7df3c4ca
Expect the parser data, if parser.load() is called.
2016-09-27 14:02:12 +02:00
Matthew Honnibal
bb4f201ad2
Pass morphological features from tag map into the lemmatizer.
2016-09-27 14:01:43 +02:00
Matthew Honnibal
40509e8bca
Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed.
2016-09-27 14:01:16 +02:00
Matthew Honnibal
9c8ac91d72
Add test for Issue #435
2016-09-27 13:52:38 +02:00
Matthew Honnibal
3cb4d455d2
Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435
2016-09-27 13:52:11 +02:00
Matthew Honnibal
e233328d38
Fix Issue #371 : Lexeme objects were unhashable.
2016-09-27 13:22:30 +02:00
Matthew Honnibal
e382e48d9f
Temporarily patch handling of defaul templates for tagger. Need to move these to language_data.
2016-09-27 13:21:28 +02:00
Matthew Honnibal
a44763af0e
Fix Issue #469 : Incorrectly cased root label in noun chunk iterator
2016-09-27 13:13:01 +02:00
Matthew Honnibal
b14b9b096b
Return None if /deps directory not present, instead of trying to load the parser.
2016-09-26 18:48:03 +02:00
Matthew Honnibal
e07b9665f7
Don't expect parser model
2016-09-26 18:09:33 +02:00
Matthew Honnibal
ee6fa106da
Fix parser features
2016-09-26 17:57:32 +02:00
Matthew Honnibal
e607e4b598
Fix parser loading
2016-09-26 17:51:11 +02:00
Matthew Honnibal
0b2d7ae9d6
Fix Entity creation
2016-09-26 15:41:22 +02:00
Matthew Honnibal
2debc4e0a2
Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class.
2016-09-26 11:57:54 +02:00
Matthew Honnibal
722199acb8
Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey
2016-09-26 11:07:46 +02:00
Matthew Honnibal
e56653f848
Add language data for German
2016-09-25 15:44:45 +02:00
Matthew Honnibal
7db956133e
Move tokenizer data for German into spacy.de.language_data
2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
d7e9acdcdf
Add English language data, so that the tokenizer doesn't require the data download
2016-09-25 14:49:00 +02:00
Matthew Honnibal
82b8cc5efb
Whitespace
2016-09-24 22:17:01 +02:00
Matthew Honnibal
fd58f7655a
Python 3 compatible basestring
2016-09-24 22:16:43 +02:00
Matthew Honnibal
082e95b19e
Python 3 compatible basestring
2016-09-24 22:09:21 +02:00
Matthew Honnibal
f19af6cb2c
Python 3 compatible basestring
2016-09-24 22:08:43 +02:00
Matthew Honnibal
3ed4cdfe32
Handle pathlib.Path objects in CFile
2016-09-24 22:01:46 +02:00
Matthew Honnibal
df88690177
Fix encoding of path variable
2016-09-24 21:13:15 +02:00
Matthew Honnibal
af847e07fc
Fix usage of pathlib for Python3 -- turning paths to strings.
2016-09-24 21:05:27 +02:00
Matthew Honnibal
453683aaf0
Fix spacy/vocab.pyx
2016-09-24 20:50:31 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e
Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib.
2016-09-24 14:08:53 +02:00
Matthew Honnibal
b00f683a0c
Fix matcher test
2016-09-24 11:20:58 +02:00
Matthew Honnibal
eaf4065480
Expose the _patterns private member
2016-09-24 11:20:42 +02:00
Matthew Honnibal
15e42a1ba9
Allow entities to be set by Span, or by 4-tuple (with entity ID)
2016-09-24 01:17:43 +02:00
Matthew Honnibal
60fdf4d5f1
Remove commented out debuggng code
2016-09-24 01:17:18 +02:00
Matthew Honnibal
939a791a52
Update tests
2016-09-24 01:17:03 +02:00
Matthew Honnibal
55f1f7edaf
Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a *backwards incompatibility.*
2016-09-24 01:16:45 +02:00
Matthew Honnibal
e48df859b5
Fix typedef import in span.pyx
2016-09-23 16:02:28 +02:00
Matthew Honnibal
4de13606fd
Fix token.pyx
2016-09-23 15:07:07 +02:00
Matthew Honnibal
b4de419e19
Import hash_t typedef in token.pyx
2016-09-23 14:22:06 +02:00
Matthew Honnibal
c1a2e96604
Clean up notes at end of token.pyx
2016-09-21 20:45:51 +02:00
Matthew Honnibal
f6e587b1c7
Fix matcher tests
2016-09-21 20:45:20 +02:00
Matthew Honnibal
58e83fe34b
Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match.
2016-09-21 14:54:55 +02:00