Ines Montani
52e7d634df
Remove trailing whitespace
2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80
Apply emoticon exceptions to tokenizer
2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3
Fix formatting
2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee
Declare encoding and unicode literals
2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657
Fix __all__
2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c
Add missing emoticons
2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93
Update English language data
2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe
Add emoticons
2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294
Add line breaks
2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102
Add test for tokenizer regular expressions
2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32
Reformat language data
2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d
Increment version
2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada
Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667
2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07
Merge github.com:explosion/spaCy into dutch
2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86
Update language data with tag map from UD_Dutch
2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9
Update Dutch language data
...
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2
Added language Dutch to init file
2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5
Fix create_tokenizer when nlp is None
2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9
Fix model saving error for Python 3
2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c
Fix unicode problem in nonproj module
2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6
Filter out deprecated attributes when reading special-case tokenization rules.
2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2
Exclude morphs from deprecated token attributes for now
2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1
Merge old training fixes with newer state
2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4
Exclude morphs from deprecated token attributes for now
2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0
Allow dep to be None in scorer, for missing labels.
2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb
Fix NER label calculation
2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53
Tweak arc_eager n_gold to deal with negative costs, and improve error message.
2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015
Pass cfg through loading, for training.
2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421
Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state
2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a
Fix gold.pyx for 1.0
2016-11-25 08:57:59 -06:00
root
080d29e092
Fix train.py for 1.0
2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135
Test #656 , #624 : special case rules for tokenizer with attributes.
2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95
Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.
2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f
Add set_struct_attr staticmethod to token
2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e
Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.
2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51
Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.
2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840
Add emoticons
2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a
Added nl module for dutch
2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322
Added language class and some language data (with some TODOs) for Dutch
2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02
Add line breaks
2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2
Add test for tokenizer regular expressions
2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7
Reformat language data
2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76
Allow German noun chunks to work on Span
...
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4
Add directory and initial (empty) files for language Dutch
2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641
Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4
Fix default path loading.
2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee
Work on test for #615
2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89
Fix syntax mistake
2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce
Only try to load vectors if they exist.
2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093
Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.
2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6
Fix another bug related to Language.__init__'s path parameter
2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0
Fix path
param of Language.__init__
always being ignored
...
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389
Merge remote-tracking branch 'origin/master' into specify-data-path
2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72
Let --data-path be specified when running download.py scripts
...
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9
Strip trailing whitespace
2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326
Update and reformat German stopwords
2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309
Update language_data.py
2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a
Add German Stopwords
2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7
Merge pull request #627 from sadovnychyi/patch-1
...
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29
Fixed bug: eg.guess is a tag id, rather than tag
2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1
Remove duplicated line of vocab declaration
...
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c
Fix #617 : Vocab.load() required Path. Should work with string as well.
2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6
Fix test for issue 617
2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329
spacy/tests/regression/test_issue617.py
...
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f
Added a test case to cover the span.merge returning values
2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9
now span.merge returns token like it says on documentation
2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79
Fix PhraseMatcher to work with updated Matcher
...
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64
Add basic test for PhraseMatcher
...
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f
Fix test for 605
2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439
Test #590 : Order dependence in Matcher rules.
2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265
Fix #605 : Acceptor now rejects matches as expected.
2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd
Test Issue #605
2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac
Fix #608 -- __version__ should be available at the base of the package.
2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7
Increment version
2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994
Update version
2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808
Fix Issue #376 : and/or was tagged as a noun.
2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e
Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.
2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756
Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.
2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82
Fix #602 , #603 --- Broken build
2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a
Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.
2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331
Prefer to import from symbols instead of parts_of_speech
2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001
Test #595 -- Bug in lemmatization of base forms.
2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec
Fix #588 : Matcher should reject empty pattern.
2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec
Test Issue #588 : Matcher accepts invalid, empty patterns.
2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb
Add tokenizer exception for 'Ph.D.', to fix 592.
2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b
Import Jieba inside zh.make_doc
2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680
Remove deprecated tokens_from_list test.
2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595
Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.
2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2
Fix Issue #600 : Missing setters for Token attribute.
2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d
Test Issue #600
2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615
Fix doc strings for tokenizer
2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29
Fix test
2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc
Add import fr
2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982
Fix infixes in Italian
2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c
Fix infixes in spanish and portuguese
2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a
Fix infixes in french
2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb
Add test for french tokenizer
2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044
Add test for loading languages
2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b
Fix stray POS in language stubs
2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576
Handle null prefix/suffix/infix search in tokenizer
2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423
Check that patterns aren't null before compiling regex for tokenizer
2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33
Link languages in __init__.py
2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965
Stub out support for Italian
2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7
Stub out support for French, Spanish, Italian and Portuguese
2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83
Specify that spacy.util is encoded in utf8
2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395
Add draft Jieba tokenizer for Chinese
2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b
Check for class-defined make_docs method before assigning one provided as an argument
2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d
Work on draft Italian tokenizer
2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177
Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf
Add __init__.py file for regression tests
2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20
Fix variable error in token
2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce
Fix variable error in Span
2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa
Use 32 bit hashes for OOV, re Issue #589 , Issue #285
2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd
Add test for Issue #589
2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb
Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.
2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595
Improve test slightly
2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4
Test Issue #587 : Matcher segfaults on particular input
2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208
Infer types in transition_system.pyx
2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94
Fix training evaluate method
2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898
Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3
Test Issue 429: No valid actions for NER after matcher adds a new entity label.
2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f
Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.
2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912
Fix test, after IOB tweak.
2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477
Remove dead code
2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579
Improve error message when oracle fails for non-projective trees, re Issue #571 .
2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2
Fix Python 3 basestring error
2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f
Update strings.pxd
2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff
Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.
2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99
Test workaround for Issue #285 : Streaming data memory growth
2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7
Update test
2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2
Update test
2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125
Update test
2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe
Update test
2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3
Update test
2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83
Increment version
2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e
Fix Issue #566
2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294
Fix vector_norm test.
2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8
Make installation print data path.
2016-10-23 19:46:44 +02:00