Commit Graph

1940 Commits

Author SHA1 Message Date
Matthew Honnibal
c1abc8f6ed Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec 2016-10-17 11:18:41 +02:00
Matthew Honnibal
4ba9eadf3d Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1 2016-10-17 02:45:44 +02:00
Matthew Honnibal
09ab447a18 Remove tensor property from token. 2016-10-17 02:45:09 +02:00
Matthew Honnibal
5d10e2005c Defer some attributes to Doc, via getters_for_tokens attribute. 2016-10-17 02:44:49 +02:00
Matthew Honnibal
8829984efb Remove tensor attribute from Span and Token. 2016-10-17 02:44:04 +02:00
Matthew Honnibal
d15a88c66a Defer some attributes to Doc via getters_for_spans 2016-10-17 02:43:35 +02:00
Matthew Honnibal
62230dd13a Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring 2016-10-17 02:42:51 +02:00
Matthew Honnibal
ae11ea8240 Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 02:42:05 +02:00
Matthew Honnibal
be48a7b4f3 Fix conftest for website tests. 2016-10-17 01:54:26 +02:00
Matthew Honnibal
8951bf6989 Update matcher tests 2016-10-17 01:53:24 +02:00
Matthew Honnibal
0cf4aff470 Set default path in EN/DE tests. 2016-10-17 01:52:49 +02:00
Matthew Honnibal
cd71b6b0a9 Remove test of parser pickle 2016-10-17 01:52:10 +02:00
Matthew Honnibal
5bc101006e Add cfg field to Tagger 2016-10-17 01:03:41 +02:00
Matthew Honnibal
517f090cbf Use GoldParse in tagger.update 2016-10-17 00:55:15 +02:00
Matthew Honnibal
59038f7efa Restore support for prior data format -- specifically, the labels field of the config. 2016-10-17 00:53:26 +02:00
Matthew Honnibal
7887ab3b36 Fix default use of feature_templates in parser 2016-10-16 21:41:56 +02:00
Matthew Honnibal
f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
Matthew Honnibal
311a985fe0 Add input error handling in Doc 2016-10-16 18:16:42 +02:00
Matthew Honnibal
06322ba99d Add words and spaces keyword arguments to Doc. 2016-10-16 18:13:03 +02:00
Matthew Honnibal
ca51f3b77e Use DependencyParser and EntityRecognizer in the Language class. 2016-10-16 17:58:12 +02:00
Matthew Honnibal
195d998a12 Fix GoldParse argument to tagger.update 2016-10-16 17:05:09 +02:00
Matthew Honnibal
274a4d4272 Fix queue Python property in StateClass 2016-10-16 17:04:41 +02:00
Matthew Honnibal
e8c8aa08ce Make action_name optional in StepwiseState 2016-10-16 17:04:16 +02:00
Matthew Honnibal
4bb73b1a93 Fix parser labels in pipeline 2016-10-16 17:03:22 +02:00
Matthew Honnibal
a81c5a7abf Fix name of labels keyword to 'actions'. 2016-10-16 12:00:27 +02:00
Matthew Honnibal
a079677984 Fix omission of O action when creating blank entity recognizer 2016-10-16 11:43:25 +02:00
Matthew Honnibal
5444d38cc6 Update test for biluo tags 2016-10-16 11:42:45 +02:00
Matthew Honnibal
4fc56d4a31 Rename 'labels' to 'actions' in parser options 2016-10-16 11:42:26 +02:00
Matthew Honnibal
8a6b35d266 Delay binding in MakeDoc 2016-10-16 11:41:55 +02:00
Matthew Honnibal
52b48b415e Fix GoldParse class 2016-10-16 11:41:36 +02:00
Matthew Honnibal
3259a63779 Whitespace 2016-10-16 01:47:28 +02:00
Matthew Honnibal
509b30834f Add a pipeline module, to collect and wrap processes for annotation 2016-10-16 01:47:12 +02:00
Matthew Honnibal
0317cea0ad Fix GoldParse 2016-10-15 23:55:07 +02:00
Matthew Honnibal
1c62573a41 Fix spacy.train 2016-10-15 23:53:46 +02:00
Matthew Honnibal
a48aa15384 Improve the API for the GoldParse class. 2016-10-15 23:53:29 +02:00
Matthew Honnibal
e07fe92b27 Draft a refactored init for the GoldParse class 2016-10-15 22:09:52 +02:00
Matthew Honnibal
47afef7d6b Add init.py for gold tests 2016-10-15 21:51:28 +02:00
Matthew Honnibal
86ae665c78 Add function for entity->biluo transformation 2016-10-15 21:51:04 +02:00
Matthew Honnibal
2163fd238f Add tests for entity->biluo transformation 2016-10-15 21:50:43 +02:00
Matthew Honnibal
5e923b9bfa Return None in match_best_version if not path exists. 2016-10-15 14:47:29 +02:00
Matthew Honnibal
2516382106 Fix loading of English in span test 2016-10-15 14:44:37 +02:00
Matthew Honnibal
dda2fc6bef Add empty data directory 2016-10-15 14:25:25 +02:00
Matthew Honnibal
049197e0ae Update tests, somewhat messily. 2016-10-15 14:14:04 +02:00
Matthew Honnibal
1e1a1d9517 Update matcher test 2016-10-15 14:13:41 +02:00
Matthew Honnibal
9cc9ce0f14 Load with default path=False in tests. 2016-10-15 14:13:23 +02:00
Matthew Honnibal
08e9134760 Change default value of path to True 2016-10-15 14:12:54 +02:00
Matthew Honnibal
788657f062 Ensure words are added to vocab before test, so that the lexicon is updated correctly. 2016-10-15 14:12:18 +02:00
Matthew Honnibal
4a1a2bce68 Update version in about.py 2016-10-15 13:44:27 +02:00
Matthew Honnibal
6d8cb515ac Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature. 2016-10-14 17:38:29 +02:00
Matthew Honnibal
2cc515b2ed Add add_flag method to Vocab, re Issue #504. 2016-10-14 12:15:38 +02:00
Matthew Honnibal
f3be9d0a9a Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 03:24:13 +02:00
Matthew Honnibal
9b55d97a8f Update train method 2016-10-13 03:24:53 +02:00
Matthew Honnibal
645d99523a Move merge_sents method into spacy.gold 2016-10-13 03:24:29 +02:00
Matthew Honnibal
41f88ce938 Fix dep model loading in parser 2016-10-12 20:26:38 +02:00
Matthew Honnibal
d9ae2d68af Load features by string-name for backwards compatibility. 2016-10-12 20:15:11 +02:00
Matthew Honnibal
a42fbcf946 Require model for test_is_properties 2016-10-12 19:35:18 +02:00
Matthew Honnibal
20c948361b Use local path in test_lemmatizer 2016-10-12 19:35:00 +02:00
Matthew Honnibal
1318d0bc65 Test with the non-loaded versions of the English and German pipelines. 2016-10-12 19:13:31 +02:00
Matthew Honnibal
0e2bedc373 Fix default labels for parser and NER 2016-10-12 19:12:40 +02:00
Matthew Honnibal
3a03c668c3 Fix message in ParserStateError 2016-10-12 14:44:31 +02:00
Matthew Honnibal
6bf505e865 Fix error on ParserStateError 2016-10-12 14:35:55 +02:00
Matthew Honnibal
ba5e048502 Add docstring for Trainer class. 2016-10-12 14:26:02 +02:00
Matthew Honnibal
847a4a4182 Refactor Language, dropping Language.blank() method. 2016-10-12 13:45:58 +02:00
Matthew Honnibal
ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal
ca32a1ab01 Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
This reverts commit 8423e8627f.
2016-09-30 20:20:22 +02:00
Matthew Honnibal
90baa9c7e6 Revert "Changes to matcher.pyx for new StringStore scheme"
This reverts commit 3ff09614e0.
2016-09-30 20:20:13 +02:00
Matthew Honnibal
1b6b129c04 Revert "Changes to morphology.pyx for new StringStore scheme"
This reverts commit 95f8cfd745.
2016-09-30 20:20:02 +02:00
Matthew Honnibal
1d70db58aa Revert "Changes to iterators.pyx for new StringStore scheme"
This reverts commit 4f794b215a.
2016-09-30 20:19:53 +02:00
Matthew Honnibal
de01e427fd Revert "Changes to strings.pyx for new StringStore scheme"
This reverts commit 22d4752d64.
2016-09-30 20:19:42 +02:00
Matthew Honnibal
9e09b39b9f Revert "Changes to transition systems for new StringStore scheme"
This reverts commit 0442e0ab1e.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
e3285f6f30 Revert "Fix report of ParserStateError"
This reverts commit 78f19baafa.
2016-09-30 20:11:33 +02:00
Matthew Honnibal
6736977d82 Revert "Changes to Doc and Token for new string store scheme"
This reverts commit 99de44d864.
2016-09-30 20:11:15 +02:00
Matthew Honnibal
bd7fe6420c Revert "Changes to test for new string-store"
This reverts commit 21e90d7d0b.
2016-09-30 20:11:01 +02:00
Matthew Honnibal
1f1cd5013f Revert "Changes to vocab for new stringstore scheme"
This reverts commit a51149a717.
2016-09-30 20:10:30 +02:00
Matthew Honnibal
1e7d0af127 Revert "Changes to Lexeme for new string store scheme"
This reverts commit 717741b6cf.
2016-09-30 20:10:13 +02:00
Matthew Honnibal
ba51cb8325 Revert "Changes to tagger for new string store scheme"
This reverts commit f5a6aac906.
2016-09-30 20:09:53 +02:00
Matthew Honnibal
23b7244842 Make sure symbols are unicode strings 2016-09-30 20:02:19 +02:00
Matthew Honnibal
f5a6aac906 Changes to tagger for new string store scheme 2016-09-30 20:01:51 +02:00
Matthew Honnibal
717741b6cf Changes to Lexeme for new string store scheme 2016-09-30 20:01:36 +02:00
Matthew Honnibal
a51149a717 Changes to vocab for new stringstore scheme 2016-09-30 20:01:19 +02:00
Matthew Honnibal
21e90d7d0b Changes to test for new string-store 2016-09-30 20:00:58 +02:00
Matthew Honnibal
99de44d864 Changes to Doc and Token for new string store scheme 2016-09-30 20:00:21 +02:00
Matthew Honnibal
78f19baafa Fix report of ParserStateError 2016-09-30 19:59:22 +02:00
Matthew Honnibal
0442e0ab1e Changes to transition systems for new StringStore scheme 2016-09-30 19:58:51 +02:00
Matthew Honnibal
22d4752d64 Changes to strings.pyx for new StringStore scheme 2016-09-30 19:58:09 +02:00
Matthew Honnibal
4f794b215a Changes to iterators.pyx for new StringStore scheme 2016-09-30 19:57:49 +02:00
Matthew Honnibal
95f8cfd745 Changes to morphology.pyx for new StringStore scheme 2016-09-30 19:57:10 +02:00
Matthew Honnibal
3ff09614e0 Changes to matcher.pyx for new StringStore scheme 2016-09-30 19:56:48 +02:00
Matthew Honnibal
eceeaefe53 Fix defaults for Parser and Entity, adding a blank= argument. 2016-09-30 19:56:06 +02:00
Matthew Honnibal
8423e8627f Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good. 2016-09-30 10:14:47 +02:00
Matthew Honnibal
d3dc5718b2 Fix syntax error in Doc 2016-09-28 11:39:49 +02:00
Matthew Honnibal
1b520e7bab Improve docstrings for Doc object 2016-09-28 11:15:13 +02:00
Matthew Honnibal
81a47c01d8 Fix test for empty sentence string. 2016-09-27 19:21:22 +02:00
Matthew Honnibal
4cbf0d3bb6 Handle errors when no valid actions are available, pointing users to the issue tracker. 2016-09-27 19:19:53 +02:00
Matthew Honnibal
430473bd98 Raise errors when no actions are available, re Issue #429 2016-09-27 19:09:37 +02:00
Matthew Honnibal
fc4a7ad794 Test and fix Issue #411: IndexError when .sents property is used on empty string. 2016-09-27 18:49:14 +02:00
Matthew Honnibal
3d370b7d45 Add test for Issue #445, fixed in 3cb4d455d, with improved lemmatizer logic 2016-09-27 18:39:46 +02:00
Matthew Honnibal
a2f3510d6d Fix lemmatizer 2016-09-27 17:47:05 +02:00
Matthew Honnibal
07776d8096 Fix pos name conflict in lemmatize 2016-09-27 17:35:58 +02:00
Matthew Honnibal
35cd953f9e Fix pos name conflict with morphology 2016-09-27 14:16:22 +02:00
Matthew Honnibal
8e7df3c4ca Expect the parser data, if parser.load() is called. 2016-09-27 14:02:12 +02:00
Matthew Honnibal
bb4f201ad2 Pass morphological features from tag map into the lemmatizer. 2016-09-27 14:01:43 +02:00
Matthew Honnibal
40509e8bca Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed. 2016-09-27 14:01:16 +02:00
Matthew Honnibal
9c8ac91d72 Add test for Issue #435 2016-09-27 13:52:38 +02:00
Matthew Honnibal
3cb4d455d2 Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435 2016-09-27 13:52:11 +02:00
Matthew Honnibal
e233328d38 Fix Issue #371: Lexeme objects were unhashable. 2016-09-27 13:22:30 +02:00
Matthew Honnibal
e382e48d9f Temporarily patch handling of defaul templates for tagger. Need to move these to language_data. 2016-09-27 13:21:28 +02:00
Matthew Honnibal
a44763af0e Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 13:13:01 +02:00
Matthew Honnibal
b14b9b096b Return None if /deps directory not present, instead of trying to load the parser. 2016-09-26 18:48:03 +02:00
Matthew Honnibal
e07b9665f7 Don't expect parser model 2016-09-26 18:09:33 +02:00
Matthew Honnibal
ee6fa106da Fix parser features 2016-09-26 17:57:32 +02:00
Matthew Honnibal
e607e4b598 Fix parser loading 2016-09-26 17:51:11 +02:00
Matthew Honnibal
0b2d7ae9d6 Fix Entity creation 2016-09-26 15:41:22 +02:00
Matthew Honnibal
2debc4e0a2 Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class. 2016-09-26 11:57:54 +02:00
Matthew Honnibal
722199acb8 Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey 2016-09-26 11:07:46 +02:00
Matthew Honnibal
e56653f848 Add language data for German 2016-09-25 15:44:45 +02:00
Matthew Honnibal
7db956133e Move tokenizer data for German into spacy.de.language_data 2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal
d7e9acdcdf Add English language data, so that the tokenizer doesn't require the data download 2016-09-25 14:49:00 +02:00
Matthew Honnibal
82b8cc5efb Whitespace 2016-09-24 22:17:01 +02:00
Matthew Honnibal
fd58f7655a Python 3 compatible basestring 2016-09-24 22:16:43 +02:00
Matthew Honnibal
082e95b19e Python 3 compatible basestring 2016-09-24 22:09:21 +02:00
Matthew Honnibal
f19af6cb2c Python 3 compatible basestring 2016-09-24 22:08:43 +02:00
Matthew Honnibal
3ed4cdfe32 Handle pathlib.Path objects in CFile 2016-09-24 22:01:46 +02:00
Matthew Honnibal
df88690177 Fix encoding of path variable 2016-09-24 21:13:15 +02:00
Matthew Honnibal
af847e07fc Fix usage of pathlib for Python3 -- turning paths to strings. 2016-09-24 21:05:27 +02:00
Matthew Honnibal
453683aaf0 Fix spacy/vocab.pyx 2016-09-24 20:50:31 +02:00
Matthew Honnibal
fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib. 2016-09-24 14:08:53 +02:00
Matthew Honnibal
b00f683a0c Fix matcher test 2016-09-24 11:20:58 +02:00
Matthew Honnibal
eaf4065480 Expose the _patterns private member 2016-09-24 11:20:42 +02:00
Matthew Honnibal
15e42a1ba9 Allow entities to be set by Span, or by 4-tuple (with entity ID) 2016-09-24 01:17:43 +02:00
Matthew Honnibal
60fdf4d5f1 Remove commented out debuggng code 2016-09-24 01:17:18 +02:00
Matthew Honnibal
939a791a52 Update tests 2016-09-24 01:17:03 +02:00
Matthew Honnibal
55f1f7edaf Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a *backwards incompatibility.* 2016-09-24 01:16:45 +02:00
Matthew Honnibal
e48df859b5 Fix typedef import in span.pyx 2016-09-23 16:02:28 +02:00
Matthew Honnibal
4de13606fd Fix token.pyx 2016-09-23 15:07:07 +02:00
Matthew Honnibal
b4de419e19 Import hash_t typedef in token.pyx 2016-09-23 14:22:06 +02:00
Matthew Honnibal
c1a2e96604 Clean up notes at end of token.pyx 2016-09-21 20:45:51 +02:00
Matthew Honnibal
f6e587b1c7 Fix matcher tests 2016-09-21 20:45:20 +02:00
Matthew Honnibal
58e83fe34b Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
Matthew Honnibal
2735b6247b Fix orths_and_spaces in Doc.__init__ 2016-09-21 14:52:05 +02:00
Matthew Honnibal
070af4af9d Revert "* Working neural net, but features hacky. Switching to extractor."
This reverts commit 7c2f1a673b.
2016-09-21 12:26:14 +02:00
Matthew Honnibal
6b202ec43f Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-09-21 12:08:25 +02:00
Mahmoud Lababidi
4c9ccc3b8b Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged. 2016-09-14 10:04:09 -04:00
Adam Ever Hadani
f1c0762443 exit code 0 for when downloading a model that already was downloaded 2016-07-13 16:22:14 -07:00
Matthew Honnibal
7c2f1a673b * Working neural net, but features hacky. Switching to extractor. 2016-05-26 19:06:10 +02:00
Matthew Honnibal
cdc10e9a1c * Fix Issue #375: noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work. 2016-05-20 10:14:06 +02:00
Matthew Honnibal
13fad36e49 * Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop 2016-05-20 10:11:05 +02:00
Matthew Honnibal
02276cc444 Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-17 16:56:22 +02:00
Matthew Honnibal
4d7f5468bb * Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded 2016-05-17 16:55:42 +02:00
Daylen Yang
5405e7dd73 Fix get_lang_class parsing (take 2) 2016-05-16 16:40:31 -07:00
Matthew Honnibal
b240104f40 Revert "Fix get_lang_class parsing" 2016-05-17 08:04:26 +10:00
Daylen Yang
1692c2df3c Fix get_lang_class parsing
We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.
2016-05-16 14:38:20 -07:00
Matthew Honnibal
17137f5c0c * Fix issue #372: mistake in Lexeme rich comparison 2016-05-12 12:58:57 +02:00
Matthew Honnibal
cc8bf62208 * Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens. 2016-05-09 13:23:47 +02:00
Matthew Honnibal
c61ee8f9fa * Increment version 2016-05-09 13:20:00 +02:00
Matthew Honnibal
5d86c30f0b * Fix Issue #367: Missing has_vector property on Doc and Span objects 2016-05-09 12:36:14 +02:00
Wolfgang Seeker
7b78239436 add fix for German noun chunk iterator (issue #365) 2016-05-06 01:41:26 +02:00
Matthew Honnibal
8c0888d6cb * Fix error in span.sent 2016-05-06 00:28:05 +02:00
Matthew Honnibal
bb94022975 * Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-06 00:21:05 +02:00
Matthew Honnibal
41342ca79b Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-06 00:17:58 +02:00
Matthew Honnibal
26095f9722 * Add span.sent property, re Issue #366 2016-05-06 00:17:38 +02:00
Wolfgang Seeker
dbf8f5f3ec fix bug in StateC.set_break() 2016-05-05 15:15:34 +02:00
Wolfgang Seeker
3c44b5dc1a call deprojectivization after parsing 2016-05-05 15:10:36 +02:00
Matthew Honnibal
472f576b82 * Deprojectivize German parses 2016-05-05 15:01:10 +02:00
Matthew Honnibal
9bbd6cf031 * Work on Chinese support 2016-05-05 11:39:12 +02:00
Matthew Honnibal
a6a25166ba * Remove print from test 2016-05-05 11:10:59 +02:00
Matthew Honnibal
e31df66d26 * Fix Issue #361: Lexemes didn't have rich comparison. 2016-05-05 01:32:26 +02:00
Matthew Honnibal
7441ca30ee * Add tests for Issue #361: Lexeme rich comparison 2016-05-05 01:31:58 +02:00
Matthew Honnibal
72564213e3 * Add test for Issue #309 2016-05-04 16:00:28 +02:00
Matthew Honnibal
76f1d871da Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-04 15:54:00 +02:00
Matthew Honnibal
519366f677 * Fix Issue #351: Indices off when leading whitespace 2016-05-04 15:53:36 +02:00
Matthew Honnibal
b4bfc6ae55 * Add test for Issue #351: Indices off when leading whitespace 2016-05-04 15:53:17 +02:00
Matthew Honnibal
76021cb853 * Fix bug in Doc.text, introduced by a862edc 2016-05-04 11:02:16 +02:00
Wolfgang Seeker
e4ea2bea01 fix whitespace 2016-05-04 07:40:38 +02:00
Wolfgang Seeker
5bf2fd1f78 make the code less cryptic 2016-05-03 17:19:05 +02:00
Wolfgang Seeker
a06fca9fdf German noun chunk iterator now doesn't return tokens more than once 2016-05-03 16:58:59 +02:00
Wolfgang Seeker
7825b75548 add tests for German noun chunker 2016-05-03 15:01:28 +02:00
Wolfgang Seeker
7b246c13cb reformulate noun chunk tests for English 2016-05-03 14:24:35 +02:00
Wolfgang Seeker
1786331cd8 add model sanity test 2016-05-03 12:51:47 +02:00
Matthew Honnibal
1f1532142f * Fix cost calculation on non-monotonic oracle 2016-05-03 00:21:08 +02:00
Matthew Honnibal
377a624046 Merge pull request #358 from wbwseeker/german_lemmatizer_dummy
German lemmatizer dummy
2016-05-03 07:38:26 +10:00
Wolfgang Seeker
92bfbebeec remove unnecessary imports 2016-05-02 17:33:22 +02:00
Wolfgang Seeker
857454ffa0 fix indentation -.- 2016-05-02 17:10:41 +02:00
Matthew Honnibal
308a28c26c * Whitespace 2016-05-02 16:08:11 +02:00
Matthew Honnibal
29a114e645 * Don't assign 0-valued tags in Doc.from_array 2016-05-02 16:07:50 +02:00
Matthew Honnibal
c1c11a8ae0 * Fix formatting on serializer tests 2016-05-02 16:07:21 +02:00
Wolfgang Seeker
dae6bc05eb define German dummy lemmatizer until morphology is done 2016-05-02 16:04:53 +02:00
Matthew Honnibal
6e1f1c4b9e Merge pull request #357 from wbwseeker/german_ner
German ner
2016-05-02 23:39:34 +10:00
Wolfgang Seeker
b6b96b233c don't require read_json_file to expect particular annotations 2016-05-02 15:29:30 +02:00
Matthew Honnibal
902a389d85 * Fix merge conflict in test_parse 2016-05-02 15:28:07 +02:00
Matthew Honnibal
276fbe9996 * Fix assignment of iterator on Doc object 2016-05-02 15:26:24 +02:00
Matthew Honnibal
02c23cc1d0 * Fix sentence boundary test 2016-05-02 15:26:07 +02:00
Matthew Honnibal
d2f469b809 * Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct 2016-05-02 15:25:27 +02:00
Wolfgang Seeker
b11cbb06c6 remove old tests for sentence boundary detection 2016-05-02 14:36:35 +02:00
Matthew Honnibal
508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal
e526be5602 Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-02 13:08:08 +02:00
Wolfgang Seeker
fa961ea694 add tests for serialization bug 2016-05-02 11:01:56 +02:00
Matthew Honnibal
97b2bba249 * Merge updated/simplified Break approach 2016-04-25 19:44:42 +00:00
Matthew Honnibal
77609588b6 * Fix assignment of root label to words left as root implicitly, after parsing ends. 2016-04-25 19:41:59 +00:00
Matthew Honnibal
7c2d2deaa7 * Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322 2016-04-25 19:41:59 +00:00
Wolfgang Seeker
c2f76a4024 Merge branch 'master' into german_ner 2016-04-25 13:21:23 +02:00
Wolfgang Seeker
1003e7ccec remove debug output from tests 2016-04-25 12:12:40 +02:00
Wolfgang Seeker
f57f843e85 fix bug in updating tree structure when introducing additional roots 2016-04-25 12:01:19 +02:00
Matthew Honnibal
478a8d1829 * Register Chinese language in spacy/__init__.py 2016-04-24 18:45:16 +02:00
Matthew Honnibal
8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker
4d7f393fae don't require json-files to have syntactic annotation 2016-04-22 16:32:27 +02:00
Wolfgang Seeker
b6477fc4f4 adjusted tests to Travis Setup 2016-04-21 17:15:10 +02:00
Wolfgang Seeker
736ffcb9a2 remove whitespace 2016-04-21 16:55:55 +02:00
Wolfgang Seeker
6c7301cc6d the parser now introduces sentence boundaries properly when predicting dependents with root labels 2016-04-21 16:50:53 +02:00
Wolfgang Seeker
12024b0b0a bugfix: introducing multiple roots now updates original head's properties
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Matthew Honnibal
67ce96c9c9 * Make patterns argument to Matcher class optional 2016-04-17 21:32:24 +02:00
Matthew Honnibal
8b4677d34d * Add missing keyword arguments to spacy.load() function 2016-04-17 21:31:50 +02:00
Matthew Honnibal
2add5206aa * Fix description of matcher test 2016-04-17 15:40:21 +02:00
Matthew Honnibal
2b419d5b8c * Update test for Issue #242 2016-04-17 15:34:23 +02:00
Matthew Honnibal
f12b043308 * Add test for Issue #242: Overlapping matches not well recognised. 2016-04-17 15:19:17 +02:00
Wolfgang Seeker
b98cc3266d bugfix: iterators now reset properly when called a second time 2016-04-15 17:49:16 +02:00
Wolfgang Seeker
e6945c4d0e bugfix: uppercase attr values before looking them up 2016-04-15 15:46:31 +02:00
Matthew Honnibal
c0909afe22 Merge pull request #312 from wbwseeker/space_head_bug
add restrictions to L-arc and R-arc to prevent space heads
2016-04-15 20:36:03 +10:00
Wolfgang Seeker
289b10f441 remove some comments 2016-04-14 15:37:51 +02:00
Matthew Honnibal
6f82065761 * Fix infixed commas in tokenizer, re Issue #326. Need to benchmark on empirical data, to make sure this doesn't break other cases. 2016-04-14 11:36:03 +02:00
Matthew Honnibal
0f957dd586 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-04-14 10:37:56 +02:00
Matthew Honnibal
108aca0e50 * Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping. 2016-04-14 10:37:39 +02:00
Matthew Honnibal
61d20de35d * Fix language.py docstring 2016-04-14 10:36:57 +02:00
Wolfgang Seeker
d99a9cbce9 different handling of space tokens
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Matthew Honnibal
04d0209be9 * Recognise multiple infixes in a token. 2016-04-13 18:38:26 +10:00
Henning Peters
a473d6e937 fix tests (use english model) 2016-04-12 16:41:57 +02:00
Henning Peters
f2d011c034 avoid polluting spacy namespace with lang classes 2016-04-12 16:31:16 +02:00
Henning Peters
ff690f76ba fix loading non-german models 2016-04-12 16:00:56 +02:00
Henning Peters
6215272786 remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels 2016-04-12 11:28:07 +02:00
Matthew Honnibal
6df3858dbc * Fix Issue #323: Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition. 2016-04-12 13:17:59 +10:00
Wolfgang Seeker
d328e0b4a8 Merge branch 'master' into space_head_bug 2016-04-11 12:11:01 +02:00
Wolfgang Seeker
80bea62842 bugfix in unit test 2016-04-08 16:46:44 +02:00
Wolfgang Seeker
be4903a1b2 update version numbers 2016-04-08 13:54:05 +02:00
Wolfgang Seeker
1fe911cdb0 bigfix 2016-04-07 18:19:51 +02:00
Matthew Honnibal
872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters
470cdf5bf9 remove deprecated LOCAL_DATA_DIR 2016-04-05 11:25:54 +02:00
Matthew Honnibal
26622f0ffc Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-03-29 14:31:52 +11:00
Matthew Honnibal
b1fe41b45d * Extend infix test, commenting on limitation of tokenizer w.r.t. infixes at the moment. 2016-03-29 14:31:05 +11:00
Matthew Honnibal
9c73983bdd * Add test for hyphenation problem in Issue #302 2016-03-29 14:27:13 +11:00
Matthew Honnibal
ad119c074f * Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics. 2016-03-29 13:02:42 +11:00
Matthew Honnibal
8c7a1908ee Merge pull request #307 from scoder/faster_string_store
remove internal redundancy and overhead from StringStore
2016-03-29 12:59:52 +11:00
Wolfgang Seeker
7195b6742d add restrictions to L-arc and R-arc to prevent space heads 2016-03-28 10:40:52 +02:00
Matthew Honnibal
8c77a994c6 Merge pull request #305 from henningpeters/master
multiple langs in download script
2016-03-26 21:54:59 +11:00
Henning Peters
c90d4a6f17 relative imports in __init__.py 2016-03-26 11:44:53 +01:00
Henning Peters
db095a162c fix 2016-03-25 18:59:47 +01:00
Henning Peters
b8f63071eb add lang registration facility 2016-03-25 18:54:45 +01:00
Matthew Honnibal
4a37fdcee1 Merge pull request #287 from wbwseeker/deproj_sentbnd_bug
add function to Token for setting head and dep (and dep_)
2016-03-25 09:47:45 +11:00