Wolfgang Seeker
5080077097
revert init_model.py back to pre-german state (because it makes more sense)
...
simplify token.n_rights and token.n_lefts
2016-03-21 16:10:25 +01:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Matthew Honnibal
80134eb12d
Merge branch 'master' of https://github.com/spacy-io/spaCy
2016-03-15 19:14:50 +00:00
Wolfgang Seeker
2ae253ef5b
changed head.__set__ to make it simpler
2016-03-14 13:43:48 +01:00
Henning Peters
c12d3dd200
add __init__.py to empty package dirs
2016-03-14 11:28:03 +01:00
Henning Peters
54f3447b5f
cleanup
2016-03-14 01:46:33 +01:00
Wolfgang Seeker
46e3f979f1
add function for setting head and label to token
...
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
bc9c62e279
replace Language functions with corresponding orth functions
...
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Wolfgang Seeker
d9312bc9ea
add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators
2016-03-09 16:18:48 +01:00
Matthew Honnibal
1508528c8c
* Increment version
2016-03-08 15:58:45 +00:00
Matthew Honnibal
963fe5258e
* Add missing __contains__ method to vocab
2016-03-08 15:49:10 +00:00
Matthew Honnibal
478aa21cb0
* Remove broken __reduce__ method on vocab
2016-03-08 15:48:21 +00:00
Matthew Honnibal
20235bde00
Merge pull request #282 from henningpeters/switch_vectors
...
initial proposal for ability to switch vectors
2016-03-09 01:39:41 +11:00
Henning Peters
eb7ae61b1c
cleanup api
2016-03-08 12:59:18 +01:00
Henning Peters
b740f20191
hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2
2016-03-06 09:19:27 +01:00
Henning Peters
aa4d964c14
cleanup api
2016-03-05 17:51:32 +01:00
Henning Peters
931c07a609
initial proposal for separate vector package
2016-03-04 11:09:06 +01:00
Wolfgang Seeker
7adbd7a785
replace Counter with normal dict
2016-03-03 21:36:27 +01:00
Wolfgang Seeker
1ae487a4f6
add backwards compatibility with python 2.6
2016-03-03 21:18:12 +01:00
Wolfgang Seeker
9d1e6de4a0
make a proper list from zip iterator
2016-03-03 19:51:01 +01:00
Wolfgang Seeker
49f9d1c085
change test_nonproj.py to not use zip inside numpy.asarray
2016-03-03 19:42:09 +01:00
Wolfgang Seeker
72b8df0684
turned PseudoProjectivity into a normal python class
2016-03-03 19:05:08 +01:00
Matthew Honnibal
fcaa0ad7ce
Merge pull request #280 from wbwseeker/german_parser
...
German parser
2016-03-04 03:27:42 +11:00
Wolfgang Seeker
690c5acabf
adjust train.py to train both english and german models
2016-03-03 15:21:00 +01:00
Wolfgang Seeker
3448cb40a4
integrated pseudo-projective parsing into parser
...
- nonproj.pyx holds a class PseudoProjectivity which currently holds
all functionality to implement Nivre & Nilsson 2005's pseudo-projective
parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker
56b7210e82
moved nonproj.py to syntax/nonproj.pyx
2016-02-25 15:08:49 +01:00
Henning Peters
f3df736e0a
remove unidecode-related test
2016-02-24 18:22:22 +01:00
Wolfgang Seeker
4b2297d5d4
add class PseudoProjective for pseudo-projective parsing
...
PseudoProjective() implements the algorithm from Nivre & Nilsson 2005
using their HEAD decoration scheme.
2016-02-24 11:26:25 +01:00
Henning Peters
12d58a7099
remove text-unidecode dependency
2016-02-24 08:01:59 +01:00
Wolfgang Seeker
8d531c958b
replace tests for non-projectivity
...
- add functions to find non-projective edges
- add test file for non-projectivity functions
2016-02-22 14:40:40 +01:00
Matthew Honnibal
141639ea3a
* Fix bug in tokenizer that caused new tokens to be added for affixes
2016-02-21 23:17:47 +00:00
Wolfgang Seeker
eae35e9b27
add tokenizer files for German, add/change code to train German pos tagger
...
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/
- init_model.py
- change doc freq threshold to 0
- add train_german_tagger.py
- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Henning Peters
9cc4f8d5b3
avoid shadowing __name__
2016-02-15 01:33:39 +01:00
Henning Peters
4c9e3c7911
upgrade spuntik, enforce data api via model version constraints
2016-02-14 16:03:17 +01:00
Henning Peters
9d8966a2c0
Update test_tokenizer.py
2016-02-10 19:24:37 +01:00
Henning Peters
3b5f1e753b
py26 compatibility
2016-02-10 14:32:54 +01:00
Henning Peters
ee1f1ac300
mark test_sentence_space() as model test
2016-02-10 07:49:11 +01:00
Matthew Honnibal
5d96b3ef4f
* Increment version
2016-02-07 13:48:58 +01:00
Matthew Honnibal
1b83cb9dfa
* Fix Issue #251 : Incorrect right edge calculation on left-clobber low in the tree
2016-02-07 00:00:42 +01:00
Matthew Honnibal
c6623889c1
* Add test for Issue #251 : Incorrect right edges, caused by bad update to r_edge in del_arc, triggered from non-monotonic left-arc
2016-02-06 23:47:51 +01:00
Matthew Honnibal
a95974ad3f
* Fix oov probability
2016-02-06 15:13:55 +01:00
Matthew Honnibal
af8514cb0c
* Refine the way the is_parsed attribute is set by from_array
2016-02-06 14:44:35 +01:00
Matthew Honnibal
161b01d4c0
* Tweak usage example for multi-processing
2016-02-06 14:44:11 +01:00
Matthew Honnibal
7f24229f10
* Don't try to pickle the tokenizer
2016-02-06 14:09:05 +01:00
Matthew Honnibal
dcb401f3e1
* Remove broken Vocab pickling
2016-02-06 14:08:47 +01:00
Matthew Honnibal
e66d45bf66
* Restore previous patch to Span.root, as it seems it wasn't the cause of the problem.
2016-02-06 13:37:41 +01:00
Matthew Honnibal
4412a70dc5
* Initialize StateC._empty_token to 0, to avoid undefined behaviour.
2016-02-06 13:34:38 +01:00
Matthew Honnibal
1b41f868d2
* Check for errors in parser, and parallelise the left-over batch
2016-02-06 10:06:30 +01:00
Matthew Honnibal
031b00cb91
* Fix Span.root calculation
2016-02-05 20:12:09 +01:00
Matthew Honnibal
165ca28b80
* Set is_parsed flag in Parser.pipe
2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a
* Set is_parsed flag in Parser.pipe
2016-02-05 19:50:11 +01:00
Matthew Honnibal
7119e77fb6
* Fix Matcher.pipe
2016-02-05 19:46:02 +01:00
Matthew Honnibal
1cf0100bf6
* Add test for multithreading
2016-02-05 19:38:22 +01:00
Matthew Honnibal
b04c9aad71
* Fix off-by-one in Parser.pipe
2016-02-05 19:37:50 +01:00
Matthew Honnibal
e5c447e237
* Questionable fix to problem in Span.root
2016-02-05 19:18:35 +01:00
Matthew Honnibal
1ef84a0557
* Merge master into rethinc2
2016-02-05 12:55:59 +01:00
Matthew Honnibal
4cf34fc170
Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2
2016-02-05 12:48:28 +01:00
Matthew Honnibal
249dccbe95
* Fix Language.pipe
2016-02-05 12:47:57 +01:00
Matthew Honnibal
c0e63feccc
* xfail pickle tests
2016-02-05 12:46:58 +01:00
Matthew Honnibal
6aa92b70f1
* Fix merge problem in span
2016-02-05 12:46:11 +01:00
Matthew Honnibal
048dfe35aa
* cimport cython.parallel
2016-02-05 12:20:42 +01:00
Matthew Honnibal
af58f273b3
* Fix spacy.language.pipe
2016-02-05 12:20:29 +01:00
Matthew Honnibal
8a13cebdcc
* Update for modified thinc interface
2016-02-05 11:44:39 +01:00
Matthew Honnibal
48ce09687d
* Skip pickling the vocab in the tests
2016-02-04 15:51:19 +01:00
Matthew Honnibal
419edfab50
* Use generic flags for the new attributes until they're added
2016-02-04 15:50:54 +01:00
Matthew Honnibal
c4017a06d9
* Add placeholders for the new flags in attrs and symbols
2016-02-04 15:49:45 +01:00
Matthew Honnibal
e5c96c969f
* Wire up new attributes
2016-02-04 13:04:58 +01:00
Matthew Honnibal
9703ccc3de
* Remove unused import
2016-02-04 13:04:33 +01:00
Matthew Honnibal
11810be33e
* Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct
2016-02-04 13:04:16 +01:00
Matthew Honnibal
fe611132f0
* Add stubs for is_bracket/is_quote/is_left_punct/is_right_punct functions
2016-02-04 13:03:04 +01:00
Matthew Honnibal
ee975d36d0
* Add stubs to test is_bracket/is_quote/is_left_punct/is_right_punct functions
2016-02-04 13:02:25 +01:00
Matthew Honnibal
f9e765cae7
* Add pipe() method to tokenizer
2016-02-03 02:32:37 +01:00
Matthew Honnibal
4cbad510ff
* Fix calculation of head for spans with punctuation.
2016-02-03 02:32:21 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
fcfc17a164
Merge branch 'master' into rethinc2
2016-02-02 23:05:34 +01:00
Matthew Honnibal
f204daf27b
* Add error warning that a gold tag is unrecognised
2016-02-02 22:59:59 +01:00
Matthew Honnibal
99b8906100
* Accept punct_labels as an argument to the scorer
2016-02-02 22:59:06 +01:00
Matthew Honnibal
59123443e2
* Check for presence/absence of the different models in Language.end_training
2016-02-02 22:49:55 +01:00
Matthew Honnibal
9e9d4c8706
* Fix stupid error in Language.batch
2016-02-01 09:49:32 +01:00
Matthew Honnibal
e3db39dd21
* Fix compiler warning about signed/unsigned comparison
2016-02-01 09:08:07 +01:00
Matthew Honnibal
98fbdf2856
* Add Language.batch() method, to support multi-threaded jobs
2016-02-01 09:01:13 +01:00
Matthew Honnibal
b3802562d6
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd
* Fix merge conflict
2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8
* Fix parseC function
2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40
* Add a parse_batch method to Parser, that releases the GIL around a batch of documents.
2016-02-01 08:34:55 +01:00
Matthew Honnibal
d5579cd0d8
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 03:08:49 +01:00
Matthew Honnibal
490ba65398
* Use openmp in parser
2016-02-01 03:08:42 +01:00
Matthew Honnibal
cb78d91ec5
* Fix ArcEager.set_valid
2016-02-01 03:07:37 +01:00
Matthew Honnibal
28e5ad62bc
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448
* Now fully proxied
2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1
* Continue proxying. Some problem currently
2016-02-01 02:22:21 +01:00
Matthew Honnibal
2169bbb7ea
* Shadow StateClass with StateC, to start proxying
2016-02-01 01:16:14 +01:00
Matthew Honnibal
2fa228458e
* Add _state file, which StateClass will proxy to
2016-02-01 01:09:21 +01:00
Matthew Honnibal
6bb007d16e
* Make set_parse nogil
2016-01-30 20:27:52 +01:00
Matthew Honnibal
9410e74c92
* Switch parser to use nogil functions
2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791
* Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser
2016-01-30 14:31:36 +01:00
Matthew Honnibal
ea4ff94cde
* Whitespace
2016-01-29 03:59:22 +01:00
Matthew Honnibal
b0718b6ee1
* Move to thinc 5.0
2016-01-29 03:58:55 +01:00
Matthew Honnibal
9721502c81
* Update version
2016-01-25 15:52:59 +01:00
Matthew Honnibal
907e8cf07d
* Add u prefix to string in web example
2016-01-25 15:51:38 +01:00
Matthew Honnibal
eba03695ef
* Comment out pickle tests
2016-01-25 15:51:13 +01:00
Matthew Honnibal
de94e6c525
* Mark pickle tests as xfail, due to temp files problem
2016-01-25 15:24:17 +01:00
Matthew Honnibal
87172a15c6
* Fix runtime error bug that arose from updated Span.root function.
2016-01-25 15:22:42 +01:00
Matthew Honnibal
2c8dd91785
* Fix first code example on the website
2016-01-23 18:09:19 +01:00
Matthew Honnibal
3af84cfd6e
* Increment version
2016-01-21 17:49:27 +01:00
Henning Peters
65aeac24cb
remove package version constraint
2016-01-21 17:40:51 +01:00
Matthew Honnibal
792c98a438
* Increment version for OSX-fixed release of v0.100
2016-01-21 00:23:04 +01:00
Matthew Honnibal
82d011ac43
* Fix test for whitespace
2016-01-19 20:38:26 +01:00
Matthew Honnibal
e89069dcae
* Fix matcher test
2016-01-19 20:24:01 +01:00
Matthew Honnibal
63e3d4e27f
* Add comment on Vocab.__reduce__
2016-01-19 20:11:25 +01:00
Matthew Honnibal
e1282b7f2f
* Require user-custom NER classes to work without adding the label.
2016-01-19 20:11:03 +01:00
Matthew Honnibal
84c5dfbfc3
* Clean up debugging python list
2016-01-19 20:10:32 +01:00
Matthew Honnibal
04d0686b26
* Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions.
2016-01-19 20:10:04 +01:00
Matthew Honnibal
c4a89d56bd
* Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types.
2016-01-19 20:09:26 +01:00
Matthew Honnibal
f0f92793f6
* Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217
2016-01-19 19:23:16 +01:00
Matthew Honnibal
65c5bc4988
* Add add_label method, to allow users to register new entity types and dependency labels.
2016-01-19 19:11:02 +01:00
Matthew Honnibal
151aa0b0e2
* Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model
2016-01-19 19:09:33 +01:00
Matthew Honnibal
c8e0011ebc
* Add iterators to the NER and parser transition systems, to get the action types
2016-01-19 19:07:43 +01:00
Matthew Honnibal
515493c675
* Add xfail test for Issue #225 : tokenization with non-whitespace delimiters
2016-01-19 13:20:14 +01:00
Matthew Honnibal
7abe653223
* Fix imports
2016-01-19 03:36:51 +01:00
Matthew Honnibal
590f38bdb2
* Add hacky solution to Issue #220 . Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution.
2016-01-19 03:35:20 +01:00
Matthew Honnibal
445164d5b4
* Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated
2016-01-19 02:54:56 +01:00
Matthew Honnibal
04177debd0
* Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184 , but may cause further problems. Needs testing.
2016-01-19 02:54:15 +01:00
Matthew Honnibal
7893de3203
* Add test for Issue #184 : Whitespace at sentence boundary causes sentence boundary error.
2016-01-18 23:04:38 +01:00
Matthew Honnibal
bba0a5e078
* Handle string paths in default_vocab, default_parser, default_entity in Language class
2016-01-18 22:37:24 +01:00
Matthew Honnibal
e825fd9554
* Make some of the website tests work without models
2016-01-18 18:14:44 +01:00
Matthew Honnibal
334c4b2b57
* Disprefer punctuation and spaces as heads of spans
2016-01-18 18:14:09 +01:00
Matthew Honnibal
bed36ab0ff
* Fix import of HEAD attribute
2016-01-18 17:34:43 +01:00
Matthew Honnibal
28c659c1fe
* Fix import for numpy
2016-01-18 17:25:04 +01:00
Matthew Honnibal
fc36bcf458
* Fix import for English
2016-01-18 17:14:40 +01:00
Matthew Honnibal
cc4c335e14
* Set heads for test_merge_tokens, to make the test run without models
2016-01-18 17:00:11 +01:00
Matthew Honnibal
c107da9738
* Bug fix to _count_words_to_root
2016-01-18 16:59:38 +01:00
Matthew Honnibal
f24833d607
* Fix merge for coordinations
2016-01-18 16:03:19 +01:00
Matthew Honnibal
14534958a9
* Fix bug in Span.root
2016-01-18 15:40:28 +01:00
Matthew Honnibal
714cbc03d5
* Add test for Issue #203 : nested noun chunks.
2016-01-16 18:02:30 +01:00
Matthew Honnibal
4e2253170c
* Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest
2016-01-16 18:01:36 +01:00
Matthew Honnibal
34a157511f
* Move test_merge_hang to test_tokens_api
2016-01-16 18:00:26 +01:00
Matthew Honnibal
fc8f26584a
* Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203 , but might be problematic. Also allow root NPs to be considered noun chunks.
2016-01-16 17:52:40 +01:00
Matthew Honnibal
4a16dbfeca
* Add test for Issue #203 : noun chunks should be flat, but sometimes are nested
2016-01-16 17:41:25 +01:00
Matthew Honnibal
995b2d18fd
* Route token.string via token.txt_with_ws, to deprecate token.string in future
2016-01-16 17:14:34 +01:00
Matthew Honnibal
54a98eaf19
* Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future
2016-01-16 17:13:50 +01:00
Matthew Honnibal
3e9961d2c4
* If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154
2016-01-16 17:08:59 +01:00
Matthew Honnibal
223d2b3484
* Add test for Issue #154 : Additional whitespace introduced when string ends with a whitespace token.
2016-01-16 17:08:07 +01:00
Matthew Honnibal
3dc398b727
* Fix merge conflict in requirements.txt
2016-01-16 16:20:49 +01:00
Matthew Honnibal
fc5962a77d
* Improve test for root token in Span
2016-01-16 16:19:09 +01:00
Matthew Honnibal
c025a0c64b
* Check for KeyboardInerrupt in parser.__call__
2016-01-16 16:18:44 +01:00
Matthew Honnibal
03e8a4293d
* Add loop guard to Token.lefts and Token.rights properties
2016-01-16 16:18:17 +01:00
Matthew Honnibal
304339985e
* Add a linear scan to Span.root method, to help with long sentences
2016-01-16 16:17:28 +01:00
Matthew Honnibal
aa0dd79f52
* Delete test_token_references, which checked a flakey strategy for preventing orphan tokens from a while ago. Now orphan tokens simply hold a reference to Pool, preventing the memory from being freed underneath them. This means that we don't need to run this slow test.
2016-01-16 16:03:35 +01:00
Matthew Honnibal
8cbcc3a799
* Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214 .
2016-01-16 15:38:50 +01:00
Matthew Honnibal
c1039fa4b4
* Add test for Issue #214 . Resolved in change to Span.root
2016-01-16 15:37:47 +01:00
Henning Peters
41ea14a56f
fix pickling
2016-01-16 13:23:11 +01:00
Henning Peters
5551052840
fix py2/3 issue
2016-01-16 12:44:53 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Matthew Honnibal
42a9f29b40
* Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214
2016-01-16 11:53:37 +01:00
Henning Peters
6d1a3af343
cleanup unused
2016-01-16 10:05:04 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
211913d689
add about.py, adapt setup.py
2016-01-15 18:57:01 +01:00
Henning Peters
f8a8f97d25
cleanup
2016-01-15 18:13:37 +01:00
Henning Peters
780cb847c9
add default_model to about
2016-01-15 18:07:15 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Matthew Honnibal
478a79a3d5
* Add test for Issue #220 : Whitespace being tagged as noun
2016-01-15 16:17:07 +01:00
Henning Peters
d9471f684f
fix typo
2016-01-14 12:14:12 +01:00
Henning Peters
9b75d872b0
fix model download
2016-01-14 12:02:56 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
3fbfba575a
* xfail the contractions test
2015-12-31 13:16:28 +01:00
Matthew Honnibal
3bd910ccad
* Merge therell test
2015-12-31 11:55:18 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
029136a007
* Fix resource loading for Matcher
2015-12-31 02:45:12 +01:00
Matthew Honnibal
55bcdf8bdd
* Fix errors
2015-12-29 22:32:03 +01:00
Matthew Honnibal
a6ba43ecaf
* Fix errors in packaging revision
2015-12-29 18:37:26 +01:00
Matthew Honnibal
4b4eec8b47
* Fix Issue #201 : Tokenization of there'll
2015-12-29 18:09:09 +01:00
Matthew Honnibal
86ee9d046d
* Remove test that belongs to a change for master
2015-12-29 18:07:23 +01:00
Matthew Honnibal
a2dfdec85d
* Clean up spacy.util
2015-12-29 18:06:09 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
0e2498da00
* Replace from_package with load() classmethod in Vocab
2015-12-29 16:56:51 +01:00
Matthew Honnibal
c5902f2b4b
* Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod
2015-12-29 16:56:02 +01:00
Matthew Honnibal
4131e45543
* Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way
2015-12-29 16:55:03 +01:00
Matthew Honnibal
f5dea1406d
* Fix silly mistake in Language.__init__
2015-12-28 18:48:57 +01:00
Matthew Honnibal
187960606f
* Fix pickle problems
2015-12-28 16:54:03 +01:00
Matthew Honnibal
8c7e149ec9
* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug
2015-12-28 15:56:27 +01:00
Henning Peters
32d655b6e1
bump version
2015-12-28 09:34:39 +01:00
Matthew Honnibal
8b61d45ed0
* Fix merge conflicts for headers branch
2015-12-27 17:46:25 +01:00
Matthew Honnibal
6bb9c7f311
Merge pull request #202 from henningpeters/sputnik
...
access model via sputnik
2015-12-28 03:29:53 +11:00
Henning Peters
0e321a7105
get mingw32 to work
2015-12-22 23:25:38 +01:00
Henning Peters
d8d348bb55
allow to specify version constraint within model name
2015-12-18 19:12:08 +01:00
Henning Peters
7f7299cafb
Merge branch 'tmpdir' into headers
2015-12-18 12:25:25 +01:00
Henning Peters
cfa187aaf0
fix tests
2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
970278a3d6
no need to link data dir anymore
2015-12-18 09:49:45 +01:00
Henning Peters
4f3efb8eaf
avoid writing to /tmp (not cross-platform compatible)
2015-12-16 19:56:40 +01:00
Henning Peters
4ada39f472
avoid writing to /tmp (not cross-platform compatible)
2015-12-16 19:53:06 +01:00
Henning Peters
2d4efe40f9
fix sputnik call
2015-12-13 14:46:08 +01:00
Henning Peters
ac318b568c
new approach to dependency headers
2015-12-13 11:49:17 +01:00
Henning Peters
345dda6f53
small fixes, add package build step
2015-12-07 06:50:26 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Henning Peters
73e5650be5
change index server
2015-11-18 18:09:46 +01:00
Henning Peters
50d15ea5d2
fix
2015-11-18 17:35:21 +01:00
Henning Peters
02a1dcec76
add data dir
2015-11-18 11:48:55 +01:00
Henning Peters
919a4f0b04
change data path, add repository
2015-11-18 11:40:46 +01:00
Henning Peters
12de895e60
fix version
2015-11-15 16:38:16 +01:00
Henning Peters
03d2f98cd5
add sputnik
2015-11-15 15:58:21 +01:00
Matthew Honnibal
ec7d36c3a4
* Add test for matcher end-point problem
2015-11-12 05:00:40 +11:00
Matthew Honnibal
d309622a27
* Add test for matcher end-point problem
2015-11-12 04:59:11 +11:00
Matthew Honnibal
56ea20a886
* Add test for matcher end-point problem
2015-11-12 04:58:53 +11:00
Matthew Honnibal
cfa4062147
* Add test for matcher end-point problem
2015-11-12 04:56:07 +11:00
Matthew Honnibal
5623242b3e
* Adjust NER rules, so that U entries in gazetteer don't become B moves to the model
2015-11-12 04:48:23 +11:00
Matthew Honnibal
d67d7d5a86
* Add test for NER inconsistency bug
2015-11-08 16:19:33 +01:00
Matthew Honnibal
44fbdc7260
* Fix bug in NER transition system, that sometimes left no valid moves
2015-11-08 16:19:12 +01:00
Matthew Honnibal
ab5aac5b2f
* Add .rank property to Token and Lexeme, for frequency rank
2015-11-08 16:18:25 +01:00
Matthew Honnibal
fde9a22ec2
* Add new test for ner
2015-11-08 13:57:15 +01:00
Matthew Honnibal
e92371bb54
* Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed.
2015-11-08 22:17:51 +11:00
Matthew Honnibal
3b74739c3e
* Download updated data
2015-11-08 21:24:25 +11:00
Matthew Honnibal
31da42eb27
* Mark tests that require models
2015-11-07 19:27:38 +11:00
Matthew Honnibal
8e26a28616
* Mark tests that require models
2015-11-07 19:10:56 +11:00
Matthew Honnibal
15eab7354f
* Remove extraneous test files
2015-11-07 18:45:13 +11:00
Matthew Honnibal
6f47074214
* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.
2015-11-07 18:25:17 +11:00
Matthew Honnibal
1cfa20fb17
* Fix sentence-final whitespace issue
2015-11-07 17:34:46 +11:00
Matthew Honnibal
7663970d5f
* Removed unused i variable from Span, and set attributes to read-only
2015-11-07 17:06:15 +11:00
Matthew Honnibal
4b3c96d76d
* Fix zero-length spans
2015-11-07 17:05:16 +11:00
Matthew Honnibal
888c05a7fa
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:30:45 +11:00
Matthew Honnibal
06f26d258e
* Fix test_basic_create
2015-11-07 10:04:37 +11:00
Matthew Honnibal
1d3884c46d
* Fix test_basic_create
2015-11-07 10:03:56 +11:00
Matthew Honnibal
cc8febcbe1
* Fix Span comparison
2015-11-07 09:54:14 +11:00
Matthew Honnibal
af70dc166a
* Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect.
2015-11-07 09:52:00 +11:00
Matthew Honnibal
a9b612abdf
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 09:01:12 +11:00
Matthew Honnibal
56499d89ef
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 08:55:34 +11:00
Andreas Grivas
83ca4e0b93
* use old merge tests - add more
2015-11-07 07:57:04 +11:00
Andreas Grivas
4be7fda453
* span start, end -> properties. autoupdate after merge
2015-11-07 07:57:04 +11:00
Andreas Grivas
562db6d2d0
* merge add lex last - add index finder funcs
2015-11-07 07:57:04 +11:00
Matthew Honnibal
a06e3c8963
* Fix bone-headed mistake in StateClass.E
2015-11-07 07:35:28 +11:00
Matthew Honnibal
d24b8509e4
* Correct screw ups from the previous commits
2015-11-07 06:51:41 +11:00
Matthew Honnibal
5efad178b5
* Set ent tag when close entity
2015-11-07 06:09:25 +11:00
Matthew Honnibal
9285f01d26
* Fix broken StateClass.E tracking
2015-11-07 06:06:39 +11:00
Matthew Honnibal
19136b0e7d
* Add better debug message for illegal move
2015-11-07 05:34:37 +11:00
Matthew Honnibal
2733816b7b
* Fix whitespace
2015-11-07 05:31:06 +11:00
Matthew Honnibal
01ab464383
* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169
2015-11-07 05:30:44 +11:00
Matthew Honnibal
b65633f270
* Fix function that returns nth entity in StateClass. Was only returning the first.
2015-11-07 05:29:11 +11:00
Matthew Honnibal
410b6f9ec1
* Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class.
2015-11-07 05:13:10 +11:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
9d1b2a103a
* Fix capitalization in lemmatizer
2015-11-06 05:44:35 +11:00
Matthew Honnibal
6ed3aedf79
* Merge vocab changes
2015-11-06 00:48:08 +11:00
Matthew Honnibal
72abbb43fb
* Add type declarations in strings.pyx
2015-11-06 00:47:26 +11:00
Matthew Honnibal
5b2af4864f
* When lemmatizing non-noun, non-verb, non-adj words, output lower-case
2015-11-06 00:45:09 +11:00
Matthew Honnibal
754bf04162
* Remove declaration of Model.update
2015-11-06 00:31:15 +11:00
Matthew Honnibal
e18bdff23a
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-06 00:26:15 +11:00
Matthew Honnibal
b9991fbd20
* Update to use thinc 3.0
2015-11-06 00:25:59 +11:00
Matthew Honnibal
864a8f45d8
* Use unicode in StringStore.intern, instead of unreliably casting to bytes.
2015-11-05 11:32:19 +00:00
Matthew Honnibal
b18204cd52
* Fix StringStore._realloc, re Issue #155
2015-11-05 11:28:26 +00:00
Matthew Honnibal
f8004c5f65
* Begin upgrading to improved thinc API
2015-11-05 03:53:03 +11:00
Matthew Honnibal
adc7bbd6cf
* Fix name of like_num in default_lex_attrs
2015-11-04 22:02:47 +11:00
Matthew Honnibal
e96faf29e7
* Rename like_number to like_num, to fix inconsistency re Issue #166
2015-11-04 22:01:44 +11:00
Matthew Honnibal
65934b7cd4
* Enforce import of ujson in strings.pyx, because otherwise it's too slow
2015-11-04 00:32:02 +11:00
Matthew Honnibal
1ce5d5602d
* Rename Doc.data to Doc.c
2015-11-04 00:17:13 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
3ddea19b2b
* Rename spans.pyx to span.pyx
2015-11-04 00:14:40 +11:00
Matthew Honnibal
9482d616bc
* Rename spans.pyx to span.pyx
2015-11-03 23:51:05 +11:00
Matthew Honnibal
116da5990a
* Clean up setting of tag in doc.from_bytes
2015-11-03 23:48:57 +11:00
Matthew Honnibal
9ec7b9c454
* Clean up unused Constituent struct.
2015-11-03 23:48:21 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
ee3f9ba581
* Fix test of serializer
2015-11-03 19:45:16 +11:00
Matthew Honnibal
d06ba26371
* Fix test of serializer
2015-11-03 19:43:27 +11:00
Matthew Honnibal
4083059650
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-11-03 09:07:19 +01:00
Matthew Honnibal
9e37437ba8
* Fix assign_tag in doc.merge
2015-11-03 19:07:02 +11:00
Matthew Honnibal
dde9e1357c
* Add todo to morphology.lemmatize
2015-11-03 18:54:35 +11:00
Matthew Honnibal
ffedff9e6c
* Remove the archive after download, to save disk space
2015-11-03 18:54:05 +11:00
Matthew Honnibal
85372468e3
* Fix serialize test
2015-11-03 08:51:33 +01:00
Matthew Honnibal
833eb35c57
* Fix tag assignment in doc.from_array
2015-11-03 18:45:54 +11:00
Matthew Honnibal
09664177d7
* Fix tag handling in doc.merge, and assign sent_start when setting heads.
2015-11-03 18:15:52 +11:00
Matthew Honnibal
389a373807
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-03 18:07:25 +11:00
Matthew Honnibal
3f44b3e43f
* Mark serializer test as requiring models
2015-11-03 18:07:08 +11:00
Matthew Honnibal
25ed7be8f8
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-11-03 07:58:17 +01:00
Matthew Honnibal
604ceac4c6
* Fix morphological assignment in doc.merge()
2015-11-03 17:57:51 +11:00
Matthew Honnibal
5e040855a5
* Ensure morphological features and lemmas are loaded in from_array, re Issue #152
2015-11-03 17:56:50 +11:00
Matthew Honnibal
5668feb235
* Fix pickle test for python3
2015-11-03 04:57:02 +01:00
Matthew Honnibal
6161d2529a
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-03 13:36:30 +11:00
Matthew Honnibal
5887506f5d
* Don't expect lexemes.bin in Vocab
2015-11-03 13:23:39 +11:00
Matthew Honnibal
f7dd377575
* Adjust conjuncts iterator in Token
2015-11-03 13:23:22 +11:00
Andreas Grivas
d418f00eb1
fixed error when printing unicode
2015-11-02 20:23:18 +02:00
Matthew Honnibal
52fc338001
* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152
2015-10-28 10:43:22 +11:00
Matthew Honnibal
1c0356e4c2
* Set test file mode to w+t
2015-10-26 22:40:48 +11:00
Matthew Honnibal
0fe98f358b
* Fix mode on text file for Python3 in strings test
2015-10-26 22:25:16 +11:00
Matthew Honnibal
8ba9cf905e
* Fix mode on text file for Python3 in strings test
2015-10-26 21:44:34 +11:00
Matthew Honnibal
a0730699b1
* Fix mode on text file for Python3 in strings test
2015-10-26 21:25:56 +11:00
Matthew Honnibal
725344d349
* Fix tempfile in test
2015-10-26 21:08:18 +11:00
Matthew Honnibal
f11030aadc
* Remove out-dated TODO comment
2015-10-26 12:33:38 +11:00
Matthew Honnibal
a371a1071d
* Save and load word vectors during pickling, re Issue #125
2015-10-26 12:33:04 +11:00
Matthew Honnibal
a824a98312
* Add tests for pickling vectors, re: Issue #125
2015-10-26 12:31:05 +11:00
Matthew Honnibal
314090cc78
* Set vectors length when unpickling vocab, re Issue #125
2015-10-26 12:05:08 +11:00
Matthew Honnibal
4e16f9e435
* Move tests underneath spacy/
2015-10-26 00:07:31 +11:00
Matthew Honnibal
3a6e48e814
Merge pull request #149 from chrisdubois/pickle-patch
...
Add __reduce__ to Tokenizer so that English pickles.
2015-10-25 15:30:31 +11:00
Chris DuBois
dac8fe7bdb
Add __reduce__ to Tokenizer so that English pickles.
...
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal
ff4fe524ee
* Fix exception for python 2
2015-10-23 01:56:13 +02:00
Matthew Honnibal
341a3e85cd
* Upd downloaded data version
2015-10-23 00:56:57 +02:00
Matthew Honnibal
f18fd8c659
* Fix language.py for change in StringStore load API
2015-10-23 03:48:12 +11:00
Matthew Honnibal
23855db3ca
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-10-23 03:46:09 +11:00
Matthew Honnibal
4f13849065
Merge pull request #145 from henningpeters/master
...
better error reporting, cleanup
2015-10-23 03:45:47 +11:00
Matthew Honnibal
3be94be0c0
Merge pull request #148 from maxirmx/master
...
Utf8 encoding for lemma_rules.json
2015-10-22 21:46:28 +11:00
Matthew Honnibal
c86bda8d1a
* Fix import of uget
2015-10-22 21:13:56 +11:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
9baf0abd59
* Save vocab after training.
2015-10-22 21:09:14 +11:00
maxirmx
f07e4accd7
Fixing encoding issue #4
2015-10-21 20:45:56 +03:00
maxirmx
fcbfff043f
Fixing encoding issue #3
2015-10-21 15:52:34 +03:00
maxirmx
fe9d2e2c4e
Fixing encode issue #2
2015-10-21 15:36:21 +03:00
maxirmx
e4a1726f77
Fixing encoding issue
...
UTF-8
2015-10-21 14:16:37 +03:00
Andreas Grivas
93ada458e2
added __repr__ that prints text in ipython for doc, token, and span objects
2015-10-21 14:11:46 +03:00
Henning Peters
ccffd2ef53
fixed extract directory
2015-10-21 07:59:34 +02:00
Henning Peters
da4c9cee06
assert filename match
2015-10-20 19:33:59 +02:00
Henning Peters
4f703f0cb4
better error reporting, cleanup
2015-10-20 19:11:29 +02:00
Matthew Honnibal
9cdea6e450
* Import uget correctly
2015-10-19 08:32:41 +02:00
Matthew Honnibal
6727a46bb5
* Fix Issue #118 : Matcher behaves unpredictably when matches overlap.
2015-10-19 16:45:32 +11:00
Matthew Honnibal
135062d23c
* Fix error with merged text when merged region did not have trailing whitespace
2015-10-19 15:47:04 +11:00
Henning Peters
bfde91fa49
add custom download tool (uget), replace wget with uget
2015-10-18 12:35:04 +02:00
Matthew Honnibal
9839cd2c0b
* Fix whitespace_ calculation in Token
2015-10-18 17:21:11 +11:00
Matthew Honnibal
c99285b8b9
* Clean up C++ usage in spacy/matcher.pyx
2015-10-18 17:20:50 +11:00
Matthew Honnibal
a7e6c5ac8f
* Fix Issue #122 : Incorrect calculation of children after Doc.merge()
2015-10-18 17:17:27 +11:00
Matthew Honnibal
3ba66f2dc7
* Add string length cap in Tokenizer.__call__
2015-10-16 04:54:16 +11:00
Matthew Honnibal
6e0f985afc
* Fix token.conjuncts
2015-10-15 03:49:45 +11:00
Matthew Honnibal
2e0104ac81
* Fix token.conjuncts
2015-10-15 03:47:45 +11:00
Matthew Honnibal
b8f3345a82
* Fix token.conjuncts method
2015-10-15 03:36:01 +11:00
Matthew Honnibal
23818f89b8
* Fix token.conjuncts method
2015-10-15 03:34:57 +11:00
Matthew Honnibal
7a15d1b60c
* Add Python 2/3 compatibility fix for copy_reg
2015-10-13 20:04:40 +11:00
Matthew Honnibal
329ae57520
* Fix whitespace attachment thing
2015-10-13 09:46:38 +02:00
Matthew Honnibal
37919eac82
* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.
2015-10-13 18:23:24 +11:00
Matthew Honnibal
c70eb776ae
* Fix whitespace attachment, so that left/right children are consistent with head.
2015-10-13 15:58:22 +11:00
Matthew Honnibal
531182f937
* Fix Model.__reduce__
2015-10-13 15:14:38 +11:00
Matthew Honnibal
6c227a6c1f
* Fix Model.__reduce__
2015-10-13 15:10:04 +11:00
Matthew Honnibal
358c82595c
* Fix NAMES list in spacy/parts_of_speech.pyx
2015-10-13 14:18:45 +11:00
Matthew Honnibal
c1fdc487bc
Merge branch 'attrs'
2015-10-13 14:03:41 +11:00
Matthew Honnibal
e886e6a406
* Inc version
2015-10-13 13:46:17 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
f8de403483
* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125
2015-10-13 13:44:41 +11:00
Matthew Honnibal
85e7944572
* Start trying to pickle Vocab
2015-10-13 13:44:41 +11:00
Matthew Honnibal
5ca57bd859
* Ensure Morphology can be pickled, to address Issue #125 .
2015-10-13 13:44:41 +11:00
Matthew Honnibal
0cee928467
* Allow StringStore to be pickled, to start addressing Issue #125
2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8
* Fix variable name
2015-10-13 13:44:40 +11:00
Matthew Honnibal
e70368d157
* Use lower case strings for dependency label names in symbols enum
2015-10-13 13:44:40 +11:00
Matthew Honnibal
7b4af3d1e7
* Fix parts_of_speech now that symbols list has been reformed
2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6
* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd
2015-10-13 13:44:40 +11:00
Matthew Honnibal
ce65ec698c
* Remove qualified naming in symbols
2015-10-13 13:44:40 +11:00
Matthew Honnibal
9f4be0adcd
* Map NO_TAG to NIL in parts_of_speech.pxd
2015-10-13 13:44:40 +11:00
Matthew Honnibal
278e12f7e8
* Addmorphology symbols to morphology. May need to remove these as an enum.
2015-10-13 13:44:40 +11:00
Matthew Honnibal
d80067eda1
* Map empty string to NULL_ATTR in attrs
2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c
* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore
2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d
* Add symbols to the vocab before reading the strings, so that they line up correctly
2015-10-13 13:44:39 +11:00
Matthew Honnibal
74c0853471
* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS
2015-10-13 13:44:39 +11:00