Wolfgang Seeker
e4ea2bea01
fix whitespace
2016-05-04 07:40:38 +02:00
Wolfgang Seeker
5bf2fd1f78
make the code less cryptic
2016-05-03 17:19:05 +02:00
Wolfgang Seeker
a06fca9fdf
German noun chunk iterator now doesn't return tokens more than once
2016-05-03 16:58:59 +02:00
Wolfgang Seeker
7825b75548
add tests for German noun chunker
2016-05-03 15:01:28 +02:00
Wolfgang Seeker
7b246c13cb
reformulate noun chunk tests for English
2016-05-03 14:24:35 +02:00
Wolfgang Seeker
1786331cd8
add model sanity test
2016-05-03 12:51:47 +02:00
Matthew Honnibal
1f1532142f
* Fix cost calculation on non-monotonic oracle
2016-05-03 00:21:08 +02:00
Matthew Honnibal
377a624046
Merge pull request #358 from wbwseeker/german_lemmatizer_dummy
...
German lemmatizer dummy
2016-05-03 07:38:26 +10:00
Wolfgang Seeker
92bfbebeec
remove unnecessary imports
2016-05-02 17:33:22 +02:00
Wolfgang Seeker
857454ffa0
fix indentation -.-
2016-05-02 17:10:41 +02:00
Matthew Honnibal
308a28c26c
* Whitespace
2016-05-02 16:08:11 +02:00
Matthew Honnibal
29a114e645
* Don't assign 0-valued tags in Doc.from_array
2016-05-02 16:07:50 +02:00
Matthew Honnibal
c1c11a8ae0
* Fix formatting on serializer tests
2016-05-02 16:07:21 +02:00
Wolfgang Seeker
dae6bc05eb
define German dummy lemmatizer until morphology is done
2016-05-02 16:04:53 +02:00
Matthew Honnibal
6e1f1c4b9e
Merge pull request #357 from wbwseeker/german_ner
...
German ner
2016-05-02 23:39:34 +10:00
Wolfgang Seeker
b6b96b233c
don't require read_json_file to expect particular annotations
2016-05-02 15:29:30 +02:00
Matthew Honnibal
902a389d85
* Fix merge conflict in test_parse
2016-05-02 15:28:07 +02:00
Matthew Honnibal
276fbe9996
* Fix assignment of iterator on Doc object
2016-05-02 15:26:24 +02:00
Matthew Honnibal
02c23cc1d0
* Fix sentence boundary test
2016-05-02 15:26:07 +02:00
Matthew Honnibal
d2f469b809
* Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct
2016-05-02 15:25:27 +02:00
Wolfgang Seeker
b11cbb06c6
remove old tests for sentence boundary detection
2016-05-02 14:36:35 +02:00
Matthew Honnibal
508fd1f6dc
* Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples.
2016-05-02 14:25:10 +02:00
Matthew Honnibal
e526be5602
Merge branch 'master' of ssh://github.com/spacy-io/spaCy
2016-05-02 13:08:08 +02:00
Wolfgang Seeker
fa961ea694
add tests for serialization bug
2016-05-02 11:01:56 +02:00
Matthew Honnibal
97b2bba249
* Merge updated/simplified Break approach
2016-04-25 19:44:42 +00:00
Matthew Honnibal
77609588b6
* Fix assignment of root label to words left as root implicitly, after parsing ends.
2016-04-25 19:41:59 +00:00
Matthew Honnibal
7c2d2deaa7
* Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322
2016-04-25 19:41:59 +00:00
Wolfgang Seeker
c2f76a4024
Merge branch 'master' into german_ner
2016-04-25 13:21:23 +02:00
Wolfgang Seeker
1003e7ccec
remove debug output from tests
2016-04-25 12:12:40 +02:00
Wolfgang Seeker
f57f843e85
fix bug in updating tree structure when introducing additional roots
2016-04-25 12:01:19 +02:00
Matthew Honnibal
478a8d1829
* Register Chinese language in spacy/__init__.py
2016-04-24 18:45:16 +02:00
Matthew Honnibal
8569dbc2d0
* Add initial stuff for Chinese parsing
2016-04-24 18:44:24 +02:00
Wolfgang Seeker
4d7f393fae
don't require json-files to have syntactic annotation
2016-04-22 16:32:27 +02:00
Wolfgang Seeker
b6477fc4f4
adjusted tests to Travis Setup
2016-04-21 17:15:10 +02:00
Wolfgang Seeker
736ffcb9a2
remove whitespace
2016-04-21 16:55:55 +02:00
Wolfgang Seeker
6c7301cc6d
the parser now introduces sentence boundaries properly when predicting dependents with root labels
2016-04-21 16:50:53 +02:00
Wolfgang Seeker
12024b0b0a
bugfix: introducing multiple roots now updates original head's properties
...
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Matthew Honnibal
67ce96c9c9
* Make patterns argument to Matcher class optional
2016-04-17 21:32:24 +02:00
Matthew Honnibal
8b4677d34d
* Add missing keyword arguments to spacy.load() function
2016-04-17 21:31:50 +02:00
Matthew Honnibal
2add5206aa
* Fix description of matcher test
2016-04-17 15:40:21 +02:00
Matthew Honnibal
2b419d5b8c
* Update test for Issue #242
2016-04-17 15:34:23 +02:00
Matthew Honnibal
f12b043308
* Add test for Issue #242 : Overlapping matches not well recognised.
2016-04-17 15:19:17 +02:00
Wolfgang Seeker
b98cc3266d
bugfix: iterators now reset properly when called a second time
2016-04-15 17:49:16 +02:00
Wolfgang Seeker
e6945c4d0e
bugfix: uppercase attr values before looking them up
2016-04-15 15:46:31 +02:00
Matthew Honnibal
c0909afe22
Merge pull request #312 from wbwseeker/space_head_bug
...
add restrictions to L-arc and R-arc to prevent space heads
2016-04-15 20:36:03 +10:00
Wolfgang Seeker
289b10f441
remove some comments
2016-04-14 15:37:51 +02:00
Matthew Honnibal
6f82065761
* Fix infixed commas in tokenizer, re Issue #326 . Need to benchmark on empirical data, to make sure this doesn't break other cases.
2016-04-14 11:36:03 +02:00
Matthew Honnibal
0f957dd586
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2016-04-14 10:37:56 +02:00
Matthew Honnibal
108aca0e50
* Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping.
2016-04-14 10:37:39 +02:00
Matthew Honnibal
61d20de35d
* Fix language.py docstring
2016-04-14 10:36:57 +02:00
Wolfgang Seeker
d99a9cbce9
different handling of space tokens
...
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Matthew Honnibal
04d0209be9
* Recognise multiple infixes in a token.
2016-04-13 18:38:26 +10:00
Henning Peters
a473d6e937
fix tests (use english model)
2016-04-12 16:41:57 +02:00
Henning Peters
f2d011c034
avoid polluting spacy namespace with lang classes
2016-04-12 16:31:16 +02:00
Henning Peters
ff690f76ba
fix loading non-german models
2016-04-12 16:00:56 +02:00
Henning Peters
6215272786
remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels
2016-04-12 11:28:07 +02:00
Matthew Honnibal
6df3858dbc
* Fix Issue #323 : Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition.
2016-04-12 13:17:59 +10:00
Wolfgang Seeker
d328e0b4a8
Merge branch 'master' into space_head_bug
2016-04-11 12:11:01 +02:00
Wolfgang Seeker
80bea62842
bugfix in unit test
2016-04-08 16:46:44 +02:00
Wolfgang Seeker
be4903a1b2
update version numbers
2016-04-08 13:54:05 +02:00
Wolfgang Seeker
1fe911cdb0
bigfix
2016-04-07 18:19:51 +02:00
Matthew Honnibal
872695759d
Merge pull request #306 from wbwseeker/german_noun_chunks
...
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters
470cdf5bf9
remove deprecated LOCAL_DATA_DIR
2016-04-05 11:25:54 +02:00
Matthew Honnibal
26622f0ffc
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2016-03-29 14:31:52 +11:00
Matthew Honnibal
b1fe41b45d
* Extend infix test, commenting on limitation of tokenizer w.r.t. infixes at the moment.
2016-03-29 14:31:05 +11:00
Matthew Honnibal
9c73983bdd
* Add test for hyphenation problem in Issue #302
2016-03-29 14:27:13 +11:00
Matthew Honnibal
ad119c074f
* Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics.
2016-03-29 13:02:42 +11:00
Matthew Honnibal
8c7a1908ee
Merge pull request #307 from scoder/faster_string_store
...
remove internal redundancy and overhead from StringStore
2016-03-29 12:59:52 +11:00
Wolfgang Seeker
7195b6742d
add restrictions to L-arc and R-arc to prevent space heads
2016-03-28 10:40:52 +02:00
Matthew Honnibal
8c77a994c6
Merge pull request #305 from henningpeters/master
...
multiple langs in download script
2016-03-26 21:54:59 +11:00
Henning Peters
c90d4a6f17
relative imports in __init__.py
2016-03-26 11:44:53 +01:00
Henning Peters
db095a162c
fix
2016-03-25 18:59:47 +01:00
Henning Peters
b8f63071eb
add lang registration facility
2016-03-25 18:54:45 +01:00
Matthew Honnibal
4a37fdcee1
Merge pull request #287 from wbwseeker/deproj_sentbnd_bug
...
add function to Token for setting head and dep (and dep_)
2016-03-25 09:47:45 +11:00
Stefan Behnel
f18805ee1c
make StringStore.__contains__() return True for the empty string (which is also contained in iteration)
2016-03-24 15:42:12 +01:00
Stefan Behnel
f2cfbfc412
remove internal redundancy and overhead from StringStore
2016-03-24 15:25:27 +01:00
Wolfgang Seeker
d65ef41d08
make error messages language independent
2016-03-24 11:47:09 +01:00
Henning Peters
963570aa49
Merge branch 'master' of github.com:spacy-io/spaCy
2016-03-24 11:19:47 +01:00
Henning Peters
a7d7ea3afa
first idea for supporting multiple langs in download script
2016-03-24 11:19:43 +01:00
Wolfgang Seeker
5080077097
revert init_model.py back to pre-german state (because it makes more sense)
...
simplify token.n_rights and token.n_lefts
2016-03-21 16:10:25 +01:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Matthew Honnibal
80134eb12d
Merge branch 'master' of https://github.com/spacy-io/spaCy
2016-03-15 19:14:50 +00:00
Wolfgang Seeker
2ae253ef5b
changed head.__set__ to make it simpler
2016-03-14 13:43:48 +01:00
Henning Peters
c12d3dd200
add __init__.py to empty package dirs
2016-03-14 11:28:03 +01:00
Henning Peters
54f3447b5f
cleanup
2016-03-14 01:46:33 +01:00
Wolfgang Seeker
46e3f979f1
add function for setting head and label to token
...
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
bc9c62e279
replace Language functions with corresponding orth functions
...
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Wolfgang Seeker
d9312bc9ea
add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators
2016-03-09 16:18:48 +01:00
Matthew Honnibal
1508528c8c
* Increment version
2016-03-08 15:58:45 +00:00
Matthew Honnibal
963fe5258e
* Add missing __contains__ method to vocab
2016-03-08 15:49:10 +00:00
Matthew Honnibal
478aa21cb0
* Remove broken __reduce__ method on vocab
2016-03-08 15:48:21 +00:00
Matthew Honnibal
20235bde00
Merge pull request #282 from henningpeters/switch_vectors
...
initial proposal for ability to switch vectors
2016-03-09 01:39:41 +11:00
Henning Peters
eb7ae61b1c
cleanup api
2016-03-08 12:59:18 +01:00
Henning Peters
b740f20191
hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2
2016-03-06 09:19:27 +01:00
Henning Peters
aa4d964c14
cleanup api
2016-03-05 17:51:32 +01:00
Henning Peters
931c07a609
initial proposal for separate vector package
2016-03-04 11:09:06 +01:00
Wolfgang Seeker
7adbd7a785
replace Counter with normal dict
2016-03-03 21:36:27 +01:00
Wolfgang Seeker
1ae487a4f6
add backwards compatibility with python 2.6
2016-03-03 21:18:12 +01:00
Wolfgang Seeker
9d1e6de4a0
make a proper list from zip iterator
2016-03-03 19:51:01 +01:00