Andreas Grivas
|
4be7fda453
|
* span start, end -> properties. autoupdate after merge
|
2015-11-07 07:57:04 +11:00 |
|
Andreas Grivas
|
562db6d2d0
|
* merge add lex last - add index finder funcs
|
2015-11-07 07:57:04 +11:00 |
|
Matthew Honnibal
|
a06e3c8963
|
* Fix bone-headed mistake in StateClass.E
|
2015-11-07 07:35:28 +11:00 |
|
Matthew Honnibal
|
d24b8509e4
|
* Correct screw ups from the previous commits
|
2015-11-07 06:51:41 +11:00 |
|
Matthew Honnibal
|
5efad178b5
|
* Set ent tag when close entity
|
2015-11-07 06:09:25 +11:00 |
|
Matthew Honnibal
|
9285f01d26
|
* Fix broken StateClass.E tracking
|
2015-11-07 06:06:39 +11:00 |
|
Matthew Honnibal
|
19136b0e7d
|
* Add better debug message for illegal move
|
2015-11-07 05:34:37 +11:00 |
|
Matthew Honnibal
|
2733816b7b
|
* Fix whitespace
|
2015-11-07 05:31:06 +11:00 |
|
Matthew Honnibal
|
01ab464383
|
* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169
|
2015-11-07 05:30:44 +11:00 |
|
Matthew Honnibal
|
b65633f270
|
* Fix function that returns nth entity in StateClass. Was only returning the first.
|
2015-11-07 05:29:11 +11:00 |
|
Matthew Honnibal
|
410b6f9ec1
|
* Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class.
|
2015-11-07 05:13:10 +11:00 |
|
Matthew Honnibal
|
3c162dcac3
|
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
|
2015-11-07 03:24:30 +11:00 |
|
Matthew Honnibal
|
9d1b2a103a
|
* Fix capitalization in lemmatizer
|
2015-11-06 05:44:35 +11:00 |
|
Matthew Honnibal
|
6ed3aedf79
|
* Merge vocab changes
|
2015-11-06 00:48:08 +11:00 |
|
Matthew Honnibal
|
72abbb43fb
|
* Add type declarations in strings.pyx
|
2015-11-06 00:47:26 +11:00 |
|
Matthew Honnibal
|
5b2af4864f
|
* When lemmatizing non-noun, non-verb, non-adj words, output lower-case
|
2015-11-06 00:45:09 +11:00 |
|
Matthew Honnibal
|
754bf04162
|
* Remove declaration of Model.update
|
2015-11-06 00:31:15 +11:00 |
|
Matthew Honnibal
|
e18bdff23a
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-11-06 00:26:15 +11:00 |
|
Matthew Honnibal
|
b9991fbd20
|
* Update to use thinc 3.0
|
2015-11-06 00:25:59 +11:00 |
|
Matthew Honnibal
|
864a8f45d8
|
* Use unicode in StringStore.intern, instead of unreliably casting to bytes.
|
2015-11-05 11:32:19 +00:00 |
|
Matthew Honnibal
|
b18204cd52
|
* Fix StringStore._realloc, re Issue #155
|
2015-11-05 11:28:26 +00:00 |
|
Matthew Honnibal
|
f8004c5f65
|
* Begin upgrading to improved thinc API
|
2015-11-05 03:53:03 +11:00 |
|
Matthew Honnibal
|
adc7bbd6cf
|
* Fix name of like_num in default_lex_attrs
|
2015-11-04 22:02:47 +11:00 |
|
Matthew Honnibal
|
e96faf29e7
|
* Rename like_number to like_num, to fix inconsistency re Issue #166
|
2015-11-04 22:01:44 +11:00 |
|
Matthew Honnibal
|
65934b7cd4
|
* Enforce import of ujson in strings.pyx, because otherwise it's too slow
|
2015-11-04 00:32:02 +11:00 |
|
Matthew Honnibal
|
1ce5d5602d
|
* Rename Doc.data to Doc.c
|
2015-11-04 00:17:13 +11:00 |
|
Matthew Honnibal
|
68f479e821
|
* Rename Doc.data to Doc.c
|
2015-11-04 00:15:14 +11:00 |
|
Matthew Honnibal
|
3ddea19b2b
|
* Rename spans.pyx to span.pyx
|
2015-11-04 00:14:40 +11:00 |
|
Matthew Honnibal
|
9482d616bc
|
* Rename spans.pyx to span.pyx
|
2015-11-03 23:51:05 +11:00 |
|
Matthew Honnibal
|
116da5990a
|
* Clean up setting of tag in doc.from_bytes
|
2015-11-03 23:48:57 +11:00 |
|
Matthew Honnibal
|
9ec7b9c454
|
* Clean up unused Constituent struct.
|
2015-11-03 23:48:21 +11:00 |
|
Matthew Honnibal
|
1e99fcd413
|
* Rename .repvec to .vector in C API
|
2015-11-03 23:47:59 +11:00 |
|
Matthew Honnibal
|
ee3f9ba581
|
* Fix test of serializer
|
2015-11-03 19:45:16 +11:00 |
|
Matthew Honnibal
|
d06ba26371
|
* Fix test of serializer
|
2015-11-03 19:43:27 +11:00 |
|
Matthew Honnibal
|
4083059650
|
Merge branch 'master' of https://github.com/honnibal/spaCy
|
2015-11-03 09:07:19 +01:00 |
|
Matthew Honnibal
|
9e37437ba8
|
* Fix assign_tag in doc.merge
|
2015-11-03 19:07:02 +11:00 |
|
Matthew Honnibal
|
dde9e1357c
|
* Add todo to morphology.lemmatize
|
2015-11-03 18:54:35 +11:00 |
|
Matthew Honnibal
|
ffedff9e6c
|
* Remove the archive after download, to save disk space
|
2015-11-03 18:54:05 +11:00 |
|
Matthew Honnibal
|
85372468e3
|
* Fix serialize test
|
2015-11-03 08:51:33 +01:00 |
|
Matthew Honnibal
|
833eb35c57
|
* Fix tag assignment in doc.from_array
|
2015-11-03 18:45:54 +11:00 |
|
Matthew Honnibal
|
09664177d7
|
* Fix tag handling in doc.merge, and assign sent_start when setting heads.
|
2015-11-03 18:15:52 +11:00 |
|
Matthew Honnibal
|
389a373807
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-11-03 18:07:25 +11:00 |
|
Matthew Honnibal
|
3f44b3e43f
|
* Mark serializer test as requiring models
|
2015-11-03 18:07:08 +11:00 |
|
Matthew Honnibal
|
25ed7be8f8
|
Merge branch 'master' of https://github.com/honnibal/spaCy
|
2015-11-03 07:58:17 +01:00 |
|
Matthew Honnibal
|
604ceac4c6
|
* Fix morphological assignment in doc.merge()
|
2015-11-03 17:57:51 +11:00 |
|
Matthew Honnibal
|
5e040855a5
|
* Ensure morphological features and lemmas are loaded in from_array, re Issue #152
|
2015-11-03 17:56:50 +11:00 |
|
Matthew Honnibal
|
5668feb235
|
* Fix pickle test for python3
|
2015-11-03 04:57:02 +01:00 |
|
Matthew Honnibal
|
6161d2529a
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-11-03 13:36:30 +11:00 |
|
Matthew Honnibal
|
5887506f5d
|
* Don't expect lexemes.bin in Vocab
|
2015-11-03 13:23:39 +11:00 |
|
Matthew Honnibal
|
f7dd377575
|
* Adjust conjuncts iterator in Token
|
2015-11-03 13:23:22 +11:00 |
|
Andreas Grivas
|
d418f00eb1
|
fixed error when printing unicode
|
2015-11-02 20:23:18 +02:00 |
|
Matthew Honnibal
|
52fc338001
|
* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152
|
2015-10-28 10:43:22 +11:00 |
|
Matthew Honnibal
|
1c0356e4c2
|
* Set test file mode to w+t
|
2015-10-26 22:40:48 +11:00 |
|
Matthew Honnibal
|
0fe98f358b
|
* Fix mode on text file for Python3 in strings test
|
2015-10-26 22:25:16 +11:00 |
|
Matthew Honnibal
|
8ba9cf905e
|
* Fix mode on text file for Python3 in strings test
|
2015-10-26 21:44:34 +11:00 |
|
Matthew Honnibal
|
a0730699b1
|
* Fix mode on text file for Python3 in strings test
|
2015-10-26 21:25:56 +11:00 |
|
Matthew Honnibal
|
725344d349
|
* Fix tempfile in test
|
2015-10-26 21:08:18 +11:00 |
|
Matthew Honnibal
|
f11030aadc
|
* Remove out-dated TODO comment
|
2015-10-26 12:33:38 +11:00 |
|
Matthew Honnibal
|
a371a1071d
|
* Save and load word vectors during pickling, re Issue #125
|
2015-10-26 12:33:04 +11:00 |
|
Matthew Honnibal
|
a824a98312
|
* Add tests for pickling vectors, re: Issue #125
|
2015-10-26 12:31:05 +11:00 |
|
Matthew Honnibal
|
314090cc78
|
* Set vectors length when unpickling vocab, re Issue #125
|
2015-10-26 12:05:08 +11:00 |
|
Matthew Honnibal
|
4e16f9e435
|
* Move tests underneath spacy/
|
2015-10-26 00:07:31 +11:00 |
|
Matthew Honnibal
|
3a6e48e814
|
Merge pull request #149 from chrisdubois/pickle-patch
Add __reduce__ to Tokenizer so that English pickles.
|
2015-10-25 15:30:31 +11:00 |
|
Chris DuBois
|
dac8fe7bdb
|
Add __reduce__ to Tokenizer so that English pickles.
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
|
2015-10-23 22:24:03 -07:00 |
|
Matthew Honnibal
|
ff4fe524ee
|
* Fix exception for python 2
|
2015-10-23 01:56:13 +02:00 |
|
Matthew Honnibal
|
341a3e85cd
|
* Upd downloaded data version
|
2015-10-23 00:56:57 +02:00 |
|
Matthew Honnibal
|
f18fd8c659
|
* Fix language.py for change in StringStore load API
|
2015-10-23 03:48:12 +11:00 |
|
Matthew Honnibal
|
23855db3ca
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
|
2015-10-23 03:46:09 +11:00 |
|
Matthew Honnibal
|
4f13849065
|
Merge pull request #145 from henningpeters/master
better error reporting, cleanup
|
2015-10-23 03:45:47 +11:00 |
|
Matthew Honnibal
|
3be94be0c0
|
Merge pull request #148 from maxirmx/master
Utf8 encoding for lemma_rules.json
|
2015-10-22 21:46:28 +11:00 |
|
Matthew Honnibal
|
c86bda8d1a
|
* Fix import of uget
|
2015-10-22 21:13:56 +11:00 |
|
Matthew Honnibal
|
2348a08481
|
* Load/dump strings with a json file, instead of the hacky strings file we were using.
|
2015-10-22 21:13:03 +11:00 |
|
Matthew Honnibal
|
9baf0abd59
|
* Save vocab after training.
|
2015-10-22 21:09:14 +11:00 |
|
maxirmx
|
f07e4accd7
|
Fixing encoding issue #4
|
2015-10-21 20:45:56 +03:00 |
|
maxirmx
|
fcbfff043f
|
Fixing encoding issue #3
|
2015-10-21 15:52:34 +03:00 |
|
maxirmx
|
fe9d2e2c4e
|
Fixing encode issue #2
|
2015-10-21 15:36:21 +03:00 |
|
maxirmx
|
e4a1726f77
|
Fixing encoding issue
UTF-8
|
2015-10-21 14:16:37 +03:00 |
|
Andreas Grivas
|
93ada458e2
|
added __repr__ that prints text in ipython for doc, token, and span objects
|
2015-10-21 14:11:46 +03:00 |
|
Henning Peters
|
ccffd2ef53
|
fixed extract directory
|
2015-10-21 07:59:34 +02:00 |
|
Henning Peters
|
da4c9cee06
|
assert filename match
|
2015-10-20 19:33:59 +02:00 |
|
Henning Peters
|
4f703f0cb4
|
better error reporting, cleanup
|
2015-10-20 19:11:29 +02:00 |
|
Matthew Honnibal
|
9cdea6e450
|
* Import uget correctly
|
2015-10-19 08:32:41 +02:00 |
|
Matthew Honnibal
|
6727a46bb5
|
* Fix Issue #118: Matcher behaves unpredictably when matches overlap.
|
2015-10-19 16:45:32 +11:00 |
|
Matthew Honnibal
|
135062d23c
|
* Fix error with merged text when merged region did not have trailing whitespace
|
2015-10-19 15:47:04 +11:00 |
|
Henning Peters
|
bfde91fa49
|
add custom download tool (uget), replace wget with uget
|
2015-10-18 12:35:04 +02:00 |
|
Matthew Honnibal
|
9839cd2c0b
|
* Fix whitespace_ calculation in Token
|
2015-10-18 17:21:11 +11:00 |
|
Matthew Honnibal
|
c99285b8b9
|
* Clean up C++ usage in spacy/matcher.pyx
|
2015-10-18 17:20:50 +11:00 |
|
Matthew Honnibal
|
a7e6c5ac8f
|
* Fix Issue #122: Incorrect calculation of children after Doc.merge()
|
2015-10-18 17:17:27 +11:00 |
|
Matthew Honnibal
|
3ba66f2dc7
|
* Add string length cap in Tokenizer.__call__
|
2015-10-16 04:54:16 +11:00 |
|
Matthew Honnibal
|
6e0f985afc
|
* Fix token.conjuncts
|
2015-10-15 03:49:45 +11:00 |
|
Matthew Honnibal
|
2e0104ac81
|
* Fix token.conjuncts
|
2015-10-15 03:47:45 +11:00 |
|
Matthew Honnibal
|
b8f3345a82
|
* Fix token.conjuncts method
|
2015-10-15 03:36:01 +11:00 |
|
Matthew Honnibal
|
23818f89b8
|
* Fix token.conjuncts method
|
2015-10-15 03:34:57 +11:00 |
|
Matthew Honnibal
|
7a15d1b60c
|
* Add Python 2/3 compatibility fix for copy_reg
|
2015-10-13 20:04:40 +11:00 |
|
Matthew Honnibal
|
329ae57520
|
* Fix whitespace attachment thing
|
2015-10-13 09:46:38 +02:00 |
|
Matthew Honnibal
|
37919eac82
|
* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.
|
2015-10-13 18:23:24 +11:00 |
|
Matthew Honnibal
|
c70eb776ae
|
* Fix whitespace attachment, so that left/right children are consistent with head.
|
2015-10-13 15:58:22 +11:00 |
|
Matthew Honnibal
|
531182f937
|
* Fix Model.__reduce__
|
2015-10-13 15:14:38 +11:00 |
|
Matthew Honnibal
|
6c227a6c1f
|
* Fix Model.__reduce__
|
2015-10-13 15:10:04 +11:00 |
|
Matthew Honnibal
|
358c82595c
|
* Fix NAMES list in spacy/parts_of_speech.pyx
|
2015-10-13 14:18:45 +11:00 |
|
Matthew Honnibal
|
c1fdc487bc
|
Merge branch 'attrs'
|
2015-10-13 14:03:41 +11:00 |
|
Matthew Honnibal
|
e886e6a406
|
* Inc version
|
2015-10-13 13:46:17 +11:00 |
|
Matthew Honnibal
|
20fd36a0f7
|
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125: allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
f8de403483
|
* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
85e7944572
|
* Start trying to pickle Vocab
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
5ca57bd859
|
* Ensure Morphology can be pickled, to address Issue #125.
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
0cee928467
|
* Allow StringStore to be pickled, to start addressing Issue #125
|
2015-10-13 13:44:41 +11:00 |
|
Matthew Honnibal
|
41012907a8
|
* Fix variable name
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
e70368d157
|
* Use lower case strings for dependency label names in symbols enum
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
7b4af3d1e7
|
* Fix parts_of_speech now that symbols list has been reformed
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
37b909b6b6
|
* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
ce65ec698c
|
* Remove qualified naming in symbols
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
9f4be0adcd
|
* Map NO_TAG to NIL in parts_of_speech.pxd
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
278e12f7e8
|
* Addmorphology symbols to morphology. May need to remove these as an enum.
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
d80067eda1
|
* Map empty string to NULL_ATTR in attrs
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
d70e8cac2c
|
* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore
|
2015-10-13 13:44:40 +11:00 |
|
Matthew Honnibal
|
a29c8ee23d
|
* Add symbols to the vocab before reading the strings, so that they line up correctly
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
74c0853471
|
* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
10a4a843ea
|
* Enumerate all symbols in one file
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
85ce36ab11
|
* Refactor symbols, so that frequency rank can be derived from the orth id of a word.
|
2015-10-13 13:44:39 +11:00 |
|
Matthew Honnibal
|
dfbcff2ff1
|
* Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate.
|
2015-10-10 15:54:55 +11:00 |
|
Matthew Honnibal
|
9dd2f25c74
|
* Fix Issue #131: Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units.
|
2015-10-10 15:53:30 +11:00 |
|
Matthew Honnibal
|
8b39feefbe
|
* Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries
|
2015-10-10 15:32:13 +11:00 |
|
Matthew Honnibal
|
2153067958
|
* Fix use of io in strings.pyx
|
2015-10-10 15:03:12 +11:00 |
|
Matthew Honnibal
|
ec874247b5
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-10-10 14:23:51 +11:00 |
|
Matthew Honnibal
|
30de4135c9
|
* Fix merge problem
|
2015-10-10 14:22:32 +11:00 |
|
Matthew Honnibal
|
dc393a5f1d
|
Merge pull request #126 from tomtung/master
Improve slicing support for both Doc and Span
|
2015-10-10 14:14:57 +11:00 |
|
Matthew Honnibal
|
83dccf0fd7
|
* Use io module insteads of deprecated codecs module
|
2015-10-10 14:13:01 +11:00 |
|
Matthew Honnibal
|
a3dfe2b901
|
* Increment data version
|
2015-10-09 13:26:17 +02:00 |
|
Matthew Honnibal
|
2d9e5bf566
|
* Allow punctuation to be lemmatized
|
2015-10-09 19:02:42 +11:00 |
|
Matthew Honnibal
|
5332c0b697
|
* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130
|
2015-10-09 18:54:40 +11:00 |
|
Yubing (Tom) Dong
|
9a6811acc4
|
Merge remote-tracking branch 'upstream/master'
|
2015-10-08 22:53:02 -07:00 |
|
Matthew Honnibal
|
b125289f30
|
* Fix type declaration in asciied function
|
2015-10-09 13:46:57 +11:00 |
|
Matthew Honnibal
|
801d55a6d9
|
* Fix phrase matcher
|
2015-10-09 02:00:45 +11:00 |
|
Matthew Honnibal
|
b3a70e6375
|
* Clean up unnecessary try/except block
|
2015-10-08 14:34:11 +11:00 |
|
Yubing (Tom) Dong
|
0f601b8b75
|
Update docstring of Doc.__getitem__
|
2015-10-07 01:27:28 -07:00 |
|
Yubing (Tom) Dong
|
3fd3bc79aa
|
Refactor to remove duplicate slicing logic
|
2015-10-07 01:25:35 -07:00 |
|
Yubing (Tom) Dong
|
97685aecb7
|
Add slicing support to Span
|
2015-10-06 02:45:49 -07:00 |
|
Yubing (Tom) Dong
|
ef2af20cd3
|
Make Doc's slicing behavior conform to Python conventions
|
2015-10-06 02:41:28 -07:00 |
|
Yubing (Tom) Dong
|
2fc33e8024
|
Allow step=1 when slicing a Doc
|
2015-10-06 00:57:05 -07:00 |
|
Matthew Honnibal
|
b228a8f4a6
|
* Remove spacy/en/attrs
|
2015-10-06 16:20:46 +11:00 |
|
Matthew Honnibal
|
693677fd8d
|
* Prepare to remove en/attrx file, now that moving to symbols.pyx
|
2015-10-06 16:20:13 +11:00 |
|
Matthew Honnibal
|
3d9f41c2c9
|
* Add LookupError for better error reporting in Vocab
|
2015-10-06 10:34:59 +11:00 |
|
Matthew Honnibal
|
ecc5281b36
|
* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx
|
2015-10-06 10:12:08 +11:00 |
|
alvations
|
8caedba42a
|
caught more codecs.open -> io.open
|
2015-09-30 20:20:09 +02:00 |
|
alvations
|
8199012d26
|
changing deprecated codecs.open to io.open =)
|
2015-09-30 20:10:15 +02:00 |
|
Matthew Honnibal
|
87e6186828
|
* Rename _seq to doc attribute in Span
|
2015-09-29 23:03:55 +10:00 |
|
Matthew Honnibal
|
ab694b0364
|
* Fix open-bounded slice indices.
|
2015-09-29 23:03:09 +10:00 |
|
Matthew Honnibal
|
a6ced80c0c
|
* Fix Issue #116: Misleading handling of True value in Language.__init__.
|
2015-09-29 20:54:12 +10:00 |
|
Matthew Honnibal
|
f9d2a5b651
|
* Fix issue #112: Replace unidecode with text-unidecode, to avoid license problems.
|
2015-09-28 23:40:18 +10:00 |
|
Matthew Honnibal
|
2c33a96ac3
|
Merge pull request #99 from rw/patch-1
Force SSL for downloading English language data.
|
2015-09-28 17:46:26 +10:00 |
|
Matthew Honnibal
|
abf0d930af
|
* Fix API for loading word vectors from a file.
|
2015-09-23 23:51:08 +10:00 |
|
Matthew Honnibal
|
f5c256745b
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-09-22 12:26:24 +10:00 |
|
Matthew Honnibal
|
528e26a506
|
* Add rule to ensure ordinals are preserved as single tokens
|
2015-09-22 12:26:05 +10:00 |
|
Robert
|
8711b64860
|
Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
|
2015-09-21 17:26:01 -07:00 |
|
Matthew Honnibal
|
f7283a5067
|
* Fix vectors bugs for OOV words
|
2015-09-22 02:10:25 +02:00 |
|
Matthew Honnibal
|
44aecba701
|
* Fix Token.has_vector and Lexeme.has_vector
|
2015-09-22 01:43:16 +02:00 |
|
Matthew Honnibal
|
596fde8daa
|
* Add has_vector attribute to Token and Lexeme
|
2015-09-21 19:52:43 +10:00 |
|
Matthew Honnibal
|
f32927efbf
|
* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97. Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.
|
2015-09-21 18:35:40 +10:00 |
|
Matthew Honnibal
|
388062ae01
|
* Fix repvec_length problem
|
2015-09-21 18:10:51 +10:00 |
|
Matthew Honnibal
|
ac459278d1
|
* Fix vector length error reporting, and ensure vec_len is returned
|
2015-09-21 18:08:32 +10:00 |
|
Matthew Honnibal
|
ba4e563701
|
* Ensure vectors are same length, and return vector length in load_vectors_bz2
|
2015-09-21 18:03:08 +10:00 |
|
Matthew Honnibal
|
d00fe2bbc6
|
* Don't allow Span objects to be written to, as it introduces subtle bugs because they're created afresh from Doc.sents, Doc.ents etc.
|
2015-09-21 17:59:39 +10:00 |
|
Matthew Honnibal
|
d6945bf880
|
* Add way to load vectors from bz2 file to vocab
|
2015-09-17 12:58:23 +10:00 |
|
Matthew Honnibal
|
77856c4fcd
|
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.
|
2015-09-17 11:50:11 +10:00 |
|
Matthew Honnibal
|
191d593e03
|
* Fix vectors bug in lexeme
|
2015-09-15 19:05:11 +10:00 |
|
Matthew Honnibal
|
3d87519f64
|
* Remove vectors argument from Vocab object
|
2015-09-15 14:47:14 +10:00 |
|
Matthew Honnibal
|
362526b592
|
* Rename vectors_length attribute
|
2015-09-15 14:43:31 +10:00 |
|
Matthew Honnibal
|
60c26b2dfa
|
* Fix slicing when start or stop is None
|
2015-09-15 14:43:10 +10:00 |
|
Matthew Honnibal
|
7ac6cacc26
|
* Remove const qualifier on LexemeC.repvec
|
2015-09-15 14:42:51 +10:00 |
|
Matthew Honnibal
|
dd4d64b235
|
* Support setting of word vectors on Lexeme object.
|
2015-09-15 14:42:27 +10:00 |
|
Matthew Honnibal
|
27f988b167
|
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
|
2015-09-15 14:41:48 +10:00 |
|
Matthew Honnibal
|
193f127f81
|
* Fix ugly py_check_flag and py_set_flag functions in Lexeme
|
2015-09-15 13:06:18 +10:00 |
|
Matthew Honnibal
|
9561d88529
|
* Add is_stop to Python API
|
2015-09-14 18:25:40 +10:00 |
|
Matthew Honnibal
|
65dc0d1dfb
|
* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.
|
2015-09-14 17:49:58 +10:00 |
|
Matthew Honnibal
|
e13e47e9e5
|
* Add English stop words
|
2015-09-14 17:48:51 +10:00 |
|
Matthew Honnibal
|
24ed3fc25c
|
* Check file existance before opening in lemmatizer
|
2015-09-13 10:45:21 +10:00 |
|
Matthew Honnibal
|
dbb48ce49e
|
* Delete extra wordnets
|
2015-09-13 10:31:37 +10:00 |
|
Matthew Honnibal
|
e9c59693ea
|
* Remove assertion from vocab.pyx
|
2015-09-13 10:30:08 +10:00 |
|
Matthew Honnibal
|
c08f10083c
|
* Add test and test_with_ws attributes.
|
2015-09-13 10:27:42 +10:00 |
|
Matthew Honnibal
|
0b7d2a6c62
|
* Inc version
|
2015-09-13 01:26:29 +02:00 |
|
Matthew Honnibal
|
e1dfaeed8a
|
* Check serializer freqs exist before loading
|
2015-09-12 23:49:38 +02:00 |
|
Matthew Honnibal
|
a412c66c8c
|
* Check serializer freqs exist before loading
|
2015-09-12 23:40:01 +02:00 |
|
Matthew Honnibal
|
631c843ed1
|
* Don't look for index.adv in le,matizer
|
2015-09-12 06:03:44 +02:00 |
|
Matthew Honnibal
|
dfdd4f2d60
|
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
|
2015-09-10 15:23:06 +02:00 |
|
Matthew Honnibal
|
e285ca7d6c
|
* Load serializer freqs in vocab
|
2015-09-10 15:22:48 +02:00 |
|
Matthew Honnibal
|
f7fdcce1f9
|
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
|
2015-09-10 14:52:47 +02:00 |
|
Matthew Honnibal
|
85c3fec1d1
|
* Fix morphology loading
|
2015-09-10 14:52:23 +02:00 |
|
Matthew Honnibal
|
7c660c5efc
|
* Use dict.get in lemmatizer
|
2015-09-10 14:51:39 +02:00 |
|
Matthew Honnibal
|
094440f9f5
|
Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop
|
2015-09-10 14:51:17 +02:00 |
|
Matthew Honnibal
|
c3f773cd63
|
* Fix Lexeme.check_flag
|
2015-09-10 14:51:05 +02:00 |
|
Matthew Honnibal
|
90da3a695d
|
* Load lemmatizer from disk in Vocab.from_dir
|
2015-09-10 14:49:10 +02:00 |
|
Matthew Honnibal
|
e7e529edf4
|
* Fix Lexeme.check_flag
|
2015-09-10 14:45:43 +02:00 |
|
Matthew Honnibal
|
9e7bfe8449
|
* Fix space at end of merged token
|
2015-09-10 14:45:17 +02:00 |
|
Matthew Honnibal
|
f634191e27
|
* Fix vocab read/write
|
2015-09-10 14:44:38 +02:00 |
|
Matthew Honnibal
|
31ccf494e6
|
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
|
2015-09-09 14:33:38 +02:00 |
|
Matthew Honnibal
|
a7f4b26c8c
|
* Tmp
|
2015-09-09 14:33:26 +02:00 |
|
Matthew Honnibal
|
07686470a9
|
* Don't consider a coordinated NP a base chunk
|
2015-09-09 14:32:28 +02:00 |
|
Matthew Honnibal
|
d9f1fc2112
|
* Add deprecation warning for unused load_vectors argument.
|
2015-09-09 14:31:09 +02:00 |
|
Matthew Honnibal
|
0b527fbdc8
|
* Set POS tag in morphology
|
2015-09-09 14:30:24 +02:00 |
|
Matthew Honnibal
|
07c09a0e1b
|
* Fix attribute getters and setters in Lexeme
|
2015-09-09 14:29:22 +02:00 |
|
Matthew Honnibal
|
d6561988cf
|
* Fix lexemes.bin
|
2015-09-09 11:49:51 +02:00 |
|
Matthew Honnibal
|
c301bebd33
|
Merge branch 'master' of https://github.com/honnibal/spaCy into develop
|
2015-09-09 10:55:39 +02:00 |
|
Matthew Honnibal
|
0e24d099a1
|
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
|
2015-09-09 03:40:44 +02:00 |
|
Matthew Honnibal
|
2be3620333
|
* Save morphological analyses in a cache
|
2015-09-08 15:39:24 +02:00 |
|
Matthew Honnibal
|
1def5a6cbe
|
* Fix print statements in matcher
|
2015-09-08 15:38:19 +02:00 |
|
Matthew Honnibal
|
64d71f8893
|
* Fix lemmatizer
|
2015-09-08 15:38:03 +02:00 |
|
Matthew Honnibal
|
623329b19a
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
|
2015-09-08 14:27:01 +02:00 |
|
Matthew Honnibal
|
62a01dd41d
|
* Fix issue #92: lexemes.bin read error on 32-bit platforms.
|
2015-09-08 14:23:58 +02:00 |
|
Matthew Honnibal
|
ef58607a99
|
* Add spacy.it
|
2015-09-06 22:10:37 +02:00 |
|
Matthew Honnibal
|
2154a54f6b
|
* Add spacy.de
|
2015-09-06 21:56:47 +02:00 |
|
Matthew Honnibal
|
f6ec5bf1b0
|
* Use empty tag map in vocab if none supplied
|
2015-09-06 20:19:27 +02:00 |
|
Matthew Honnibal
|
4f8e38271d
|
* Fix merge errors in lexeme.pxd
|
2015-09-06 20:19:08 +02:00 |
|
Matthew Honnibal
|
86c888667f
|
* Merge in changes from de branch
|
2015-09-06 19:49:28 +02:00 |
|
Matthew Honnibal
|
d2fc104a26
|
* Begin merge of Gazetteer and DE branches
|
2015-09-06 19:45:15 +02:00 |
|
Matthew Honnibal
|
dbf8dce109
|
Merge branch 'gaz' of ssh://github.com/honnibal/spaCy into gaz
|
2015-09-06 18:44:14 +02:00 |
|
Matthew Honnibal
|
9eae9837c4
|
* Fix morphology look up
|
2015-09-06 17:53:39 +02:00 |
|
Matthew Honnibal
|
6427a3fcac
|
* Temporarily import flag attributes in matcher
|
2015-09-06 17:53:12 +02:00 |
|
Matthew Honnibal
|
7cc56ada6e
|
* Temporarily add py_set_flag attribute in Lexeme
|
2015-09-06 17:52:51 +02:00 |
|
Matthew Honnibal
|
e35bb36be7
|
* Ensure Lexeme.check_flag returns a boolean value
|
2015-09-06 17:52:32 +02:00 |
|
Matthew Honnibal
|
7e4fea67d3
|
* Fix bug in token subtree, introduced by duplication of L/R code in Stateclass. Need to consolidate the two methods.
|
2015-09-06 10:48:36 +02:00 |
|
Matthew Honnibal
|
5edac11225
|
* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.
|
2015-09-06 04:15:00 +02:00 |
|
Matthew Honnibal
|
fd1eeb3102
|
* Add POS attribute support in get_attr
|
2015-09-06 04:13:03 +02:00 |
|
Matthew Honnibal
|
534e3dda3c
|
* More work on language independent parsing
|
2015-08-28 03:44:54 +02:00 |
|
Matthew Honnibal
|
c2307fa9ee
|
* More work on language-generic parsing
|
2015-08-28 02:02:33 +02:00 |
|
Matthew Honnibal
|
86c4a8e3e2
|
* Work on new morphology organization
|
2015-08-27 23:11:51 +02:00 |
|
Matthew Honnibal
|
5b89e2454c
|
* Improve error-reporting in tagger
|
2015-08-27 10:26:36 +02:00 |
|
Matthew Honnibal
|
f0a7c99554
|
* Relax rule-requirement in lemmatizer
|
2015-08-27 10:26:19 +02:00 |
|
Matthew Honnibal
|
0af139e183
|
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
|
2015-08-27 09:16:11 +02:00 |
|
Matthew Honnibal
|
1302d35dff
|
* Rework interfaces in vocab
|
2015-08-26 19:21:46 +02:00 |
|
Matthew Honnibal
|
2d521768a3
|
* Store Morphology class in Vocab
|
2015-08-26 19:21:03 +02:00 |
|
Matthew Honnibal
|
d30029979e
|
* Avoid import of morphology in spans
|
2015-08-26 19:20:46 +02:00 |
|
Matthew Honnibal
|
119c0f8c3f
|
* Hack out morphology stuff from tokenizer, while morphology being reimplemented.
|
2015-08-26 19:20:11 +02:00 |
|
Matthew Honnibal
|
b4faf551f5
|
* Refactor language-independent tagger class
|
2015-08-26 19:19:21 +02:00 |
|
Matthew Honnibal
|
a3d5e6c0dd
|
* Reform constructor and save/load workflow in parser model
|
2015-08-26 19:19:01 +02:00 |
|
Matthew Honnibal
|
1d7f2d3abc
|
* Hack on morphology structs
|
2015-08-26 19:18:36 +02:00 |
|
Matthew Honnibal
|
f8f2f4e545
|
* Temporarily add PUNC name to parts_of_specch dictionary, until better solution
|
2015-08-26 19:18:19 +02:00 |
|
Matthew Honnibal
|
008b02b035
|
* Comment out enums in Morpohlogy for now
|
2015-08-26 19:17:35 +02:00 |
|
Matthew Honnibal
|
378729f81a
|
* Hack Morphology class towards usability
|
2015-08-26 19:17:21 +02:00 |
|
Matthew Honnibal
|
430affc347
|
* Fix missing n_patterns property in Matcher class. Fix from_dir method
|
2015-08-26 19:17:02 +02:00 |
|
Matthew Honnibal
|
3acf60df06
|
* Add missing properties in Lexeme class
|
2015-08-26 19:16:28 +02:00 |
|
Matthew Honnibal
|
76996f4145
|
* Hack on generic Language class. Still needs work for morphology, defaults, etc
|
2015-08-26 19:16:09 +02:00 |
|
Matthew Honnibal
|
e2ef78b29c
|
* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx
|
2015-08-26 19:15:42 +02:00 |
|
Matthew Honnibal
|
c4d8754385
|
* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py
|
2015-08-26 19:15:07 +02:00 |
|
Matthew Honnibal
|
c2d8edd0bd
|
* Add PROB attribute in attrs.pxd
|
2015-08-26 19:14:19 +02:00 |
|
Matthew Honnibal
|
c5a27d1821
|
* Move lemmatizer to spacy
|
2015-08-25 15:47:08 +02:00 |
|
Matthew Honnibal
|
82217c6ec6
|
* Generalize lemmatizer
|
2015-08-25 15:46:19 +02:00 |
|
Matthew Honnibal
|
8083a07c3e
|
* Use language base class
|
2015-08-25 15:37:30 +02:00 |
|
Matthew Honnibal
|
f2f699ac18
|
* Add language base class
|
2015-08-25 15:37:17 +02:00 |
|
Matthew Honnibal
|
5dd76be446
|
* Split EnPosTagger up into base class and subclass
|
2015-08-24 05:25:55 +02:00 |
|