Henning Peters
41ea14a56f
fix pickling
2016-01-16 13:23:11 +01:00
Henning Peters
5551052840
fix py2/3 issue
2016-01-16 12:44:53 +01:00
Henning Peters
235f094534
untangle data_path/via
2016-01-16 12:23:45 +01:00
Matthew Honnibal
42a9f29b40
* Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214
2016-01-16 11:53:37 +01:00
Henning Peters
6d1a3af343
cleanup unused
2016-01-16 10:05:04 +01:00
Henning Peters
846fa49b2a
distinct load() and from_package() methods
2016-01-16 10:00:57 +01:00
Henning Peters
211913d689
add about.py, adapt setup.py
2016-01-15 18:57:01 +01:00
Henning Peters
f8a8f97d25
cleanup
2016-01-15 18:13:37 +01:00
Henning Peters
780cb847c9
add default_model to about
2016-01-15 18:07:15 +01:00
Henning Peters
788f734513
refactored data_dir->via, add zip_safe, add spacy.load()
2016-01-15 18:01:02 +01:00
Matthew Honnibal
478a79a3d5
* Add test for Issue #220 : Whitespace being tagged as noun
2016-01-15 16:17:07 +01:00
Henning Peters
d9471f684f
fix typo
2016-01-14 12:14:12 +01:00
Henning Peters
9b75d872b0
fix model download
2016-01-14 12:02:56 +01:00
Henning Peters
bc229790ac
integrate with sputnik
2016-01-13 19:46:17 +01:00
Matthew Honnibal
3fbfba575a
* xfail the contractions test
2015-12-31 13:16:28 +01:00
Matthew Honnibal
3bd910ccad
* Merge therell test
2015-12-31 11:55:18 +01:00
Matthew Honnibal
eaf2ad59f1
* Fix use of mock Package object
2015-12-31 04:13:15 +01:00
Matthew Honnibal
029136a007
* Fix resource loading for Matcher
2015-12-31 02:45:12 +01:00
Matthew Honnibal
55bcdf8bdd
* Fix errors
2015-12-29 22:32:03 +01:00
Matthew Honnibal
a6ba43ecaf
* Fix errors in packaging revision
2015-12-29 18:37:26 +01:00
Matthew Honnibal
4b4eec8b47
* Fix Issue #201 : Tokenization of there'll
2015-12-29 18:09:09 +01:00
Matthew Honnibal
86ee9d046d
* Remove test that belongs to a change for master
2015-12-29 18:07:23 +01:00
Matthew Honnibal
a2dfdec85d
* Clean up spacy.util
2015-12-29 18:06:09 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
0e2498da00
* Replace from_package with load() classmethod in Vocab
2015-12-29 16:56:51 +01:00
Matthew Honnibal
c5902f2b4b
* Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod
2015-12-29 16:56:02 +01:00
Matthew Honnibal
4131e45543
* Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way
2015-12-29 16:55:03 +01:00
Matthew Honnibal
f5dea1406d
* Fix silly mistake in Language.__init__
2015-12-28 18:48:57 +01:00
Matthew Honnibal
187960606f
* Fix pickle problems
2015-12-28 16:54:03 +01:00
Matthew Honnibal
8c7e149ec9
* Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug
2015-12-28 15:56:27 +01:00
Henning Peters
32d655b6e1
bump version
2015-12-28 09:34:39 +01:00
Matthew Honnibal
8b61d45ed0
* Fix merge conflicts for headers branch
2015-12-27 17:46:25 +01:00
Matthew Honnibal
6bb9c7f311
Merge pull request #202 from henningpeters/sputnik
...
access model via sputnik
2015-12-28 03:29:53 +11:00
Henning Peters
0e321a7105
get mingw32 to work
2015-12-22 23:25:38 +01:00
Henning Peters
d8d348bb55
allow to specify version constraint within model name
2015-12-18 19:12:08 +01:00
Henning Peters
7f7299cafb
Merge branch 'tmpdir' into headers
2015-12-18 12:25:25 +01:00
Henning Peters
cfa187aaf0
fix tests
2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93
strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible
2015-12-18 09:52:55 +01:00
Henning Peters
970278a3d6
no need to link data dir anymore
2015-12-18 09:49:45 +01:00
Henning Peters
4f3efb8eaf
avoid writing to /tmp (not cross-platform compatible)
2015-12-16 19:56:40 +01:00
Henning Peters
4ada39f472
avoid writing to /tmp (not cross-platform compatible)
2015-12-16 19:53:06 +01:00
Henning Peters
2d4efe40f9
fix sputnik call
2015-12-13 14:46:08 +01:00
Henning Peters
ac318b568c
new approach to dependency headers
2015-12-13 11:49:17 +01:00
Henning Peters
345dda6f53
small fixes, add package build step
2015-12-07 06:50:26 +01:00
Henning Peters
9027cef3bc
access model via sputnik
2015-12-07 06:01:28 +01:00
Henning Peters
73e5650be5
change index server
2015-11-18 18:09:46 +01:00
Henning Peters
50d15ea5d2
fix
2015-11-18 17:35:21 +01:00
Henning Peters
02a1dcec76
add data dir
2015-11-18 11:48:55 +01:00
Henning Peters
919a4f0b04
change data path, add repository
2015-11-18 11:40:46 +01:00
Henning Peters
12de895e60
fix version
2015-11-15 16:38:16 +01:00
Henning Peters
03d2f98cd5
add sputnik
2015-11-15 15:58:21 +01:00
Matthew Honnibal
ec7d36c3a4
* Add test for matcher end-point problem
2015-11-12 05:00:40 +11:00
Matthew Honnibal
d309622a27
* Add test for matcher end-point problem
2015-11-12 04:59:11 +11:00
Matthew Honnibal
56ea20a886
* Add test for matcher end-point problem
2015-11-12 04:58:53 +11:00
Matthew Honnibal
cfa4062147
* Add test for matcher end-point problem
2015-11-12 04:56:07 +11:00
Matthew Honnibal
5623242b3e
* Adjust NER rules, so that U entries in gazetteer don't become B moves to the model
2015-11-12 04:48:23 +11:00
Matthew Honnibal
d67d7d5a86
* Add test for NER inconsistency bug
2015-11-08 16:19:33 +01:00
Matthew Honnibal
44fbdc7260
* Fix bug in NER transition system, that sometimes left no valid moves
2015-11-08 16:19:12 +01:00
Matthew Honnibal
ab5aac5b2f
* Add .rank property to Token and Lexeme, for frequency rank
2015-11-08 16:18:25 +01:00
Matthew Honnibal
fde9a22ec2
* Add new test for ner
2015-11-08 13:57:15 +01:00
Matthew Honnibal
e92371bb54
* Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed.
2015-11-08 22:17:51 +11:00
Matthew Honnibal
3b74739c3e
* Download updated data
2015-11-08 21:24:25 +11:00
Matthew Honnibal
31da42eb27
* Mark tests that require models
2015-11-07 19:27:38 +11:00
Matthew Honnibal
8e26a28616
* Mark tests that require models
2015-11-07 19:10:56 +11:00
Matthew Honnibal
15eab7354f
* Remove extraneous test files
2015-11-07 18:45:13 +11:00
Matthew Honnibal
6f47074214
* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.
2015-11-07 18:25:17 +11:00
Matthew Honnibal
1cfa20fb17
* Fix sentence-final whitespace issue
2015-11-07 17:34:46 +11:00
Matthew Honnibal
7663970d5f
* Removed unused i variable from Span, and set attributes to read-only
2015-11-07 17:06:15 +11:00
Matthew Honnibal
4b3c96d76d
* Fix zero-length spans
2015-11-07 17:05:16 +11:00
Matthew Honnibal
888c05a7fa
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:30:45 +11:00
Matthew Honnibal
06f26d258e
* Fix test_basic_create
2015-11-07 10:04:37 +11:00
Matthew Honnibal
1d3884c46d
* Fix test_basic_create
2015-11-07 10:03:56 +11:00
Matthew Honnibal
cc8febcbe1
* Fix Span comparison
2015-11-07 09:54:14 +11:00
Matthew Honnibal
af70dc166a
* Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect.
2015-11-07 09:52:00 +11:00
Matthew Honnibal
a9b612abdf
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 09:01:12 +11:00
Matthew Honnibal
56499d89ef
* Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient
2015-11-07 08:55:34 +11:00
Andreas Grivas
83ca4e0b93
* use old merge tests - add more
2015-11-07 07:57:04 +11:00
Andreas Grivas
4be7fda453
* span start, end -> properties. autoupdate after merge
2015-11-07 07:57:04 +11:00
Andreas Grivas
562db6d2d0
* merge add lex last - add index finder funcs
2015-11-07 07:57:04 +11:00
Matthew Honnibal
a06e3c8963
* Fix bone-headed mistake in StateClass.E
2015-11-07 07:35:28 +11:00
Matthew Honnibal
d24b8509e4
* Correct screw ups from the previous commits
2015-11-07 06:51:41 +11:00
Matthew Honnibal
5efad178b5
* Set ent tag when close entity
2015-11-07 06:09:25 +11:00
Matthew Honnibal
9285f01d26
* Fix broken StateClass.E tracking
2015-11-07 06:06:39 +11:00
Matthew Honnibal
19136b0e7d
* Add better debug message for illegal move
2015-11-07 05:34:37 +11:00
Matthew Honnibal
2733816b7b
* Fix whitespace
2015-11-07 05:31:06 +11:00
Matthew Honnibal
01ab464383
* Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169
2015-11-07 05:30:44 +11:00
Matthew Honnibal
b65633f270
* Fix function that returns nth entity in StateClass. Was only returning the first.
2015-11-07 05:29:11 +11:00
Matthew Honnibal
410b6f9ec1
* Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class.
2015-11-07 05:13:10 +11:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
9d1b2a103a
* Fix capitalization in lemmatizer
2015-11-06 05:44:35 +11:00
Matthew Honnibal
6ed3aedf79
* Merge vocab changes
2015-11-06 00:48:08 +11:00
Matthew Honnibal
72abbb43fb
* Add type declarations in strings.pyx
2015-11-06 00:47:26 +11:00
Matthew Honnibal
5b2af4864f
* When lemmatizing non-noun, non-verb, non-adj words, output lower-case
2015-11-06 00:45:09 +11:00
Matthew Honnibal
754bf04162
* Remove declaration of Model.update
2015-11-06 00:31:15 +11:00
Matthew Honnibal
e18bdff23a
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-06 00:26:15 +11:00
Matthew Honnibal
b9991fbd20
* Update to use thinc 3.0
2015-11-06 00:25:59 +11:00
Matthew Honnibal
864a8f45d8
* Use unicode in StringStore.intern, instead of unreliably casting to bytes.
2015-11-05 11:32:19 +00:00
Matthew Honnibal
b18204cd52
* Fix StringStore._realloc, re Issue #155
2015-11-05 11:28:26 +00:00
Matthew Honnibal
f8004c5f65
* Begin upgrading to improved thinc API
2015-11-05 03:53:03 +11:00
Matthew Honnibal
adc7bbd6cf
* Fix name of like_num in default_lex_attrs
2015-11-04 22:02:47 +11:00
Matthew Honnibal
e96faf29e7
* Rename like_number to like_num, to fix inconsistency re Issue #166
2015-11-04 22:01:44 +11:00
Matthew Honnibal
65934b7cd4
* Enforce import of ujson in strings.pyx, because otherwise it's too slow
2015-11-04 00:32:02 +11:00
Matthew Honnibal
1ce5d5602d
* Rename Doc.data to Doc.c
2015-11-04 00:17:13 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
3ddea19b2b
* Rename spans.pyx to span.pyx
2015-11-04 00:14:40 +11:00
Matthew Honnibal
9482d616bc
* Rename spans.pyx to span.pyx
2015-11-03 23:51:05 +11:00
Matthew Honnibal
116da5990a
* Clean up setting of tag in doc.from_bytes
2015-11-03 23:48:57 +11:00
Matthew Honnibal
9ec7b9c454
* Clean up unused Constituent struct.
2015-11-03 23:48:21 +11:00
Matthew Honnibal
1e99fcd413
* Rename .repvec to .vector in C API
2015-11-03 23:47:59 +11:00
Matthew Honnibal
ee3f9ba581
* Fix test of serializer
2015-11-03 19:45:16 +11:00
Matthew Honnibal
d06ba26371
* Fix test of serializer
2015-11-03 19:43:27 +11:00
Matthew Honnibal
4083059650
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-11-03 09:07:19 +01:00
Matthew Honnibal
9e37437ba8
* Fix assign_tag in doc.merge
2015-11-03 19:07:02 +11:00
Matthew Honnibal
dde9e1357c
* Add todo to morphology.lemmatize
2015-11-03 18:54:35 +11:00
Matthew Honnibal
ffedff9e6c
* Remove the archive after download, to save disk space
2015-11-03 18:54:05 +11:00
Matthew Honnibal
85372468e3
* Fix serialize test
2015-11-03 08:51:33 +01:00
Matthew Honnibal
833eb35c57
* Fix tag assignment in doc.from_array
2015-11-03 18:45:54 +11:00
Matthew Honnibal
09664177d7
* Fix tag handling in doc.merge, and assign sent_start when setting heads.
2015-11-03 18:15:52 +11:00
Matthew Honnibal
389a373807
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-03 18:07:25 +11:00
Matthew Honnibal
3f44b3e43f
* Mark serializer test as requiring models
2015-11-03 18:07:08 +11:00
Matthew Honnibal
25ed7be8f8
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-11-03 07:58:17 +01:00
Matthew Honnibal
604ceac4c6
* Fix morphological assignment in doc.merge()
2015-11-03 17:57:51 +11:00
Matthew Honnibal
5e040855a5
* Ensure morphological features and lemmas are loaded in from_array, re Issue #152
2015-11-03 17:56:50 +11:00
Matthew Honnibal
5668feb235
* Fix pickle test for python3
2015-11-03 04:57:02 +01:00
Matthew Honnibal
6161d2529a
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-11-03 13:36:30 +11:00
Matthew Honnibal
5887506f5d
* Don't expect lexemes.bin in Vocab
2015-11-03 13:23:39 +11:00
Matthew Honnibal
f7dd377575
* Adjust conjuncts iterator in Token
2015-11-03 13:23:22 +11:00
Andreas Grivas
d418f00eb1
fixed error when printing unicode
2015-11-02 20:23:18 +02:00
Matthew Honnibal
52fc338001
* Set is_parsed and is_tagged attrs when loading annotations into Doc, re Issue #152
2015-10-28 10:43:22 +11:00
Matthew Honnibal
1c0356e4c2
* Set test file mode to w+t
2015-10-26 22:40:48 +11:00
Matthew Honnibal
0fe98f358b
* Fix mode on text file for Python3 in strings test
2015-10-26 22:25:16 +11:00
Matthew Honnibal
8ba9cf905e
* Fix mode on text file for Python3 in strings test
2015-10-26 21:44:34 +11:00
Matthew Honnibal
a0730699b1
* Fix mode on text file for Python3 in strings test
2015-10-26 21:25:56 +11:00
Matthew Honnibal
725344d349
* Fix tempfile in test
2015-10-26 21:08:18 +11:00
Matthew Honnibal
f11030aadc
* Remove out-dated TODO comment
2015-10-26 12:33:38 +11:00
Matthew Honnibal
a371a1071d
* Save and load word vectors during pickling, re Issue #125
2015-10-26 12:33:04 +11:00
Matthew Honnibal
a824a98312
* Add tests for pickling vectors, re: Issue #125
2015-10-26 12:31:05 +11:00
Matthew Honnibal
314090cc78
* Set vectors length when unpickling vocab, re Issue #125
2015-10-26 12:05:08 +11:00
Matthew Honnibal
4e16f9e435
* Move tests underneath spacy/
2015-10-26 00:07:31 +11:00
Matthew Honnibal
3a6e48e814
Merge pull request #149 from chrisdubois/pickle-patch
...
Add __reduce__ to Tokenizer so that English pickles.
2015-10-25 15:30:31 +11:00
Chris DuBois
dac8fe7bdb
Add __reduce__ to Tokenizer so that English pickles.
...
- Add tests to test_pickle and test_tokenizer that save to tempfiles.
2015-10-23 22:24:03 -07:00
Matthew Honnibal
ff4fe524ee
* Fix exception for python 2
2015-10-23 01:56:13 +02:00
Matthew Honnibal
341a3e85cd
* Upd downloaded data version
2015-10-23 00:56:57 +02:00
Matthew Honnibal
f18fd8c659
* Fix language.py for change in StringStore load API
2015-10-23 03:48:12 +11:00
Matthew Honnibal
23855db3ca
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-10-23 03:46:09 +11:00
Matthew Honnibal
4f13849065
Merge pull request #145 from henningpeters/master
...
better error reporting, cleanup
2015-10-23 03:45:47 +11:00
Matthew Honnibal
3be94be0c0
Merge pull request #148 from maxirmx/master
...
Utf8 encoding for lemma_rules.json
2015-10-22 21:46:28 +11:00
Matthew Honnibal
c86bda8d1a
* Fix import of uget
2015-10-22 21:13:56 +11:00
Matthew Honnibal
2348a08481
* Load/dump strings with a json file, instead of the hacky strings file we were using.
2015-10-22 21:13:03 +11:00
Matthew Honnibal
9baf0abd59
* Save vocab after training.
2015-10-22 21:09:14 +11:00
maxirmx
f07e4accd7
Fixing encoding issue #4
2015-10-21 20:45:56 +03:00
maxirmx
fcbfff043f
Fixing encoding issue #3
2015-10-21 15:52:34 +03:00
maxirmx
fe9d2e2c4e
Fixing encode issue #2
2015-10-21 15:36:21 +03:00
maxirmx
e4a1726f77
Fixing encoding issue
...
UTF-8
2015-10-21 14:16:37 +03:00
Andreas Grivas
93ada458e2
added __repr__ that prints text in ipython for doc, token, and span objects
2015-10-21 14:11:46 +03:00
Henning Peters
ccffd2ef53
fixed extract directory
2015-10-21 07:59:34 +02:00
Henning Peters
da4c9cee06
assert filename match
2015-10-20 19:33:59 +02:00
Henning Peters
4f703f0cb4
better error reporting, cleanup
2015-10-20 19:11:29 +02:00
Matthew Honnibal
9cdea6e450
* Import uget correctly
2015-10-19 08:32:41 +02:00
Matthew Honnibal
6727a46bb5
* Fix Issue #118 : Matcher behaves unpredictably when matches overlap.
2015-10-19 16:45:32 +11:00
Matthew Honnibal
135062d23c
* Fix error with merged text when merged region did not have trailing whitespace
2015-10-19 15:47:04 +11:00
Henning Peters
bfde91fa49
add custom download tool (uget), replace wget with uget
2015-10-18 12:35:04 +02:00
Matthew Honnibal
9839cd2c0b
* Fix whitespace_ calculation in Token
2015-10-18 17:21:11 +11:00
Matthew Honnibal
c99285b8b9
* Clean up C++ usage in spacy/matcher.pyx
2015-10-18 17:20:50 +11:00
Matthew Honnibal
a7e6c5ac8f
* Fix Issue #122 : Incorrect calculation of children after Doc.merge()
2015-10-18 17:17:27 +11:00
Matthew Honnibal
3ba66f2dc7
* Add string length cap in Tokenizer.__call__
2015-10-16 04:54:16 +11:00
Matthew Honnibal
6e0f985afc
* Fix token.conjuncts
2015-10-15 03:49:45 +11:00
Matthew Honnibal
2e0104ac81
* Fix token.conjuncts
2015-10-15 03:47:45 +11:00
Matthew Honnibal
b8f3345a82
* Fix token.conjuncts method
2015-10-15 03:36:01 +11:00
Matthew Honnibal
23818f89b8
* Fix token.conjuncts method
2015-10-15 03:34:57 +11:00
Matthew Honnibal
7a15d1b60c
* Add Python 2/3 compatibility fix for copy_reg
2015-10-13 20:04:40 +11:00
Matthew Honnibal
329ae57520
* Fix whitespace attachment thing
2015-10-13 09:46:38 +02:00
Matthew Honnibal
37919eac82
* Fix whitespace attachment in simpler way. Leaves problem with setting left/right children.
2015-10-13 18:23:24 +11:00
Matthew Honnibal
c70eb776ae
* Fix whitespace attachment, so that left/right children are consistent with head.
2015-10-13 15:58:22 +11:00
Matthew Honnibal
531182f937
* Fix Model.__reduce__
2015-10-13 15:14:38 +11:00
Matthew Honnibal
6c227a6c1f
* Fix Model.__reduce__
2015-10-13 15:10:04 +11:00
Matthew Honnibal
358c82595c
* Fix NAMES list in spacy/parts_of_speech.pyx
2015-10-13 14:18:45 +11:00
Matthew Honnibal
c1fdc487bc
Merge branch 'attrs'
2015-10-13 14:03:41 +11:00
Matthew Honnibal
e886e6a406
* Inc version
2015-10-13 13:46:17 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
f8de403483
* Work on pickling Vocab instances. The current implementation is not correct, but it may serve to see whether this approach is workable. Pickling is necessary to address Issue #125
2015-10-13 13:44:41 +11:00
Matthew Honnibal
85e7944572
* Start trying to pickle Vocab
2015-10-13 13:44:41 +11:00
Matthew Honnibal
5ca57bd859
* Ensure Morphology can be pickled, to address Issue #125 .
2015-10-13 13:44:41 +11:00
Matthew Honnibal
0cee928467
* Allow StringStore to be pickled, to start addressing Issue #125
2015-10-13 13:44:41 +11:00
Matthew Honnibal
41012907a8
* Fix variable name
2015-10-13 13:44:40 +11:00
Matthew Honnibal
e70368d157
* Use lower case strings for dependency label names in symbols enum
2015-10-13 13:44:40 +11:00
Matthew Honnibal
7b4af3d1e7
* Fix parts_of_speech now that symbols list has been reformed
2015-10-13 13:44:40 +11:00
Matthew Honnibal
37b909b6b6
* Use the symbols file in vocab instead of the symbols subfiles like attrs.pxd
2015-10-13 13:44:40 +11:00
Matthew Honnibal
ce65ec698c
* Remove qualified naming in symbols
2015-10-13 13:44:40 +11:00
Matthew Honnibal
9f4be0adcd
* Map NO_TAG to NIL in parts_of_speech.pxd
2015-10-13 13:44:40 +11:00
Matthew Honnibal
278e12f7e8
* Addmorphology symbols to morphology. May need to remove these as an enum.
2015-10-13 13:44:40 +11:00
Matthew Honnibal
d80067eda1
* Map empty string to NULL_ATTR in attrs
2015-10-13 13:44:40 +11:00
Matthew Honnibal
d70e8cac2c
* Fix empty values in attributes and parts of speech, so symbols align correctly with the StringStore
2015-10-13 13:44:40 +11:00
Matthew Honnibal
a29c8ee23d
* Add symbols to the vocab before reading the strings, so that they line up correctly
2015-10-13 13:44:39 +11:00
Matthew Honnibal
74c0853471
* Rename ATTR_IDS to attrs.IDS. Rename ATTR_NAMES to attrs.NAMES. Rename UNIV_POS_IDS to parts_of_speech.IDS
2015-10-13 13:44:39 +11:00
Matthew Honnibal
10a4a843ea
* Enumerate all symbols in one file
2015-10-13 13:44:39 +11:00
Matthew Honnibal
85ce36ab11
* Refactor symbols, so that frequency rank can be derived from the orth id of a word.
2015-10-13 13:44:39 +11:00
Matthew Honnibal
dfbcff2ff1
* Revert codecs/io change to strings.pyx, as it seemed to cause an error? Will investigate.
2015-10-10 15:54:55 +11:00
Matthew Honnibal
9dd2f25c74
* Fix Issue #131 : Force whitespace characters to attach syntactically to previous token, and ensure they cannot serve as stand-alone 'sentence' units.
2015-10-10 15:53:30 +11:00
Matthew Honnibal
8b39feefbe
* Add dependency post-process rule to ensure spaces are attached to neighbouring tokens, so that they can't be sentence boundaries
2015-10-10 15:32:13 +11:00
Matthew Honnibal
2153067958
* Fix use of io in strings.pyx
2015-10-10 15:03:12 +11:00
Matthew Honnibal
ec874247b5
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-10-10 14:23:51 +11:00
Matthew Honnibal
30de4135c9
* Fix merge problem
2015-10-10 14:22:32 +11:00
Matthew Honnibal
dc393a5f1d
Merge pull request #126 from tomtung/master
...
Improve slicing support for both Doc and Span
2015-10-10 14:14:57 +11:00
Matthew Honnibal
83dccf0fd7
* Use io module insteads of deprecated codecs module
2015-10-10 14:13:01 +11:00
Matthew Honnibal
a3dfe2b901
* Increment data version
2015-10-09 13:26:17 +02:00
Matthew Honnibal
2d9e5bf566
* Allow punctuation to be lemmatized
2015-10-09 19:02:42 +11:00
Matthew Honnibal
5332c0b697
* Add support for punctuation lemmatization, to handle unicode characters. This should help in addressing Issue #130
2015-10-09 18:54:40 +11:00
Yubing (Tom) Dong
9a6811acc4
Merge remote-tracking branch 'upstream/master'
2015-10-08 22:53:02 -07:00
Matthew Honnibal
b125289f30
* Fix type declaration in asciied function
2015-10-09 13:46:57 +11:00
Matthew Honnibal
801d55a6d9
* Fix phrase matcher
2015-10-09 02:00:45 +11:00
Matthew Honnibal
b3a70e6375
* Clean up unnecessary try/except block
2015-10-08 14:34:11 +11:00
Yubing (Tom) Dong
0f601b8b75
Update docstring of Doc.__getitem__
2015-10-07 01:27:28 -07:00
Yubing (Tom) Dong
3fd3bc79aa
Refactor to remove duplicate slicing logic
2015-10-07 01:25:35 -07:00
Yubing (Tom) Dong
97685aecb7
Add slicing support to Span
2015-10-06 02:45:49 -07:00
Yubing (Tom) Dong
ef2af20cd3
Make Doc's slicing behavior conform to Python conventions
2015-10-06 02:41:28 -07:00
Yubing (Tom) Dong
2fc33e8024
Allow step=1 when slicing a Doc
2015-10-06 00:57:05 -07:00
Matthew Honnibal
b228a8f4a6
* Remove spacy/en/attrs
2015-10-06 16:20:46 +11:00
Matthew Honnibal
693677fd8d
* Prepare to remove en/attrx file, now that moving to symbols.pyx
2015-10-06 16:20:13 +11:00
Matthew Honnibal
3d9f41c2c9
* Add LookupError for better error reporting in Vocab
2015-10-06 10:34:59 +11:00
Matthew Honnibal
ecc5281b36
* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx
2015-10-06 10:12:08 +11:00
alvations
8caedba42a
caught more codecs.open -> io.open
2015-09-30 20:20:09 +02:00
alvations
8199012d26
changing deprecated codecs.open to io.open =)
2015-09-30 20:10:15 +02:00
Matthew Honnibal
87e6186828
* Rename _seq to doc attribute in Span
2015-09-29 23:03:55 +10:00
Matthew Honnibal
ab694b0364
* Fix open-bounded slice indices.
2015-09-29 23:03:09 +10:00
Matthew Honnibal
a6ced80c0c
* Fix Issue #116 : Misleading handling of True value in Language.__init__.
2015-09-29 20:54:12 +10:00
Matthew Honnibal
f9d2a5b651
* Fix issue #112 : Replace unidecode with text-unidecode, to avoid license problems.
2015-09-28 23:40:18 +10:00
Matthew Honnibal
2c33a96ac3
Merge pull request #99 from rw/patch-1
...
Force SSL for downloading English language data.
2015-09-28 17:46:26 +10:00
Matthew Honnibal
abf0d930af
* Fix API for loading word vectors from a file.
2015-09-23 23:51:08 +10:00
Matthew Honnibal
f5c256745b
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-09-22 12:26:24 +10:00
Matthew Honnibal
528e26a506
* Add rule to ensure ordinals are preserved as single tokens
2015-09-22 12:26:05 +10:00
Robert
8711b64860
Force SSL for downloading English language data.
...
It would also be nice to have a checksum for this.
2015-09-21 17:26:01 -07:00
Matthew Honnibal
f7283a5067
* Fix vectors bugs for OOV words
2015-09-22 02:10:25 +02:00
Matthew Honnibal
44aecba701
* Fix Token.has_vector and Lexeme.has_vector
2015-09-22 01:43:16 +02:00
Matthew Honnibal
596fde8daa
* Add has_vector attribute to Token and Lexeme
2015-09-21 19:52:43 +10:00
Matthew Honnibal
f32927efbf
* Raise exceptions if attempt to access parse, but data is not installed. This partly but not fully addresses Issue #97 . Still need exceptions on the various Token attributes that access the parse tree, e.g. token.head, token.lefts, token.rights, etc. Exceptions should be centralized, too.
2015-09-21 18:35:40 +10:00
Matthew Honnibal
388062ae01
* Fix repvec_length problem
2015-09-21 18:10:51 +10:00
Matthew Honnibal
ac459278d1
* Fix vector length error reporting, and ensure vec_len is returned
2015-09-21 18:08:32 +10:00
Matthew Honnibal
ba4e563701
* Ensure vectors are same length, and return vector length in load_vectors_bz2
2015-09-21 18:03:08 +10:00
Matthew Honnibal
d00fe2bbc6
* Don't allow Span objects to be written to, as it introduces subtle bugs because they're created afresh from Doc.sents, Doc.ents etc.
2015-09-21 17:59:39 +10:00
Matthew Honnibal
d6945bf880
* Add way to load vectors from bz2 file to vocab
2015-09-17 12:58:23 +10:00
Matthew Honnibal
77856c4fcd
* Try giving Doc and Span objects vector and vector_norm attributes, and .similarity functions. Turns out to be bad idea.
2015-09-17 11:50:11 +10:00
Matthew Honnibal
191d593e03
* Fix vectors bug in lexeme
2015-09-15 19:05:11 +10:00
Matthew Honnibal
3d87519f64
* Remove vectors argument from Vocab object
2015-09-15 14:47:14 +10:00
Matthew Honnibal
362526b592
* Rename vectors_length attribute
2015-09-15 14:43:31 +10:00
Matthew Honnibal
60c26b2dfa
* Fix slicing when start or stop is None
2015-09-15 14:43:10 +10:00
Matthew Honnibal
7ac6cacc26
* Remove const qualifier on LexemeC.repvec
2015-09-15 14:42:51 +10:00
Matthew Honnibal
dd4d64b235
* Support setting of word vectors on Lexeme object.
2015-09-15 14:42:27 +10:00
Matthew Honnibal
27f988b167
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
2015-09-15 14:41:48 +10:00
Matthew Honnibal
193f127f81
* Fix ugly py_check_flag and py_set_flag functions in Lexeme
2015-09-15 13:06:18 +10:00
Matthew Honnibal
9561d88529
* Add is_stop to Python API
2015-09-14 18:25:40 +10:00
Matthew Honnibal
65dc0d1dfb
* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.
2015-09-14 17:49:58 +10:00
Matthew Honnibal
e13e47e9e5
* Add English stop words
2015-09-14 17:48:51 +10:00
Matthew Honnibal
24ed3fc25c
* Check file existance before opening in lemmatizer
2015-09-13 10:45:21 +10:00
Matthew Honnibal
dbb48ce49e
* Delete extra wordnets
2015-09-13 10:31:37 +10:00
Matthew Honnibal
e9c59693ea
* Remove assertion from vocab.pyx
2015-09-13 10:30:08 +10:00
Matthew Honnibal
c08f10083c
* Add test and test_with_ws attributes.
2015-09-13 10:27:42 +10:00
Matthew Honnibal
0b7d2a6c62
* Inc version
2015-09-13 01:26:29 +02:00
Matthew Honnibal
e1dfaeed8a
* Check serializer freqs exist before loading
2015-09-12 23:49:38 +02:00
Matthew Honnibal
a412c66c8c
* Check serializer freqs exist before loading
2015-09-12 23:40:01 +02:00
Matthew Honnibal
631c843ed1
* Don't look for index.adv in le,matizer
2015-09-12 06:03:44 +02:00
Matthew Honnibal
dfdd4f2d60
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-10 15:23:06 +02:00
Matthew Honnibal
e285ca7d6c
* Load serializer freqs in vocab
2015-09-10 15:22:48 +02:00
Matthew Honnibal
f7fdcce1f9
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-10 14:52:47 +02:00
Matthew Honnibal
85c3fec1d1
* Fix morphology loading
2015-09-10 14:52:23 +02:00
Matthew Honnibal
7c660c5efc
* Use dict.get in lemmatizer
2015-09-10 14:51:39 +02:00
Matthew Honnibal
094440f9f5
Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop
2015-09-10 14:51:17 +02:00
Matthew Honnibal
c3f773cd63
* Fix Lexeme.check_flag
2015-09-10 14:51:05 +02:00
Matthew Honnibal
90da3a695d
* Load lemmatizer from disk in Vocab.from_dir
2015-09-10 14:49:10 +02:00
Matthew Honnibal
e7e529edf4
* Fix Lexeme.check_flag
2015-09-10 14:45:43 +02:00
Matthew Honnibal
9e7bfe8449
* Fix space at end of merged token
2015-09-10 14:45:17 +02:00
Matthew Honnibal
f634191e27
* Fix vocab read/write
2015-09-10 14:44:38 +02:00
Matthew Honnibal
31ccf494e6
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-09 14:33:38 +02:00
Matthew Honnibal
a7f4b26c8c
* Tmp
2015-09-09 14:33:26 +02:00
Matthew Honnibal
07686470a9
* Don't consider a coordinated NP a base chunk
2015-09-09 14:32:28 +02:00
Matthew Honnibal
d9f1fc2112
* Add deprecation warning for unused load_vectors argument.
2015-09-09 14:31:09 +02:00
Matthew Honnibal
0b527fbdc8
* Set POS tag in morphology
2015-09-09 14:30:24 +02:00
Matthew Honnibal
07c09a0e1b
* Fix attribute getters and setters in Lexeme
2015-09-09 14:29:22 +02:00
Matthew Honnibal
d6561988cf
* Fix lexemes.bin
2015-09-09 11:49:51 +02:00
Matthew Honnibal
c301bebd33
Merge branch 'master' of https://github.com/honnibal/spaCy into develop
2015-09-09 10:55:39 +02:00
Matthew Honnibal
0e24d099a1
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
2015-09-09 03:40:44 +02:00
Matthew Honnibal
2be3620333
* Save morphological analyses in a cache
2015-09-08 15:39:24 +02:00
Matthew Honnibal
1def5a6cbe
* Fix print statements in matcher
2015-09-08 15:38:19 +02:00
Matthew Honnibal
64d71f8893
* Fix lemmatizer
2015-09-08 15:38:03 +02:00
Matthew Honnibal
623329b19a
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-09-08 14:27:01 +02:00
Matthew Honnibal
62a01dd41d
* Fix issue #92 : lexemes.bin read error on 32-bit platforms.
2015-09-08 14:23:58 +02:00
Matthew Honnibal
ef58607a99
* Add spacy.it
2015-09-06 22:10:37 +02:00
Matthew Honnibal
2154a54f6b
* Add spacy.de
2015-09-06 21:56:47 +02:00
Matthew Honnibal
f6ec5bf1b0
* Use empty tag map in vocab if none supplied
2015-09-06 20:19:27 +02:00
Matthew Honnibal
4f8e38271d
* Fix merge errors in lexeme.pxd
2015-09-06 20:19:08 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26
* Begin merge of Gazetteer and DE branches
2015-09-06 19:45:15 +02:00
Matthew Honnibal
dbf8dce109
Merge branch 'gaz' of ssh://github.com/honnibal/spaCy into gaz
2015-09-06 18:44:14 +02:00
Matthew Honnibal
9eae9837c4
* Fix morphology look up
2015-09-06 17:53:39 +02:00
Matthew Honnibal
6427a3fcac
* Temporarily import flag attributes in matcher
2015-09-06 17:53:12 +02:00
Matthew Honnibal
7cc56ada6e
* Temporarily add py_set_flag attribute in Lexeme
2015-09-06 17:52:51 +02:00
Matthew Honnibal
e35bb36be7
* Ensure Lexeme.check_flag returns a boolean value
2015-09-06 17:52:32 +02:00
Matthew Honnibal
7e4fea67d3
* Fix bug in token subtree, introduced by duplication of L/R code in Stateclass. Need to consolidate the two methods.
2015-09-06 10:48:36 +02:00