Commit Graph

1480 Commits

Author SHA1 Message Date
Matthew Honnibal
2c8dd91785 * Fix first code example on the website 2016-01-23 18:09:19 +01:00
Matthew Honnibal
3af84cfd6e * Increment version 2016-01-21 17:49:27 +01:00
Henning Peters
65aeac24cb remove package version constraint 2016-01-21 17:40:51 +01:00
Matthew Honnibal
792c98a438 * Increment version for OSX-fixed release of v0.100 2016-01-21 00:23:04 +01:00
Matthew Honnibal
82d011ac43 * Fix test for whitespace 2016-01-19 20:38:26 +01:00
Matthew Honnibal
e89069dcae * Fix matcher test 2016-01-19 20:24:01 +01:00
Matthew Honnibal
63e3d4e27f * Add comment on Vocab.__reduce__ 2016-01-19 20:11:25 +01:00
Matthew Honnibal
e1282b7f2f * Require user-custom NER classes to work without adding the label. 2016-01-19 20:11:03 +01:00
Matthew Honnibal
84c5dfbfc3 * Clean up debugging python list 2016-01-19 20:10:32 +01:00
Matthew Honnibal
04d0686b26 * Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions. 2016-01-19 20:10:04 +01:00
Matthew Honnibal
c4a89d56bd * Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types. 2016-01-19 20:09:26 +01:00
Matthew Honnibal
f0f92793f6 * Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217 2016-01-19 19:23:16 +01:00
Matthew Honnibal
65c5bc4988 * Add add_label method, to allow users to register new entity types and dependency labels. 2016-01-19 19:11:02 +01:00
Matthew Honnibal
151aa0b0e2 * Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model 2016-01-19 19:09:33 +01:00
Matthew Honnibal
c8e0011ebc * Add iterators to the NER and parser transition systems, to get the action types 2016-01-19 19:07:43 +01:00
Matthew Honnibal
515493c675 * Add xfail test for Issue #225: tokenization with non-whitespace delimiters 2016-01-19 13:20:14 +01:00
Matthew Honnibal
7abe653223 * Fix imports 2016-01-19 03:36:51 +01:00
Matthew Honnibal
590f38bdb2 * Add hacky solution to Issue #220. Currently specials.json only supports literal patterns, which doesn't allow us to pre-tag whitespace with the correct token, SP, as a rule. The data-driven approach should be easy but for some reason fails here. Adding a hard code in Morphology isn't a good solution, but we do want to fix the behaviour right away, and don't want to wait for an architecturally better solution. 2016-01-19 03:35:20 +01:00
Matthew Honnibal
445164d5b4 * Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 02:54:56 +01:00
Matthew Honnibal
04177debd0 * Unwind limit to sentence boundary detection that prevents it from inserting boundaries on whitespace. Replace it with a check for whitespace in StateClass.fast_forward, so that whitespace is LeftArced when it's on the stack. This should prevent the previous problem of whitespace-only sentences. Should fix Issue #184, but may cause further problems. Needs testing. 2016-01-19 02:54:15 +01:00
Matthew Honnibal
7893de3203 * Add test for Issue #184: Whitespace at sentence boundary causes sentence boundary error. 2016-01-18 23:04:38 +01:00
Matthew Honnibal
bba0a5e078 * Handle string paths in default_vocab, default_parser, default_entity in Language class 2016-01-18 22:37:24 +01:00
Matthew Honnibal
e825fd9554 * Make some of the website tests work without models 2016-01-18 18:14:44 +01:00
Matthew Honnibal
334c4b2b57 * Disprefer punctuation and spaces as heads of spans 2016-01-18 18:14:09 +01:00
Matthew Honnibal
bed36ab0ff * Fix import of HEAD attribute 2016-01-18 17:34:43 +01:00
Matthew Honnibal
28c659c1fe * Fix import for numpy 2016-01-18 17:25:04 +01:00
Matthew Honnibal
fc36bcf458 * Fix import for English 2016-01-18 17:14:40 +01:00
Matthew Honnibal
cc4c335e14 * Set heads for test_merge_tokens, to make the test run without models 2016-01-18 17:00:11 +01:00
Matthew Honnibal
c107da9738 * Bug fix to _count_words_to_root 2016-01-18 16:59:38 +01:00
Matthew Honnibal
f24833d607 * Fix merge for coordinations 2016-01-18 16:03:19 +01:00
Matthew Honnibal
14534958a9 * Fix bug in Span.root 2016-01-18 15:40:28 +01:00
Matthew Honnibal
714cbc03d5 * Add test for Issue #203: nested noun chunks. 2016-01-16 18:02:30 +01:00
Matthew Honnibal
4e2253170c * Move test for doc.merge to tokens_api file, to avoid name conflicts which upset pytest 2016-01-16 18:01:36 +01:00
Matthew Honnibal
34a157511f * Move test_merge_hang to test_tokens_api 2016-01-16 18:00:26 +01:00
Matthew Honnibal
fc8f26584a * Don't consider NPs connected to parse via conj relation as noun chunks. Change motivated by the nested noun chunks identified in Issue #203, but might be problematic. Also allow root NPs to be considered noun chunks. 2016-01-16 17:52:40 +01:00
Matthew Honnibal
4a16dbfeca * Add test for Issue #203: noun chunks should be flat, but sometimes are nested 2016-01-16 17:41:25 +01:00
Matthew Honnibal
995b2d18fd * Route token.string via token.txt_with_ws, to deprecate token.string in future 2016-01-16 17:14:34 +01:00
Matthew Honnibal
54a98eaf19 * Fix typo text_wth_ws --> text_with_ws. Reroute .string attribute to text_with_ws, to deprecate .string in future 2016-01-16 17:13:50 +01:00
Matthew Honnibal
3e9961d2c4 * If final token is whitespace, don't mark it as owning a trailing space. Fixes Issue #154 2016-01-16 17:08:59 +01:00
Matthew Honnibal
223d2b3484 * Add test for Issue #154: Additional whitespace introduced when string ends with a whitespace token. 2016-01-16 17:08:07 +01:00
Matthew Honnibal
3dc398b727 * Fix merge conflict in requirements.txt 2016-01-16 16:20:49 +01:00
Matthew Honnibal
fc5962a77d * Improve test for root token in Span 2016-01-16 16:19:09 +01:00
Matthew Honnibal
c025a0c64b * Check for KeyboardInerrupt in parser.__call__ 2016-01-16 16:18:44 +01:00
Matthew Honnibal
03e8a4293d * Add loop guard to Token.lefts and Token.rights properties 2016-01-16 16:18:17 +01:00
Matthew Honnibal
304339985e * Add a linear scan to Span.root method, to help with long sentences 2016-01-16 16:17:28 +01:00
Matthew Honnibal
aa0dd79f52 * Delete test_token_references, which checked a flakey strategy for preventing orphan tokens from a while ago. Now orphan tokens simply hold a reference to Pool, preventing the memory from being freed underneath them. This means that we don't need to run this slow test. 2016-01-16 16:03:35 +01:00
Matthew Honnibal
8cbcc3a799 * Fix calculation of root token in Span. Now take root to be word with shortest tree path. Avoids parse trees ending up in inconsistent state, as had occurred in Issue #214. 2016-01-16 15:38:50 +01:00
Matthew Honnibal
c1039fa4b4 * Add test for Issue #214. Resolved in change to Span.root 2016-01-16 15:37:47 +01:00
Henning Peters
41ea14a56f fix pickling 2016-01-16 13:23:11 +01:00
Henning Peters
5551052840 fix py2/3 issue 2016-01-16 12:44:53 +01:00
Henning Peters
235f094534 untangle data_path/via 2016-01-16 12:23:45 +01:00
Matthew Honnibal
42a9f29b40 * Add loop guard in Span.root, to raise errors if there is a cycle in the dependency parse, instead of entering an infinite loop. Re Issue #214 2016-01-16 11:53:37 +01:00
Henning Peters
6d1a3af343 cleanup unused 2016-01-16 10:05:04 +01:00
Henning Peters
846fa49b2a distinct load() and from_package() methods 2016-01-16 10:00:57 +01:00
Henning Peters
211913d689 add about.py, adapt setup.py 2016-01-15 18:57:01 +01:00
Henning Peters
f8a8f97d25 cleanup 2016-01-15 18:13:37 +01:00
Henning Peters
780cb847c9 add default_model to about 2016-01-15 18:07:15 +01:00
Henning Peters
788f734513 refactored data_dir->via, add zip_safe, add spacy.load() 2016-01-15 18:01:02 +01:00
Matthew Honnibal
478a79a3d5 * Add test for Issue #220: Whitespace being tagged as noun 2016-01-15 16:17:07 +01:00
Henning Peters
d9471f684f fix typo 2016-01-14 12:14:12 +01:00
Henning Peters
9b75d872b0 fix model download 2016-01-14 12:02:56 +01:00
Henning Peters
bc229790ac integrate with sputnik 2016-01-13 19:46:17 +01:00
Matthew Honnibal
3fbfba575a * xfail the contractions test 2015-12-31 13:16:28 +01:00
Matthew Honnibal
3bd910ccad * Merge therell test 2015-12-31 11:55:18 +01:00
Matthew Honnibal
eaf2ad59f1 * Fix use of mock Package object 2015-12-31 04:13:15 +01:00
Matthew Honnibal
029136a007 * Fix resource loading for Matcher 2015-12-31 02:45:12 +01:00
Matthew Honnibal
55bcdf8bdd * Fix errors 2015-12-29 22:32:03 +01:00
Matthew Honnibal
a6ba43ecaf * Fix errors in packaging revision 2015-12-29 18:37:26 +01:00
Matthew Honnibal
4b4eec8b47 * Fix Issue #201: Tokenization of there'll 2015-12-29 18:09:09 +01:00
Matthew Honnibal
86ee9d046d * Remove test that belongs to a change for master 2015-12-29 18:07:23 +01:00
Matthew Honnibal
a2dfdec85d * Clean up spacy.util 2015-12-29 18:06:09 +01:00
Matthew Honnibal
aec130af56 Use util.Package class for io
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().

Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.

Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
0e2498da00 * Replace from_package with load() classmethod in Vocab 2015-12-29 16:56:51 +01:00
Matthew Honnibal
c5902f2b4b * Upd Lemmatizer to use MockPackage. Replace from_package with load() classmethod 2015-12-29 16:56:02 +01:00
Matthew Honnibal
4131e45543 * Add MockPackage class, to see whether we can proxy for Sputnik in a lightweight way 2015-12-29 16:55:03 +01:00
Matthew Honnibal
f5dea1406d * Fix silly mistake in Language.__init__ 2015-12-28 18:48:57 +01:00
Matthew Honnibal
187960606f * Fix pickle problems 2015-12-28 16:54:03 +01:00
Matthew Honnibal
8c7e149ec9 * Replace kwargs argument of Language.__init__ with explicit arguments, to fix pickle bug 2015-12-28 15:56:27 +01:00
Henning Peters
32d655b6e1 bump version 2015-12-28 09:34:39 +01:00
Matthew Honnibal
8b61d45ed0 * Fix merge conflicts for headers branch 2015-12-27 17:46:25 +01:00
Matthew Honnibal
6bb9c7f311 Merge pull request #202 from henningpeters/sputnik
access model via sputnik
2015-12-28 03:29:53 +11:00
Henning Peters
0e321a7105 get mingw32 to work 2015-12-22 23:25:38 +01:00
Henning Peters
d8d348bb55 allow to specify version constraint within model name 2015-12-18 19:12:08 +01:00
Henning Peters
7f7299cafb Merge branch 'tmpdir' into headers 2015-12-18 12:25:25 +01:00
Henning Peters
cfa187aaf0 fix tests 2015-12-18 10:58:02 +01:00
Henning Peters
8359bd4d93 strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 09:52:55 +01:00
Henning Peters
970278a3d6 no need to link data dir anymore 2015-12-18 09:49:45 +01:00
Henning Peters
4f3efb8eaf avoid writing to /tmp (not cross-platform compatible) 2015-12-16 19:56:40 +01:00
Henning Peters
4ada39f472 avoid writing to /tmp (not cross-platform compatible) 2015-12-16 19:53:06 +01:00
Henning Peters
2d4efe40f9 fix sputnik call 2015-12-13 14:46:08 +01:00
Henning Peters
ac318b568c new approach to dependency headers 2015-12-13 11:49:17 +01:00
Henning Peters
345dda6f53 small fixes, add package build step 2015-12-07 06:50:26 +01:00
Henning Peters
9027cef3bc access model via sputnik 2015-12-07 06:01:28 +01:00
Henning Peters
73e5650be5 change index server 2015-11-18 18:09:46 +01:00
Henning Peters
50d15ea5d2 fix 2015-11-18 17:35:21 +01:00
Henning Peters
02a1dcec76 add data dir 2015-11-18 11:48:55 +01:00
Henning Peters
919a4f0b04 change data path, add repository 2015-11-18 11:40:46 +01:00
Henning Peters
12de895e60 fix version 2015-11-15 16:38:16 +01:00
Henning Peters
03d2f98cd5 add sputnik 2015-11-15 15:58:21 +01:00
Matthew Honnibal
ec7d36c3a4 * Add test for matcher end-point problem 2015-11-12 05:00:40 +11:00
Matthew Honnibal
d309622a27 * Add test for matcher end-point problem 2015-11-12 04:59:11 +11:00
Matthew Honnibal
56ea20a886 * Add test for matcher end-point problem 2015-11-12 04:58:53 +11:00
Matthew Honnibal
cfa4062147 * Add test for matcher end-point problem 2015-11-12 04:56:07 +11:00
Matthew Honnibal
5623242b3e * Adjust NER rules, so that U entries in gazetteer don't become B moves to the model 2015-11-12 04:48:23 +11:00
Matthew Honnibal
d67d7d5a86 * Add test for NER inconsistency bug 2015-11-08 16:19:33 +01:00
Matthew Honnibal
44fbdc7260 * Fix bug in NER transition system, that sometimes left no valid moves 2015-11-08 16:19:12 +01:00
Matthew Honnibal
ab5aac5b2f * Add .rank property to Token and Lexeme, for frequency rank 2015-11-08 16:18:25 +01:00
Matthew Honnibal
fde9a22ec2 * Add new test for ner 2015-11-08 13:57:15 +01:00
Matthew Honnibal
e92371bb54 * Fix rule that made Last action invalid if there was a preset of O, since if the entity is already open, that ship has sailed. 2015-11-08 22:17:51 +11:00
Matthew Honnibal
3b74739c3e * Download updated data 2015-11-08 21:24:25 +11:00
Matthew Honnibal
31da42eb27 * Mark tests that require models 2015-11-07 19:27:38 +11:00
Matthew Honnibal
8e26a28616 * Mark tests that require models 2015-11-07 19:10:56 +11:00
Matthew Honnibal
15eab7354f * Remove extraneous test files 2015-11-07 18:45:13 +11:00
Matthew Honnibal
6f47074214 * Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling. 2015-11-07 18:25:17 +11:00
Matthew Honnibal
1cfa20fb17 * Fix sentence-final whitespace issue 2015-11-07 17:34:46 +11:00
Matthew Honnibal
7663970d5f * Removed unused i variable from Span, and set attributes to read-only 2015-11-07 17:06:15 +11:00
Matthew Honnibal
4b3c96d76d * Fix zero-length spans 2015-11-07 17:05:16 +11:00
Matthew Honnibal
888c05a7fa * Fix variable naming in StepwiseState, for thinc 4.0 2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3 * Fix variable naming in StepwiseState, for thinc 4.0 2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807 * Fix variable naming in StepwiseState, for thinc 4.0 2015-11-07 10:30:45 +11:00
Matthew Honnibal
06f26d258e * Fix test_basic_create 2015-11-07 10:04:37 +11:00
Matthew Honnibal
1d3884c46d * Fix test_basic_create 2015-11-07 10:03:56 +11:00
Matthew Honnibal
cc8febcbe1 * Fix Span comparison 2015-11-07 09:54:14 +11:00
Matthew Honnibal
af70dc166a * Fix Last restriction, that was supposed to prevent conflicts with presets, but was incorrect. 2015-11-07 09:52:00 +11:00
Matthew Honnibal
a9b612abdf * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 09:01:12 +11:00
Matthew Honnibal
56499d89ef * Rework the Span-merge patch, to avoid extending the interface of Doc, and avoid virtualizing the Span.start and Span.end indices, to keep Span usage efficient 2015-11-07 08:55:34 +11:00
Andreas Grivas
83ca4e0b93 * use old merge tests - add more 2015-11-07 07:57:04 +11:00
Andreas Grivas
4be7fda453 * span start, end -> properties. autoupdate after merge 2015-11-07 07:57:04 +11:00
Andreas Grivas
562db6d2d0 * merge add lex last - add index finder funcs 2015-11-07 07:57:04 +11:00
Matthew Honnibal
a06e3c8963 * Fix bone-headed mistake in StateClass.E 2015-11-07 07:35:28 +11:00
Matthew Honnibal
d24b8509e4 * Correct screw ups from the previous commits 2015-11-07 06:51:41 +11:00
Matthew Honnibal
5efad178b5 * Set ent tag when close entity 2015-11-07 06:09:25 +11:00
Matthew Honnibal
9285f01d26 * Fix broken StateClass.E tracking 2015-11-07 06:06:39 +11:00
Matthew Honnibal
19136b0e7d * Add better debug message for illegal move 2015-11-07 05:34:37 +11:00
Matthew Honnibal
2733816b7b * Fix whitespace 2015-11-07 05:31:06 +11:00
Matthew Honnibal
01ab464383 * Prevent Begin and In moves from applying in NER if we're at the last token of a sentence, as this would mean the entity would span over a sentence boundary. Re Issue #169 2015-11-07 05:30:44 +11:00
Matthew Honnibal
b65633f270 * Fix function that returns nth entity in StateClass. Was only returning the first. 2015-11-07 05:29:11 +11:00
Matthew Honnibal
410b6f9ec1 * Remove deprecated _ml.pyx. We now use the nicer APIs provided by thinc 4.0, and subclass the AveragedPerceptron class. 2015-11-07 05:13:10 +11:00
Matthew Honnibal
3c162dcac3 * Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc. 2015-11-07 03:24:30 +11:00
Matthew Honnibal
9d1b2a103a * Fix capitalization in lemmatizer 2015-11-06 05:44:35 +11:00
Matthew Honnibal
6ed3aedf79 * Merge vocab changes 2015-11-06 00:48:08 +11:00
Matthew Honnibal
72abbb43fb * Add type declarations in strings.pyx 2015-11-06 00:47:26 +11:00
Matthew Honnibal
5b2af4864f * When lemmatizing non-noun, non-verb, non-adj words, output lower-case 2015-11-06 00:45:09 +11:00
Matthew Honnibal
754bf04162 * Remove declaration of Model.update 2015-11-06 00:31:15 +11:00
Matthew Honnibal
e18bdff23a Merge branch 'master' of ssh://github.com/honnibal/spaCy 2015-11-06 00:26:15 +11:00
Matthew Honnibal
b9991fbd20 * Update to use thinc 3.0 2015-11-06 00:25:59 +11:00
Matthew Honnibal
864a8f45d8 * Use unicode in StringStore.intern, instead of unreliably casting to bytes. 2015-11-05 11:32:19 +00:00
Matthew Honnibal
b18204cd52 * Fix StringStore._realloc, re Issue #155 2015-11-05 11:28:26 +00:00
Matthew Honnibal
f8004c5f65 * Begin upgrading to improved thinc API 2015-11-05 03:53:03 +11:00
Matthew Honnibal
adc7bbd6cf * Fix name of like_num in default_lex_attrs 2015-11-04 22:02:47 +11:00