Commit Graph

3713 Commits

Author SHA1 Message Date
Matthew Honnibal
583628c350 Import metadata into __init__ 2017-03-18 19:30:03 +01:00
Matthew Honnibal
1754e0db9b Call pip via subprocess, to make it use virtualenv 2017-03-18 19:29:36 +01:00
ines
1277abcde2 Remove print statement 2017-03-18 19:14:58 +01:00
Matthew Honnibal
dcec104643 Remove unused import 2017-03-18 18:57:45 +01:00
Matthew Honnibal
703eb7bdbd Fix link module 2017-03-18 18:57:31 +01:00
Matthew Honnibal
f6c6c89546 Add empty data directory 2017-03-18 18:32:29 +01:00
ines
7d33104180 Use distutils.sysconfig.get_python_lib
site.getsitepackages seems to not work as expected in Python 2
2017-03-18 18:20:40 +01:00
Matthew Honnibal
1a53fcc685 Fix CLI for Python 2 2017-03-18 18:14:03 +01:00
ines
aefb898e37 Add title-case version of morph rules (resolves #686) 2017-03-18 17:27:11 +01:00
ines
64ec17abc1 Pass xpassing tests and add xfails for failures 2017-03-18 17:20:46 +01:00
ines
d0b85faf69 Pass regression test for #401 (resolves #401)
Fixed in new English models.
2017-03-18 17:06:49 +01:00
ines
be9daefbdd Remove actual model downloading from tests 2017-03-18 17:01:10 +01:00
ines
850650221a Use correct command in deprecated download command message 2017-03-18 17:01:01 +01:00
ines
0dd7710556 Make sure paths are paths 2017-03-18 16:48:52 +01:00
Matthew Honnibal
de0e6385b4 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-18 16:17:28 +01:00
Matthew Honnibal
fe442cac53 Fix #717: Set correct lemma for contracted verbs 2017-03-18 16:16:10 +01:00
ines
ad934a9abd Add regression test for #693 2017-03-18 16:12:30 +01:00
ines
f57c616830 Add regression test for #704 and test new model (resolves #704)
(using new English model)
2017-03-18 16:04:14 +01:00
Matthew Honnibal
413138de79 Fix #719: Lemmatizer can no longer output empty string 2017-03-18 16:02:06 +01:00
ines
ab1451f997 Don't mark compatibility test as slow 2017-03-18 15:17:39 +01:00
ines
ec3e810662 Add directory cli and set up command line interface 2017-03-18 15:14:48 +01:00
ines
cd94ea1095 Use info module for spacy.info() 2017-03-18 13:01:26 +01:00
ines
e3e25c0a33 Add spacy.info module
Print info about spaCy installation, local setup and models. Allow
export in Markdown format to copy-paste into GitHub issues.
2017-03-18 13:01:16 +01:00
ines
0eafc0f2c6 Add util functions to print data as table or markdown list 2017-03-18 13:00:14 +01:00
ines
6b9b444065 Fix imports 2017-03-18 12:59:41 +01:00
ines
a035ebd32a Use pathlib.Path instead of os.path 2017-03-18 12:59:21 +01:00
ines
9605cf39cc Handle default path in Language classes 2017-03-18 12:58:45 +01:00
Matthew Honnibal
ac4b88cce9 Fix auto-linking in download command 2017-03-17 21:36:13 +01:00
ines
8a34c3e666 Fix shortcut name 2017-03-17 20:07:34 +01:00
Matthew Honnibal
6420f86f02 Merge changes to __init__.py 2017-03-17 19:51:45 +01:00
ines
e01fbacf81 Update resolve_model_name 2017-03-17 19:26:28 +01:00
ines
aedefef49d Add function to resolve model names and link them 2017-03-17 18:47:05 +01:00
Matthew Honnibal
d013aba7b5 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 18:30:53 +01:00
Matthew Honnibal
854cfce7cf Make vocabs more compatible across versions
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal
1cc841e600 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 08:18:11 -05:00
Matthew Honnibal
4bfc55b532 Auto-add words to vocab when loading vectors
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
ines
0e533ad0cc Mark compatibility table test as slow (temporary)
Prevent Travis from running test test until models repo is published
2017-03-17 13:11:36 +01:00
ines
279b1d1965 Update version 2017-03-17 12:43:08 +01:00
ines
8af4b9e4df Fix compatibility.json link 2017-03-17 12:43:03 +01:00
Matthew Honnibal
a630726b13 Fix typo in tests 2017-03-16 20:50:36 -05:00
Matthew Honnibal
f98b30583f Fix tests 2017-03-16 19:48:00 -05:00
Matthew Honnibal
db51abf685 Fix tests 2017-03-16 18:53:47 -05:00
Matthew Honnibal
adb0b7e43b Fix loading when no package found 2017-03-16 18:30:23 -05:00
Matthew Honnibal
5c66cffafd Add tag map for Spanish 2017-03-16 18:05:15 -05:00
Matthew Honnibal
c4351e1165 Update base-form check in lemmatizer, for UD 2.0 morphology 2017-03-16 17:59:31 -05:00
Matthew Honnibal
1e10383e1b Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-16 17:41:13 -05:00
Matthew Honnibal
859315863a Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-16 17:40:07 -05:00
Matthew Honnibal
fea9fe08af Merge pull request #866 from juanmirocks/master
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
ffd4a19383 Increment version 2017-03-16 17:35:57 -05:00
Matthew Honnibal
28bb546939 Merge pull request #883 from ericzhao28/master
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
fd60961825 Fix spacing 2017-03-16 23:23:26 +01:00
Matthew Honnibal
890747d8ff Fix trailing whitespace on morphology features 2017-03-16 17:07:37 -05:00
Matthew Honnibal
af41a9790c Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 20:41:37 +01:00
Matthew Honnibal
303a56f173 Get absolute path for linking 2017-03-16 20:41:23 +01:00
ines
3d484c3faf Don't print in parse_package_meta and accept on_erro callback instead
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
d8c984b65e Don't exit if no model meta data is present 2017-03-16 20:33:33 +01:00
Matthew Honnibal
2524efc0ac Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 20:20:41 +01:00
ines
8253581057 Link model automatically if not direct download 2017-03-16 19:54:51 +01:00
Matthew Honnibal
8843b84bd1 Merge remote-tracking branch 'origin/develop-downloads' 2017-03-16 12:00:42 -05:00
Matthew Honnibal
55f813bfbb Don't reapply the model during training 2017-03-16 11:59:43 -05:00
Matthew Honnibal
c90dc7ac29 Clean up state initiatisation in transition system 2017-03-16 11:59:11 -05:00
Matthew Honnibal
a46933a8fe Clean up FTRL parsing stuff. 2017-03-16 11:58:20 -05:00
ines
618ce3b425 Add .meta to Language object
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
ines
e348d4434c Add spacy.info(model_name) to show model meta
Allows "previewing" model before loading and making sure it's linked
correctly.
2017-03-16 17:13:40 +01:00
ines
eea3b35e3f Update model loading to support links
Remove match_best_version check, fetch model language from meta instead
of directory name, and don't make too many assumptions – if model is
downloaded via downloader, version should match anyway. (Otherwise,
users should be free to add and load whichever models they want.)
2017-03-16 17:13:08 +01:00
ines
5f3f04bd0a Add util function to load and parse package meta.json 2017-03-16 17:10:05 +01:00
ines
7f920c2f75 Don't break text in when rendering print_msg 2017-03-16 17:09:50 +01:00
ines
16a63d9676 Add docstring 2017-03-16 17:09:11 +01:00
ines
68c04fa897 Move sys_exit() function to util 2017-03-16 17:08:58 +01:00
ines
ccd1a79988 Add spacy.link module to link model directories to shortcuts 2017-03-16 17:01:51 +01:00
Matthew Honnibal
2611ac2a89 Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens 2017-03-16 09:38:28 -05:00
ines
595d89698a Add basestring 2017-03-16 10:01:14 +01:00
ines
7b2eca36e4 Revert "Fix formatting and remove unused code"
This reverts commit d7898d586f.
2017-03-16 09:58:41 +01:00
ines
2f0db1dd36 Use small English model as default 2017-03-16 09:54:40 +01:00
Matthew Honnibal
3d0833c3df Fix off-by-1 in parse features fill_context 2017-03-15 19:55:35 -05:00
Matthew Honnibal
4ef68c413f Approximate cost in Break transition, to speed things up a bit. 2017-03-15 16:40:27 -05:00
Matthew Honnibal
8543db8a5b Use ftrl optimizer in parser 2017-03-15 11:56:37 -05:00
ines
4cfc8ffbd2 Reformat pickle tests 2017-03-15 17:39:54 +01:00
ines
2a0fcf1354 Add tests for new download module 2017-03-15 17:39:43 +01:00
ines
71956c94db Handle deprecated language-specific model downloading 2017-03-15 17:37:55 +01:00
ines
58b884b6d4 Refactor download script and about.py to use new download method 2017-03-15 17:37:18 +01:00
ines
f5d1a39a5b Add util functions for printing and wrapping messages 2017-03-15 17:35:57 +01:00
ines
d7898d586f Fix formatting and remove unused code 2017-03-15 17:35:41 +01:00
ines
b672e95045 Fix formatting 2017-03-15 17:35:04 +01:00
ines
0474e706a0 Remove unused deprecated functions for sputnik 2017-03-15 17:34:54 +01:00
ines
b13e7f79b4 Fix formatting and remove unused imports 2017-03-15 17:33:57 +01:00
ines
1101fd3855 Fix formatting and remove unused imports 2017-03-15 17:33:39 +01:00
ines
842782c128 Move fix_deprecated_glove_vectors_loading to deprecated.py 2017-03-15 17:33:29 +01:00
Matthew Honnibal
4cab8ac136 Update morph exceptions test 2017-03-15 09:31:34 -05:00
Matthew Honnibal
d719f8e77e Use nogil in parser, and set L1 to 0.0 by default 2017-03-15 09:31:01 -05:00
Matthew Honnibal
c61c501406 Update beam-parser to allow parser to maintain nogil 2017-03-15 09:30:22 -05:00
Matthew Honnibal
3d4e389d23 Whitespace 2017-03-15 09:29:42 -05:00
Matthew Honnibal
7769bc31e3 Add beam-search classes 2017-03-15 09:27:41 -05:00
Matthew Honnibal
c79b3129e3 Fix setting of empty lexeme in initial parse state 2017-03-15 09:26:53 -05:00
Matthew Honnibal
d864708072 Add more morphology names in attrs.pyx 2017-03-15 09:26:16 -05:00
Matthew Honnibal
b382dc902c Add morph rules in Language 2017-03-15 09:24:40 -05:00
Matthew Honnibal
8dbff4f5f4 Wire up English lemma and morph rules. 2017-03-15 09:23:22 -05:00
Matthew Honnibal
f70be44746 Use lemmatizer in code, not from downloaded model. 2017-03-15 04:52:50 -05:00
ines
42ba740dde Revert "Merge branch 'debug'"
This reverts commit 89b79d1178, reversing
changes made to 02bdf490a1.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e Update regression test 2017-03-13 15:16:11 +01:00
ines
02bdf490a1 Remove regression test to see if it caused pytest Travis error 2017-03-13 13:00:22 +01:00
ines
17018750ac Add regression test for #717 2017-03-13 12:58:22 +01:00
ines
2883ebfca2 Remove print statement 2017-03-13 12:30:42 +01:00
ines
98c13d8aa9 Add regression test for #401 2017-03-13 12:28:41 +01:00
ines
444d665f9d Add regression test for #686 2017-03-13 12:23:35 +01:00
ines
46b17e5b51 Add regression test for #719 2017-03-13 12:17:35 +01:00
ines
c8ae682ff9 Add regression test for #636 2017-03-13 12:08:31 +01:00
ines
337f9601f2 Add missing unicode declaration 2017-03-13 12:08:19 +01:00
ines
d70386ec6e Update docstring in #886 regression test 2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8 Add regression test for #886 2017-03-13 11:44:58 +01:00
ines
eec3f21c50 Add WordNet license 2017-03-12 13:58:24 +01:00
ines
f9e603903b Rename stop_words.py to word_sets.py and include more sets
NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)
2017-03-12 13:58:22 +01:00
ines
f24f9b4b7b Remove unused code 2017-03-12 13:58:22 +01:00
ines
1da29a7146 Use new Lemmatizer data and remove file import
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
0957737ee8 Add Python-formatted lemmatizer data and rules 2017-03-12 13:58:22 +01:00
ines
c89e30d1a3 Add test for English time exceptions ("1a.m." etc.) 2017-03-12 13:58:22 +01:00
ines
ce9568af84 Move English time exceptions ("1a.m." etc.) and refactor 2017-03-12 13:58:22 +01:00
ines
6b30541774 Fix formatting 2017-03-12 13:58:22 +01:00
Ines Montani
e97a30b99a Merge pull request #885 from PySUST/master
[Bengali] 	Spell checked and add new stop words
2017-03-12 13:20:59 +01:00
ines
66c1f194f9 Use consistent unicode declarations 2017-03-12 13:07:28 +01:00
shuvanon
91cb4cdb2b Sort stop_words 2017-03-12 17:55:51 +06:00
shuvanon
784f6cfa49 Update stop_words 2017-03-12 17:41:01 +06:00
shuvanon
73cc17078e Merge branch 'master' of https://github.com/PySUST/spaCy 2017-03-12 14:52:17 +06:00
shuvanon
35ec7135bb Spell checked and add new stop words 2017-03-12 14:51:34 +06:00
Em
9c809efc25 Removed mapStr 2017-03-11 16:23:26 -08:00
Matthew Honnibal
fa23278ee3 Add classes for beam parser and beam NER 2017-03-11 12:45:37 -06:00
Matthew Honnibal
6c4108c073 Add header for beam parser 2017-03-11 12:45:12 -06:00
Matthew Honnibal
4382f175b3 Squelch compiler warnings 2017-03-11 12:44:43 -06:00
Matthew Honnibal
ea2592879f Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-11 11:13:37 -06:00
Matthew Honnibal
1224c4d3c6 Improve output on trainer 2017-03-11 11:12:48 -06:00
Matthew Honnibal
b438dfd3f3 Add itn argument to tagger.update 2017-03-11 11:12:21 -06:00
Matthew Honnibal
931feb3360 Allow beam parsing for NER 2017-03-11 11:12:01 -06:00
Matthew Honnibal
f77a5bb60a Switch back to greedy parser 2017-03-11 11:11:30 -06:00
Matthew Honnibal
ca9c8c57c0 Add iteration argument to parser.update 2017-03-11 07:00:47 -06:00
Matthew Honnibal
dcce9ca3f3 Use beam parser 2017-03-11 07:00:20 -06:00
Matthew Honnibal
e30ffdd003 Use ftrl optimizer in tagger 2017-03-11 06:59:13 -06:00
Matthew Honnibal
d59c6926c1 I think this fixes the segfault 2017-03-11 06:58:34 -06:00
Matthew Honnibal
318b9e32ff WIP on beam parser. Currently segfaults. 2017-03-11 06:19:52 -06:00
Em
426d17167f Added string manipulation for spans 2017-03-10 16:50:02 -08:00
Matthew Honnibal
b0d80dc9ae Update name of 'train' function in BeamParser 2017-03-10 14:35:43 -06:00
Matthew Honnibal
d11f1a4ddf Record negative costs in non-monotonic arc eager oracle 2017-03-10 11:22:04 -06:00
Matthew Honnibal
ecf91a2dbb Support beam parser 2017-03-10 11:21:21 -06:00
Ines Montani
a16aff17aa Merge pull request #876 from PySUST/master
[Bangla] Update "tokenizer_exceptions.py"
2017-03-10 14:46:00 +01:00
ines
10e29189ac Adjust URL testcases and xfail problems (instead of comment) 2017-03-10 14:22:50 +01:00
ines
b04893a059 Make regex locale-independent for Python 2 2017-03-10 14:21:57 +01:00
Matthew Honnibal
ea53647362 Merge branch 'develop' 2017-03-10 02:49:39 -06:00
Ines Montani
1c40890321 Add missing comma
Should fix Travis build error
2017-03-10 09:34:54 +01:00
Shuvanon Razik
c251703428 Update abbreviations 2017-03-10 10:45:01 +06:00
Matthew Honnibal
b5247c49eb Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-03-09 18:45:43 -06:00
Matthew Honnibal
798450136d Set L1 penalty to 0 in tagger. 2017-03-09 18:43:47 -06:00
Matthew Honnibal
c62da02344 Use ftrl training, to learn compressed model. 2017-03-09 18:43:21 -06:00
Matthew Honnibal
f71eeef9bb Pass path argument to end_training 2017-03-09 18:42:40 -06:00
Dan Rapp
123d3f2d38 Fix error in test case parameterization 2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7 Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix 2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d Issue #840 - URL pattenr too broad 2017-03-09 11:39:39 -07:00
Matthew Honnibal
5b0b968d13 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-03-08 15:03:10 +01:00
Matthew Honnibal
0ac3d27689 Fix handling of trailing whitespace
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
ines
c2e3e651b8 Re-add regression test for #859 2017-03-08 14:36:09 +01:00
Matthew Honnibal
0a6d7ca200 Fix spacing after token_match
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859.
2017-03-08 14:33:32 +01:00
shuvanon
85438aee1b update tokenizertokenizer 2017-03-08 17:29:39 +06:00
shuvanon
45bc78461c update tokenizertokenizer 2017-03-08 17:27:12 +06:00
Matthew Honnibal
cd33b39a04 Fix 2/3 problem for json save/load 2017-03-08 01:39:13 +01:00
Matthew Honnibal
40703988bc Use FTRL training in parser 2017-03-08 01:38:51 +01:00
Matthew Honnibal
d108534dc2 Fix 2/3 problems for training 2017-03-08 01:37:52 +01:00
Matthew Honnibal
d03d6a13f1 Merge branch 'rominf-ud20' into develop 2017-03-07 21:48:56 +01:00
Matthew Honnibal
f7374d0b86 Merge branch 'ud20' of https://github.com/rominf/spaCy into rominf-ud20 2017-03-07 21:48:37 +01:00
Matthew Honnibal
16670d3251 Xfail the vocab pickling for now 2017-03-07 21:43:28 +01:00
Matthew Honnibal
a89c3500f6 Fixes to hacky vocab pickling 2017-03-07 20:58:55 +01:00
Matthew Honnibal
d814892805 Hackish pickle support for Vocab. 2017-03-07 20:25:12 +01:00
Matthew Honnibal
26614e028f Add hacky support for StringCFile, to make pickling easier. 2017-03-07 20:24:37 +01:00
Matthew Honnibal
3edb8ae207 Whitespace 2017-03-07 17:16:26 +01:00
Matthew Honnibal
5de7e712b7 Add support for pickling StringStore. 2017-03-07 17:15:18 +01:00
Matthew Honnibal
4e75e74247 Update regression test for variable-length pattern problem in the matcher. 2017-03-07 16:08:32 +01:00
Matthew Honnibal
6d67213b80 Add test for 850: Matcher fails on zero-or-more. 2017-03-07 15:55:28 +01:00
Aniruddha Adhikary
696215a3fb add tests for Bengali 2017-03-05 11:25:12 +06:00
Aniruddha Adhikary
8f3bfe9bfc [Bengali] basic tag map, morph, lemma rules and exceptions 2017-03-04 12:36:59 +06:00
Roman Inflianskas
66e1109b53 Add support for Universal Dependencies v2.0 2017-03-03 13:17:34 +01:00
ines
8dff040032 Revert "Add regression test for #859"
This reverts commit c4f16c66d1.
2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela
25c29f072d apply patch 2017-03-01 21:44:17 +01:00
Juan Miguel Cejuela
a8cfde46d3 #781 Fix test — colocalizes is lemmatized to colocaliz and colicalize 2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela
a471114eb2 #781 add regression test, failing previous bug fix 2017-03-01 21:30:51 +01:00
ines
c4f16c66d1 Add regression test for #859 2017-03-01 16:07:27 +01:00
Aniruddha Adhikary
d91be7aed4 add punctuations for Bengali 2017-02-28 21:07:14 +06:00
Aniruddha Adhikary
5a4fc09576 add basic Bengali support 2017-02-28 07:48:37 +06:00
Matthew Honnibal
cc9b2b74e3 Merge branch 'french-tokenizer-exceptions' 2017-02-27 11:44:39 +01:00
Matthew Honnibal
bd4375a2e6 Remove comment 2017-02-27 11:44:26 +01:00
Matthew Honnibal
e7e22d8be6 Move import within get_exceptions() function, to speed import 2017-02-27 11:34:48 +01:00
Matthew Honnibal
34bcc8706d Merge branch 'french-tokenizer-exceptions' 2017-02-27 11:21:21 +01:00
Matthew Honnibal
0aaa546435 Fix test after updating the French tokenizer stuff 2017-02-27 11:20:47 +01:00
Matthew Honnibal
26446aa728 Avoid loading all French exceptions on import
Move exceptions loading behind a get_tokenizer_exceptions() function
for French, instead of loading into the top-level namespace. This
cuts import times from 0.6s to 0.2s, at the expense of making the
French data a little different from the others (there's no top-level
TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat
unsatisfying.
2017-02-25 11:55:00 +01:00
ines
376c5813a7 Remove print statements from test 2017-02-24 18:26:32 +01:00
ines
7c1260e98c Add regression test 2017-02-24 18:22:49 +01:00
ines
0e2e331b58 Convert exceptions to Python list 2017-02-24 18:22:40 +01:00
ines
51eb190ef4 Remove print statements from test 2017-02-24 17:41:12 +01:00
Matthew Honnibal
db5ada3995 Merge branch 'master' of https://github.com/explosion/spaCy 2017-02-24 14:28:12 +01:00
Matthew Honnibal
8f94897d07 Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766 2017-02-24 14:27:02 +01:00
ines
67991b6e5f Add more test cases to #775 regression test to cover #847 2017-02-18 14:10:44 +01:00
ines
30ce2a6793 Exclude "shed" and "Shed" from tokenizer exceptions (see #847) 2017-02-18 14:10:44 +01:00
Ines Montani
de997c1a33 Merge pull request #842 from magnusburton/master
Added regular verb rules for Swedish
2017-02-17 11:18:20 +01:00
Magnus Burton
41fcfd06b8 Added regular verb rules for Swedish 2017-02-17 10:04:04 +01:00
ines
aa92d4e9b5 Fix unicode regex for Python 2 (see #834) 2017-02-16 23:49:54 +01:00
ines
44de3c7642 Reformat test and use text_file fixture 2017-02-16 23:49:19 +01:00
ines
3dd22e9c88 Mark vectors test as xfail (temporary) 2017-02-16 23:28:51 +01:00
ines
85d249d451 Revert "Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834)""
This reverts commit ea05f78660.
2017-02-16 23:26:25 +01:00
ines
ea05f78660 Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834)"
This reverts commit 7d8c9eee7f, reversing
changes made to f6b69babcc.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
06a71d22df Fix test failure by using unicode literals 2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque
3ba109622c Add regression test with non ' ' space character as token 2017-02-16 12:23:27 +01:00
Raphaël Bournhonesque
e17dc2db75 Remove useless import 2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque
3fd2742649 load_vectors should accept arbitrary space characters as word tokens
Fix bug  #834
2017-02-16 12:08:30 +01:00
ines
f08e180a47 Make groups non-capturing
Prevents hitting the 100 named groups limit in Python
2017-02-10 13:35:02 +01:00
ines
fa3b8512da Use consistent imports and exports
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
2017-02-10 13:34:09 +01:00
ines
21f09d10d7 Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
This reverts commit f02a2f9322.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322 Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
This reverts commit b95afdf39c, reversing
changes made to b0ccf32378.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
309da78bf0 Merge branch 'master' into tokenizer_exceptions 2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque
4ce0bbc6b6 Update unit tests 2017-02-09 16:30:43 +01:00
Raphaël Bournhonesque
5d706ab95d Merge tokenizer exceptions from PR #802 2017-02-09 16:30:28 +01:00
ines
654fe447b1 Add Swedish tokenizer tests (see #807) 2017-02-05 11:47:07 +01:00
ines
6715615d55 Add missing EXC variable and combine tokenizer exceptions 2017-02-05 11:42:52 +01:00
Ines Montani
30a52d576b Merge pull request #807 from magnusburton/master
Added swedish lemma rules and more verb contractions
2017-02-05 11:34:19 +01:00
Magnus Burton
19c0ce745a Added swedish lemma rules 2017-02-04 17:53:32 +01:00
Michael Wallin
d25556bf80 [issue 805] Fix issue 2017-02-04 16:22:21 +02:00
Michael Wallin
35100c8bdd [issue 805] Add regression test and the required fixture 2017-02-04 16:21:34 +02:00
ines
0ab353b0ca Add line breaks to Finnish stop words for better readability 2017-02-04 13:40:25 +01:00
Michael Wallin
1a1952afa5 [finnish] Add initial tests for tokenizer 2017-02-04 13:54:10 +02:00
Michael Wallin
f9bb25d1cf [finnish] Reformat and correct stop words 2017-02-04 13:54:10 +02:00
Michael Wallin
73f66ec570 Add preliminary support for Finnish 2017-02-04 13:54:10 +02:00
Ines Montani
65d6202107 Merge pull request #802 from Tpt/fr-tokenizer
Adds more French tokenizer exceptions
2017-02-03 10:52:20 +01:00
Tpt
75a74857bb Adds more French tokenizer exceptions 2017-02-03 13:45:18 +04:00
Ines Montani
afc6365388 Update regression test for #801 to match current expected behaviour 2017-02-02 16:23:05 +01:00
Ines Montani
012f4820cb Keep infixes of punctuation + hyphens as one token (see #801) 2017-02-02 16:22:40 +01:00
Ines Montani
1219a5f513 Add = to tokenizer prefixes 2017-02-02 16:21:11 +01:00
Ines Montani
ff04748eb6 Add missing emoticon 2017-02-02 16:21:00 +01:00
Ines Montani
13a4ab37e0 Add regression test for #801 2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque
85f951ca99 Add tokenizer exceptions for French 2017-02-02 08:36:16 +01:00
Matvey Ezhov
32a22291bc Small Doc.count_by documentation update
Current example doesn't work
2017-01-31 19:18:45 +03:00
Ines Montani
e4875834fe Fix formatting 2017-01-31 15:19:33 +01:00
Ines Montani
c304834e45 Add missing import 2017-01-31 15:18:30 +01:00
Ines Montani
e6465b9ca3 Parametrize test cases and mark as xfail 2017-01-31 15:14:42 +01:00
latkins
e4c84321a5 Added regression test for Issue #792. 2017-01-31 13:47:42 +00:00
Matthew Honnibal
6c665b81df Fix redundant == TAG in from_array conditional 2017-01-31 00:46:21 +11:00
Ines Montani
19501f3340 Add regression test for #775 2017-01-25 13:16:52 +01:00
Ines Montani
209c37bbcf Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775) 2017-01-25 13:15:02 +01:00
Raphaël Bournhonesque
1be9c0e724 Add fr tokenization unit tests 2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
1faaf698ca Add infixes and abbreviation exceptions (fr) 2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
cf8474401b Remove unused import statement 2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
902f136f18 Add support for elision in French 2017-01-24 10:57:37 +01:00
Ines Montani
55c9c62abc Use relative import 2017-01-23 21:27:49 +01:00
Ines Montani
0967eb07be Add regression test for #768 2017-01-23 21:25:46 +01:00
Ines Montani
6baa98f774 Merge pull request #769 from raphael0202/spacy-768
Allow zero-width 'infix' token
2017-01-23 21:24:33 +01:00
Raphaël Bournhonesque
dce8f5515e Allow zero-width 'infix' token 2017-01-23 18:28:01 +01:00
Ines Montani
5f6f48e734 Add regression test for #759 2017-01-20 15:11:48 +01:00
Ines Montani
09ecc39b4e Fix multi-line string of NUM_WORDS (resolves #759) 2017-01-20 15:11:48 +01:00
Magnus Burton
69eab727d7 Added loops to handle contractions with verbs 2017-01-19 14:08:52 +01:00
Matthew Honnibal
be26085277 Fix missing import
Closes #755
2017-01-19 22:03:52 +11:00
Ines Montani
7e36568d5b Fix title to accommodate sputnik 2017-01-17 00:51:09 +01:00
Ines Montani
d704cfa60d Fix typo 2017-01-16 21:30:33 +01:00
Ines Montani
64e142f460 Update about.py 2017-01-16 14:23:08 +01:00
Matthew Honnibal
e889cd698e Increment version 2017-01-16 14:01:35 +01:00
Matthew Honnibal
e7f8e13cf3 Make Token hashable. Fixes #743 2017-01-16 13:27:57 +01:00
Matthew Honnibal
2c60d0cb1e Test #743: Tokens unhashable. 2017-01-16 13:27:26 +01:00
Matthew Honnibal
48c712f1c1 Merge branch 'master' of ssh://github.com/explosion/spaCy 2017-01-16 13:18:06 +01:00
Matthew Honnibal
7ccf490c73 Increment version 2017-01-16 13:17:58 +01:00
Ines Montani
50878ef598 Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744) 2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b Fix formatting 2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c Merge pull request #742 from oroszgy/hu_tokenizer_fix
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41 Further numeric test. 2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa Better error handling 2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c Better error handling 2017-01-14 22:09:29 +01:00
Gyorgy Orosz
a45f22913f Added further abbreviations present in the Szeged corpus 2017-01-14 22:08:55 +01:00
Ines Montani
332ce2d758 Update README.md 2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b Passing all old tests. 2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af Fixed hyphen handling in the Hungarian tokenizer. 2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6 Maintaining compatibility with other spacy tokenizers. 2017-01-14 16:19:15 +01:00
Gyorgy Orosz
be7a7aeb1a Reversed accidental changes. 2017-01-14 15:59:36 +01:00
Gyorgy Orosz
1be5da1ac6 Fixed Hungarian tokenizer for numbers 2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a Fix test formatting and consistency 2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5 Update README.md 2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1 Mark lemmatizer tests as models since they use installed data 2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1 Modernise vector tests, use add_vecs_to_vocab and don't depend on models 2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a Fix test name for consistency 2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f Add util function to add vectors to vocab 2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d Reformat add_docs_equal and add docstring 2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073 Add README.md to tests to explain organisation and conventions 2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90 Modernise serializer I/O tests and don't depend on models where possible 2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4 Add text_file_b fixture using BytesIO 2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62 Modernise noun chunks tests and don't depend on models 2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686 Rename test_parser to test_noun_chunks 2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47 Remove old tests 2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26 Move parser tests from unit to parser 2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e Merge tokenizer tests 2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff Move attrs tests from unit to root and modernise 2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967 Move alignment tests from munge to gold and modernise 2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a Reformat and rename Pragmatic Segmenter tests and mark xfails 2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d Modernise lemmatizer tests 2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9 Modernise tagger tests and fix xpassing test 2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e Create basic and extended test set for URLs 2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8 Modernise BILUO tests 2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01 Add Lemmatizer fixture 2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597 Add path fixture for spaCy data path 2017-01-12 23:38:47 +01:00
Ines Montani
0894b8c0ef Don't split tokens with digits and "/" infixes (resolves #740) 2017-01-12 22:58:26 +01:00
Ines Montani
e9e99a5670 Add regression test for #740 2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409 Fix formatting 2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31 Modernise and merge matcher tests 2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a Update comments on EN and DE fixtures 2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9 Tidy up and rename regression tests and remove unnecessary imports 2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3 Fix formatting and consistency 2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e Remove redundant language loading integration tests 2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2 Modernise serializer codecs tests 2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6 Modernise Huffman tests 2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5 Modernise packer tests and don't depend on models where possible 2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0 Modernise and merge serialization tests 2017-01-12 21:57:19 +01:00
Ines Montani
442237787c Add assert_docs_equal util to compare two docs 2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb Add fixture for entity recognizer 2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc Modernise matcher tests and split into two files 2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8 Move matcher tests for #188 and #242 to regression tests
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd Update test to not create redundant Doc object 2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8 Fix formatting, naming and unicode declaration 2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d Modernise vector similarity tests 2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a Add get_cosine util function 2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629 Fix regression test for #615 and remove unnecessary imports 2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c Adjust formatting 2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6 Modernise and merge lexeme vocab tests 2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2 Move test for #361 to regression tests 2017-01-12 16:51:12 +01:00
Daniel Hershcovich
99eb494a82 Fix #737: support loading word vectors with " " as a word 2017-01-12 17:00:14 +02:00
Ines Montani
7cb3d74426 Modernise span tests and don't depend on models 2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee Modernise vocab API tests and remove old xfailing tests 2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd Rename test_vocab.py to test_vocab_api.py 2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68 Merge flag features tests into orth tests in tests root 2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3 Remove StringStore tests from vocab tests 2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf Modernise add vectors vocab test 2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345 Use consistent test names 2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce Remove old unused tests and conftest files 2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9 Move Pragmatic Segmenter test cases (currently unused) to parser tests 2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874 Add tests for StringStore 2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5 Add fixture for StringStore 2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a Modernise tests for merging spans and don't depend on models 2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d Remove unused old test 2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b Move test for #54 to regression tests 2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c Remove unused conftest 2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc Allow setting ents in get_doc 2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5 Fix and pass regression test for #736 2017-01-12 11:48:56 +01:00
Matthew Honnibal
4e48862fa8 Remove print statement 2017-01-12 11:25:39 +01:00
Matthew Honnibal
d1d8214767 Increment version 2017-01-12 11:21:57 +01:00
Matthew Honnibal
fba67fa342 Fix Issue #736: Times were being tokenized with incorrect string values. 2017-01-12 11:21:01 +01:00
Ines Montani
a6790b6694 Rename tags to pos in get_doc and allow adding tags to tokens 2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67 Merge lemmatizer tests 2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf Modernise morph exceptions test and don't depend on models 2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e Add regression test for #736 2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891 Move language-specific tests out of redundant tokenizer directories 2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a Tidy up 2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7 Move text file back to tokenizer tests directory 2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017 Remove old and/or redundant tests 2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097 Modernise space attachment parser tests and don't depend on models 2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8 Modernise and merge parser tests and don't depend on models 2017-01-12 01:07:29 +01:00
Ines Montani
178c147612 Modernise nonprojectivity tests and don't depend on models 2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c Modernise sentence boundary detection tests and don't depend on models (where possible) 2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d Remove old unused pickle test 2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc Move test for #309 to regression tests 2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670 Modernise parser tests and don't depend on models 2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782 Add apply_transition_sequence util function to utils 2017-01-11 21:30:14 +01:00
Ines Montani
09807addff Add en_parser fixture 2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61 Modernise Doc parse tree navigation tests and don't depend on models 2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2 Use consistent test names 2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367 Rename "tokens" tests to "doc" 2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563 Remove old unused files 2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f Remove old word vector tests 2017-01-11 18:55:08 +01:00
Ines Montani
e027936920 Modernise Doc noun chunks tests 2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd Modernise Doc array tests and don't depend on models 2017-01-11 18:54:46 +01:00
Ines Montani
05447be884 Modernise test for adding entities 2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00 Modernise Doc API tests and don't depend on models 2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44 Make words optional for get_doc 2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419 Fix StringIO import for Python 3 2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b Rename test_tokens_api.py to test_doc_api.py 2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18 Merge token tests into token API tests 2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0 Modernise token API tests and don't depend on loading models 2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90 Merge conftests into one cohesive file 2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df Add test utils and get_doc helper function
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Matthew Honnibal
e12c90e03f Merge branch 'master' of ssh://github.com/explosion/spaCy 2017-01-11 13:03:51 +01:00
Matthew Honnibal
12cd27b821 Amend 8ae8b443f: Handle comparison with None tokens. 2017-01-11 13:03:32 +01:00
Daniel Hershcovich
8e603cc917 Avoid "True if ... else False" 2017-01-11 11:18:22 +02:00
Matthew Honnibal
44e2b0100d Support TAG attribute in doc.from_array 2017-01-10 22:47:07 +01:00
Ines Montani
3e6e1f0251 Tidy up regression tests 2017-01-10 19:24:10 +01:00
Magnus Burton
aad23ab0b4 Supplemented with capitalized Swedish exceptions 2017-01-10 16:07:20 +01:00
Ines Montani
869963c3c4 Mark extensive prefix/suffix tests as slow 2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe Add simple test for surrounding brackets 2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2 Assert length first 2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907 Adjust names and formatting 2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964 Remove semi-redundant URLs and punctuation for faster testing 2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c Add unicode declaration 2017-01-10 15:53:15 +01:00
Matthew Honnibal
0f9b8a00a5 Unbreak data download 2017-01-09 23:40:26 +01:00
Matthew Honnibal
8ae8b443f1 Add richcmp method to Token. Closes #631 2017-01-09 19:30:31 +01:00
Matthew Honnibal
64f747cb65 Token comparison test 2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c Add tests for token comparison, re Issue #631 2017-01-09 19:09:59 +01:00
Matthew Honnibal
97a1286129 Revert changes to tagger and parser for thinc 6 2017-01-09 10:08:34 -06:00
Matthew Honnibal
95a52005df Revert "Fix Issue #683: Add 'SP' to tag_map, if it's not there already, within the Morphology class."
This reverts commit 40e71586d6.
2017-01-09 09:55:55 -06:00
Ines Montani
363f09e68c Merge pull request #726 from magnusburton/master
Added Swedish abbreviations as token exceptions
2017-01-09 14:58:15 +01:00
Matthew Honnibal
42cd598f57 Use correct fixtures in URL tokenizer 2017-01-09 14:10:40 +01:00
Matthew Honnibal
d9a77ddf14 Return None for data path if it doesn't exist 2017-01-09 14:10:05 +01:00
Matthew Honnibal
e4862d1dab Merge branch 'develop' 2017-01-09 13:36:01 +01:00
Ines Montani
aa876884f0 Revert "Revert "Merge remote-tracking branch 'origin/master'""
This reverts commit fb9d3bb022.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb Remove old tests for old website example code 2017-01-08 22:28:53 +01:00
Ines Montani
eef94e3ee2 Split off period after two or more uppercase letters (fixes #483) 2017-01-08 22:28:25 +01:00
Ines Montani
a89a6000e5 Remove unused import 2017-01-08 22:17:37 +01:00
Ines Montani
5d28664fc5 Don't test Hungarian for numbers and hyphens for now
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
53362b6b93 Reorganise Hungarian prefixes/suffixes/infixes
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani
347c4a2d06 Reorganise and reformat global tokenizer prefixes, suffixes and infixes 2017-01-08 20:37:39 +01:00
Ines Montani
0dec90e9f7 Use global abbreviation data languages and remove duplicates 2017-01-08 20:36:00 +01:00
Ines Montani
7c3cb2a652 Add global abbreviations data 2017-01-08 20:34:03 +01:00
Ines Montani
de5aa92bc2 Handle deprecated tokenizer prefix data 2017-01-08 20:33:28 +01:00
Ines Montani
abb09782f9 Move sun.txt to original location and fix path to not break parser tests 2017-01-08 20:32:54 +01:00
Ines Montani
cab39c59c5 Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07 Move abbreviations below other exceptions 2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9 Generate he/she/it correctly with 's instead of 've 2017-01-05 19:57:00 +01:00
Ines Montani
8328925e1f Add newlines to long German text 2017-01-05 18:13:30 +01:00
Ines Montani
55b46d7cf6 Add tokenizer tests for German 2017-01-05 18:11:25 +01:00
Ines Montani
5bb4081f52 Remove redundant test_tokenizer.py for English 2017-01-05 18:11:11 +01:00
Ines Montani
8216ba599b Add tests for longer and mixed English texts 2017-01-05 18:11:04 +01:00
Ines Montani
65f937d5c6 Move basic contraction tests to test_contractions.py 2017-01-05 18:09:53 +01:00
Ines Montani
bbe7cab3a1 Move non-English-specific tests back to general tokenizer tests 2017-01-05 18:09:29 +01:00
Ines Montani
038002d616 Reformat HU tokenizer tests and adapt to general style
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani
bc911322b3 Move ") to emoticons (see Tweebo challenge test) 2017-01-05 18:05:38 +01:00
Ines Montani
637f785036 Add general sanity tests for all tokenizers 2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de Move English tokenizer tests to directory /en 2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d Modernize and merge general tokenizer tests 2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9 Modernize and merge tokenizer tests for string loading 2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822 Modernize and merge tokenizer tests for whitespace 2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1 Modernize and merge tokenizer tests for text from file 2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653 Modernize and merge tokenizer tests for punctuation 2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf Modernize and merge tokenizer tests for prefixes/suffixes/infixes 2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5 Modernize and merge tokenizer tests for exception and emoticons 2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d Fix formatting 2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa Add missing docstrings 2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6 Add unicode declarations 2017-01-05 13:09:48 +01:00
Ines Montani
58adae8774 Remove unused file 2017-01-05 13:09:22 +01:00
Ines Montani
c6e5a5349d Move regression test for #360 into own file 2017-01-04 00:49:31 +01:00
Ines Montani
8279993a6f Modernize and merge tokenizer tests for punctuation 2017-01-04 00:49:20 +01:00
Ines Montani
550630df73 Update tokenizer tests for contractions 2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f Update conftest fixture 2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293 Modernize tokenizer tests for emoticons 2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd Modernize tokenizer tests for infixes 2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27 Move regression test for #351 to own file 2017-01-04 00:47:11 +01:00
Ines Montani
667051375d Modernize tokenizer tests for whitespace 2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285 Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
1d237664af Add lowercase lemma to tokenizer exceptions 2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb Fix typos 2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani
fb9d3bb022 Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1, reversing
changes made to b19cfcc144.
2017-01-03 18:21:36 +01:00
Ines Montani
461cbb99d8 Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144.
2017-01-03 18:21:29 +01:00
Ines Montani
d3b181cdf1 Merge remote-tracking branch 'origin/master'
# Conflicts:
#	spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani
b19cfcc144 Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani
1bd53bbf89 Fix typos (resolves #718) 2017-01-03 11:26:21 +01:00
Matthew Honnibal
fde53be3b4 Move whole token mach inside _split_affixes. 2016-12-30 17:11:50 -06:00
Matthew Honnibal
3ba7c167a8 Fix URL tests 2016-12-30 17:10:08 -06:00
Matthew Honnibal
9936a1b9b5 Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns 2016-12-30 14:53:40 -06:00
Magnus Burton
56e2219b65 Added Swedish city abbreviations 2016-12-30 21:17:34 +01:00
Magnus Burton
e935c950d8 Added months and days as abbreviations for Swedish 2016-12-30 21:08:44 +01:00
kengz
73a38bd4d1 Merge remote-tracking branch 'upstream/master' 2016-12-30 12:19:59 -05:00
kengz
da44183ae1 move parse_tree logic to a new tokens/printers.py file 2016-12-30 12:19:18 -05:00
Matthew Honnibal
3e8d9c772e Test interaction of token_match and punctuation
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Matthew Honnibal
74b921f394 Merge branch 'master' of ssh://github.com/explosion/spaCy into develop 2016-12-30 14:38:27 +01:00
Matthew Honnibal
623d94e14f Whitespace 2016-12-31 00:30:28 +11:00
Matthew Honnibal
af81ac8bb0 Use thinc 6.0 2016-12-29 11:58:42 +01:00
Petter Hohle
f112e7754e Add PART to tag map
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
2016-12-28 18:39:01 +01:00
Matthew Honnibal
f62db78dc3 Increment version 2016-12-27 21:11:22 +01:00
Matthew Honnibal
cade536d1e Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-27 21:04:10 +01:00
Matthew Honnibal
ce4539dafd Allow the vocabulary to grow to 10,000, to prevent cold-start problem. 2016-12-27 21:03:45 +01:00
Ines Montani
ad3669cef5 Merge pull request #703 from magnusburton/master
Added Swedish abbreviations
2016-12-27 01:01:49 +01:00
Ines Montani
78f754dd9a Merge pull request #705 from oroszgy/hu_tokenizer
Initial support for Hungarian
2016-12-27 00:48:13 +01:00
Ines Montani
8785706039 Reformat stop words for better readability 2016-12-24 00:58:40 +01:00
Gyorgy Orosz
45e045a87b Unicode/UTF8 compatibility for Python2 2016-12-24 00:21:00 +01:00
Gyorgy Orosz
72b61b6d03 Typo fix. 2016-12-24 00:10:29 +01:00
Gyorgy Orosz
3a9be4d485 Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers. 2016-12-23 23:49:34 +01:00
Ines Montani
1436b9f15a Fix formatting and consistency 2016-12-23 21:36:01 +01:00
Ines Montani
1d64527727 Update Spanish tokenizer
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
2016-12-23 21:36:01 +01:00
Ines Montani
7f411fd01c Remove exceptions containing whitespace / no special chars 2016-12-23 14:30:06 +01:00
Magnus Burton
fdf4776262 Added Swedish abbreviations 2016-12-22 22:45:18 +01:00
Gyorgy Orosz
d9c59c4751 Maintaining backward compatibility. 2016-12-21 23:30:49 +01:00
Gyorgy Orosz
1748549aeb Added exception pattern mechanism to the tokenizer. 2016-12-21 23:16:19 +01:00
Gyorgy Orosz
35aa54765d Hungarian module is exposed in spacy. 2016-12-21 20:45:36 +01:00
Gyorgy Orosz
ab2f6ea46c Removed data files from tests.. 2016-12-21 20:22:09 +01:00
Ines Montani
3c87c71d43 Add tokenizer exceptions for a.m. and p.m. in Spanish 2016-12-21 18:19:10 +01:00
Ines Montani
78e63dc7d0 Update tokenizer exceptions for English 2016-12-21 18:06:34 +01:00
Ines Montani
702d1eed93 Update tokenizer exceptions for German 2016-12-21 18:06:27 +01:00
Ines Montani
d60380418e Update tokenizer exceptions for Spanish 2016-12-21 18:06:17 +01:00
Ines Montani
920fa0fed2 Add DET_LEMMA constant 2016-12-21 18:05:41 +01:00
Ines Montani
8978806ea6 Allow Vocab to load without serializer_freqs 2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6 Remove trailing whitespace 2016-12-21 18:04:41 +01:00
Ines Montani
926e19184a Merge pull request #695 from magnusburton/master
Added Swedish morph rules
2016-12-21 01:06:00 +01:00
Gyorgy Orosz
3d5306acb9 Added further testcases. 2016-12-20 23:49:35 +01:00
Gyorgy Orosz
23956e72ff Improved partial support for tokenzing Hungarian numbers 2016-12-20 23:36:59 +01:00
Gyorgy Orosz
6add156075 Refactored language data structure 2016-12-20 22:28:20 +01:00
Gyorgy Orosz
366b3f8685 Merge branch 'master' into hu_tokenizer 2016-12-20 20:53:31 +01:00
Gyorgy Orosz
c035928156 Partial Hungarian number tokenization is added. 2016-12-20 20:46:20 +01:00
JM
70ff0639b5 Fixed missing vec_path declaration that was failing if 'add_vectors' was set
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
2016-12-20 18:21:05 +01:00
Magnus Burton
48dcc9f647 Added morph rules 2016-12-20 13:18:41 +01:00
Magnus Burton
db5a077d2b Initial commit for Swedish 2016-12-20 11:05:06 +01:00
Matthew Honnibal
3f5747a9b2 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-18 23:44:22 +01:00
Matthew Honnibal
40e71586d6 Fix Issue #683: Add 'SP' to tag_map, if it's not there already, within the Morphology class. 2016-12-18 23:44:05 +01:00
Matthew Honnibal
fa1d23e10d Merge branch 'master' of https://github.com/explosion/spaCy 2016-12-18 23:32:03 +01:00
Matthew Honnibal
f38eb25fe1 Fix test for word vector 2016-12-18 23:31:55 +01:00
Matthew Honnibal
4e68abebc4 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-18 23:19:45 +01:00
Matthew Honnibal
5a6328a5a4 Increment version 2016-12-18 23:19:19 +01:00
Matthew Honnibal
13a0b31279 Another tweak to GloVe path hackery. 2016-12-18 23:12:49 +01:00
Matthew Honnibal
2c6228565e Fix vector loading re glove hack 2016-12-18 23:06:44 +01:00
Matthew Honnibal
618b50a064 Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-18 22:46:31 +01:00
Matthew Honnibal
404019ad2f Fix issue #672: ent_iob_ was a string, not unicode, due to missing unicode_literals statement. 2016-12-18 22:33:53 +01:00
Matthew Honnibal
2ef9d53117 Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-18 22:29:31 +01:00
Matthew Honnibal
c065359459 Fix path-override bug in spacy.load 2016-12-18 22:15:29 +01:00
Matthew Honnibal
813249f826 Work on morphology class. Still not fully consistent with rest of library. 2016-12-18 17:35:22 +01:00
Matthew Honnibal
3679fb43a3 Fix loading of lemmatizer 2016-12-18 17:34:09 +01:00
Matthew Honnibal
3980f1b0cb Ignore more morphology attributes in deprecated mode of intify_attrs 2016-12-18 17:33:46 +01:00
Matthew Honnibal
7a98ee5e5a Merge language data change 2016-12-18 17:03:52 +01:00
Matthew Honnibal
e4c951c153 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93 Fix formatting 2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data 2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd Fix tokenizer test 2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5 Use base language data as default 2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09 Remove trailing whitespace 2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c Add base tag map 2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11 Reorganise language data 2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8 Whitespace 2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96 Add import in regression test 2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5 Set tag_map to None if it's not seen in the data by vocab 2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b Update header for morphology class 2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7 Filter out morphology keys in deprecated attrs 2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100 Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced. 2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd Wire up lemmatizer rules for English 2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04 Whitespace 2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882 Break language data components into their own files 2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db Update English language data 2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd Add lemma rules 2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8 Add morph rules 2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9 Add entity rules 2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d Fix formatting 2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0 Break language data components into their own files 2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f Add ENT_ID constant 2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453 Refactor loading of morphology exceptions, adding a method add_special_case. 2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6 Remove unnecessary argument in test 2016-12-18 14:06:27 +01:00
Ines Montani
121c310566 Remove trailing whitespace 2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3 Fix tag map for German 2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3 Fix typo 2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635 Change test595 to mock data, instead of requiring model. 2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff Check POS key in lemmatizer, to update it for new data format 2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e Restore missing '' character in tokenizer exceptions. 2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9 Remove duplicates in tag map 2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8 Fix tag map 2016-12-17 22:44:22 +01:00
Ines Montani
577adad945 Fix formatting 2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136 Fix typo 2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc Fix typo 2016-12-17 13:59:30 +01:00
Ines Montani
afda532595 Use symbols in tag map 2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9 Fix formatting 2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6 Reformat dutch language data to match new style 2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504 Resolve stopwords conflict to merge Dutch 2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f Merge pull request #688 from nlesc-sherlock/dutch
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f Add missing lemmas to tokenizer exceptions (fixes #674) 2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd Expand tokenizer exceptions with unicode apostrophe (fixes #685) 2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612 Fix formatting 2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67 Move shared functions and constants to global language data 2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086 Move update_exc to global language data utils 2016-12-17 12:29:02 +01:00
Ines Montani
f324311249 Add global language data utils 2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a Add encoding declaration 2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334 Add tokenizer exception for "gonna" (fixes #691) 2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa Revert "Add exception for "gonna""
This reverts commit 280c03f67b.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b Add exception for "gonna" 2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2 Fix typo in stopwords (fixes #689) 2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a Merge github.com:explosion/spaCy into dutch 2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7 Revert "Add acl to symbols.pyx" 2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed Update symbols.pxd 2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777 Add acl to symbols.pyx 2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24 Adding partial hyphen and quote handling support. 2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3 Passing Hungatian abbrev tests. 2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9 Add Portuguese stopwords 2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc Update Portuguese language data 2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950 Remove unused data and download script 2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104 Remove unused data 2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660 Add French stopwords 2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb Update French language data 2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6 Add Italian stopwords 2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea Update Italian language data 2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e Add Spanish language data 2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786 Remove unused import 2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761 Split punctuation into its own file 2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8 Remove time from German language data 2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9 Add emoticons 2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f Fix formatting 2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35 Reorganize exceptions for English and German 2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda Add update_exc util function 2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad Fix formatting 2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c Fix formatting 2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004 Fix formatting 2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b Add more custom rules for abbreviations 2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa Additional abbreviation tests. 2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023 Added Hungarian resource files. 2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c Update language data for German 2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321 Fix capitalization on morphological features 2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955 First steps towards the Hungarian tokenizer code. 2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df Resolve conflict 2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695 Change morphology and lemmatizer API
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df Remove trailing whitespace 2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80 Apply emoticon exceptions to tokenizer 2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3 Fix formatting 2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee Declare encoding and unicode literals 2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657 Fix __all__ 2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c Add missing emoticons 2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93 Update English language data 2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe Add emoticons 2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294 Add line breaks 2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102 Add test for tokenizer regular expressions 2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32 Reformat language data 2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d Increment version 2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667 2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07 Merge github.com:explosion/spaCy into dutch 2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86 Update language data with tag map from UD_Dutch 2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9 Update Dutch language data
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2 Added language Dutch to init file 2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5 Fix create_tokenizer when nlp is None 2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9 Fix model saving error for Python 3 2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c Fix unicode problem in nonproj module 2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6 Filter out deprecated attributes when reading special-case tokenization rules. 2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2 Exclude morphs from deprecated token attributes for now 2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25 Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1 Merge old training fixes with newer state 2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4 Exclude morphs from deprecated token attributes for now 2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0 Allow dep to be None in scorer, for missing labels. 2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb Fix NER label calculation 2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53 Tweak arc_eager n_gold to deal with negative costs, and improve error message. 2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015 Pass cfg through loading, for training. 2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421 Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state 2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a Fix gold.pyx for 1.0 2016-11-25 08:57:59 -06:00
root
080d29e092 Fix train.py for 1.0 2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135 Test #656, #624: special case rules for tokenizer with attributes. 2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95 Fix #656, #624: Support arbitrary token attributes when adding special-case rules. 2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f Add set_struct_attr staticmethod to token 2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr. 2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51 Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries. 2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840 Add emoticons 2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a Added nl module for dutch 2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322 Added language class and some language data (with some TODOs) for Dutch 2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02 Add line breaks 2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2 Add test for tokenizer regular expressions 2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7 Reformat language data 2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76 Allow German noun chunks to work on Span
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4 Add directory and initial (empty) files for language Dutch 2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641 Fix Issue #639: stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored. 2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4 Fix default path loading. 2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee Work on test for #615 2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89 Fix syntax mistake 2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce Only try to load vectors if they exist. 2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093 Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional. 2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6 Fix another bug related to Language.__init__'s path parameter 2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0 Fix path param of Language.__init__ always being ignored
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389 Merge remote-tracking branch 'origin/master' into specify-data-path 2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72 Let --data-path be specified when running download.py scripts
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9 Strip trailing whitespace 2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326 Update and reformat German stopwords 2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309 Update language_data.py 2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a Add German Stopwords 2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7 Merge pull request #627 from sadovnychyi/patch-1
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29 Fixed bug: eg.guess is a tag id, rather than tag 2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1 Remove duplicated line of vocab declaration
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c Fix #617: Vocab.load() required Path. Should work with string as well. 2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6 Fix test for issue 617 2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329 spacy/tests/regression/test_issue617.py
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f Added a test case to cover the span.merge returning values 2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9 now span.merge returns token like it says on documentation 2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79 Fix PhraseMatcher to work with updated Matcher
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64 Add basic test for PhraseMatcher
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f Fix test for 605 2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439 Test #590: Order dependence in Matcher rules. 2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265 Fix #605: Acceptor now rejects matches as expected. 2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd Test Issue #605 2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac Fix #608 -- __version__ should be available at the base of the package. 2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7 Increment version 2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994 Update version 2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808 Fix Issue #376: and/or was tagged as a noun. 2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e Fix #368: Tokenizer handled pattern 'unicode close quote, period' incorrectly. 2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756 Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one. 2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82 Fix #602, #603 --- Broken build 2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331 Prefer to import from symbols instead of parts_of_speech 2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001 Test #595 -- Bug in lemmatization of base forms. 2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec Fix #588: Matcher should reject empty pattern. 2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec Test Issue #588: Matcher accepts invalid, empty patterns. 2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb Add tokenizer exception for 'Ph.D.', to fix 592. 2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b Import Jieba inside zh.make_doc 2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680 Remove deprecated tokens_from_list test. 2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595 Test Issue #599: .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents. 2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2 Fix Issue #600: Missing setters for Token attribute. 2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d Test Issue #600 2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29 Fix test 2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc Add import fr 2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982 Fix infixes in Italian 2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c Fix infixes in spanish and portuguese 2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a Fix infixes in french 2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb Add test for french tokenizer 2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044 Add test for loading languages 2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b Fix stray POS in language stubs 2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576 Handle null prefix/suffix/infix search in tokenizer 2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423 Check that patterns aren't null before compiling regex for tokenizer 2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33 Link languages in __init__.py 2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965 Stub out support for Italian 2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7 Stub out support for French, Spanish, Italian and Portuguese 2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83 Specify that spacy.util is encoded in utf8 2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395 Add draft Jieba tokenizer for Chinese 2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b Check for class-defined make_docs method before assigning one provided as an argument 2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d Work on draft Italian tokenizer 2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177 Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596 2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf Add __init__.py file for regression tests 2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20 Fix variable error in token 2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce Fix variable error in Span 2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f Fix syntax error while fixing doc strings 2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa Use 32 bit hashes for OOV, re Issue #589, Issue #285 2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd Add test for Issue #589 2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb Fix Issue #587: Segfault in Matcher, due to simple error in the state machine. 2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595 Improve test slightly 2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4 Test Issue #587: Matcher segfaults on particular input 2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208 Infer types in transition_system.pyx 2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94 Fix training evaluate method 2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898 Fix Issue #429. Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found. 2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3 Test Issue 429: No valid actions for NER after matcher adds a new entity label. 2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state. 2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912 Fix test, after IOB tweak. 2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87 Fix clobbering of 'missing' named ent values after assigning ents. 2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477 Remove dead code 2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579 Improve error message when oracle fails for non-projective trees, re Issue #571. 2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2 Fix Python 3 basestring error 2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f Update strings.pxd 2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff Workaround Issue #285: Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings. 2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99 Test workaround for Issue #285: Streaming data memory growth 2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7 Update test 2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2 Update test 2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125 Update test 2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe Update test 2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3 Update test 2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83 Increment version 2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e Fix Issue #566 2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294 Fix vector_norm test. 2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8 Make installation print data path. 2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064 Try to fix weird install glitch. 2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd Increment version 2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77 Fix vector norm when loading lexemes. 2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e Fix similarity test for Python 3 2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98 Test Issue #514: Serializer fails when new entity type has been added. 2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1 Fix broken test, re Issue #461 2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38 Test Issue #514: Serialization fails after adding a new entity label. 2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa Fix Issue #459 -- failed to deserialize empty doc. 2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322 Test Issue #459: Fail to deserialize empty doc 2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c Fix bug where huffman codec died if given empty freqs dict. 2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f Test that huffman codec works with empty freqs dict 2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e Fix Issue #461: O tag was being clobbered by doc.ents.__set__ 2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9 Test Issue #461: ent_iob tag incorrect after setting entities. 2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646 Fix compile error 2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693 Fix calculation of vector norm, re Issue #522. Need to consolidate the calculations into a helper function. 2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a Fix calculation of L2-norm for Lexeme 2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac Add tests to verify that Issue #442 is fixed in 1.1 2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af Fix create_pipeline keyword argument. 2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4 Fix vector_norm when vector is assigned to Lexeme. 2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f Add tests for using the new Entity ID tracking in the rule matcher 2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9 Add test for vector resizing, re Issue #544 2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc Bump version to 1.1.0 2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4 Add resize_vectors method to Vocab 2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824 Fix JSON in tokenizer 2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f Fix JSON in tagger 2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807 Increment version 2016-10-21 00:00:03 +02:00
Matthew Honnibal
e80944276f Fix Span.vector_norm 2016-10-20 21:58:56 +02:00
Matthew Honnibal
f5fe4f595b Fix json loading, for Python 3. 2016-10-20 21:23:26 +02:00
Matthew Honnibal
2e92c6fb3a Fix JSON encoding issue on load 2016-10-20 21:06:48 +02:00
Matthew Honnibal
4ad7bb96c9 Increment version. 2016-10-20 20:48:30 +02:00
Matthew Honnibal
5ec32f5d97 Fix loading of GloVe vectors, to address Issue #541 2016-10-20 18:27:48 +02:00
Matthew Honnibal
ddeabd76c4 Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised. 2016-10-20 16:57:53 +02:00
Matthew Honnibal
bfe5cb1244 Increment version. 2016-10-20 14:52:00 +02:00
Matthew Honnibal
f189a3cb00 Fix encoding when opening files in Python 2.7, re Issue #539 2016-10-20 14:42:56 +02:00
Matthew Honnibal
c353a5214d Increment version 2016-10-19 23:51:01 +02:00
Matthew Honnibal
d10c17f2a4 Fix Issue #536: oov_prob was 0 for OOV words. 2016-10-19 23:38:47 +02:00
Matthew Honnibal
dfa752d064 Increment version 2016-10-19 23:19:13 +02:00
Matthew Honnibal
3588a18fb8 Fix hook names in doc 2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773 Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc. 2016-10-19 20:54:22 +02:00
Matthew Honnibal
ed5e178817 Add sentiment property on lexeme object 2016-10-19 20:52:52 +02:00
Matthew Honnibal
d4aaf2752c Fix issue #535: Pipeline elements added even when data not installed. 2016-10-19 19:55:19 +02:00
Matthew Honnibal
04d1c959da Fix version 2016-10-19 03:45:37 +02:00
Matthew Honnibal
d35aa7344e Change version ID to make PyPi happy 2016-10-19 03:24:39 +02:00
Matthew Honnibal
89d2a5c8b3 Increment build version. 2016-10-19 03:05:17 +02:00
Matthew Honnibal
622b0a9674 Tweak download script 2016-10-19 00:52:16 +02:00
Matthew Honnibal
5a5c7192a5 Fix download.py for GloVe vectors. 2016-10-19 00:47:44 +02:00
Matthew Honnibal
edc45c19d6 Update download script 2016-10-19 00:41:14 +02:00
Matthew Honnibal
2bbb050500 Fix default of serializer_freqs 2016-10-18 19:55:41 +02:00
Matthew Honnibal
1b651db9c5 Fix parser creation in Language class. 2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7 Fix loading of tagger. 2016-10-18 19:33:04 +02:00
Matthew Honnibal
76c815f40d Fix spacy.load 2016-10-18 19:23:31 +02:00
Matthew Honnibal
8c8f5c62c6 Add LANG attribute to English and German 2016-10-18 18:52:48 +02:00
Matthew Honnibal
05e2a589a4 Fix None label in matcher 2016-10-18 18:05:21 +02:00
Matthew Honnibal
c3a8a1cf51 Update serializer test. 2016-10-18 16:18:46 +02:00
Matthew Honnibal
7d5212f131 Refactor defaults 2016-10-18 16:18:25 +02:00
Matthew Honnibal
a45a9d5092 Remove stray .tensor attribute from Lexeme 2016-10-18 01:16:32 +02:00
Matthew Honnibal
9258db788a Revert "Have the matcher return character offsets, to handle the match better."
This reverts commit 049c937540.
2016-10-17 16:49:51 +02:00
Matthew Honnibal
7d446e5094 Revert "Update matcher test, to reflect character offset return instead of token offset."
This reverts commit f8d3e3bcfe.
2016-10-17 16:49:49 +02:00
Matthew Honnibal
4bf2c53c13 Revert "Hack on matcher tests, for new implementation."
This reverts commit dbe60644ab.
2016-10-17 16:49:48 +02:00
Matthew Honnibal
2fd97c71cc Revert "Don't try to pickle matcher."
This reverts commit 97bd0c9d00.
2016-10-17 16:49:43 +02:00
Matthew Honnibal
97bd0c9d00 Don't try to pickle matcher. 2016-10-17 16:38:40 +02:00
Matthew Honnibal
dbe60644ab Hack on matcher tests, for new implementation. 2016-10-17 16:12:22 +02:00
Matthew Honnibal
f8d3e3bcfe Update matcher test, to reflect character offset return instead of token offset. 2016-10-17 16:00:10 +02:00
Matthew Honnibal
049c937540 Have the matcher return character offsets, to handle the match better. 2016-10-17 15:58:57 +02:00
Matthew Honnibal
9b60186266 Fix doc class 2016-10-17 15:23:47 +02:00
Matthew Honnibal
6cbdc94959 Lots of updates to Matcher, to make entity handling sane. 2016-10-17 15:23:31 +02:00
Matthew Honnibal
7fd98fc91c Remove deprecation shim around str/bytes in Token. 2016-10-17 14:02:47 +02:00
Matthew Honnibal
b67697a97b Improve API for doc.merge() and span.merge(), to use keyword arguments. 2016-10-17 14:02:13 +02:00
Matthew Honnibal
fbb7f3f15c Add user_data attribute to Doc object. 2016-10-17 11:43:22 +02:00
Matthew Honnibal
c1abc8f6ed Fix deprecation stuff in Token: Remove the shim for the str/unicode semantics, and raise for has_repvec and repvec 2016-10-17 11:18:41 +02:00
Matthew Honnibal
4ba9eadf3d Merge branch 'v1.0.0-rc1' of ssh://github.com/explosion/spaCy into v1.0.0-rc1 2016-10-17 02:45:44 +02:00
Matthew Honnibal
09ab447a18 Remove tensor property from token. 2016-10-17 02:45:09 +02:00
Matthew Honnibal
5d10e2005c Defer some attributes to Doc, via getters_for_tokens attribute. 2016-10-17 02:44:49 +02:00
Matthew Honnibal
8829984efb Remove tensor attribute from Span and Token. 2016-10-17 02:44:04 +02:00
Matthew Honnibal
d15a88c66a Defer some attributes to Doc via getters_for_spans 2016-10-17 02:43:35 +02:00
Matthew Honnibal
62230dd13a Add getters_for_spans and getters_for_tokens attributes to Doc. Fix docstring 2016-10-17 02:42:51 +02:00
Matthew Honnibal
ae11ea8240 Add getters_for_tokens and getters_for_spans attributes to Doc object. 2016-10-17 02:42:05 +02:00
Matthew Honnibal
be48a7b4f3 Fix conftest for website tests. 2016-10-17 01:54:26 +02:00
Matthew Honnibal
8951bf6989 Update matcher tests 2016-10-17 01:53:24 +02:00
Matthew Honnibal
0cf4aff470 Set default path in EN/DE tests. 2016-10-17 01:52:49 +02:00
Matthew Honnibal
cd71b6b0a9 Remove test of parser pickle 2016-10-17 01:52:10 +02:00
Matthew Honnibal
5bc101006e Add cfg field to Tagger 2016-10-17 01:03:41 +02:00
Matthew Honnibal
517f090cbf Use GoldParse in tagger.update 2016-10-17 00:55:15 +02:00
Matthew Honnibal
59038f7efa Restore support for prior data format -- specifically, the labels field of the config. 2016-10-17 00:53:26 +02:00
Matthew Honnibal
7887ab3b36 Fix default use of feature_templates in parser 2016-10-16 21:41:56 +02:00
Matthew Honnibal
f787cd29fe Refactor the pipeline classes to make them more consistent, and remove the redundant blank() constructor. 2016-10-16 21:34:57 +02:00
kengz
fb92e2d061 activate parse_tree test, use from_array, test for root correctness 2016-10-16 15:12:08 -04:00
kengz
17b7832419 mark test as needing models 2016-10-16 14:39:07 -04:00
kengz
f046e0d7c8 add parse_tree method to language, separate from __call__ for efficiency, but will use __call__ to get the doc 2016-10-16 14:20:23 -04:00
Matthew Honnibal
311a985fe0 Add input error handling in Doc 2016-10-16 18:16:42 +02:00
Matthew Honnibal
06322ba99d Add words and spaces keyword arguments to Doc. 2016-10-16 18:13:03 +02:00
Matthew Honnibal
ca51f3b77e Use DependencyParser and EntityRecognizer in the Language class. 2016-10-16 17:58:12 +02:00
Matthew Honnibal
195d998a12 Fix GoldParse argument to tagger.update 2016-10-16 17:05:09 +02:00
Matthew Honnibal
274a4d4272 Fix queue Python property in StateClass 2016-10-16 17:04:41 +02:00
Matthew Honnibal
e8c8aa08ce Make action_name optional in StepwiseState 2016-10-16 17:04:16 +02:00
Matthew Honnibal
4bb73b1a93 Fix parser labels in pipeline 2016-10-16 17:03:22 +02:00
Matthew Honnibal
a81c5a7abf Fix name of labels keyword to 'actions'. 2016-10-16 12:00:27 +02:00
Matthew Honnibal
a079677984 Fix omission of O action when creating blank entity recognizer 2016-10-16 11:43:25 +02:00
Matthew Honnibal
5444d38cc6 Update test for biluo tags 2016-10-16 11:42:45 +02:00
Matthew Honnibal
4fc56d4a31 Rename 'labels' to 'actions' in parser options 2016-10-16 11:42:26 +02:00
Matthew Honnibal
8a6b35d266 Delay binding in MakeDoc 2016-10-16 11:41:55 +02:00
Matthew Honnibal
52b48b415e Fix GoldParse class 2016-10-16 11:41:36 +02:00
Matthew Honnibal
3259a63779 Whitespace 2016-10-16 01:47:28 +02:00
Matthew Honnibal
509b30834f Add a pipeline module, to collect and wrap processes for annotation 2016-10-16 01:47:12 +02:00
Matthew Honnibal
0317cea0ad Fix GoldParse 2016-10-15 23:55:07 +02:00
Matthew Honnibal
1c62573a41 Fix spacy.train 2016-10-15 23:53:46 +02:00
Matthew Honnibal
a48aa15384 Improve the API for the GoldParse class. 2016-10-15 23:53:29 +02:00
Matthew Honnibal
e07fe92b27 Draft a refactored init for the GoldParse class 2016-10-15 22:09:52 +02:00
Matthew Honnibal
47afef7d6b Add init.py for gold tests 2016-10-15 21:51:28 +02:00
Matthew Honnibal
86ae665c78 Add function for entity->biluo transformation 2016-10-15 21:51:04 +02:00
Matthew Honnibal
2163fd238f Add tests for entity->biluo transformation 2016-10-15 21:50:43 +02:00
Matthew Honnibal
5e923b9bfa Return None in match_best_version if not path exists. 2016-10-15 14:47:29 +02:00
Matthew Honnibal
2516382106 Fix loading of English in span test 2016-10-15 14:44:37 +02:00
Matthew Honnibal
dda2fc6bef Add empty data directory 2016-10-15 14:25:25 +02:00
Matthew Honnibal
049197e0ae Update tests, somewhat messily. 2016-10-15 14:14:04 +02:00
Matthew Honnibal
1e1a1d9517 Update matcher test 2016-10-15 14:13:41 +02:00
Matthew Honnibal
9cc9ce0f14 Load with default path=False in tests. 2016-10-15 14:13:23 +02:00
Matthew Honnibal
08e9134760 Change default value of path to True 2016-10-15 14:12:54 +02:00
Matthew Honnibal
788657f062 Ensure words are added to vocab before test, so that the lexicon is updated correctly. 2016-10-15 14:12:18 +02:00
Matthew Honnibal
4a1a2bce68 Update version in about.py 2016-10-15 13:44:27 +02:00
Matthew Honnibal
6d8cb515ac Break the tokenization stage out of the pipeline into a function 'make_doc'. This allows all pipeline methods to have the same signature. 2016-10-14 17:38:29 +02:00
Matthew Honnibal
2cc515b2ed Add add_flag method to Vocab, re Issue #504. 2016-10-14 12:15:38 +02:00
Matthew Honnibal
f3be9d0a9a Add tensor field to Lexeme, Token, Doc and Span, so that users have a place to hang neural network outputs 2016-10-14 03:24:13 +02:00
Matthew Honnibal
9b55d97a8f Update train method 2016-10-13 03:24:53 +02:00
Matthew Honnibal
645d99523a Move merge_sents method into spacy.gold 2016-10-13 03:24:29 +02:00
Matthew Honnibal
41f88ce938 Fix dep model loading in parser 2016-10-12 20:26:38 +02:00
Matthew Honnibal
d9ae2d68af Load features by string-name for backwards compatibility. 2016-10-12 20:15:11 +02:00
Matthew Honnibal
a42fbcf946 Require model for test_is_properties 2016-10-12 19:35:18 +02:00
Matthew Honnibal
20c948361b Use local path in test_lemmatizer 2016-10-12 19:35:00 +02:00
Matthew Honnibal
1318d0bc65 Test with the non-loaded versions of the English and German pipelines. 2016-10-12 19:13:31 +02:00
Matthew Honnibal
0e2bedc373 Fix default labels for parser and NER 2016-10-12 19:12:40 +02:00
Matthew Honnibal
3a03c668c3 Fix message in ParserStateError 2016-10-12 14:44:31 +02:00
Matthew Honnibal
6bf505e865 Fix error on ParserStateError 2016-10-12 14:35:55 +02:00
Matthew Honnibal
ba5e048502 Add docstring for Trainer class. 2016-10-12 14:26:02 +02:00
Matthew Honnibal
847a4a4182 Refactor Language, dropping Language.blank() method. 2016-10-12 13:45:58 +02:00
Matthew Honnibal
ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal
ca32a1ab01 Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
This reverts commit 8423e8627f.
2016-09-30 20:20:22 +02:00
Matthew Honnibal
90baa9c7e6 Revert "Changes to matcher.pyx for new StringStore scheme"
This reverts commit 3ff09614e0.
2016-09-30 20:20:13 +02:00
Matthew Honnibal
1b6b129c04 Revert "Changes to morphology.pyx for new StringStore scheme"
This reverts commit 95f8cfd745.
2016-09-30 20:20:02 +02:00
Matthew Honnibal
1d70db58aa Revert "Changes to iterators.pyx for new StringStore scheme"
This reverts commit 4f794b215a.
2016-09-30 20:19:53 +02:00
Matthew Honnibal
de01e427fd Revert "Changes to strings.pyx for new StringStore scheme"
This reverts commit 22d4752d64.
2016-09-30 20:19:42 +02:00
Matthew Honnibal
9e09b39b9f Revert "Changes to transition systems for new StringStore scheme"
This reverts commit 0442e0ab1e.
2016-09-30 20:11:49 +02:00
Matthew Honnibal
e3285f6f30 Revert "Fix report of ParserStateError"
This reverts commit 78f19baafa.
2016-09-30 20:11:33 +02:00
Matthew Honnibal
6736977d82 Revert "Changes to Doc and Token for new string store scheme"
This reverts commit 99de44d864.
2016-09-30 20:11:15 +02:00
Matthew Honnibal
bd7fe6420c Revert "Changes to test for new string-store"
This reverts commit 21e90d7d0b.
2016-09-30 20:11:01 +02:00
Matthew Honnibal
1f1cd5013f Revert "Changes to vocab for new stringstore scheme"
This reverts commit a51149a717.
2016-09-30 20:10:30 +02:00
Matthew Honnibal
1e7d0af127 Revert "Changes to Lexeme for new string store scheme"
This reverts commit 717741b6cf.
2016-09-30 20:10:13 +02:00
Matthew Honnibal
ba51cb8325 Revert "Changes to tagger for new string store scheme"
This reverts commit f5a6aac906.
2016-09-30 20:09:53 +02:00
Matthew Honnibal
23b7244842 Make sure symbols are unicode strings 2016-09-30 20:02:19 +02:00
Matthew Honnibal
f5a6aac906 Changes to tagger for new string store scheme 2016-09-30 20:01:51 +02:00
Matthew Honnibal
717741b6cf Changes to Lexeme for new string store scheme 2016-09-30 20:01:36 +02:00
Matthew Honnibal
a51149a717 Changes to vocab for new stringstore scheme 2016-09-30 20:01:19 +02:00
Matthew Honnibal
21e90d7d0b Changes to test for new string-store 2016-09-30 20:00:58 +02:00
Matthew Honnibal
99de44d864 Changes to Doc and Token for new string store scheme 2016-09-30 20:00:21 +02:00
Matthew Honnibal
78f19baafa Fix report of ParserStateError 2016-09-30 19:59:22 +02:00
Matthew Honnibal
0442e0ab1e Changes to transition systems for new StringStore scheme 2016-09-30 19:58:51 +02:00
Matthew Honnibal
22d4752d64 Changes to strings.pyx for new StringStore scheme 2016-09-30 19:58:09 +02:00
Matthew Honnibal
4f794b215a Changes to iterators.pyx for new StringStore scheme 2016-09-30 19:57:49 +02:00
Matthew Honnibal
95f8cfd745 Changes to morphology.pyx for new StringStore scheme 2016-09-30 19:57:10 +02:00
Matthew Honnibal
3ff09614e0 Changes to matcher.pyx for new StringStore scheme 2016-09-30 19:56:48 +02:00
Matthew Honnibal
eceeaefe53 Fix defaults for Parser and Entity, adding a blank= argument. 2016-09-30 19:56:06 +02:00
Matthew Honnibal
8423e8627f Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good. 2016-09-30 10:14:47 +02:00
Matthew Honnibal
d3dc5718b2 Fix syntax error in Doc 2016-09-28 11:39:49 +02:00
Matthew Honnibal
1b520e7bab Improve docstrings for Doc object 2016-09-28 11:15:13 +02:00
Matthew Honnibal
81a47c01d8 Fix test for empty sentence string. 2016-09-27 19:21:22 +02:00
Matthew Honnibal
4cbf0d3bb6 Handle errors when no valid actions are available, pointing users to the issue tracker. 2016-09-27 19:19:53 +02:00
Matthew Honnibal
430473bd98 Raise errors when no actions are available, re Issue #429 2016-09-27 19:09:37 +02:00
Matthew Honnibal
fc4a7ad794 Test and fix Issue #411: IndexError when .sents property is used on empty string. 2016-09-27 18:49:14 +02:00
Matthew Honnibal
3d370b7d45 Add test for Issue #445, fixed in 3cb4d455d, with improved lemmatizer logic 2016-09-27 18:39:46 +02:00
Matthew Honnibal
a2f3510d6d Fix lemmatizer 2016-09-27 17:47:05 +02:00
Matthew Honnibal
07776d8096 Fix pos name conflict in lemmatize 2016-09-27 17:35:58 +02:00
Matthew Honnibal
35cd953f9e Fix pos name conflict with morphology 2016-09-27 14:16:22 +02:00
Matthew Honnibal
8e7df3c4ca Expect the parser data, if parser.load() is called. 2016-09-27 14:02:12 +02:00
Matthew Honnibal
bb4f201ad2 Pass morphological features from tag map into the lemmatizer. 2016-09-27 14:01:43 +02:00
Matthew Honnibal
40509e8bca Tweak the new is_base_form logic, because we can expect the 'pos' key in the morphology we're passed. 2016-09-27 14:01:16 +02:00
Matthew Honnibal
9c8ac91d72 Add test for Issue #435 2016-09-27 13:52:38 +02:00
Matthew Honnibal
3cb4d455d2 Pass lemmatizer morphological features, so that rules are sensitive to base/inflected distinction, which is how the WordNet data is designed. See Issue #435 2016-09-27 13:52:11 +02:00
Matthew Honnibal
e233328d38 Fix Issue #371: Lexeme objects were unhashable. 2016-09-27 13:22:30 +02:00
Matthew Honnibal
e382e48d9f Temporarily patch handling of defaul templates for tagger. Need to move these to language_data. 2016-09-27 13:21:28 +02:00
Matthew Honnibal
a44763af0e Fix Issue #469: Incorrectly cased root label in noun chunk iterator 2016-09-27 13:13:01 +02:00
Matthew Honnibal
b14b9b096b Return None if /deps directory not present, instead of trying to load the parser. 2016-09-26 18:48:03 +02:00
Matthew Honnibal
e07b9665f7 Don't expect parser model 2016-09-26 18:09:33 +02:00
Matthew Honnibal
ee6fa106da Fix parser features 2016-09-26 17:57:32 +02:00
Matthew Honnibal
e607e4b598 Fix parser loading 2016-09-26 17:51:11 +02:00
Matthew Honnibal
0b2d7ae9d6 Fix Entity creation 2016-09-26 15:41:22 +02:00
Matthew Honnibal
2debc4e0a2 Add .blank() method to Parser. Start housing default dep labels and entity types within the Defaults class. 2016-09-26 11:57:54 +02:00
Matthew Honnibal
722199acb8 Add spacy.blank() method, that doesn't load data. Don't try to load data if path is falsey 2016-09-26 11:07:46 +02:00
Matthew Honnibal
e56653f848 Add language data for German 2016-09-25 15:44:45 +02:00
Matthew Honnibal
7db956133e Move tokenizer data for German into spacy.de.language_data 2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal
d7e9acdcdf Add English language data, so that the tokenizer doesn't require the data download 2016-09-25 14:49:00 +02:00
Matthew Honnibal
82b8cc5efb Whitespace 2016-09-24 22:17:01 +02:00
Matthew Honnibal
fd58f7655a Python 3 compatible basestring 2016-09-24 22:16:43 +02:00
Matthew Honnibal
082e95b19e Python 3 compatible basestring 2016-09-24 22:09:21 +02:00
Matthew Honnibal
f19af6cb2c Python 3 compatible basestring 2016-09-24 22:08:43 +02:00
Matthew Honnibal
3ed4cdfe32 Handle pathlib.Path objects in CFile 2016-09-24 22:01:46 +02:00
Matthew Honnibal
df88690177 Fix encoding of path variable 2016-09-24 21:13:15 +02:00
Matthew Honnibal
af847e07fc Fix usage of pathlib for Python3 -- turning paths to strings. 2016-09-24 21:05:27 +02:00
Matthew Honnibal
453683aaf0 Fix spacy/vocab.pyx 2016-09-24 20:50:31 +02:00
Matthew Honnibal
fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal
9dc8043a7e Refactor Language to use new Defaults class, and work on revised data loading. We're getting rid of sputnik's weird file-system wrapper, and using pathlib. 2016-09-24 14:08:53 +02:00
Matthew Honnibal
b00f683a0c Fix matcher test 2016-09-24 11:20:58 +02:00
Matthew Honnibal
eaf4065480 Expose the _patterns private member 2016-09-24 11:20:42 +02:00
Matthew Honnibal
15e42a1ba9 Allow entities to be set by Span, or by 4-tuple (with entity ID) 2016-09-24 01:17:43 +02:00
Matthew Honnibal
60fdf4d5f1 Remove commented out debuggng code 2016-09-24 01:17:18 +02:00
Matthew Honnibal
939a791a52 Update tests 2016-09-24 01:17:03 +02:00
Matthew Honnibal
55f1f7edaf Don't automatically write new entities into the Doc in the Matcher. This fixes a long-standing wart, but introduces a *backwards incompatibility.* 2016-09-24 01:16:45 +02:00
Matthew Honnibal
e48df859b5 Fix typedef import in span.pyx 2016-09-23 16:02:28 +02:00
Matthew Honnibal
4de13606fd Fix token.pyx 2016-09-23 15:07:07 +02:00
Matthew Honnibal
b4de419e19 Import hash_t typedef in token.pyx 2016-09-23 14:22:06 +02:00
Matthew Honnibal
c1a2e96604 Clean up notes at end of token.pyx 2016-09-21 20:45:51 +02:00
Matthew Honnibal
f6e587b1c7 Fix matcher tests 2016-09-21 20:45:20 +02:00
Matthew Honnibal
58e83fe34b Initial, limited support for quantified patterns in Matcher, and tracking of ent_id attribute in Token and Span. The quantifiers need a lot more testing, and there are some known problems. The main known problem is that the zero-plus and one-plus quantifiers won't work if a token can match both the quantified pattern expression AND the tail of the match. 2016-09-21 14:54:55 +02:00
Matthew Honnibal
2735b6247b Fix orths_and_spaces in Doc.__init__ 2016-09-21 14:52:05 +02:00
Matthew Honnibal
070af4af9d Revert "* Working neural net, but features hacky. Switching to extractor."
This reverts commit 7c2f1a673b.
2016-09-21 12:26:14 +02:00
Matthew Honnibal
6b202ec43f Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-09-21 12:08:25 +02:00
Mahmoud Lababidi
4c9ccc3b8b Add parameter to download() for application to not exit if a Model exists. The default behavior is unchanged. 2016-09-14 10:04:09 -04:00
Adam Ever Hadani
f1c0762443 exit code 0 for when downloading a model that already was downloaded 2016-07-13 16:22:14 -07:00
Matthew Honnibal
7c2f1a673b * Working neural net, but features hacky. Switching to extractor. 2016-05-26 19:06:10 +02:00
Matthew Honnibal
cdc10e9a1c * Fix Issue #375: noun phrase iteration results in index error if noun phrases are merged during the loop. Fix by accumulating the spans inside the noun_chunks property, allowing the Span index tricks to work. 2016-05-20 10:14:06 +02:00
Matthew Honnibal
13fad36e49 * Cosmetic change to english noun chunks iterator -- use enumerate instead of range loop 2016-05-20 10:11:05 +02:00
Matthew Honnibal
02276cc444 Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-17 16:56:22 +02:00
Matthew Honnibal
4d7f5468bb * Change Language class to use a .pipeline attribute, instead of having the pipeline hard coded 2016-05-17 16:55:42 +02:00
Daylen Yang
5405e7dd73 Fix get_lang_class parsing (take 2) 2016-05-16 16:40:31 -07:00
Matthew Honnibal
b240104f40 Revert "Fix get_lang_class parsing" 2016-05-17 08:04:26 +10:00
Daylen Yang
1692c2df3c Fix get_lang_class parsing
We want the get_lang_class to return "en" for both "en" and "en_glove_cc_300_1m_vectors". Changed the split rule to "_" so that this happens.
2016-05-16 14:38:20 -07:00
Matthew Honnibal
17137f5c0c * Fix issue #372: mistake in Lexeme rich comparison 2016-05-12 12:58:57 +02:00
Matthew Honnibal
cc8bf62208 * Fix Issue #360: Tokenizer failed when the infix regex matched the start of the string while trying to tokenize multi-infix tokens. 2016-05-09 13:23:47 +02:00
Matthew Honnibal
c61ee8f9fa * Increment version 2016-05-09 13:20:00 +02:00
Matthew Honnibal
5d86c30f0b * Fix Issue #367: Missing has_vector property on Doc and Span objects 2016-05-09 12:36:14 +02:00
Wolfgang Seeker
7b78239436 add fix for German noun chunk iterator (issue #365) 2016-05-06 01:41:26 +02:00
Matthew Honnibal
8c0888d6cb * Fix error in span.sent 2016-05-06 00:28:05 +02:00
Matthew Honnibal
bb94022975 * Fix Issue #365: Error introduced during noun phrase chunking, due to use of corrected PRON/PROPN/etc tags. 2016-05-06 00:21:05 +02:00
Matthew Honnibal
41342ca79b Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-06 00:17:58 +02:00
Matthew Honnibal
26095f9722 * Add span.sent property, re Issue #366 2016-05-06 00:17:38 +02:00
Wolfgang Seeker
dbf8f5f3ec fix bug in StateC.set_break() 2016-05-05 15:15:34 +02:00
Wolfgang Seeker
3c44b5dc1a call deprojectivization after parsing 2016-05-05 15:10:36 +02:00
Matthew Honnibal
472f576b82 * Deprojectivize German parses 2016-05-05 15:01:10 +02:00
Matthew Honnibal
9bbd6cf031 * Work on Chinese support 2016-05-05 11:39:12 +02:00
Matthew Honnibal
a6a25166ba * Remove print from test 2016-05-05 11:10:59 +02:00
Matthew Honnibal
e31df66d26 * Fix Issue #361: Lexemes didn't have rich comparison. 2016-05-05 01:32:26 +02:00
Matthew Honnibal
7441ca30ee * Add tests for Issue #361: Lexeme rich comparison 2016-05-05 01:31:58 +02:00
Matthew Honnibal
72564213e3 * Add test for Issue #309 2016-05-04 16:00:28 +02:00
Matthew Honnibal
76f1d871da Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-04 15:54:00 +02:00
Matthew Honnibal
519366f677 * Fix Issue #351: Indices off when leading whitespace 2016-05-04 15:53:36 +02:00
Matthew Honnibal
b4bfc6ae55 * Add test for Issue #351: Indices off when leading whitespace 2016-05-04 15:53:17 +02:00
Matthew Honnibal
76021cb853 * Fix bug in Doc.text, introduced by a862edc 2016-05-04 11:02:16 +02:00
Wolfgang Seeker
e4ea2bea01 fix whitespace 2016-05-04 07:40:38 +02:00
Wolfgang Seeker
5bf2fd1f78 make the code less cryptic 2016-05-03 17:19:05 +02:00
Wolfgang Seeker
a06fca9fdf German noun chunk iterator now doesn't return tokens more than once 2016-05-03 16:58:59 +02:00
Wolfgang Seeker
7825b75548 add tests for German noun chunker 2016-05-03 15:01:28 +02:00
Wolfgang Seeker
7b246c13cb reformulate noun chunk tests for English 2016-05-03 14:24:35 +02:00
Wolfgang Seeker
1786331cd8 add model sanity test 2016-05-03 12:51:47 +02:00
Matthew Honnibal
1f1532142f * Fix cost calculation on non-monotonic oracle 2016-05-03 00:21:08 +02:00
Matthew Honnibal
377a624046 Merge pull request #358 from wbwseeker/german_lemmatizer_dummy
German lemmatizer dummy
2016-05-03 07:38:26 +10:00
Wolfgang Seeker
92bfbebeec remove unnecessary imports 2016-05-02 17:33:22 +02:00
Wolfgang Seeker
857454ffa0 fix indentation -.- 2016-05-02 17:10:41 +02:00
Matthew Honnibal
308a28c26c * Whitespace 2016-05-02 16:08:11 +02:00
Matthew Honnibal
29a114e645 * Don't assign 0-valued tags in Doc.from_array 2016-05-02 16:07:50 +02:00
Matthew Honnibal
c1c11a8ae0 * Fix formatting on serializer tests 2016-05-02 16:07:21 +02:00
Wolfgang Seeker
dae6bc05eb define German dummy lemmatizer until morphology is done 2016-05-02 16:04:53 +02:00
Matthew Honnibal
6e1f1c4b9e Merge pull request #357 from wbwseeker/german_ner
German ner
2016-05-02 23:39:34 +10:00
Wolfgang Seeker
b6b96b233c don't require read_json_file to expect particular annotations 2016-05-02 15:29:30 +02:00
Matthew Honnibal
902a389d85 * Fix merge conflict in test_parse 2016-05-02 15:28:07 +02:00
Matthew Honnibal
276fbe9996 * Fix assignment of iterator on Doc object 2016-05-02 15:26:24 +02:00
Matthew Honnibal
02c23cc1d0 * Fix sentence boundary test 2016-05-02 15:26:07 +02:00
Matthew Honnibal
d2f469b809 * Fix parsing tests, so that labels are added if they're missing, and so that the branching test values are correct 2016-05-02 15:25:27 +02:00
Wolfgang Seeker
b11cbb06c6 remove old tests for sentence boundary detection 2016-05-02 14:36:35 +02:00
Matthew Honnibal
508fd1f6dc * Refactor noun chunk iterators, so that they're simple functions. Install the iterator when the Doc is created, but allow users to write to the noun_chunk_iterator attribute. The iterator functions accept an object and yield (int start, int end, int label) triples. 2016-05-02 14:25:10 +02:00
Matthew Honnibal
e526be5602 Merge branch 'master' of ssh://github.com/spacy-io/spaCy 2016-05-02 13:08:08 +02:00
Wolfgang Seeker
fa961ea694 add tests for serialization bug 2016-05-02 11:01:56 +02:00
Matthew Honnibal
97b2bba249 * Merge updated/simplified Break approach 2016-04-25 19:44:42 +00:00
Matthew Honnibal
77609588b6 * Fix assignment of root label to words left as root implicitly, after parsing ends. 2016-04-25 19:41:59 +00:00
Matthew Honnibal
7c2d2deaa7 * Revise transition system so that the Break transition retains sole responsibility for setting sentence boundaries. Re Issue #322 2016-04-25 19:41:59 +00:00
Wolfgang Seeker
c2f76a4024 Merge branch 'master' into german_ner 2016-04-25 13:21:23 +02:00
Wolfgang Seeker
1003e7ccec remove debug output from tests 2016-04-25 12:12:40 +02:00
Wolfgang Seeker
f57f843e85 fix bug in updating tree structure when introducing additional roots 2016-04-25 12:01:19 +02:00
Matthew Honnibal
478a8d1829 * Register Chinese language in spacy/__init__.py 2016-04-24 18:45:16 +02:00
Matthew Honnibal
8569dbc2d0 * Add initial stuff for Chinese parsing 2016-04-24 18:44:24 +02:00
Wolfgang Seeker
4d7f393fae don't require json-files to have syntactic annotation 2016-04-22 16:32:27 +02:00
Wolfgang Seeker
b6477fc4f4 adjusted tests to Travis Setup 2016-04-21 17:15:10 +02:00
Wolfgang Seeker
736ffcb9a2 remove whitespace 2016-04-21 16:55:55 +02:00
Wolfgang Seeker
6c7301cc6d the parser now introduces sentence boundaries properly when predicting dependents with root labels 2016-04-21 16:50:53 +02:00
Wolfgang Seeker
12024b0b0a bugfix: introducing multiple roots now updates original head's properties
adjust tests to rely less on statistical model
2016-04-20 16:42:41 +02:00
Matthew Honnibal
67ce96c9c9 * Make patterns argument to Matcher class optional 2016-04-17 21:32:24 +02:00
Matthew Honnibal
8b4677d34d * Add missing keyword arguments to spacy.load() function 2016-04-17 21:31:50 +02:00
Matthew Honnibal
2add5206aa * Fix description of matcher test 2016-04-17 15:40:21 +02:00
Matthew Honnibal
2b419d5b8c * Update test for Issue #242 2016-04-17 15:34:23 +02:00
Matthew Honnibal
f12b043308 * Add test for Issue #242: Overlapping matches not well recognised. 2016-04-17 15:19:17 +02:00
Wolfgang Seeker
b98cc3266d bugfix: iterators now reset properly when called a second time 2016-04-15 17:49:16 +02:00
Wolfgang Seeker
e6945c4d0e bugfix: uppercase attr values before looking them up 2016-04-15 15:46:31 +02:00
Matthew Honnibal
c0909afe22 Merge pull request #312 from wbwseeker/space_head_bug
add restrictions to L-arc and R-arc to prevent space heads
2016-04-15 20:36:03 +10:00
Wolfgang Seeker
289b10f441 remove some comments 2016-04-14 15:37:51 +02:00
Matthew Honnibal
6f82065761 * Fix infixed commas in tokenizer, re Issue #326. Need to benchmark on empirical data, to make sure this doesn't break other cases. 2016-04-14 11:36:03 +02:00
Matthew Honnibal
0f957dd586 Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-04-14 10:37:56 +02:00
Matthew Honnibal
108aca0e50 * Make Matcher use attrs from the attrs.pyx file, rather than having an incomplete function doing the mapping. 2016-04-14 10:37:39 +02:00
Matthew Honnibal
61d20de35d * Fix language.py docstring 2016-04-14 10:36:57 +02:00
Wolfgang Seeker
d99a9cbce9 different handling of space tokens
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Matthew Honnibal
04d0209be9 * Recognise multiple infixes in a token. 2016-04-13 18:38:26 +10:00
Henning Peters
a473d6e937 fix tests (use english model) 2016-04-12 16:41:57 +02:00
Henning Peters
f2d011c034 avoid polluting spacy namespace with lang classes 2016-04-12 16:31:16 +02:00
Henning Peters
ff690f76ba fix loading non-german models 2016-04-12 16:00:56 +02:00
Henning Peters
6215272786 remove ujson as default non-dev dependency (still works as fallback if installed), because ujson doesn't ship wheels 2016-04-12 11:28:07 +02:00
Matthew Honnibal
6df3858dbc * Fix Issue #323: Incorrect semantics of Token.__str__ built-in. Add flag to allow users to switch the old semantics back on, to ease transition. 2016-04-12 13:17:59 +10:00
Wolfgang Seeker
d328e0b4a8 Merge branch 'master' into space_head_bug 2016-04-11 12:11:01 +02:00
Wolfgang Seeker
80bea62842 bugfix in unit test 2016-04-08 16:46:44 +02:00
Wolfgang Seeker
be4903a1b2 update version numbers 2016-04-08 13:54:05 +02:00
Wolfgang Seeker
1fe911cdb0 bigfix 2016-04-07 18:19:51 +02:00
Matthew Honnibal
872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters
470cdf5bf9 remove deprecated LOCAL_DATA_DIR 2016-04-05 11:25:54 +02:00
Matthew Honnibal
26622f0ffc Merge branch 'master' of ssh://github.com/honnibal/spaCy 2016-03-29 14:31:52 +11:00
Matthew Honnibal
b1fe41b45d * Extend infix test, commenting on limitation of tokenizer w.r.t. infixes at the moment. 2016-03-29 14:31:05 +11:00
Matthew Honnibal
9c73983bdd * Add test for hyphenation problem in Issue #302 2016-03-29 14:27:13 +11:00
Matthew Honnibal
ad119c074f * Fix incorrect whitespacing in Doc.text. This change is potentially breaking, to anyone who was relying on the previous incorrect semantics. 2016-03-29 13:02:42 +11:00
Matthew Honnibal
8c7a1908ee Merge pull request #307 from scoder/faster_string_store
remove internal redundancy and overhead from StringStore
2016-03-29 12:59:52 +11:00
Wolfgang Seeker
7195b6742d add restrictions to L-arc and R-arc to prevent space heads 2016-03-28 10:40:52 +02:00
Matthew Honnibal
8c77a994c6 Merge pull request #305 from henningpeters/master
multiple langs in download script
2016-03-26 21:54:59 +11:00
Henning Peters
c90d4a6f17 relative imports in __init__.py 2016-03-26 11:44:53 +01:00
Henning Peters
db095a162c fix 2016-03-25 18:59:47 +01:00
Henning Peters
b8f63071eb add lang registration facility 2016-03-25 18:54:45 +01:00
Matthew Honnibal
4a37fdcee1 Merge pull request #287 from wbwseeker/deproj_sentbnd_bug
add function to Token for setting head and dep (and dep_)
2016-03-25 09:47:45 +11:00
Stefan Behnel
f18805ee1c make StringStore.__contains__() return True for the empty string (which is also contained in iteration) 2016-03-24 15:42:12 +01:00
Stefan Behnel
f2cfbfc412 remove internal redundancy and overhead from StringStore 2016-03-24 15:25:27 +01:00
Wolfgang Seeker
d65ef41d08 make error messages language independent 2016-03-24 11:47:09 +01:00
Henning Peters
963570aa49 Merge branch 'master' of github.com:spacy-io/spaCy 2016-03-24 11:19:47 +01:00
Henning Peters
a7d7ea3afa first idea for supporting multiple langs in download script 2016-03-24 11:19:43 +01:00
Wolfgang Seeker
5080077097 revert init_model.py back to pre-german state (because it makes more sense)
simplify token.n_rights and token.n_lefts
2016-03-21 16:10:25 +01:00
Wolfgang Seeker
5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Matthew Honnibal
80134eb12d Merge branch 'master' of https://github.com/spacy-io/spaCy 2016-03-15 19:14:50 +00:00
Wolfgang Seeker
2ae253ef5b changed head.__set__ to make it simpler 2016-03-14 13:43:48 +01:00
Henning Peters
c12d3dd200 add __init__.py to empty package dirs 2016-03-14 11:28:03 +01:00
Henning Peters
54f3447b5f cleanup 2016-03-14 01:46:33 +01:00
Wolfgang Seeker
46e3f979f1 add function for setting head and label to token
change PseudoProjectivity.deprojectivize to use these functions
2016-03-11 17:31:06 +01:00
Wolfgang Seeker
03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Wolfgang Seeker
bc9c62e279 replace Language functions with corresponding orth functions
implement punctuation functions in orth
2016-03-09 18:07:37 +01:00
Wolfgang Seeker
d9312bc9ea add new files npchunks.{pyx,pxd} to hold noun phrase chunk generators 2016-03-09 16:18:48 +01:00
Matthew Honnibal
1508528c8c * Increment version 2016-03-08 15:58:45 +00:00
Matthew Honnibal
963fe5258e * Add missing __contains__ method to vocab 2016-03-08 15:49:10 +00:00
Matthew Honnibal
478aa21cb0 * Remove broken __reduce__ method on vocab 2016-03-08 15:48:21 +00:00
Matthew Honnibal
20235bde00 Merge pull request #282 from henningpeters/switch_vectors
initial proposal for ability to switch vectors
2016-03-09 01:39:41 +11:00
Henning Peters
eb7ae61b1c cleanup api 2016-03-08 12:59:18 +01:00
Henning Peters
b740f20191 hash_string() should not depend on python's internal unicode representation, also fixes https://github.com/spacy-io/sense2vec/issues/5 for py2 2016-03-06 09:19:27 +01:00
Henning Peters
aa4d964c14 cleanup api 2016-03-05 17:51:32 +01:00
Henning Peters
931c07a609 initial proposal for separate vector package 2016-03-04 11:09:06 +01:00
Wolfgang Seeker
7adbd7a785 replace Counter with normal dict 2016-03-03 21:36:27 +01:00
Wolfgang Seeker
1ae487a4f6 add backwards compatibility with python 2.6 2016-03-03 21:18:12 +01:00
Wolfgang Seeker
9d1e6de4a0 make a proper list from zip iterator 2016-03-03 19:51:01 +01:00
Wolfgang Seeker
49f9d1c085 change test_nonproj.py to not use zip inside numpy.asarray 2016-03-03 19:42:09 +01:00
Wolfgang Seeker
72b8df0684 turned PseudoProjectivity into a normal python class 2016-03-03 19:05:08 +01:00
Matthew Honnibal
fcaa0ad7ce Merge pull request #280 from wbwseeker/german_parser
German parser
2016-03-04 03:27:42 +11:00
Wolfgang Seeker
690c5acabf adjust train.py to train both english and german models 2016-03-03 15:21:00 +01:00
Wolfgang Seeker
3448cb40a4 integrated pseudo-projective parsing into parser
- nonproj.pyx holds a class PseudoProjectivity which currently holds
  all functionality to implement Nivre & Nilsson 2005's pseudo-projective
  parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
  structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker
56b7210e82 moved nonproj.py to syntax/nonproj.pyx 2016-02-25 15:08:49 +01:00
Henning Peters
f3df736e0a remove unidecode-related test 2016-02-24 18:22:22 +01:00
Wolfgang Seeker
4b2297d5d4 add class PseudoProjective for pseudo-projective parsing
PseudoProjective() implements the algorithm from Nivre & Nilsson 2005
using their HEAD decoration scheme.
2016-02-24 11:26:25 +01:00
Henning Peters
12d58a7099 remove text-unidecode dependency 2016-02-24 08:01:59 +01:00
Wolfgang Seeker
8d531c958b replace tests for non-projectivity
- add functions to find non-projective edges
- add test file for non-projectivity functions
2016-02-22 14:40:40 +01:00
Matthew Honnibal
141639ea3a * Fix bug in tokenizer that caused new tokens to be added for affixes 2016-02-21 23:17:47 +00:00
Wolfgang Seeker
eae35e9b27 add tokenizer files for German, add/change code to train German pos tagger
- add files to specify rules for German tokenization
- change generate_specials.py to generate from an external file (abbrev.de.tab)
- copy gazetteer.json from lang_data/en/

- init_model.py
	- change doc freq threshold to 0
- add train_german_tagger.py
	- expects conll09-formatted input
2016-02-18 13:24:20 +01:00
Henning Peters
9cc4f8d5b3 avoid shadowing __name__ 2016-02-15 01:33:39 +01:00
Henning Peters
4c9e3c7911 upgrade spuntik, enforce data api via model version constraints 2016-02-14 16:03:17 +01:00
Henning Peters
9d8966a2c0 Update test_tokenizer.py 2016-02-10 19:24:37 +01:00
Henning Peters
3b5f1e753b py26 compatibility 2016-02-10 14:32:54 +01:00
Henning Peters
ee1f1ac300 mark test_sentence_space() as model test 2016-02-10 07:49:11 +01:00
Matthew Honnibal
5d96b3ef4f * Increment version 2016-02-07 13:48:58 +01:00
Matthew Honnibal
1b83cb9dfa * Fix Issue #251: Incorrect right edge calculation on left-clobber low in the tree 2016-02-07 00:00:42 +01:00
Matthew Honnibal
c6623889c1 * Add test for Issue #251: Incorrect right edges, caused by bad update to r_edge in del_arc, triggered from non-monotonic left-arc 2016-02-06 23:47:51 +01:00
Matthew Honnibal
a95974ad3f * Fix oov probability 2016-02-06 15:13:55 +01:00
Matthew Honnibal
af8514cb0c * Refine the way the is_parsed attribute is set by from_array 2016-02-06 14:44:35 +01:00
Matthew Honnibal
161b01d4c0 * Tweak usage example for multi-processing 2016-02-06 14:44:11 +01:00
Matthew Honnibal
7f24229f10 * Don't try to pickle the tokenizer 2016-02-06 14:09:05 +01:00
Matthew Honnibal
dcb401f3e1 * Remove broken Vocab pickling 2016-02-06 14:08:47 +01:00
Matthew Honnibal
e66d45bf66 * Restore previous patch to Span.root, as it seems it wasn't the cause of the problem. 2016-02-06 13:37:41 +01:00
Matthew Honnibal
4412a70dc5 * Initialize StateC._empty_token to 0, to avoid undefined behaviour. 2016-02-06 13:34:38 +01:00
Matthew Honnibal
1b41f868d2 * Check for errors in parser, and parallelise the left-over batch 2016-02-06 10:06:30 +01:00
Matthew Honnibal
031b00cb91 * Fix Span.root calculation 2016-02-05 20:12:09 +01:00
Matthew Honnibal
165ca28b80 * Set is_parsed flag in Parser.pipe 2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a * Set is_parsed flag in Parser.pipe 2016-02-05 19:50:11 +01:00
Matthew Honnibal
7119e77fb6 * Fix Matcher.pipe 2016-02-05 19:46:02 +01:00
Matthew Honnibal
1cf0100bf6 * Add test for multithreading 2016-02-05 19:38:22 +01:00
Matthew Honnibal
b04c9aad71 * Fix off-by-one in Parser.pipe 2016-02-05 19:37:50 +01:00
Matthew Honnibal
e5c447e237 * Questionable fix to problem in Span.root 2016-02-05 19:18:35 +01:00
Matthew Honnibal
1ef84a0557 * Merge master into rethinc2 2016-02-05 12:55:59 +01:00
Matthew Honnibal
4cf34fc170 Merge branch 'rethinc2' of ssh://github.com/honnibal/spaCy into rethinc2 2016-02-05 12:48:28 +01:00
Matthew Honnibal
249dccbe95 * Fix Language.pipe 2016-02-05 12:47:57 +01:00
Matthew Honnibal
c0e63feccc * xfail pickle tests 2016-02-05 12:46:58 +01:00
Matthew Honnibal
6aa92b70f1 * Fix merge problem in span 2016-02-05 12:46:11 +01:00
Matthew Honnibal
048dfe35aa * cimport cython.parallel 2016-02-05 12:20:42 +01:00
Matthew Honnibal
af58f273b3 * Fix spacy.language.pipe 2016-02-05 12:20:29 +01:00
Matthew Honnibal
8a13cebdcc * Update for modified thinc interface 2016-02-05 11:44:39 +01:00
Matthew Honnibal
48ce09687d * Skip pickling the vocab in the tests 2016-02-04 15:51:19 +01:00
Matthew Honnibal
419edfab50 * Use generic flags for the new attributes until they're added 2016-02-04 15:50:54 +01:00
Matthew Honnibal
c4017a06d9 * Add placeholders for the new flags in attrs and symbols 2016-02-04 15:49:45 +01:00
Matthew Honnibal
e5c96c969f * Wire up new attributes 2016-02-04 13:04:58 +01:00
Matthew Honnibal
9703ccc3de * Remove unused import 2016-02-04 13:04:33 +01:00
Matthew Honnibal
11810be33e * Add Python hooks for is_bracket/is_quote/is_left_punct/is_right_punct 2016-02-04 13:04:16 +01:00
Matthew Honnibal
fe611132f0 * Add stubs for is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:03:04 +01:00
Matthew Honnibal
ee975d36d0 * Add stubs to test is_bracket/is_quote/is_left_punct/is_right_punct functions 2016-02-04 13:02:25 +01:00
Matthew Honnibal
f9e765cae7 * Add pipe() method to tokenizer 2016-02-03 02:32:37 +01:00
Matthew Honnibal
4cbad510ff * Fix calculation of head for spans with punctuation. 2016-02-03 02:32:21 +01:00
Matthew Honnibal
84b247ef83 * Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading. 2016-02-03 02:10:58 +01:00
Matthew Honnibal
fcfc17a164 Merge branch 'master' into rethinc2 2016-02-02 23:05:34 +01:00
Matthew Honnibal
f204daf27b * Add error warning that a gold tag is unrecognised 2016-02-02 22:59:59 +01:00
Matthew Honnibal
99b8906100 * Accept punct_labels as an argument to the scorer 2016-02-02 22:59:06 +01:00
Matthew Honnibal
59123443e2 * Check for presence/absence of the different models in Language.end_training 2016-02-02 22:49:55 +01:00
Matthew Honnibal
9e9d4c8706 * Fix stupid error in Language.batch 2016-02-01 09:49:32 +01:00
Matthew Honnibal
e3db39dd21 * Fix compiler warning about signed/unsigned comparison 2016-02-01 09:08:07 +01:00
Matthew Honnibal
98fbdf2856 * Add Language.batch() method, to support multi-threaded jobs 2016-02-01 09:01:13 +01:00
Matthew Honnibal
b3802562d6 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd * Fix merge conflict 2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8 * Fix parseC function 2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40 * Add a parse_batch method to Parser, that releases the GIL around a batch of documents. 2016-02-01 08:34:55 +01:00
Matthew Honnibal
d5579cd0d8 Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2 2016-02-01 03:08:49 +01:00
Matthew Honnibal
490ba65398 * Use openmp in parser 2016-02-01 03:08:42 +01:00
Matthew Honnibal
cb78d91ec5 * Fix ArcEager.set_valid 2016-02-01 03:07:37 +01:00
Matthew Honnibal
28e5ad62bc * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b * Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents 2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448 * Now fully proxied 2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1 * Continue proxying. Some problem currently 2016-02-01 02:22:21 +01:00
Matthew Honnibal
2169bbb7ea * Shadow StateClass with StateC, to start proxying 2016-02-01 01:16:14 +01:00
Matthew Honnibal
2fa228458e * Add _state file, which StateClass will proxy to 2016-02-01 01:09:21 +01:00
Matthew Honnibal
6bb007d16e * Make set_parse nogil 2016-01-30 20:27:52 +01:00
Matthew Honnibal
9410e74c92 * Switch parser to use nogil functions 2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791 * Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser 2016-01-30 14:31:36 +01:00
Matthew Honnibal
ea4ff94cde * Whitespace 2016-01-29 03:59:22 +01:00
Matthew Honnibal
b0718b6ee1 * Move to thinc 5.0 2016-01-29 03:58:55 +01:00
Matthew Honnibal
9721502c81 * Update version 2016-01-25 15:52:59 +01:00
Matthew Honnibal
907e8cf07d * Add u prefix to string in web example 2016-01-25 15:51:38 +01:00
Matthew Honnibal
eba03695ef * Comment out pickle tests 2016-01-25 15:51:13 +01:00
Matthew Honnibal
de94e6c525 * Mark pickle tests as xfail, due to temp files problem 2016-01-25 15:24:17 +01:00
Matthew Honnibal
87172a15c6 * Fix runtime error bug that arose from updated Span.root function. 2016-01-25 15:22:42 +01:00
Matthew Honnibal
2c8dd91785 * Fix first code example on the website 2016-01-23 18:09:19 +01:00
Matthew Honnibal
3af84cfd6e * Increment version 2016-01-21 17:49:27 +01:00
Henning Peters
65aeac24cb remove package version constraint 2016-01-21 17:40:51 +01:00
Matthew Honnibal
792c98a438 * Increment version for OSX-fixed release of v0.100 2016-01-21 00:23:04 +01:00
Matthew Honnibal
82d011ac43 * Fix test for whitespace 2016-01-19 20:38:26 +01:00
Matthew Honnibal
e89069dcae * Fix matcher test 2016-01-19 20:24:01 +01:00
Matthew Honnibal
63e3d4e27f * Add comment on Vocab.__reduce__ 2016-01-19 20:11:25 +01:00
Matthew Honnibal
e1282b7f2f * Require user-custom NER classes to work without adding the label. 2016-01-19 20:11:03 +01:00
Matthew Honnibal
84c5dfbfc3 * Clean up debugging python list 2016-01-19 20:10:32 +01:00
Matthew Honnibal
04d0686b26 * Make TransitionSystem.add_action idempotent, i.e. ignore duplicate added actions. 2016-01-19 20:10:04 +01:00
Matthew Honnibal
c4a89d56bd * Automatically register any entity types pre-set on the tokens, so that the NER works with user-given entity types. 2016-01-19 20:09:26 +01:00
Matthew Honnibal
f0f92793f6 * Add test for user NER classes in matcher blocking the NER model. Re Issue #178 and Issue #217 2016-01-19 19:23:16 +01:00
Matthew Honnibal
65c5bc4988 * Add add_label method, to allow users to register new entity types and dependency labels. 2016-01-19 19:11:02 +01:00
Matthew Honnibal
151aa0b0e2 * Allow users to add_label, in order to extend the entity recogniser to new classes. Does not by itself add a class to the model 2016-01-19 19:09:33 +01:00
Matthew Honnibal
c8e0011ebc * Add iterators to the NER and parser transition systems, to get the action types 2016-01-19 19:07:43 +01:00
Matthew Honnibal
515493c675 * Add xfail test for Issue #225: tokenization with non-whitespace delimiters 2016-01-19 13:20:14 +01:00
Matthew Honnibal
7abe653223 * Fix imports 2016-01-19 03:36:51 +01:00