Matthew Honnibal
5c66cffafd
Add tag map for Spanish
2017-03-16 18:05:15 -05:00
Matthew Honnibal
c4351e1165
Update base-form check in lemmatizer, for UD 2.0 morphology
2017-03-16 17:59:31 -05:00
Matthew Honnibal
1e10383e1b
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:41:13 -05:00
Matthew Honnibal
859315863a
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-16 17:40:07 -05:00
Matthew Honnibal
fea9fe08af
Merge pull request #866 from juanmirocks/master
...
Fix lemmatization of OOV words
2017-03-16 23:37:36 +01:00
Matthew Honnibal
ffd4a19383
Increment version
2017-03-16 17:35:57 -05:00
Matthew Honnibal
28bb546939
Merge pull request #883 from ericzhao28/master
...
Add `lower_` and `upper_` properties to `Span` class
2017-03-16 23:35:47 +01:00
ines
fd60961825
Fix spacing
2017-03-16 23:23:26 +01:00
Matthew Honnibal
890747d8ff
Fix trailing whitespace on morphology features
2017-03-16 17:07:37 -05:00
Matthew Honnibal
af41a9790c
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:41:37 +01:00
Matthew Honnibal
303a56f173
Get absolute path for linking
2017-03-16 20:41:23 +01:00
ines
3d484c3faf
Don't print in parse_package_meta and accept on_erro callback instead
...
TODO: log warning for missing meta data in spacy.link, as this affects
the Language class returned by spacy.load()
2017-03-16 20:34:50 +01:00
ines
d8c984b65e
Don't exit if no model meta data is present
2017-03-16 20:33:33 +01:00
Matthew Honnibal
2524efc0ac
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 20:20:41 +01:00
ines
8253581057
Link model automatically if not direct download
2017-03-16 19:54:51 +01:00
Matthew Honnibal
8843b84bd1
Merge remote-tracking branch 'origin/develop-downloads'
2017-03-16 12:00:42 -05:00
Matthew Honnibal
55f813bfbb
Don't reapply the model during training
2017-03-16 11:59:43 -05:00
Matthew Honnibal
c90dc7ac29
Clean up state initiatisation in transition system
2017-03-16 11:59:11 -05:00
Matthew Honnibal
a46933a8fe
Clean up FTRL parsing stuff.
2017-03-16 11:58:20 -05:00
ines
618ce3b425
Add .meta to Language object
...
Allows getting the current model's meta data, e.g.:
nlp = spacy.load('my-model')
print(nlp.meta)
2017-03-16 17:14:56 +01:00
ines
e348d4434c
Add spacy.info(model_name) to show model meta
...
Allows "previewing" model before loading and making sure it's linked
correctly.
2017-03-16 17:13:40 +01:00
ines
eea3b35e3f
Update model loading to support links
...
Remove match_best_version check, fetch model language from meta instead
of directory name, and don't make too many assumptions – if model is
downloaded via downloader, version should match anyway. (Otherwise,
users should be free to add and load whichever models they want.)
2017-03-16 17:13:08 +01:00
ines
5f3f04bd0a
Add util function to load and parse package meta.json
2017-03-16 17:10:05 +01:00
ines
7f920c2f75
Don't break text in when rendering print_msg
2017-03-16 17:09:50 +01:00
ines
16a63d9676
Add docstring
2017-03-16 17:09:11 +01:00
ines
68c04fa897
Move sys_exit() function to util
2017-03-16 17:08:58 +01:00
ines
ccd1a79988
Add spacy.link module to link model directories to shortcuts
2017-03-16 17:01:51 +01:00
Matthew Honnibal
2611ac2a89
Fix scorer bug for NER, related to ambiguity between missing annotations and misaligned tokens
2017-03-16 09:38:28 -05:00
ines
595d89698a
Add basestring
2017-03-16 10:01:14 +01:00
ines
7b2eca36e4
Revert "Fix formatting and remove unused code"
...
This reverts commit d7898d586f
.
2017-03-16 09:58:41 +01:00
ines
2f0db1dd36
Use small English model as default
2017-03-16 09:54:40 +01:00
Matthew Honnibal
3d0833c3df
Fix off-by-1 in parse features fill_context
2017-03-15 19:55:35 -05:00
Matthew Honnibal
4ef68c413f
Approximate cost in Break transition, to speed things up a bit.
2017-03-15 16:40:27 -05:00
Matthew Honnibal
8543db8a5b
Use ftrl optimizer in parser
2017-03-15 11:56:37 -05:00
ines
4cfc8ffbd2
Reformat pickle tests
2017-03-15 17:39:54 +01:00
ines
2a0fcf1354
Add tests for new download module
2017-03-15 17:39:43 +01:00
ines
71956c94db
Handle deprecated language-specific model downloading
2017-03-15 17:37:55 +01:00
ines
58b884b6d4
Refactor download script and about.py to use new download method
2017-03-15 17:37:18 +01:00
ines
f5d1a39a5b
Add util functions for printing and wrapping messages
2017-03-15 17:35:57 +01:00
ines
d7898d586f
Fix formatting and remove unused code
2017-03-15 17:35:41 +01:00
ines
b672e95045
Fix formatting
2017-03-15 17:35:04 +01:00
ines
0474e706a0
Remove unused deprecated functions for sputnik
2017-03-15 17:34:54 +01:00
ines
b13e7f79b4
Fix formatting and remove unused imports
2017-03-15 17:33:57 +01:00
ines
1101fd3855
Fix formatting and remove unused imports
2017-03-15 17:33:39 +01:00
ines
842782c128
Move fix_deprecated_glove_vectors_loading to deprecated.py
2017-03-15 17:33:29 +01:00
Matthew Honnibal
4cab8ac136
Update morph exceptions test
2017-03-15 09:31:34 -05:00
Matthew Honnibal
d719f8e77e
Use nogil in parser, and set L1 to 0.0 by default
2017-03-15 09:31:01 -05:00
Matthew Honnibal
c61c501406
Update beam-parser to allow parser to maintain nogil
2017-03-15 09:30:22 -05:00
Matthew Honnibal
3d4e389d23
Whitespace
2017-03-15 09:29:42 -05:00
Matthew Honnibal
7769bc31e3
Add beam-search classes
2017-03-15 09:27:41 -05:00
Matthew Honnibal
c79b3129e3
Fix setting of empty lexeme in initial parse state
2017-03-15 09:26:53 -05:00
Matthew Honnibal
d864708072
Add more morphology names in attrs.pyx
2017-03-15 09:26:16 -05:00
Matthew Honnibal
b382dc902c
Add morph rules in Language
2017-03-15 09:24:40 -05:00
Matthew Honnibal
8dbff4f5f4
Wire up English lemma and morph rules.
2017-03-15 09:23:22 -05:00
Matthew Honnibal
f70be44746
Use lemmatizer in code, not from downloaded model.
2017-03-15 04:52:50 -05:00
ines
42ba740dde
Revert "Merge branch 'debug'"
...
This reverts commit 89b79d1178
, reversing
changes made to 02bdf490a1
.
2017-03-13 20:11:52 +01:00
ines
4c5f51e49e
Update regression test
2017-03-13 15:16:11 +01:00
ines
02bdf490a1
Remove regression test to see if it caused pytest Travis error
2017-03-13 13:00:22 +01:00
ines
17018750ac
Add regression test for #717
2017-03-13 12:58:22 +01:00
ines
2883ebfca2
Remove print statement
2017-03-13 12:30:42 +01:00
ines
98c13d8aa9
Add regression test for #401
2017-03-13 12:28:41 +01:00
ines
444d665f9d
Add regression test for #686
2017-03-13 12:23:35 +01:00
ines
46b17e5b51
Add regression test for #719
2017-03-13 12:17:35 +01:00
ines
c8ae682ff9
Add regression test for #636
2017-03-13 12:08:31 +01:00
ines
337f9601f2
Add missing unicode declaration
2017-03-13 12:08:19 +01:00
ines
d70386ec6e
Update docstring in #886 regression test
2017-03-13 12:00:38 +01:00
ines
51ba3ef0a8
Add regression test for #886
2017-03-13 11:44:58 +01:00
ines
eec3f21c50
Add WordNet license
2017-03-12 13:58:24 +01:00
ines
f9e603903b
Rename stop_words.py to word_sets.py and include more sets
...
NUM_WORDS and ORDINAL_WORDS are currently not used, but the hard-coded
list should be removed from orth.pyx and replaced to use
language-specific functions. This will later allow other languages to
use their own functions to set those flags. (In English, this is easier
because it only needs to be checked against a set – in German for
example, this requires a more complex function, as most number words
are one word.)
2017-03-12 13:58:22 +01:00
ines
f24f9b4b7b
Remove unused code
2017-03-12 13:58:22 +01:00
ines
1da29a7146
Use new Lemmatizer data and remove file import
...
Since there's currently only an English lemmatizer, the global
Lemmatizer imports from spacy.en. This is unideal and still needs to be
fixed.
2017-03-12 13:58:22 +01:00
ines
0957737ee8
Add Python-formatted lemmatizer data and rules
2017-03-12 13:58:22 +01:00
ines
c89e30d1a3
Add test for English time exceptions ("1a.m." etc.)
2017-03-12 13:58:22 +01:00
ines
ce9568af84
Move English time exceptions ("1a.m." etc.) and refactor
2017-03-12 13:58:22 +01:00
ines
6b30541774
Fix formatting
2017-03-12 13:58:22 +01:00
Ines Montani
e97a30b99a
Merge pull request #885 from PySUST/master
...
[Bengali] Spell checked and add new stop words
2017-03-12 13:20:59 +01:00
ines
66c1f194f9
Use consistent unicode declarations
2017-03-12 13:07:28 +01:00
shuvanon
91cb4cdb2b
Sort stop_words
2017-03-12 17:55:51 +06:00
shuvanon
784f6cfa49
Update stop_words
2017-03-12 17:41:01 +06:00
shuvanon
73cc17078e
Merge branch 'master' of https://github.com/PySUST/spaCy
2017-03-12 14:52:17 +06:00
shuvanon
35ec7135bb
Spell checked and add new stop words
2017-03-12 14:51:34 +06:00
Em
9c809efc25
Removed mapStr
2017-03-11 16:23:26 -08:00
Matthew Honnibal
fa23278ee3
Add classes for beam parser and beam NER
2017-03-11 12:45:37 -06:00
Matthew Honnibal
6c4108c073
Add header for beam parser
2017-03-11 12:45:12 -06:00
Matthew Honnibal
4382f175b3
Squelch compiler warnings
2017-03-11 12:44:43 -06:00
Matthew Honnibal
ea2592879f
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-11 11:13:37 -06:00
Matthew Honnibal
1224c4d3c6
Improve output on trainer
2017-03-11 11:12:48 -06:00
Matthew Honnibal
b438dfd3f3
Add itn argument to tagger.update
2017-03-11 11:12:21 -06:00
Matthew Honnibal
931feb3360
Allow beam parsing for NER
2017-03-11 11:12:01 -06:00
Matthew Honnibal
f77a5bb60a
Switch back to greedy parser
2017-03-11 11:11:30 -06:00
Matthew Honnibal
ca9c8c57c0
Add iteration argument to parser.update
2017-03-11 07:00:47 -06:00
Matthew Honnibal
dcce9ca3f3
Use beam parser
2017-03-11 07:00:20 -06:00
Matthew Honnibal
e30ffdd003
Use ftrl optimizer in tagger
2017-03-11 06:59:13 -06:00
Matthew Honnibal
d59c6926c1
I think this fixes the segfault
2017-03-11 06:58:34 -06:00
Matthew Honnibal
318b9e32ff
WIP on beam parser. Currently segfaults.
2017-03-11 06:19:52 -06:00
Em
426d17167f
Added string manipulation for spans
2017-03-10 16:50:02 -08:00
Matthew Honnibal
b0d80dc9ae
Update name of 'train' function in BeamParser
2017-03-10 14:35:43 -06:00
Matthew Honnibal
d11f1a4ddf
Record negative costs in non-monotonic arc eager oracle
2017-03-10 11:22:04 -06:00
Matthew Honnibal
ecf91a2dbb
Support beam parser
2017-03-10 11:21:21 -06:00
Ines Montani
a16aff17aa
Merge pull request #876 from PySUST/master
...
[Bangla] Update "tokenizer_exceptions.py"
2017-03-10 14:46:00 +01:00
ines
10e29189ac
Adjust URL testcases and xfail problems (instead of comment)
2017-03-10 14:22:50 +01:00
ines
b04893a059
Make regex locale-independent for Python 2
2017-03-10 14:21:57 +01:00
Matthew Honnibal
ea53647362
Merge branch 'develop'
2017-03-10 02:49:39 -06:00
Ines Montani
1c40890321
Add missing comma
...
Should fix Travis build error
2017-03-10 09:34:54 +01:00
Shuvanon Razik
c251703428
Update abbreviations
2017-03-10 10:45:01 +06:00
Matthew Honnibal
b5247c49eb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-09 18:45:43 -06:00
Matthew Honnibal
798450136d
Set L1 penalty to 0 in tagger.
2017-03-09 18:43:47 -06:00
Matthew Honnibal
c62da02344
Use ftrl training, to learn compressed model.
2017-03-09 18:43:21 -06:00
Matthew Honnibal
f71eeef9bb
Pass path argument to end_training
2017-03-09 18:42:40 -06:00
Dan Rapp
123d3f2d38
Fix error in test case parameterization
2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7
Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix
2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d
Issue #840 - URL pattenr too broad
2017-03-09 11:39:39 -07:00
Matthew Honnibal
5b0b968d13
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-08 15:03:10 +01:00
Matthew Honnibal
0ac3d27689
Fix handling of trailing whitespace
...
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
ines
c2e3e651b8
Re-add regression test for #859
2017-03-08 14:36:09 +01:00
Matthew Honnibal
0a6d7ca200
Fix spacing after token_match
...
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859 .
2017-03-08 14:33:32 +01:00
shuvanon
85438aee1b
update tokenizertokenizer
2017-03-08 17:29:39 +06:00
shuvanon
45bc78461c
update tokenizertokenizer
2017-03-08 17:27:12 +06:00
Matthew Honnibal
cd33b39a04
Fix 2/3 problem for json save/load
2017-03-08 01:39:13 +01:00
Matthew Honnibal
40703988bc
Use FTRL training in parser
2017-03-08 01:38:51 +01:00
Matthew Honnibal
d108534dc2
Fix 2/3 problems for training
2017-03-08 01:37:52 +01:00
Matthew Honnibal
d03d6a13f1
Merge branch 'rominf-ud20' into develop
2017-03-07 21:48:56 +01:00
Matthew Honnibal
f7374d0b86
Merge branch 'ud20' of https://github.com/rominf/spaCy into rominf-ud20
2017-03-07 21:48:37 +01:00
Matthew Honnibal
16670d3251
Xfail the vocab pickling for now
2017-03-07 21:43:28 +01:00
Matthew Honnibal
a89c3500f6
Fixes to hacky vocab pickling
2017-03-07 20:58:55 +01:00
Matthew Honnibal
d814892805
Hackish pickle support for Vocab.
2017-03-07 20:25:12 +01:00
Matthew Honnibal
26614e028f
Add hacky support for StringCFile, to make pickling easier.
2017-03-07 20:24:37 +01:00
Matthew Honnibal
3edb8ae207
Whitespace
2017-03-07 17:16:26 +01:00
Matthew Honnibal
5de7e712b7
Add support for pickling StringStore.
2017-03-07 17:15:18 +01:00
Matthew Honnibal
4e75e74247
Update regression test for variable-length pattern problem in the matcher.
2017-03-07 16:08:32 +01:00
Matthew Honnibal
6d67213b80
Add test for 850: Matcher fails on zero-or-more.
2017-03-07 15:55:28 +01:00
Aniruddha Adhikary
696215a3fb
add tests for Bengali
2017-03-05 11:25:12 +06:00
Aniruddha Adhikary
8f3bfe9bfc
[Bengali] basic tag map, morph, lemma rules and exceptions
2017-03-04 12:36:59 +06:00
Roman Inflianskas
66e1109b53
Add support for Universal Dependencies v2.0
2017-03-03 13:17:34 +01:00
ines
8dff040032
Revert "Add regression test for #859 "
...
This reverts commit c4f16c66d1
.
2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela
25c29f072d
apply patch
2017-03-01 21:44:17 +01:00
Juan Miguel Cejuela
a8cfde46d3
#781 Fix test — colocalizes is lemmatized to colocaliz and colicalize
2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela
a471114eb2
#781 add regression test, failing previous bug fix
2017-03-01 21:30:51 +01:00
ines
c4f16c66d1
Add regression test for #859
2017-03-01 16:07:27 +01:00
Aniruddha Adhikary
d91be7aed4
add punctuations for Bengali
2017-02-28 21:07:14 +06:00
Aniruddha Adhikary
5a4fc09576
add basic Bengali support
2017-02-28 07:48:37 +06:00
Matthew Honnibal
cc9b2b74e3
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:44:39 +01:00
Matthew Honnibal
bd4375a2e6
Remove comment
2017-02-27 11:44:26 +01:00
Matthew Honnibal
e7e22d8be6
Move import within get_exceptions() function, to speed import
2017-02-27 11:34:48 +01:00
Matthew Honnibal
34bcc8706d
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:21:21 +01:00
Matthew Honnibal
0aaa546435
Fix test after updating the French tokenizer stuff
2017-02-27 11:20:47 +01:00
Matthew Honnibal
26446aa728
Avoid loading all French exceptions on import
...
Move exceptions loading behind a get_tokenizer_exceptions() function
for French, instead of loading into the top-level namespace. This
cuts import times from 0.6s to 0.2s, at the expense of making the
French data a little different from the others (there's no top-level
TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat
unsatisfying.
2017-02-25 11:55:00 +01:00
ines
376c5813a7
Remove print statements from test
2017-02-24 18:26:32 +01:00
ines
7c1260e98c
Add regression test
2017-02-24 18:22:49 +01:00
ines
0e2e331b58
Convert exceptions to Python list
2017-02-24 18:22:40 +01:00
ines
51eb190ef4
Remove print statements from test
2017-02-24 17:41:12 +01:00
Matthew Honnibal
db5ada3995
Merge branch 'master' of https://github.com/explosion/spaCy
2017-02-24 14:28:12 +01:00
Matthew Honnibal
8f94897d07
Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766
2017-02-24 14:27:02 +01:00
ines
67991b6e5f
Add more test cases to #775 regression test to cover #847
2017-02-18 14:10:44 +01:00
ines
30ce2a6793
Exclude "shed" and "Shed" from tokenizer exceptions (see #847 )
2017-02-18 14:10:44 +01:00
Ines Montani
de997c1a33
Merge pull request #842 from magnusburton/master
...
Added regular verb rules for Swedish
2017-02-17 11:18:20 +01:00
Magnus Burton
41fcfd06b8
Added regular verb rules for Swedish
2017-02-17 10:04:04 +01:00
ines
aa92d4e9b5
Fix unicode regex for Python 2 (see #834 )
2017-02-16 23:49:54 +01:00
ines
44de3c7642
Reformat test and use text_file fixture
2017-02-16 23:49:19 +01:00
ines
3dd22e9c88
Mark vectors test as xfail (temporary)
2017-02-16 23:28:51 +01:00
ines
85d249d451
Revert "Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )""
...
This reverts commit ea05f78660
.
2017-02-16 23:26:25 +01:00
ines
ea05f78660
Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )"
...
This reverts commit 7d8c9eee7f
, reversing
changes made to f6b69babcc
.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
06a71d22df
Fix test failure by using unicode literals
2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque
3ba109622c
Add regression test with non ' ' space character as token
2017-02-16 12:23:27 +01:00
Raphaël Bournhonesque
e17dc2db75
Remove useless import
2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque
3fd2742649
load_vectors should accept arbitrary space characters as word tokens
...
Fix bug #834
2017-02-16 12:08:30 +01:00
ines
f08e180a47
Make groups non-capturing
...
Prevents hitting the 100 named groups limit in Python
2017-02-10 13:35:02 +01:00
ines
fa3b8512da
Use consistent imports and exports
...
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
2017-02-10 13:34:09 +01:00
ines
21f09d10d7
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
...
This reverts commit f02a2f9322
.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
...
This reverts commit b95afdf39c
, reversing
changes made to b0ccf32378
.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
309da78bf0
Merge branch 'master' into tokenizer_exceptions
2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque
4ce0bbc6b6
Update unit tests
2017-02-09 16:30:43 +01:00
Raphaël Bournhonesque
5d706ab95d
Merge tokenizer exceptions from PR #802
2017-02-09 16:30:28 +01:00
ines
654fe447b1
Add Swedish tokenizer tests (see #807 )
2017-02-05 11:47:07 +01:00
ines
6715615d55
Add missing EXC variable and combine tokenizer exceptions
2017-02-05 11:42:52 +01:00
Ines Montani
30a52d576b
Merge pull request #807 from magnusburton/master
...
Added swedish lemma rules and more verb contractions
2017-02-05 11:34:19 +01:00
Magnus Burton
19c0ce745a
Added swedish lemma rules
2017-02-04 17:53:32 +01:00
Michael Wallin
d25556bf80
[issue 805] Fix issue
2017-02-04 16:22:21 +02:00
Michael Wallin
35100c8bdd
[issue 805] Add regression test and the required fixture
2017-02-04 16:21:34 +02:00
ines
0ab353b0ca
Add line breaks to Finnish stop words for better readability
2017-02-04 13:40:25 +01:00
Michael Wallin
1a1952afa5
[finnish] Add initial tests for tokenizer
2017-02-04 13:54:10 +02:00
Michael Wallin
f9bb25d1cf
[finnish] Reformat and correct stop words
2017-02-04 13:54:10 +02:00
Michael Wallin
73f66ec570
Add preliminary support for Finnish
2017-02-04 13:54:10 +02:00
Ines Montani
65d6202107
Merge pull request #802 from Tpt/fr-tokenizer
...
Adds more French tokenizer exceptions
2017-02-03 10:52:20 +01:00
Tpt
75a74857bb
Adds more French tokenizer exceptions
2017-02-03 13:45:18 +04:00
Ines Montani
afc6365388
Update regression test for #801 to match current expected behaviour
2017-02-02 16:23:05 +01:00
Ines Montani
012f4820cb
Keep infixes of punctuation + hyphens as one token (see #801 )
2017-02-02 16:22:40 +01:00
Ines Montani
1219a5f513
Add = to tokenizer prefixes
2017-02-02 16:21:11 +01:00
Ines Montani
ff04748eb6
Add missing emoticon
2017-02-02 16:21:00 +01:00
Ines Montani
13a4ab37e0
Add regression test for #801
2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque
85f951ca99
Add tokenizer exceptions for French
2017-02-02 08:36:16 +01:00
Matvey Ezhov
32a22291bc
Small Doc.count_by
documentation update
...
Current example doesn't work
2017-01-31 19:18:45 +03:00
Ines Montani
e4875834fe
Fix formatting
2017-01-31 15:19:33 +01:00
Ines Montani
c304834e45
Add missing import
2017-01-31 15:18:30 +01:00
Ines Montani
e6465b9ca3
Parametrize test cases and mark as xfail
2017-01-31 15:14:42 +01:00
latkins
e4c84321a5
Added regression test for Issue #792 .
2017-01-31 13:47:42 +00:00
Matthew Honnibal
6c665b81df
Fix redundant == TAG in from_array conditional
2017-01-31 00:46:21 +11:00
Ines Montani
19501f3340
Add regression test for #775
2017-01-25 13:16:52 +01:00
Ines Montani
209c37bbcf
Exclude "shell" and "Shell" from English tokenizer exceptions ( resolves #775 )
2017-01-25 13:15:02 +01:00
Raphaël Bournhonesque
1be9c0e724
Add fr tokenization unit tests
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
1faaf698ca
Add infixes and abbreviation exceptions (fr)
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
cf8474401b
Remove unused import statement
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
902f136f18
Add support for elision in French
2017-01-24 10:57:37 +01:00
Ines Montani
55c9c62abc
Use relative import
2017-01-23 21:27:49 +01:00
Ines Montani
0967eb07be
Add regression test for #768
2017-01-23 21:25:46 +01:00
Ines Montani
6baa98f774
Merge pull request #769 from raphael0202/spacy-768
...
Allow zero-width 'infix' token
2017-01-23 21:24:33 +01:00
Raphaël Bournhonesque
dce8f5515e
Allow zero-width 'infix' token
2017-01-23 18:28:01 +01:00
Ines Montani
5f6f48e734
Add regression test for #759
2017-01-20 15:11:48 +01:00
Ines Montani
09ecc39b4e
Fix multi-line string of NUM_WORDS ( resolves #759 )
2017-01-20 15:11:48 +01:00
Magnus Burton
69eab727d7
Added loops to handle contractions with verbs
2017-01-19 14:08:52 +01:00
Matthew Honnibal
be26085277
Fix missing import
...
Closes #755
2017-01-19 22:03:52 +11:00
Ines Montani
7e36568d5b
Fix title to accommodate sputnik
2017-01-17 00:51:09 +01:00
Ines Montani
d704cfa60d
Fix typo
2017-01-16 21:30:33 +01:00
Ines Montani
64e142f460
Update about.py
2017-01-16 14:23:08 +01:00
Matthew Honnibal
e889cd698e
Increment version
2017-01-16 14:01:35 +01:00
Matthew Honnibal
e7f8e13cf3
Make Token hashable. Fixes #743
2017-01-16 13:27:57 +01:00
Matthew Honnibal
2c60d0cb1e
Test #743 : Tokens unhashable.
2017-01-16 13:27:26 +01:00
Matthew Honnibal
48c712f1c1
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-16 13:18:06 +01:00
Matthew Honnibal
7ccf490c73
Increment version
2017-01-16 13:17:58 +01:00
Ines Montani
50878ef598
Exclude "were" and "Were" from tokenizer exceptions and add regression test ( resolves #744 )
2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b
Fix formatting
2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c
Merge pull request #742 from oroszgy/hu_tokenizer_fix
...
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41
Further numeric test.
2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa
Better error handling
2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c
Better error handling
2017-01-14 22:09:29 +01:00
Gyorgy Orosz
a45f22913f
Added further abbreviations present in the Szeged corpus
2017-01-14 22:08:55 +01:00
Ines Montani
332ce2d758
Update README.md
2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b
Passing all old tests.
2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af
Fixed hyphen handling in the Hungarian tokenizer.
2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6
Maintaining compatibility with other spacy tokenizers.
2017-01-14 16:19:15 +01:00
Gyorgy Orosz
be7a7aeb1a
Reversed accidental changes.
2017-01-14 15:59:36 +01:00
Gyorgy Orosz
1be5da1ac6
Fixed Hungarian tokenizer for numbers
2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a
Fix test formatting and consistency
2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5
Update README.md
2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1
Mark lemmatizer tests as models since they use installed data
2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a
Fix test name for consistency
2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f
Add util function to add vectors to vocab
2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d
Reformat add_docs_equal and add docstring
2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073
Add README.md to tests to explain organisation and conventions
2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90
Modernise serializer I/O tests and don't depend on models where possible
2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4
Add text_file_b fixture using BytesIO
2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62
Modernise noun chunks tests and don't depend on models
2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686
Rename test_parser to test_noun_chunks
2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47
Remove old tests
2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26
Move parser tests from unit to parser
2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e
Merge tokenizer tests
2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff
Move attrs tests from unit to root and modernise
2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967
Move alignment tests from munge to gold and modernise
2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a
Reformat and rename Pragmatic Segmenter tests and mark xfails
2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d
Modernise lemmatizer tests
2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9
Modernise tagger tests and fix xpassing test
2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e
Create basic and extended test set for URLs
2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8
Modernise BILUO tests
2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01
Add Lemmatizer fixture
2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597
Add path fixture for spaCy data path
2017-01-12 23:38:47 +01:00
Ines Montani
0894b8c0ef
Don't split tokens with digits and "/" infixes ( resolves #740 )
2017-01-12 22:58:26 +01:00
Ines Montani
e9e99a5670
Add regression test for #740
2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409
Fix formatting
2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31
Modernise and merge matcher tests
2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a
Update comments on EN and DE fixtures
2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9
Tidy up and rename regression tests and remove unnecessary imports
2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3
Fix formatting and consistency
2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e
Remove redundant language loading integration tests
2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2
Modernise serializer codecs tests
2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6
Modernise Huffman tests
2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5
Modernise packer tests and don't depend on models where possible
2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0
Modernise and merge serialization tests
2017-01-12 21:57:19 +01:00
Ines Montani
442237787c
Add assert_docs_equal util to compare two docs
2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb
Add fixture for entity recognizer
2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc
Modernise matcher tests and split into two files
2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8
Move matcher tests for #188 and #242 to regression tests
...
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd
Update test to not create redundant Doc object
2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8
Fix formatting, naming and unicode declaration
2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d
Modernise vector similarity tests
2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a
Add get_cosine util function
2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629
Fix regression test for #615 and remove unnecessary imports
2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c
Adjust formatting
2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6
Modernise and merge lexeme vocab tests
2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2
Move test for #361 to regression tests
2017-01-12 16:51:12 +01:00
Daniel Hershcovich
99eb494a82
Fix #737 : support loading word vectors with " " as a word
2017-01-12 17:00:14 +02:00
Ines Montani
7cb3d74426
Modernise span tests and don't depend on models
2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee
Modernise vocab API tests and remove old xfailing tests
2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd
Rename test_vocab.py to test_vocab_api.py
2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68
Merge flag features tests into orth tests in tests root
2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3
Remove StringStore tests from vocab tests
2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf
Modernise add vectors vocab test
2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345
Use consistent test names
2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce
Remove old unused tests and conftest files
2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9
Move Pragmatic Segmenter test cases (currently unused) to parser tests
2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874
Add tests for StringStore
2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5
Add fixture for StringStore
2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a
Modernise tests for merging spans and don't depend on models
2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d
Remove unused old test
2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b
Move test for #54 to regression tests
2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c
Remove unused conftest
2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc
Allow setting ents in get_doc
2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5
Fix and pass regression test for #736
2017-01-12 11:48:56 +01:00
Matthew Honnibal
4e48862fa8
Remove print statement
2017-01-12 11:25:39 +01:00
Matthew Honnibal
d1d8214767
Increment version
2017-01-12 11:21:57 +01:00
Matthew Honnibal
fba67fa342
Fix Issue #736 : Times were being tokenized with incorrect string values.
2017-01-12 11:21:01 +01:00
Ines Montani
a6790b6694
Rename tags to pos in get_doc and allow adding tags to tokens
2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67
Merge lemmatizer tests
2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf
Modernise morph exceptions test and don't depend on models
2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e
Add regression test for #736
2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891
Move language-specific tests out of redundant tokenizer directories
2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a
Tidy up
2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7
Move text file back to tokenizer tests directory
2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017
Remove old and/or redundant tests
2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097
Modernise space attachment parser tests and don't depend on models
2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8
Modernise and merge parser tests and don't depend on models
2017-01-12 01:07:29 +01:00
Ines Montani
178c147612
Modernise nonprojectivity tests and don't depend on models
2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c
Modernise sentence boundary detection tests and don't depend on models (where possible)
2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d
Remove old unused pickle test
2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc
Move test for #309 to regression tests
2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670
Modernise parser tests and don't depend on models
2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782
Add apply_transition_sequence util function to utils
2017-01-11 21:30:14 +01:00
Ines Montani
09807addff
Add en_parser fixture
2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61
Modernise Doc parse tree navigation tests and don't depend on models
2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2
Use consistent test names
2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367
Rename "tokens" tests to "doc"
2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563
Remove old unused files
2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f
Remove old word vector tests
2017-01-11 18:55:08 +01:00
Ines Montani
e027936920
Modernise Doc noun chunks tests
2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd
Modernise Doc array tests and don't depend on models
2017-01-11 18:54:46 +01:00
Ines Montani
05447be884
Modernise test for adding entities
2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00
Modernise Doc API tests and don't depend on models
2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44
Make words optional for get_doc
2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419
Fix StringIO import for Python 3
2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b
Rename test_tokens_api.py to test_doc_api.py
2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18
Merge token tests into token API tests
2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0
Modernise token API tests and don't depend on loading models
2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90
Merge conftests into one cohesive file
2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df
Add test utils and get_doc helper function
...
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Matthew Honnibal
e12c90e03f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-11 13:03:51 +01:00
Matthew Honnibal
12cd27b821
Amend 8ae8b443f: Handle comparison with None tokens.
2017-01-11 13:03:32 +01:00
Daniel Hershcovich
8e603cc917
Avoid "True if ... else False"
2017-01-11 11:18:22 +02:00
Matthew Honnibal
44e2b0100d
Support TAG attribute in doc.from_array
2017-01-10 22:47:07 +01:00
Ines Montani
3e6e1f0251
Tidy up regression tests
2017-01-10 19:24:10 +01:00
Magnus Burton
aad23ab0b4
Supplemented with capitalized Swedish exceptions
2017-01-10 16:07:20 +01:00
Ines Montani
869963c3c4
Mark extensive prefix/suffix tests as slow
2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe
Add simple test for surrounding brackets
2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2
Assert length first
2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907
Adjust names and formatting
2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964
Remove semi-redundant URLs and punctuation for faster testing
2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c
Add unicode declaration
2017-01-10 15:53:15 +01:00
Matthew Honnibal
0f9b8a00a5
Unbreak data download
2017-01-09 23:40:26 +01:00
Matthew Honnibal
8ae8b443f1
Add richcmp method to Token. Closes #631
2017-01-09 19:30:31 +01:00
Matthew Honnibal
64f747cb65
Token comparison test
2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c
Add tests for token comparison, re Issue #631
2017-01-09 19:09:59 +01:00
Matthew Honnibal
97a1286129
Revert changes to tagger and parser for thinc 6
2017-01-09 10:08:34 -06:00
Matthew Honnibal
95a52005df
Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class."
...
This reverts commit 40e71586d6
.
2017-01-09 09:55:55 -06:00
Ines Montani
363f09e68c
Merge pull request #726 from magnusburton/master
...
Added Swedish abbreviations as token exceptions
2017-01-09 14:58:15 +01:00
Matthew Honnibal
42cd598f57
Use correct fixtures in URL tokenizer
2017-01-09 14:10:40 +01:00
Matthew Honnibal
d9a77ddf14
Return None for data path if it doesn't exist
2017-01-09 14:10:05 +01:00
Matthew Honnibal
e4862d1dab
Merge branch 'develop'
2017-01-09 13:36:01 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb
Remove old tests for old website example code
2017-01-08 22:28:53 +01:00
Ines Montani
eef94e3ee2
Split off period after two or more uppercase letters ( fixes #483 )
2017-01-08 22:28:25 +01:00
Ines Montani
a89a6000e5
Remove unused import
2017-01-08 22:17:37 +01:00
Ines Montani
5d28664fc5
Don't test Hungarian for numbers and hyphens for now
...
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
53362b6b93
Reorganise Hungarian prefixes/suffixes/infixes
...
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani
347c4a2d06
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
2017-01-08 20:37:39 +01:00
Ines Montani
0dec90e9f7
Use global abbreviation data languages and remove duplicates
2017-01-08 20:36:00 +01:00
Ines Montani
7c3cb2a652
Add global abbreviations data
2017-01-08 20:34:03 +01:00
Ines Montani
de5aa92bc2
Handle deprecated tokenizer prefix data
2017-01-08 20:33:28 +01:00
Ines Montani
abb09782f9
Move sun.txt to original location and fix path to not break parser tests
2017-01-08 20:32:54 +01:00
Ines Montani
cab39c59c5
Add missing contractions to English tokenizer exceptions
...
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07
Move abbreviations below other exceptions
2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9
Generate he/she/it correctly with 's instead of 've
2017-01-05 19:57:00 +01:00
Ines Montani
8328925e1f
Add newlines to long German text
2017-01-05 18:13:30 +01:00
Ines Montani
55b46d7cf6
Add tokenizer tests for German
2017-01-05 18:11:25 +01:00
Ines Montani
5bb4081f52
Remove redundant test_tokenizer.py for English
2017-01-05 18:11:11 +01:00
Ines Montani
8216ba599b
Add tests for longer and mixed English texts
2017-01-05 18:11:04 +01:00
Ines Montani
65f937d5c6
Move basic contraction tests to test_contractions.py
2017-01-05 18:09:53 +01:00
Ines Montani
bbe7cab3a1
Move non-English-specific tests back to general tokenizer tests
2017-01-05 18:09:29 +01:00
Ines Montani
038002d616
Reformat HU tokenizer tests and adapt to general style
...
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani
bc911322b3
Move ") to emoticons (see Tweebo challenge test)
2017-01-05 18:05:38 +01:00
Ines Montani
637f785036
Add general sanity tests for all tokenizers
2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de
Move English tokenizer tests to directory /en
2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d
Modernize and merge general tokenizer tests
2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9
Modernize and merge tokenizer tests for string loading
2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822
Modernize and merge tokenizer tests for whitespace
2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1
Modernize and merge tokenizer tests for text from file
2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653
Modernize and merge tokenizer tests for punctuation
2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5
Modernize and merge tokenizer tests for exception and emoticons
2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d
Fix formatting
2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa
Add missing docstrings
2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6
Add unicode declarations
2017-01-05 13:09:48 +01:00
Ines Montani
58adae8774
Remove unused file
2017-01-05 13:09:22 +01:00
Ines Montani
c6e5a5349d
Move regression test for #360 into own file
2017-01-04 00:49:31 +01:00
Ines Montani
8279993a6f
Modernize and merge tokenizer tests for punctuation
2017-01-04 00:49:20 +01:00
Ines Montani
550630df73
Update tokenizer tests for contractions
2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f
Update conftest fixture
2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293
Modernize tokenizer tests for emoticons
2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd
Modernize tokenizer tests for infixes
2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27
Move regression test for #351 to own file
2017-01-04 00:47:11 +01:00
Ines Montani
667051375d
Modernize tokenizer tests for whitespace
2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285
Modernize tokenizer tests for contractions
...
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
1d237664af
Add lowercase lemma to tokenizer exceptions
2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb
Fix typos
2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani
fb9d3bb022
Revert "Merge remote-tracking branch 'origin/master'"
...
This reverts commit d3b181cdf1
, reversing
changes made to b19cfcc144
.
2017-01-03 18:21:36 +01:00
Ines Montani
461cbb99d8
Revert "Reorganise English tokenizer exceptions (as discussed in #718 )"
...
This reverts commit b19cfcc144
.
2017-01-03 18:21:29 +01:00
Ines Montani
d3b181cdf1
Merge remote-tracking branch 'origin/master'
...
# Conflicts:
# spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani
b19cfcc144
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani
1bd53bbf89
Fix typos ( resolves #718 )
2017-01-03 11:26:21 +01:00
Matthew Honnibal
fde53be3b4
Move whole token mach inside _split_affixes.
2016-12-30 17:11:50 -06:00
Matthew Honnibal
3ba7c167a8
Fix URL tests
2016-12-30 17:10:08 -06:00
Matthew Honnibal
9936a1b9b5
Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
2016-12-30 14:53:40 -06:00
Magnus Burton
56e2219b65
Added Swedish city abbreviations
2016-12-30 21:17:34 +01:00
Magnus Burton
e935c950d8
Added months and days as abbreviations for Swedish
2016-12-30 21:08:44 +01:00
kengz
73a38bd4d1
Merge remote-tracking branch 'upstream/master'
2016-12-30 12:19:59 -05:00
kengz
da44183ae1
move parse_tree logic to a new tokens/printers.py file
2016-12-30 12:19:18 -05:00
Matthew Honnibal
3e8d9c772e
Test interaction of token_match and punctuation
...
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Matthew Honnibal
74b921f394
Merge branch 'master' of ssh://github.com/explosion/spaCy into develop
2016-12-30 14:38:27 +01:00
Matthew Honnibal
623d94e14f
Whitespace
2016-12-31 00:30:28 +11:00
Matthew Honnibal
af81ac8bb0
Use thinc 6.0
2016-12-29 11:58:42 +01:00
Petter Hohle
f112e7754e
Add PART to tag map
...
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
2016-12-28 18:39:01 +01:00
Matthew Honnibal
f62db78dc3
Increment version
2016-12-27 21:11:22 +01:00
Matthew Honnibal
cade536d1e
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-27 21:04:10 +01:00
Matthew Honnibal
ce4539dafd
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
2016-12-27 21:03:45 +01:00
Ines Montani
ad3669cef5
Merge pull request #703 from magnusburton/master
...
Added Swedish abbreviations
2016-12-27 01:01:49 +01:00
Ines Montani
78f754dd9a
Merge pull request #705 from oroszgy/hu_tokenizer
...
Initial support for Hungarian
2016-12-27 00:48:13 +01:00
Ines Montani
8785706039
Reformat stop words for better readability
2016-12-24 00:58:40 +01:00
Gyorgy Orosz
45e045a87b
Unicode/UTF8 compatibility for Python2
2016-12-24 00:21:00 +01:00
Gyorgy Orosz
72b61b6d03
Typo fix.
2016-12-24 00:10:29 +01:00
Gyorgy Orosz
3a9be4d485
Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
2016-12-23 23:49:34 +01:00
Ines Montani
1436b9f15a
Fix formatting and consistency
2016-12-23 21:36:01 +01:00
Ines Montani
1d64527727
Update Spanish tokenizer
...
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
2016-12-23 21:36:01 +01:00
Ines Montani
7f411fd01c
Remove exceptions containing whitespace / no special chars
2016-12-23 14:30:06 +01:00
Magnus Burton
fdf4776262
Added Swedish abbreviations
2016-12-22 22:45:18 +01:00
Gyorgy Orosz
d9c59c4751
Maintaining backward compatibility.
2016-12-21 23:30:49 +01:00
Gyorgy Orosz
1748549aeb
Added exception pattern mechanism to the tokenizer.
2016-12-21 23:16:19 +01:00
Gyorgy Orosz
35aa54765d
Hungarian module is exposed in spacy.
2016-12-21 20:45:36 +01:00
Gyorgy Orosz
ab2f6ea46c
Removed data files from tests..
2016-12-21 20:22:09 +01:00
Ines Montani
3c87c71d43
Add tokenizer exceptions for a.m. and p.m. in Spanish
2016-12-21 18:19:10 +01:00
Ines Montani
78e63dc7d0
Update tokenizer exceptions for English
2016-12-21 18:06:34 +01:00
Ines Montani
702d1eed93
Update tokenizer exceptions for German
2016-12-21 18:06:27 +01:00
Ines Montani
d60380418e
Update tokenizer exceptions for Spanish
2016-12-21 18:06:17 +01:00
Ines Montani
920fa0fed2
Add DET_LEMMA constant
2016-12-21 18:05:41 +01:00
Ines Montani
8978806ea6
Allow Vocab to load without serializer_freqs
2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6
Remove trailing whitespace
2016-12-21 18:04:41 +01:00
Ines Montani
926e19184a
Merge pull request #695 from magnusburton/master
...
Added Swedish morph rules
2016-12-21 01:06:00 +01:00
Gyorgy Orosz
3d5306acb9
Added further testcases.
2016-12-20 23:49:35 +01:00
Gyorgy Orosz
23956e72ff
Improved partial support for tokenzing Hungarian numbers
2016-12-20 23:36:59 +01:00
Gyorgy Orosz
6add156075
Refactored language data structure
2016-12-20 22:28:20 +01:00
Gyorgy Orosz
366b3f8685
Merge branch 'master' into hu_tokenizer
2016-12-20 20:53:31 +01:00
Gyorgy Orosz
c035928156
Partial Hungarian number tokenization is added.
2016-12-20 20:46:20 +01:00