Ines Montani
1c40890321
Add missing comma
...
Should fix Travis build error
2017-03-10 09:34:54 +01:00
Shuvanon Razik
c251703428
Update abbreviations
2017-03-10 10:45:01 +06:00
Matthew Honnibal
b5247c49eb
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-09 18:45:43 -06:00
Matthew Honnibal
798450136d
Set L1 penalty to 0 in tagger.
2017-03-09 18:43:47 -06:00
Matthew Honnibal
c62da02344
Use ftrl training, to learn compressed model.
2017-03-09 18:43:21 -06:00
Matthew Honnibal
f71eeef9bb
Pass path argument to end_training
2017-03-09 18:42:40 -06:00
Dan Rapp
123d3f2d38
Fix error in test case parameterization
2017-03-09 12:18:21 -07:00
Dan Rapp
b9307dfcd7
Merge branch 'master' into rappdw/tokenizer_exceptions_url_fix
2017-03-09 11:42:14 -07:00
Dan Rapp
3b1df3808d
Issue #840 - URL pattenr too broad
2017-03-09 11:39:39 -07:00
Matthew Honnibal
5b0b968d13
Merge branch 'develop' of https://github.com/explosion/spaCy into develop
2017-03-08 15:03:10 +01:00
Matthew Honnibal
0ac3d27689
Fix handling of trailing whitespace
...
Fix off-by-one error that meant trailing spaces were being dropped.
Closes #792
2017-03-08 15:01:40 +01:00
ines
c2e3e651b8
Re-add regression test for #859
2017-03-08 14:36:09 +01:00
Matthew Honnibal
0a6d7ca200
Fix spacing after token_match
...
The boolean flag indicating a space after the token was
being set incorrectly after the token_match regex was applied.
Fixes #859 .
2017-03-08 14:33:32 +01:00
shuvanon
85438aee1b
update tokenizertokenizer
2017-03-08 17:29:39 +06:00
shuvanon
45bc78461c
update tokenizertokenizer
2017-03-08 17:27:12 +06:00
Matthew Honnibal
cd33b39a04
Fix 2/3 problem for json save/load
2017-03-08 01:39:13 +01:00
Matthew Honnibal
40703988bc
Use FTRL training in parser
2017-03-08 01:38:51 +01:00
Matthew Honnibal
d108534dc2
Fix 2/3 problems for training
2017-03-08 01:37:52 +01:00
Matthew Honnibal
d03d6a13f1
Merge branch 'rominf-ud20' into develop
2017-03-07 21:48:56 +01:00
Matthew Honnibal
f7374d0b86
Merge branch 'ud20' of https://github.com/rominf/spaCy into rominf-ud20
2017-03-07 21:48:37 +01:00
Matthew Honnibal
16670d3251
Xfail the vocab pickling for now
2017-03-07 21:43:28 +01:00
Matthew Honnibal
a89c3500f6
Fixes to hacky vocab pickling
2017-03-07 20:58:55 +01:00
Matthew Honnibal
d814892805
Hackish pickle support for Vocab.
2017-03-07 20:25:12 +01:00
Matthew Honnibal
26614e028f
Add hacky support for StringCFile, to make pickling easier.
2017-03-07 20:24:37 +01:00
Matthew Honnibal
3edb8ae207
Whitespace
2017-03-07 17:16:26 +01:00
Matthew Honnibal
5de7e712b7
Add support for pickling StringStore.
2017-03-07 17:15:18 +01:00
Matthew Honnibal
4e75e74247
Update regression test for variable-length pattern problem in the matcher.
2017-03-07 16:08:32 +01:00
Matthew Honnibal
6d67213b80
Add test for 850: Matcher fails on zero-or-more.
2017-03-07 15:55:28 +01:00
Aniruddha Adhikary
696215a3fb
add tests for Bengali
2017-03-05 11:25:12 +06:00
Aniruddha Adhikary
8f3bfe9bfc
[Bengali] basic tag map, morph, lemma rules and exceptions
2017-03-04 12:36:59 +06:00
Roman Inflianskas
66e1109b53
Add support for Universal Dependencies v2.0
2017-03-03 13:17:34 +01:00
ines
8dff040032
Revert "Add regression test for #859 "
...
This reverts commit c4f16c66d1
.
2017-03-01 21:56:20 +01:00
Juan Miguel Cejuela
25c29f072d
apply patch
2017-03-01 21:44:17 +01:00
Juan Miguel Cejuela
a8cfde46d3
#781 Fix test — colocalizes is lemmatized to colocaliz and colicalize
2017-03-01 21:43:08 +01:00
Juan Miguel Cejuela
a471114eb2
#781 add regression test, failing previous bug fix
2017-03-01 21:30:51 +01:00
ines
c4f16c66d1
Add regression test for #859
2017-03-01 16:07:27 +01:00
Aniruddha Adhikary
d91be7aed4
add punctuations for Bengali
2017-02-28 21:07:14 +06:00
Aniruddha Adhikary
5a4fc09576
add basic Bengali support
2017-02-28 07:48:37 +06:00
Matthew Honnibal
cc9b2b74e3
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:44:39 +01:00
Matthew Honnibal
bd4375a2e6
Remove comment
2017-02-27 11:44:26 +01:00
Matthew Honnibal
e7e22d8be6
Move import within get_exceptions() function, to speed import
2017-02-27 11:34:48 +01:00
Matthew Honnibal
34bcc8706d
Merge branch 'french-tokenizer-exceptions'
2017-02-27 11:21:21 +01:00
Matthew Honnibal
0aaa546435
Fix test after updating the French tokenizer stuff
2017-02-27 11:20:47 +01:00
Matthew Honnibal
26446aa728
Avoid loading all French exceptions on import
...
Move exceptions loading behind a get_tokenizer_exceptions() function
for French, instead of loading into the top-level namespace. This
cuts import times from 0.6s to 0.2s, at the expense of making the
French data a little different from the others (there's no top-level
TOKENIZER_EXCEPTIONS variable.) The current solution feels somewhat
unsatisfying.
2017-02-25 11:55:00 +01:00
ines
376c5813a7
Remove print statements from test
2017-02-24 18:26:32 +01:00
ines
7c1260e98c
Add regression test
2017-02-24 18:22:49 +01:00
ines
0e2e331b58
Convert exceptions to Python list
2017-02-24 18:22:40 +01:00
ines
51eb190ef4
Remove print statements from test
2017-02-24 17:41:12 +01:00
Matthew Honnibal
db5ada3995
Merge branch 'master' of https://github.com/explosion/spaCy
2017-02-24 14:28:12 +01:00
Matthew Honnibal
8f94897d07
Add 1 operator to matcher, and make sure open patterns are closed at end of document. Closes Issue #766
2017-02-24 14:27:02 +01:00
ines
67991b6e5f
Add more test cases to #775 regression test to cover #847
2017-02-18 14:10:44 +01:00
ines
30ce2a6793
Exclude "shed" and "Shed" from tokenizer exceptions (see #847 )
2017-02-18 14:10:44 +01:00
Ines Montani
de997c1a33
Merge pull request #842 from magnusburton/master
...
Added regular verb rules for Swedish
2017-02-17 11:18:20 +01:00
Magnus Burton
41fcfd06b8
Added regular verb rules for Swedish
2017-02-17 10:04:04 +01:00
ines
aa92d4e9b5
Fix unicode regex for Python 2 (see #834 )
2017-02-16 23:49:54 +01:00
ines
44de3c7642
Reformat test and use text_file fixture
2017-02-16 23:49:19 +01:00
ines
3dd22e9c88
Mark vectors test as xfail (temporary)
2017-02-16 23:28:51 +01:00
ines
85d249d451
Revert "Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )""
...
This reverts commit ea05f78660
.
2017-02-16 23:26:25 +01:00
ines
ea05f78660
Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )"
...
This reverts commit 7d8c9eee7f
, reversing
changes made to f6b69babcc
.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
06a71d22df
Fix test failure by using unicode literals
2017-02-16 14:48:00 +01:00
Raphaël Bournhonesque
3ba109622c
Add regression test with non ' ' space character as token
2017-02-16 12:23:27 +01:00
Raphaël Bournhonesque
e17dc2db75
Remove useless import
2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque
3fd2742649
load_vectors should accept arbitrary space characters as word tokens
...
Fix bug #834
2017-02-16 12:08:30 +01:00
ines
f08e180a47
Make groups non-capturing
...
Prevents hitting the 100 named groups limit in Python
2017-02-10 13:35:02 +01:00
ines
fa3b8512da
Use consistent imports and exports
...
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
2017-02-10 13:34:09 +01:00
ines
21f09d10d7
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
...
This reverts commit f02a2f9322
.
2017-02-10 13:17:05 +01:00
ines
f02a2f9322
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
...
This reverts commit b95afdf39c
, reversing
changes made to b0ccf32378
.
2017-02-09 17:07:21 +01:00
Raphaël Bournhonesque
309da78bf0
Merge branch 'master' into tokenizer_exceptions
2017-02-09 16:32:12 +01:00
Raphaël Bournhonesque
4ce0bbc6b6
Update unit tests
2017-02-09 16:30:43 +01:00
Raphaël Bournhonesque
5d706ab95d
Merge tokenizer exceptions from PR #802
2017-02-09 16:30:28 +01:00
ines
654fe447b1
Add Swedish tokenizer tests (see #807 )
2017-02-05 11:47:07 +01:00
ines
6715615d55
Add missing EXC variable and combine tokenizer exceptions
2017-02-05 11:42:52 +01:00
Ines Montani
30a52d576b
Merge pull request #807 from magnusburton/master
...
Added swedish lemma rules and more verb contractions
2017-02-05 11:34:19 +01:00
Magnus Burton
19c0ce745a
Added swedish lemma rules
2017-02-04 17:53:32 +01:00
Michael Wallin
d25556bf80
[issue 805] Fix issue
2017-02-04 16:22:21 +02:00
Michael Wallin
35100c8bdd
[issue 805] Add regression test and the required fixture
2017-02-04 16:21:34 +02:00
ines
0ab353b0ca
Add line breaks to Finnish stop words for better readability
2017-02-04 13:40:25 +01:00
Michael Wallin
1a1952afa5
[finnish] Add initial tests for tokenizer
2017-02-04 13:54:10 +02:00
Michael Wallin
f9bb25d1cf
[finnish] Reformat and correct stop words
2017-02-04 13:54:10 +02:00
Michael Wallin
73f66ec570
Add preliminary support for Finnish
2017-02-04 13:54:10 +02:00
Ines Montani
65d6202107
Merge pull request #802 from Tpt/fr-tokenizer
...
Adds more French tokenizer exceptions
2017-02-03 10:52:20 +01:00
Tpt
75a74857bb
Adds more French tokenizer exceptions
2017-02-03 13:45:18 +04:00
Ines Montani
afc6365388
Update regression test for #801 to match current expected behaviour
2017-02-02 16:23:05 +01:00
Ines Montani
012f4820cb
Keep infixes of punctuation + hyphens as one token (see #801 )
2017-02-02 16:22:40 +01:00
Ines Montani
1219a5f513
Add = to tokenizer prefixes
2017-02-02 16:21:11 +01:00
Ines Montani
ff04748eb6
Add missing emoticon
2017-02-02 16:21:00 +01:00
Ines Montani
13a4ab37e0
Add regression test for #801
2017-02-02 15:33:52 +01:00
Raphaël Bournhonesque
85f951ca99
Add tokenizer exceptions for French
2017-02-02 08:36:16 +01:00
Matvey Ezhov
32a22291bc
Small Doc.count_by
documentation update
...
Current example doesn't work
2017-01-31 19:18:45 +03:00
Ines Montani
e4875834fe
Fix formatting
2017-01-31 15:19:33 +01:00
Ines Montani
c304834e45
Add missing import
2017-01-31 15:18:30 +01:00
Ines Montani
e6465b9ca3
Parametrize test cases and mark as xfail
2017-01-31 15:14:42 +01:00
latkins
e4c84321a5
Added regression test for Issue #792 .
2017-01-31 13:47:42 +00:00
Matthew Honnibal
6c665b81df
Fix redundant == TAG in from_array conditional
2017-01-31 00:46:21 +11:00
Ines Montani
19501f3340
Add regression test for #775
2017-01-25 13:16:52 +01:00
Ines Montani
209c37bbcf
Exclude "shell" and "Shell" from English tokenizer exceptions ( resolves #775 )
2017-01-25 13:15:02 +01:00
Raphaël Bournhonesque
1be9c0e724
Add fr tokenization unit tests
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
1faaf698ca
Add infixes and abbreviation exceptions (fr)
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
cf8474401b
Remove unused import statement
2017-01-24 10:57:37 +01:00
Raphaël Bournhonesque
902f136f18
Add support for elision in French
2017-01-24 10:57:37 +01:00
Ines Montani
55c9c62abc
Use relative import
2017-01-23 21:27:49 +01:00
Ines Montani
0967eb07be
Add regression test for #768
2017-01-23 21:25:46 +01:00
Ines Montani
6baa98f774
Merge pull request #769 from raphael0202/spacy-768
...
Allow zero-width 'infix' token
2017-01-23 21:24:33 +01:00
Raphaël Bournhonesque
dce8f5515e
Allow zero-width 'infix' token
2017-01-23 18:28:01 +01:00
Ines Montani
5f6f48e734
Add regression test for #759
2017-01-20 15:11:48 +01:00
Ines Montani
09ecc39b4e
Fix multi-line string of NUM_WORDS ( resolves #759 )
2017-01-20 15:11:48 +01:00
Magnus Burton
69eab727d7
Added loops to handle contractions with verbs
2017-01-19 14:08:52 +01:00
Matthew Honnibal
be26085277
Fix missing import
...
Closes #755
2017-01-19 22:03:52 +11:00
Ines Montani
7e36568d5b
Fix title to accommodate sputnik
2017-01-17 00:51:09 +01:00
Ines Montani
d704cfa60d
Fix typo
2017-01-16 21:30:33 +01:00
Ines Montani
64e142f460
Update about.py
2017-01-16 14:23:08 +01:00
Matthew Honnibal
e889cd698e
Increment version
2017-01-16 14:01:35 +01:00
Matthew Honnibal
e7f8e13cf3
Make Token hashable. Fixes #743
2017-01-16 13:27:57 +01:00
Matthew Honnibal
2c60d0cb1e
Test #743 : Tokens unhashable.
2017-01-16 13:27:26 +01:00
Matthew Honnibal
48c712f1c1
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-16 13:18:06 +01:00
Matthew Honnibal
7ccf490c73
Increment version
2017-01-16 13:17:58 +01:00
Ines Montani
50878ef598
Exclude "were" and "Were" from tokenizer exceptions and add regression test ( resolves #744 )
2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b
Fix formatting
2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c
Merge pull request #742 from oroszgy/hu_tokenizer_fix
...
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41
Further numeric test.
2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa
Better error handling
2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c
Better error handling
2017-01-14 22:09:29 +01:00
Gyorgy Orosz
a45f22913f
Added further abbreviations present in the Szeged corpus
2017-01-14 22:08:55 +01:00
Ines Montani
332ce2d758
Update README.md
2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b
Passing all old tests.
2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af
Fixed hyphen handling in the Hungarian tokenizer.
2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6
Maintaining compatibility with other spacy tokenizers.
2017-01-14 16:19:15 +01:00
Gyorgy Orosz
be7a7aeb1a
Reversed accidental changes.
2017-01-14 15:59:36 +01:00
Gyorgy Orosz
1be5da1ac6
Fixed Hungarian tokenizer for numbers
2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a
Fix test formatting and consistency
2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5
Update README.md
2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1
Mark lemmatizer tests as models since they use installed data
2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a
Fix test name for consistency
2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f
Add util function to add vectors to vocab
2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d
Reformat add_docs_equal and add docstring
2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073
Add README.md to tests to explain organisation and conventions
2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90
Modernise serializer I/O tests and don't depend on models where possible
2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4
Add text_file_b fixture using BytesIO
2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62
Modernise noun chunks tests and don't depend on models
2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686
Rename test_parser to test_noun_chunks
2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47
Remove old tests
2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26
Move parser tests from unit to parser
2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e
Merge tokenizer tests
2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff
Move attrs tests from unit to root and modernise
2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967
Move alignment tests from munge to gold and modernise
2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a
Reformat and rename Pragmatic Segmenter tests and mark xfails
2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d
Modernise lemmatizer tests
2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9
Modernise tagger tests and fix xpassing test
2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e
Create basic and extended test set for URLs
2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8
Modernise BILUO tests
2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01
Add Lemmatizer fixture
2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597
Add path fixture for spaCy data path
2017-01-12 23:38:47 +01:00
Ines Montani
0894b8c0ef
Don't split tokens with digits and "/" infixes ( resolves #740 )
2017-01-12 22:58:26 +01:00
Ines Montani
e9e99a5670
Add regression test for #740
2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409
Fix formatting
2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31
Modernise and merge matcher tests
2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a
Update comments on EN and DE fixtures
2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9
Tidy up and rename regression tests and remove unnecessary imports
2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3
Fix formatting and consistency
2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e
Remove redundant language loading integration tests
2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2
Modernise serializer codecs tests
2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6
Modernise Huffman tests
2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5
Modernise packer tests and don't depend on models where possible
2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0
Modernise and merge serialization tests
2017-01-12 21:57:19 +01:00
Ines Montani
442237787c
Add assert_docs_equal util to compare two docs
2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb
Add fixture for entity recognizer
2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc
Modernise matcher tests and split into two files
2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8
Move matcher tests for #188 and #242 to regression tests
...
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd
Update test to not create redundant Doc object
2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8
Fix formatting, naming and unicode declaration
2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d
Modernise vector similarity tests
2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a
Add get_cosine util function
2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629
Fix regression test for #615 and remove unnecessary imports
2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c
Adjust formatting
2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6
Modernise and merge lexeme vocab tests
2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2
Move test for #361 to regression tests
2017-01-12 16:51:12 +01:00
Daniel Hershcovich
99eb494a82
Fix #737 : support loading word vectors with " " as a word
2017-01-12 17:00:14 +02:00
Ines Montani
7cb3d74426
Modernise span tests and don't depend on models
2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee
Modernise vocab API tests and remove old xfailing tests
2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd
Rename test_vocab.py to test_vocab_api.py
2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68
Merge flag features tests into orth tests in tests root
2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3
Remove StringStore tests from vocab tests
2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf
Modernise add vectors vocab test
2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345
Use consistent test names
2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce
Remove old unused tests and conftest files
2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9
Move Pragmatic Segmenter test cases (currently unused) to parser tests
2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874
Add tests for StringStore
2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5
Add fixture for StringStore
2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a
Modernise tests for merging spans and don't depend on models
2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d
Remove unused old test
2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b
Move test for #54 to regression tests
2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c
Remove unused conftest
2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc
Allow setting ents in get_doc
2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5
Fix and pass regression test for #736
2017-01-12 11:48:56 +01:00
Matthew Honnibal
4e48862fa8
Remove print statement
2017-01-12 11:25:39 +01:00
Matthew Honnibal
d1d8214767
Increment version
2017-01-12 11:21:57 +01:00
Matthew Honnibal
fba67fa342
Fix Issue #736 : Times were being tokenized with incorrect string values.
2017-01-12 11:21:01 +01:00
Ines Montani
a6790b6694
Rename tags to pos in get_doc and allow adding tags to tokens
2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67
Merge lemmatizer tests
2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf
Modernise morph exceptions test and don't depend on models
2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e
Add regression test for #736
2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891
Move language-specific tests out of redundant tokenizer directories
2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a
Tidy up
2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7
Move text file back to tokenizer tests directory
2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017
Remove old and/or redundant tests
2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097
Modernise space attachment parser tests and don't depend on models
2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8
Modernise and merge parser tests and don't depend on models
2017-01-12 01:07:29 +01:00
Ines Montani
178c147612
Modernise nonprojectivity tests and don't depend on models
2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c
Modernise sentence boundary detection tests and don't depend on models (where possible)
2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d
Remove old unused pickle test
2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc
Move test for #309 to regression tests
2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670
Modernise parser tests and don't depend on models
2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782
Add apply_transition_sequence util function to utils
2017-01-11 21:30:14 +01:00
Ines Montani
09807addff
Add en_parser fixture
2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61
Modernise Doc parse tree navigation tests and don't depend on models
2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2
Use consistent test names
2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367
Rename "tokens" tests to "doc"
2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563
Remove old unused files
2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f
Remove old word vector tests
2017-01-11 18:55:08 +01:00
Ines Montani
e027936920
Modernise Doc noun chunks tests
2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd
Modernise Doc array tests and don't depend on models
2017-01-11 18:54:46 +01:00
Ines Montani
05447be884
Modernise test for adding entities
2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00
Modernise Doc API tests and don't depend on models
2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44
Make words optional for get_doc
2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419
Fix StringIO import for Python 3
2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b
Rename test_tokens_api.py to test_doc_api.py
2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18
Merge token tests into token API tests
2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0
Modernise token API tests and don't depend on loading models
2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90
Merge conftests into one cohesive file
2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df
Add test utils and get_doc helper function
...
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Matthew Honnibal
e12c90e03f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-11 13:03:51 +01:00
Matthew Honnibal
12cd27b821
Amend 8ae8b443f: Handle comparison with None tokens.
2017-01-11 13:03:32 +01:00
Daniel Hershcovich
8e603cc917
Avoid "True if ... else False"
2017-01-11 11:18:22 +02:00
Matthew Honnibal
44e2b0100d
Support TAG attribute in doc.from_array
2017-01-10 22:47:07 +01:00
Ines Montani
3e6e1f0251
Tidy up regression tests
2017-01-10 19:24:10 +01:00
Magnus Burton
aad23ab0b4
Supplemented with capitalized Swedish exceptions
2017-01-10 16:07:20 +01:00
Ines Montani
869963c3c4
Mark extensive prefix/suffix tests as slow
2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe
Add simple test for surrounding brackets
2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2
Assert length first
2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907
Adjust names and formatting
2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964
Remove semi-redundant URLs and punctuation for faster testing
2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c
Add unicode declaration
2017-01-10 15:53:15 +01:00
Matthew Honnibal
0f9b8a00a5
Unbreak data download
2017-01-09 23:40:26 +01:00
Matthew Honnibal
8ae8b443f1
Add richcmp method to Token. Closes #631
2017-01-09 19:30:31 +01:00
Matthew Honnibal
64f747cb65
Token comparison test
2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c
Add tests for token comparison, re Issue #631
2017-01-09 19:09:59 +01:00
Matthew Honnibal
97a1286129
Revert changes to tagger and parser for thinc 6
2017-01-09 10:08:34 -06:00
Matthew Honnibal
95a52005df
Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class."
...
This reverts commit 40e71586d6
.
2017-01-09 09:55:55 -06:00
Ines Montani
363f09e68c
Merge pull request #726 from magnusburton/master
...
Added Swedish abbreviations as token exceptions
2017-01-09 14:58:15 +01:00
Matthew Honnibal
42cd598f57
Use correct fixtures in URL tokenizer
2017-01-09 14:10:40 +01:00
Matthew Honnibal
d9a77ddf14
Return None for data path if it doesn't exist
2017-01-09 14:10:05 +01:00
Matthew Honnibal
e4862d1dab
Merge branch 'develop'
2017-01-09 13:36:01 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb
Remove old tests for old website example code
2017-01-08 22:28:53 +01:00
Ines Montani
eef94e3ee2
Split off period after two or more uppercase letters ( fixes #483 )
2017-01-08 22:28:25 +01:00
Ines Montani
a89a6000e5
Remove unused import
2017-01-08 22:17:37 +01:00
Ines Montani
5d28664fc5
Don't test Hungarian for numbers and hyphens for now
...
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
53362b6b93
Reorganise Hungarian prefixes/suffixes/infixes
...
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani
347c4a2d06
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
2017-01-08 20:37:39 +01:00
Ines Montani
0dec90e9f7
Use global abbreviation data languages and remove duplicates
2017-01-08 20:36:00 +01:00
Ines Montani
7c3cb2a652
Add global abbreviations data
2017-01-08 20:34:03 +01:00
Ines Montani
de5aa92bc2
Handle deprecated tokenizer prefix data
2017-01-08 20:33:28 +01:00
Ines Montani
abb09782f9
Move sun.txt to original location and fix path to not break parser tests
2017-01-08 20:32:54 +01:00
Ines Montani
cab39c59c5
Add missing contractions to English tokenizer exceptions
...
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07
Move abbreviations below other exceptions
2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9
Generate he/she/it correctly with 's instead of 've
2017-01-05 19:57:00 +01:00
Ines Montani
8328925e1f
Add newlines to long German text
2017-01-05 18:13:30 +01:00
Ines Montani
55b46d7cf6
Add tokenizer tests for German
2017-01-05 18:11:25 +01:00
Ines Montani
5bb4081f52
Remove redundant test_tokenizer.py for English
2017-01-05 18:11:11 +01:00
Ines Montani
8216ba599b
Add tests for longer and mixed English texts
2017-01-05 18:11:04 +01:00
Ines Montani
65f937d5c6
Move basic contraction tests to test_contractions.py
2017-01-05 18:09:53 +01:00
Ines Montani
bbe7cab3a1
Move non-English-specific tests back to general tokenizer tests
2017-01-05 18:09:29 +01:00
Ines Montani
038002d616
Reformat HU tokenizer tests and adapt to general style
...
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani
bc911322b3
Move ") to emoticons (see Tweebo challenge test)
2017-01-05 18:05:38 +01:00
Ines Montani
637f785036
Add general sanity tests for all tokenizers
2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de
Move English tokenizer tests to directory /en
2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d
Modernize and merge general tokenizer tests
2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9
Modernize and merge tokenizer tests for string loading
2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822
Modernize and merge tokenizer tests for whitespace
2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1
Modernize and merge tokenizer tests for text from file
2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653
Modernize and merge tokenizer tests for punctuation
2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5
Modernize and merge tokenizer tests for exception and emoticons
2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d
Fix formatting
2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa
Add missing docstrings
2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6
Add unicode declarations
2017-01-05 13:09:48 +01:00
Ines Montani
58adae8774
Remove unused file
2017-01-05 13:09:22 +01:00
Ines Montani
c6e5a5349d
Move regression test for #360 into own file
2017-01-04 00:49:31 +01:00
Ines Montani
8279993a6f
Modernize and merge tokenizer tests for punctuation
2017-01-04 00:49:20 +01:00
Ines Montani
550630df73
Update tokenizer tests for contractions
2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f
Update conftest fixture
2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293
Modernize tokenizer tests for emoticons
2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd
Modernize tokenizer tests for infixes
2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27
Move regression test for #351 to own file
2017-01-04 00:47:11 +01:00
Ines Montani
667051375d
Modernize tokenizer tests for whitespace
2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285
Modernize tokenizer tests for contractions
...
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
1d237664af
Add lowercase lemma to tokenizer exceptions
2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb
Fix typos
2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00