Raphaël Bournhonesque
|
3fd2742649
|
load_vectors should accept arbitrary space characters as word tokens
Fix bug #834
|
2017-02-16 12:08:30 +01:00 |
|
ines
|
f08e180a47
|
Make groups non-capturing
Prevents hitting the 100 named groups limit in Python
|
2017-02-10 13:35:02 +01:00 |
|
ines
|
fa3b8512da
|
Use consistent imports and exports
Bundle everything in language_data to keep it consistent with other
languages and make TOKENIZER_EXCEPTIONS importable from there.
|
2017-02-10 13:34:09 +01:00 |
|
ines
|
21f09d10d7
|
Revert "Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions""
This reverts commit f02a2f9322 .
|
2017-02-10 13:17:05 +01:00 |
|
ines
|
f02a2f9322
|
Revert "Merge pull request #818 from raphael0202/tokenizer_exceptions"
This reverts commit b95afdf39c , reversing
changes made to b0ccf32378 .
|
2017-02-09 17:07:21 +01:00 |
|
Raphaël Bournhonesque
|
309da78bf0
|
Merge branch 'master' into tokenizer_exceptions
|
2017-02-09 16:32:12 +01:00 |
|
Raphaël Bournhonesque
|
4ce0bbc6b6
|
Update unit tests
|
2017-02-09 16:30:43 +01:00 |
|
Raphaël Bournhonesque
|
5d706ab95d
|
Merge tokenizer exceptions from PR #802
|
2017-02-09 16:30:28 +01:00 |
|
ines
|
654fe447b1
|
Add Swedish tokenizer tests (see #807)
|
2017-02-05 11:47:07 +01:00 |
|
ines
|
6715615d55
|
Add missing EXC variable and combine tokenizer exceptions
|
2017-02-05 11:42:52 +01:00 |
|
Ines Montani
|
30a52d576b
|
Merge pull request #807 from magnusburton/master
Added swedish lemma rules and more verb contractions
|
2017-02-05 11:34:19 +01:00 |
|
Magnus Burton
|
19c0ce745a
|
Added swedish lemma rules
|
2017-02-04 17:53:32 +01:00 |
|
Michael Wallin
|
d25556bf80
|
[issue 805] Fix issue
|
2017-02-04 16:22:21 +02:00 |
|
Michael Wallin
|
35100c8bdd
|
[issue 805] Add regression test and the required fixture
|
2017-02-04 16:21:34 +02:00 |
|
ines
|
0ab353b0ca
|
Add line breaks to Finnish stop words for better readability
|
2017-02-04 13:40:25 +01:00 |
|
Michael Wallin
|
1a1952afa5
|
[finnish] Add initial tests for tokenizer
|
2017-02-04 13:54:10 +02:00 |
|
Michael Wallin
|
f9bb25d1cf
|
[finnish] Reformat and correct stop words
|
2017-02-04 13:54:10 +02:00 |
|
Michael Wallin
|
73f66ec570
|
Add preliminary support for Finnish
|
2017-02-04 13:54:10 +02:00 |
|
Ines Montani
|
65d6202107
|
Merge pull request #802 from Tpt/fr-tokenizer
Adds more French tokenizer exceptions
|
2017-02-03 10:52:20 +01:00 |
|
Tpt
|
75a74857bb
|
Adds more French tokenizer exceptions
|
2017-02-03 13:45:18 +04:00 |
|
Ines Montani
|
afc6365388
|
Update regression test for #801 to match current expected behaviour
|
2017-02-02 16:23:05 +01:00 |
|
Ines Montani
|
012f4820cb
|
Keep infixes of punctuation + hyphens as one token (see #801)
|
2017-02-02 16:22:40 +01:00 |
|
Ines Montani
|
1219a5f513
|
Add = to tokenizer prefixes
|
2017-02-02 16:21:11 +01:00 |
|
Ines Montani
|
ff04748eb6
|
Add missing emoticon
|
2017-02-02 16:21:00 +01:00 |
|
Ines Montani
|
13a4ab37e0
|
Add regression test for #801
|
2017-02-02 15:33:52 +01:00 |
|
Raphaël Bournhonesque
|
85f951ca99
|
Add tokenizer exceptions for French
|
2017-02-02 08:36:16 +01:00 |
|
Matvey Ezhov
|
32a22291bc
|
Small Doc.count_by documentation update
Current example doesn't work
|
2017-01-31 19:18:45 +03:00 |
|
Ines Montani
|
e4875834fe
|
Fix formatting
|
2017-01-31 15:19:33 +01:00 |
|
Ines Montani
|
c304834e45
|
Add missing import
|
2017-01-31 15:18:30 +01:00 |
|
Ines Montani
|
e6465b9ca3
|
Parametrize test cases and mark as xfail
|
2017-01-31 15:14:42 +01:00 |
|
latkins
|
e4c84321a5
|
Added regression test for Issue #792.
|
2017-01-31 13:47:42 +00:00 |
|
Matthew Honnibal
|
6c665b81df
|
Fix redundant == TAG in from_array conditional
|
2017-01-31 00:46:21 +11:00 |
|
Ines Montani
|
19501f3340
|
Add regression test for #775
|
2017-01-25 13:16:52 +01:00 |
|
Ines Montani
|
209c37bbcf
|
Exclude "shell" and "Shell" from English tokenizer exceptions (resolves #775)
|
2017-01-25 13:15:02 +01:00 |
|
Raphaël Bournhonesque
|
1be9c0e724
|
Add fr tokenization unit tests
|
2017-01-24 10:57:37 +01:00 |
|
Raphaël Bournhonesque
|
1faaf698ca
|
Add infixes and abbreviation exceptions (fr)
|
2017-01-24 10:57:37 +01:00 |
|
Raphaël Bournhonesque
|
cf8474401b
|
Remove unused import statement
|
2017-01-24 10:57:37 +01:00 |
|
Raphaël Bournhonesque
|
902f136f18
|
Add support for elision in French
|
2017-01-24 10:57:37 +01:00 |
|
Ines Montani
|
55c9c62abc
|
Use relative import
|
2017-01-23 21:27:49 +01:00 |
|
Ines Montani
|
0967eb07be
|
Add regression test for #768
|
2017-01-23 21:25:46 +01:00 |
|
Ines Montani
|
6baa98f774
|
Merge pull request #769 from raphael0202/spacy-768
Allow zero-width 'infix' token
|
2017-01-23 21:24:33 +01:00 |
|
Raphaël Bournhonesque
|
dce8f5515e
|
Allow zero-width 'infix' token
|
2017-01-23 18:28:01 +01:00 |
|
Ines Montani
|
5f6f48e734
|
Add regression test for #759
|
2017-01-20 15:11:48 +01:00 |
|
Ines Montani
|
09ecc39b4e
|
Fix multi-line string of NUM_WORDS (resolves #759)
|
2017-01-20 15:11:48 +01:00 |
|
Magnus Burton
|
69eab727d7
|
Added loops to handle contractions with verbs
|
2017-01-19 14:08:52 +01:00 |
|
Matthew Honnibal
|
be26085277
|
Fix missing import
Closes #755
|
2017-01-19 22:03:52 +11:00 |
|
Ines Montani
|
7e36568d5b
|
Fix title to accommodate sputnik
|
2017-01-17 00:51:09 +01:00 |
|
Ines Montani
|
d704cfa60d
|
Fix typo
|
2017-01-16 21:30:33 +01:00 |
|
Ines Montani
|
64e142f460
|
Update about.py
|
2017-01-16 14:23:08 +01:00 |
|
Matthew Honnibal
|
e889cd698e
|
Increment version
|
2017-01-16 14:01:35 +01:00 |
|
Matthew Honnibal
|
e7f8e13cf3
|
Make Token hashable. Fixes #743
|
2017-01-16 13:27:57 +01:00 |
|
Matthew Honnibal
|
2c60d0cb1e
|
Test #743: Tokens unhashable.
|
2017-01-16 13:27:26 +01:00 |
|
Matthew Honnibal
|
48c712f1c1
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2017-01-16 13:18:06 +01:00 |
|
Matthew Honnibal
|
7ccf490c73
|
Increment version
|
2017-01-16 13:17:58 +01:00 |
|
Ines Montani
|
50878ef598
|
Exclude "were" and "Were" from tokenizer exceptions and add regression test (resolves #744)
|
2017-01-16 13:10:38 +01:00 |
|
Ines Montani
|
e053c7693b
|
Fix formatting
|
2017-01-16 13:09:52 +01:00 |
|
Ines Montani
|
116c675c3c
|
Merge pull request #742 from oroszgy/hu_tokenizer_fix
Improved Hungarian tokenizer
|
2017-01-14 23:52:44 +01:00 |
|
Gyorgy Orosz
|
92345b6a41
|
Further numeric test.
|
2017-01-14 22:44:19 +01:00 |
|
Gyorgy Orosz
|
b4df202bfa
|
Better error handling
|
2017-01-14 22:24:58 +01:00 |
|
Gyorgy Orosz
|
b03a46792c
|
Better error handling
|
2017-01-14 22:09:29 +01:00 |
|
Gyorgy Orosz
|
a45f22913f
|
Added further abbreviations present in the Szeged corpus
|
2017-01-14 22:08:55 +01:00 |
|
Ines Montani
|
332ce2d758
|
Update README.md
|
2017-01-14 21:12:11 +01:00 |
|
Gyorgy Orosz
|
9505c6a72b
|
Passing all old tests.
|
2017-01-14 20:39:21 +01:00 |
|
Gyorgy Orosz
|
63037e79af
|
Fixed hyphen handling in the Hungarian tokenizer.
|
2017-01-14 16:30:11 +01:00 |
|
Gyorgy Orosz
|
f77c0284d6
|
Maintaining compatibility with other spacy tokenizers.
|
2017-01-14 16:19:15 +01:00 |
|
Gyorgy Orosz
|
be7a7aeb1a
|
Reversed accidental changes.
|
2017-01-14 15:59:36 +01:00 |
|
Gyorgy Orosz
|
1be5da1ac6
|
Fixed Hungarian tokenizer for numbers
|
2017-01-14 15:51:59 +01:00 |
|
Ines Montani
|
a89e269a5a
|
Fix test formatting and consistency
|
2017-01-14 13:41:19 +01:00 |
|
Ines Montani
|
3424e3a7e5
|
Update README.md
|
2017-01-13 15:54:54 +01:00 |
|
Ines Montani
|
49186b34a1
|
Mark lemmatizer tests as models since they use installed data
|
2017-01-13 15:12:07 +01:00 |
|
Ines Montani
|
138deb80a1
|
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
|
2017-01-13 15:12:07 +01:00 |
|
Ines Montani
|
96f0caa28a
|
Fix test name for consistency
|
2017-01-13 15:12:07 +01:00 |
|
Ines Montani
|
dc2bb1259f
|
Add util function to add vectors to vocab
|
2017-01-13 15:12:07 +01:00 |
|
Ines Montani
|
db9b25663d
|
Reformat add_docs_equal and add docstring
|
2017-01-13 15:12:07 +01:00 |
|
Ines Montani
|
62ce0a0073
|
Add README.md to tests to explain organisation and conventions
|
2017-01-13 15:11:18 +01:00 |
|
Ines Montani
|
38d60f6b90
|
Modernise serializer I/O tests and don't depend on models where possible
|
2017-01-13 02:24:56 +01:00 |
|
Ines Montani
|
4bb5b89ee4
|
Add text_file_b fixture using BytesIO
|
2017-01-13 02:23:50 +01:00 |
|
Ines Montani
|
49febd8c62
|
Modernise noun chunks tests and don't depend on models
|
2017-01-13 02:01:00 +01:00 |
|
Ines Montani
|
3ee97b5686
|
Rename test_parser to test_noun_chunks
|
2017-01-13 01:36:33 +01:00 |
|
Ines Montani
|
a308703f47
|
Remove old tests
|
2017-01-13 01:34:48 +01:00 |
|
Ines Montani
|
12eb8edf26
|
Move parser tests from unit to parser
|
2017-01-13 01:34:38 +01:00 |
|
Ines Montani
|
138c53ff2e
|
Merge tokenizer tests
|
2017-01-13 01:34:14 +01:00 |
|
Ines Montani
|
01f36ca3ff
|
Move attrs tests from unit to root and modernise
|
2017-01-13 01:33:50 +01:00 |
|
Ines Montani
|
3610d27967
|
Move alignment tests from munge to gold and modernise
|
2017-01-13 01:33:31 +01:00 |
|
Ines Montani
|
094ff7396a
|
Reformat and rename Pragmatic Segmenter tests and mark xfails
|
2017-01-13 01:30:20 +01:00 |
|
Ines Montani
|
affcf1b19d
|
Modernise lemmatizer tests
|
2017-01-12 23:41:17 +01:00 |
|
Ines Montani
|
33d9cf87f9
|
Modernise tagger tests and fix xpassing test
|
2017-01-12 23:40:52 +01:00 |
|
Ines Montani
|
33e5f8dc2e
|
Create basic and extended test set for URLs
|
2017-01-12 23:40:02 +01:00 |
|
Ines Montani
|
5e4f5ebfc8
|
Modernise BILUO tests
|
2017-01-12 23:39:18 +01:00 |
|
Ines Montani
|
09acfbca01
|
Add Lemmatizer fixture
|
2017-01-12 23:38:55 +01:00 |
|
Ines Montani
|
514bfa2597
|
Add path fixture for spaCy data path
|
2017-01-12 23:38:47 +01:00 |
|
Ines Montani
|
0894b8c0ef
|
Don't split tokens with digits and "/" infixes (resolves #740)
|
2017-01-12 22:58:26 +01:00 |
|
Ines Montani
|
e9e99a5670
|
Add regression test for #740
|
2017-01-12 22:57:38 +01:00 |
|
Ines Montani
|
6935d55409
|
Fix formatting
|
2017-01-12 22:56:20 +01:00 |
|
Ines Montani
|
5f0d196a31
|
Modernise and merge matcher tests
|
2017-01-12 22:23:11 +01:00 |
|
Ines Montani
|
d5d774413a
|
Update comments on EN and DE fixtures
|
2017-01-12 22:03:07 +01:00 |
|
Ines Montani
|
9b4bea1df9
|
Tidy up and rename regression tests and remove unnecessary imports
|
2017-01-12 22:00:37 +01:00 |
|
Ines Montani
|
5e1b6178e3
|
Fix formatting and consistency
|
2017-01-12 22:00:06 +01:00 |
|
Ines Montani
|
a3fd32455e
|
Remove redundant language loading integration tests
|
2017-01-12 21:59:48 +01:00 |
|
Ines Montani
|
61f1ca09c2
|
Modernise serializer codecs tests
|
2017-01-12 21:58:55 +01:00 |
|
Ines Montani
|
5dbc6e59f6
|
Modernise Huffman tests
|
2017-01-12 21:58:40 +01:00 |
|
Ines Montani
|
edeeeccea5
|
Modernise packer tests and don't depend on models where possible
|
2017-01-12 21:58:07 +01:00 |
|
Ines Montani
|
d084676cd0
|
Modernise and merge serialization tests
|
2017-01-12 21:57:19 +01:00 |
|
Ines Montani
|
442237787c
|
Add assert_docs_equal util to compare two docs
|
2017-01-12 21:56:52 +01:00 |
|
Ines Montani
|
eac3f700fb
|
Add fixture for entity recognizer
|
2017-01-12 21:56:32 +01:00 |
|
Ines Montani
|
b438cfddbc
|
Modernise matcher tests and split into two files
|
2017-01-12 17:51:46 +01:00 |
|
Ines Montani
|
27482ebed8
|
Move matcher tests for #188 and #242 to regression tests
Modernise tests and remove unnecessary imports
|
2017-01-12 17:33:57 +01:00 |
|
Ines Montani
|
0a4dc632bd
|
Update test to not create redundant Doc object
|
2017-01-12 17:33:18 +01:00 |
|
Ines Montani
|
a2526e66d8
|
Fix formatting, naming and unicode declaration
|
2017-01-12 16:51:13 +01:00 |
|
Ines Montani
|
052cdff07d
|
Modernise vector similarity tests
|
2017-01-12 16:51:13 +01:00 |
|
Ines Montani
|
bd20ec0a6a
|
Add get_cosine util function
|
2017-01-12 16:51:13 +01:00 |
|
Ines Montani
|
51ef75f629
|
Fix regression test for #615 and remove unnecessary imports
|
2017-01-12 16:51:12 +01:00 |
|
Ines Montani
|
aeb747e10c
|
Adjust formatting
|
2017-01-12 16:51:12 +01:00 |
|
Ines Montani
|
8e3e58a7e6
|
Modernise and merge lexeme vocab tests
|
2017-01-12 16:51:12 +01:00 |
|
Ines Montani
|
c3d4516fc2
|
Move test for #361 to regression tests
|
2017-01-12 16:51:12 +01:00 |
|
Daniel Hershcovich
|
99eb494a82
|
Fix #737: support loading word vectors with " " as a word
|
2017-01-12 17:00:14 +02:00 |
|
Ines Montani
|
7cb3d74426
|
Modernise span tests and don't depend on models
|
2017-01-12 15:30:49 +01:00 |
|
Ines Montani
|
92e3d8b3ee
|
Modernise vocab API tests and remove old xfailing tests
|
2017-01-12 15:27:46 +01:00 |
|
Ines Montani
|
7ea87684cd
|
Rename test_vocab.py to test_vocab_api.py
|
2017-01-12 15:12:21 +01:00 |
|
Ines Montani
|
0da2ee5c68
|
Merge flag features tests into orth tests in tests root
|
2017-01-12 15:12:00 +01:00 |
|
Ines Montani
|
03c136cfd3
|
Remove StringStore tests from vocab tests
|
2017-01-12 15:11:15 +01:00 |
|
Ines Montani
|
d7bd57abdf
|
Modernise add vectors vocab test
|
2017-01-12 15:09:49 +01:00 |
|
Ines Montani
|
89525ef345
|
Use consistent test names
|
2017-01-12 15:09:21 +01:00 |
|
Ines Montani
|
f8803808ce
|
Remove old unused tests and conftest files
|
2017-01-12 15:09:05 +01:00 |
|
Ines Montani
|
4d0bfebcd9
|
Move Pragmatic Segmenter test cases (currently unused) to parser tests
|
2017-01-12 15:08:02 +01:00 |
|
Ines Montani
|
26d018d874
|
Add tests for StringStore
|
2017-01-12 15:07:31 +01:00 |
|
Ines Montani
|
9b6784bab5
|
Add fixture for StringStore
|
2017-01-12 15:05:40 +01:00 |
|
Ines Montani
|
99d66d613a
|
Modernise tests for merging spans and don't depend on models
|
2017-01-12 12:26:26 +01:00 |
|
Ines Montani
|
fa8f67596d
|
Remove unused old test
|
2017-01-12 12:26:08 +01:00 |
|
Ines Montani
|
359f73a96b
|
Move test for #54 to regression tests
|
2017-01-12 12:25:51 +01:00 |
|
Ines Montani
|
3f3a46722c
|
Remove unused conftest
|
2017-01-12 12:25:24 +01:00 |
|
Ines Montani
|
c2406e92bc
|
Allow setting ents in get_doc
|
2017-01-12 12:25:10 +01:00 |
|
Ines Montani
|
c5914c6fe5
|
Fix and pass regression test for #736
|
2017-01-12 11:48:56 +01:00 |
|
Matthew Honnibal
|
4e48862fa8
|
Remove print statement
|
2017-01-12 11:25:39 +01:00 |
|
Matthew Honnibal
|
d1d8214767
|
Increment version
|
2017-01-12 11:21:57 +01:00 |
|
Matthew Honnibal
|
fba67fa342
|
Fix Issue #736: Times were being tokenized with incorrect string values.
|
2017-01-12 11:21:01 +01:00 |
|
Ines Montani
|
a6790b6694
|
Rename tags to pos in get_doc and allow adding tags to tokens
|
2017-01-12 11:18:36 +01:00 |
|
Ines Montani
|
1add8ace67
|
Merge lemmatizer tests
|
2017-01-12 11:16:53 +01:00 |
|
Ines Montani
|
3bc082abdf
|
Modernise morph exceptions test and don't depend on models
|
2017-01-12 11:14:29 +01:00 |
|
Ines Montani
|
ec7739b76e
|
Add regression test for #736
|
2017-01-12 11:12:44 +01:00 |
|
Ines Montani
|
6c1c564891
|
Move language-specific tests out of redundant tokenizer directories
|
2017-01-12 02:17:18 +01:00 |
|
Ines Montani
|
8fecedac3a
|
Tidy up
|
2017-01-12 02:16:37 +01:00 |
|
Ines Montani
|
ae7edd30e7
|
Move text file back to tokenizer tests directory
|
2017-01-12 02:10:23 +01:00 |
|
Ines Montani
|
ffcaba9017
|
Remove old and/or redundant tests
|
2017-01-12 02:10:18 +01:00 |
|
Ines Montani
|
19c4132097
|
Modernise space attachment parser tests and don't depend on models
|
2017-01-12 01:54:44 +01:00 |
|
Ines Montani
|
69778924c8
|
Modernise and merge parser tests and don't depend on models
|
2017-01-12 01:07:29 +01:00 |
|
Ines Montani
|
178c147612
|
Modernise nonprojectivity tests and don't depend on models
|
2017-01-12 01:06:36 +01:00 |
|
Ines Montani
|
1a3984742c
|
Modernise sentence boundary detection tests and don't depend on models (where possible)
|
2017-01-11 23:53:08 +01:00 |
|
Ines Montani
|
0cdb6ea61d
|
Remove old unused pickle test
|
2017-01-11 23:52:28 +01:00 |
|
Ines Montani
|
c9671329dc
|
Move test for #309 to regression tests
|
2017-01-11 23:52:13 +01:00 |
|