Ines Montani
50878ef598
Exclude "were" and "Were" from tokenizer exceptions and add regression test ( resolves #744 )
2017-01-16 13:10:38 +01:00
Ines Montani
e053c7693b
Fix formatting
2017-01-16 13:09:52 +01:00
Ines Montani
116c675c3c
Merge pull request #742 from oroszgy/hu_tokenizer_fix
...
Improved Hungarian tokenizer
2017-01-14 23:52:44 +01:00
Gyorgy Orosz
92345b6a41
Further numeric test.
2017-01-14 22:44:19 +01:00
Gyorgy Orosz
b4df202bfa
Better error handling
2017-01-14 22:24:58 +01:00
Gyorgy Orosz
b03a46792c
Better error handling
2017-01-14 22:09:29 +01:00
Gyorgy Orosz
a45f22913f
Added further abbreviations present in the Szeged corpus
2017-01-14 22:08:55 +01:00
Ines Montani
332ce2d758
Update README.md
2017-01-14 21:12:11 +01:00
Gyorgy Orosz
9505c6a72b
Passing all old tests.
2017-01-14 20:39:21 +01:00
Gyorgy Orosz
63037e79af
Fixed hyphen handling in the Hungarian tokenizer.
2017-01-14 16:30:11 +01:00
Gyorgy Orosz
f77c0284d6
Maintaining compatibility with other spacy tokenizers.
2017-01-14 16:19:15 +01:00
Gyorgy Orosz
be7a7aeb1a
Reversed accidental changes.
2017-01-14 15:59:36 +01:00
Gyorgy Orosz
1be5da1ac6
Fixed Hungarian tokenizer for numbers
2017-01-14 15:51:59 +01:00
Ines Montani
a89e269a5a
Fix test formatting and consistency
2017-01-14 13:41:19 +01:00
Ines Montani
3424e3a7e5
Update README.md
2017-01-13 15:54:54 +01:00
Ines Montani
49186b34a1
Mark lemmatizer tests as models since they use installed data
2017-01-13 15:12:07 +01:00
Ines Montani
138deb80a1
Modernise vector tests, use add_vecs_to_vocab and don't depend on models
2017-01-13 15:12:07 +01:00
Ines Montani
96f0caa28a
Fix test name for consistency
2017-01-13 15:12:07 +01:00
Ines Montani
dc2bb1259f
Add util function to add vectors to vocab
2017-01-13 15:12:07 +01:00
Ines Montani
db9b25663d
Reformat add_docs_equal and add docstring
2017-01-13 15:12:07 +01:00
Ines Montani
62ce0a0073
Add README.md to tests to explain organisation and conventions
2017-01-13 15:11:18 +01:00
Ines Montani
38d60f6b90
Modernise serializer I/O tests and don't depend on models where possible
2017-01-13 02:24:56 +01:00
Ines Montani
4bb5b89ee4
Add text_file_b fixture using BytesIO
2017-01-13 02:23:50 +01:00
Ines Montani
49febd8c62
Modernise noun chunks tests and don't depend on models
2017-01-13 02:01:00 +01:00
Ines Montani
3ee97b5686
Rename test_parser to test_noun_chunks
2017-01-13 01:36:33 +01:00
Ines Montani
a308703f47
Remove old tests
2017-01-13 01:34:48 +01:00
Ines Montani
12eb8edf26
Move parser tests from unit to parser
2017-01-13 01:34:38 +01:00
Ines Montani
138c53ff2e
Merge tokenizer tests
2017-01-13 01:34:14 +01:00
Ines Montani
01f36ca3ff
Move attrs tests from unit to root and modernise
2017-01-13 01:33:50 +01:00
Ines Montani
3610d27967
Move alignment tests from munge to gold and modernise
2017-01-13 01:33:31 +01:00
Ines Montani
094ff7396a
Reformat and rename Pragmatic Segmenter tests and mark xfails
2017-01-13 01:30:20 +01:00
Ines Montani
affcf1b19d
Modernise lemmatizer tests
2017-01-12 23:41:17 +01:00
Ines Montani
33d9cf87f9
Modernise tagger tests and fix xpassing test
2017-01-12 23:40:52 +01:00
Ines Montani
33e5f8dc2e
Create basic and extended test set for URLs
2017-01-12 23:40:02 +01:00
Ines Montani
5e4f5ebfc8
Modernise BILUO tests
2017-01-12 23:39:18 +01:00
Ines Montani
09acfbca01
Add Lemmatizer fixture
2017-01-12 23:38:55 +01:00
Ines Montani
514bfa2597
Add path fixture for spaCy data path
2017-01-12 23:38:47 +01:00
Ines Montani
0894b8c0ef
Don't split tokens with digits and "/" infixes ( resolves #740 )
2017-01-12 22:58:26 +01:00
Ines Montani
e9e99a5670
Add regression test for #740
2017-01-12 22:57:38 +01:00
Ines Montani
6935d55409
Fix formatting
2017-01-12 22:56:20 +01:00
Ines Montani
5f0d196a31
Modernise and merge matcher tests
2017-01-12 22:23:11 +01:00
Ines Montani
d5d774413a
Update comments on EN and DE fixtures
2017-01-12 22:03:07 +01:00
Ines Montani
9b4bea1df9
Tidy up and rename regression tests and remove unnecessary imports
2017-01-12 22:00:37 +01:00
Ines Montani
5e1b6178e3
Fix formatting and consistency
2017-01-12 22:00:06 +01:00
Ines Montani
a3fd32455e
Remove redundant language loading integration tests
2017-01-12 21:59:48 +01:00
Ines Montani
61f1ca09c2
Modernise serializer codecs tests
2017-01-12 21:58:55 +01:00
Ines Montani
5dbc6e59f6
Modernise Huffman tests
2017-01-12 21:58:40 +01:00
Ines Montani
edeeeccea5
Modernise packer tests and don't depend on models where possible
2017-01-12 21:58:07 +01:00
Ines Montani
d084676cd0
Modernise and merge serialization tests
2017-01-12 21:57:19 +01:00
Ines Montani
442237787c
Add assert_docs_equal util to compare two docs
2017-01-12 21:56:52 +01:00
Ines Montani
eac3f700fb
Add fixture for entity recognizer
2017-01-12 21:56:32 +01:00
Ines Montani
b438cfddbc
Modernise matcher tests and split into two files
2017-01-12 17:51:46 +01:00
Ines Montani
27482ebed8
Move matcher tests for #188 and #242 to regression tests
...
Modernise tests and remove unnecessary imports
2017-01-12 17:33:57 +01:00
Ines Montani
0a4dc632bd
Update test to not create redundant Doc object
2017-01-12 17:33:18 +01:00
Ines Montani
a2526e66d8
Fix formatting, naming and unicode declaration
2017-01-12 16:51:13 +01:00
Ines Montani
052cdff07d
Modernise vector similarity tests
2017-01-12 16:51:13 +01:00
Ines Montani
bd20ec0a6a
Add get_cosine util function
2017-01-12 16:51:13 +01:00
Ines Montani
51ef75f629
Fix regression test for #615 and remove unnecessary imports
2017-01-12 16:51:12 +01:00
Ines Montani
aeb747e10c
Adjust formatting
2017-01-12 16:51:12 +01:00
Ines Montani
8e3e58a7e6
Modernise and merge lexeme vocab tests
2017-01-12 16:51:12 +01:00
Ines Montani
c3d4516fc2
Move test for #361 to regression tests
2017-01-12 16:51:12 +01:00
Daniel Hershcovich
99eb494a82
Fix #737 : support loading word vectors with " " as a word
2017-01-12 17:00:14 +02:00
Ines Montani
7cb3d74426
Modernise span tests and don't depend on models
2017-01-12 15:30:49 +01:00
Ines Montani
92e3d8b3ee
Modernise vocab API tests and remove old xfailing tests
2017-01-12 15:27:46 +01:00
Ines Montani
7ea87684cd
Rename test_vocab.py to test_vocab_api.py
2017-01-12 15:12:21 +01:00
Ines Montani
0da2ee5c68
Merge flag features tests into orth tests in tests root
2017-01-12 15:12:00 +01:00
Ines Montani
03c136cfd3
Remove StringStore tests from vocab tests
2017-01-12 15:11:15 +01:00
Ines Montani
d7bd57abdf
Modernise add vectors vocab test
2017-01-12 15:09:49 +01:00
Ines Montani
89525ef345
Use consistent test names
2017-01-12 15:09:21 +01:00
Ines Montani
f8803808ce
Remove old unused tests and conftest files
2017-01-12 15:09:05 +01:00
Ines Montani
4d0bfebcd9
Move Pragmatic Segmenter test cases (currently unused) to parser tests
2017-01-12 15:08:02 +01:00
Ines Montani
26d018d874
Add tests for StringStore
2017-01-12 15:07:31 +01:00
Ines Montani
9b6784bab5
Add fixture for StringStore
2017-01-12 15:05:40 +01:00
Ines Montani
99d66d613a
Modernise tests for merging spans and don't depend on models
2017-01-12 12:26:26 +01:00
Ines Montani
fa8f67596d
Remove unused old test
2017-01-12 12:26:08 +01:00
Ines Montani
359f73a96b
Move test for #54 to regression tests
2017-01-12 12:25:51 +01:00
Ines Montani
3f3a46722c
Remove unused conftest
2017-01-12 12:25:24 +01:00
Ines Montani
c2406e92bc
Allow setting ents in get_doc
2017-01-12 12:25:10 +01:00
Ines Montani
c5914c6fe5
Fix and pass regression test for #736
2017-01-12 11:48:56 +01:00
Matthew Honnibal
4e48862fa8
Remove print statement
2017-01-12 11:25:39 +01:00
Matthew Honnibal
d1d8214767
Increment version
2017-01-12 11:21:57 +01:00
Matthew Honnibal
fba67fa342
Fix Issue #736 : Times were being tokenized with incorrect string values.
2017-01-12 11:21:01 +01:00
Ines Montani
a6790b6694
Rename tags to pos in get_doc and allow adding tags to tokens
2017-01-12 11:18:36 +01:00
Ines Montani
1add8ace67
Merge lemmatizer tests
2017-01-12 11:16:53 +01:00
Ines Montani
3bc082abdf
Modernise morph exceptions test and don't depend on models
2017-01-12 11:14:29 +01:00
Ines Montani
ec7739b76e
Add regression test for #736
2017-01-12 11:12:44 +01:00
Ines Montani
6c1c564891
Move language-specific tests out of redundant tokenizer directories
2017-01-12 02:17:18 +01:00
Ines Montani
8fecedac3a
Tidy up
2017-01-12 02:16:37 +01:00
Ines Montani
ae7edd30e7
Move text file back to tokenizer tests directory
2017-01-12 02:10:23 +01:00
Ines Montani
ffcaba9017
Remove old and/or redundant tests
2017-01-12 02:10:18 +01:00
Ines Montani
19c4132097
Modernise space attachment parser tests and don't depend on models
2017-01-12 01:54:44 +01:00
Ines Montani
69778924c8
Modernise and merge parser tests and don't depend on models
2017-01-12 01:07:29 +01:00
Ines Montani
178c147612
Modernise nonprojectivity tests and don't depend on models
2017-01-12 01:06:36 +01:00
Ines Montani
1a3984742c
Modernise sentence boundary detection tests and don't depend on models (where possible)
2017-01-11 23:53:08 +01:00
Ines Montani
0cdb6ea61d
Remove old unused pickle test
2017-01-11 23:52:28 +01:00
Ines Montani
c9671329dc
Move test for #309 to regression tests
2017-01-11 23:52:13 +01:00
Ines Montani
d0e37b5670
Modernise parser tests and don't depend on models
2017-01-11 21:30:27 +01:00
Ines Montani
342cb41782
Add apply_transition_sequence util function to utils
2017-01-11 21:30:14 +01:00
Ines Montani
09807addff
Add en_parser fixture
2017-01-11 21:29:59 +01:00
Ines Montani
55d151aa61
Modernise Doc parse tree navigation tests and don't depend on models
2017-01-11 21:14:15 +01:00
Ines Montani
7262421bb2
Use consistent test names
2017-01-11 19:00:52 +01:00
Ines Montani
33800c9367
Rename "tokens" tests to "doc"
2017-01-11 18:59:01 +01:00
Ines Montani
3a9c6a9563
Remove old unused files
2017-01-11 18:58:38 +01:00
Ines Montani
8e962de39f
Remove old word vector tests
2017-01-11 18:55:08 +01:00
Ines Montani
e027936920
Modernise Doc noun chunks tests
2017-01-11 18:54:56 +01:00
Ines Montani
439f396acd
Modernise Doc array tests and don't depend on models
2017-01-11 18:54:46 +01:00
Ines Montani
05447be884
Modernise test for adding entities
2017-01-11 18:54:24 +01:00
Ines Montani
6e883f4c00
Modernise Doc API tests and don't depend on models
2017-01-11 18:05:36 +01:00
Ines Montani
8bf3bb5c44
Make words optional for get_doc
2017-01-11 18:05:10 +01:00
Ines Montani
928db7e419
Fix StringIO import for Python 3
2017-01-11 14:07:48 +01:00
Ines Montani
69998f216b
Rename test_tokens_api.py to test_doc_api.py
2017-01-11 13:58:56 +01:00
Ines Montani
d94dea1b18
Merge token tests into token API tests
2017-01-11 13:57:02 +01:00
Ines Montani
eb23424ab0
Modernise token API tests and don't depend on loading models
2017-01-11 13:56:54 +01:00
Ines Montani
c682b8ca90
Merge conftests into one cohesive file
2017-01-11 13:56:32 +01:00
Ines Montani
909f24d7df
Add test utils and get_doc helper function
...
Create Doc object from given vocab, words and annotations to allow
tests not to depend on loading the models.
2017-01-11 13:55:33 +01:00
Matthew Honnibal
e12c90e03f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2017-01-11 13:03:51 +01:00
Matthew Honnibal
12cd27b821
Amend 8ae8b443f: Handle comparison with None tokens.
2017-01-11 13:03:32 +01:00
Daniel Hershcovich
8e603cc917
Avoid "True if ... else False"
2017-01-11 11:18:22 +02:00
Matthew Honnibal
44e2b0100d
Support TAG attribute in doc.from_array
2017-01-10 22:47:07 +01:00
Ines Montani
3e6e1f0251
Tidy up regression tests
2017-01-10 19:24:10 +01:00
Magnus Burton
aad23ab0b4
Supplemented with capitalized Swedish exceptions
2017-01-10 16:07:20 +01:00
Ines Montani
869963c3c4
Mark extensive prefix/suffix tests as slow
2017-01-10 15:57:35 +01:00
Ines Montani
487e020ebe
Add simple test for surrounding brackets
2017-01-10 15:57:26 +01:00
Ines Montani
0ba5cf51d2
Assert length first
2017-01-10 15:57:00 +01:00
Ines Montani
2185d31907
Adjust names and formatting
2017-01-10 15:56:35 +01:00
Ines Montani
e10d4ca964
Remove semi-redundant URLs and punctuation for faster testing
2017-01-10 15:54:25 +01:00
Ines Montani
3a3cb2c90c
Add unicode declaration
2017-01-10 15:53:15 +01:00
Matthew Honnibal
0f9b8a00a5
Unbreak data download
2017-01-09 23:40:26 +01:00
Matthew Honnibal
8ae8b443f1
Add richcmp method to Token. Closes #631
2017-01-09 19:30:31 +01:00
Matthew Honnibal
64f747cb65
Token comparison test
2017-01-09 19:12:00 +01:00
Matthew Honnibal
18c3c2d05c
Add tests for token comparison, re Issue #631
2017-01-09 19:09:59 +01:00
Matthew Honnibal
97a1286129
Revert changes to tagger and parser for thinc 6
2017-01-09 10:08:34 -06:00
Matthew Honnibal
95a52005df
Revert "Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class."
...
This reverts commit 40e71586d6
.
2017-01-09 09:55:55 -06:00
Ines Montani
363f09e68c
Merge pull request #726 from magnusburton/master
...
Added Swedish abbreviations as token exceptions
2017-01-09 14:58:15 +01:00
Matthew Honnibal
42cd598f57
Use correct fixtures in URL tokenizer
2017-01-09 14:10:40 +01:00
Matthew Honnibal
d9a77ddf14
Return None for data path if it doesn't exist
2017-01-09 14:10:05 +01:00
Matthew Honnibal
e4862d1dab
Merge branch 'develop'
2017-01-09 13:36:01 +01:00
Ines Montani
aa876884f0
Revert "Revert "Merge remote-tracking branch 'origin/master'""
...
This reverts commit fb9d3bb022
.
2017-01-09 13:28:13 +01:00
Ines Montani
d5c72c40eb
Remove old tests for old website example code
2017-01-08 22:28:53 +01:00
Ines Montani
eef94e3ee2
Split off period after two or more uppercase letters ( fixes #483 )
2017-01-08 22:28:25 +01:00
Ines Montani
a89a6000e5
Remove unused import
2017-01-08 22:17:37 +01:00
Ines Montani
5d28664fc5
Don't test Hungarian for numbers and hyphens for now
...
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
2017-01-08 20:45:40 +01:00
Ines Montani
53362b6b93
Reorganise Hungarian prefixes/suffixes/infixes
...
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
2017-01-08 20:40:33 +01:00
Ines Montani
347c4a2d06
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
2017-01-08 20:37:39 +01:00
Ines Montani
0dec90e9f7
Use global abbreviation data languages and remove duplicates
2017-01-08 20:36:00 +01:00
Ines Montani
7c3cb2a652
Add global abbreviations data
2017-01-08 20:34:03 +01:00
Ines Montani
de5aa92bc2
Handle deprecated tokenizer prefix data
2017-01-08 20:33:28 +01:00
Ines Montani
abb09782f9
Move sun.txt to original location and fix path to not break parser tests
2017-01-08 20:32:54 +01:00
Ines Montani
cab39c59c5
Add missing contractions to English tokenizer exceptions
...
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
2017-01-05 19:59:06 +01:00
Ines Montani
a23504fe07
Move abbreviations below other exceptions
2017-01-05 19:58:07 +01:00
Ines Montani
7d2cf934b9
Generate he/she/it correctly with 's instead of 've
2017-01-05 19:57:00 +01:00
Ines Montani
8328925e1f
Add newlines to long German text
2017-01-05 18:13:30 +01:00
Ines Montani
55b46d7cf6
Add tokenizer tests for German
2017-01-05 18:11:25 +01:00
Ines Montani
5bb4081f52
Remove redundant test_tokenizer.py for English
2017-01-05 18:11:11 +01:00
Ines Montani
8216ba599b
Add tests for longer and mixed English texts
2017-01-05 18:11:04 +01:00
Ines Montani
65f937d5c6
Move basic contraction tests to test_contractions.py
2017-01-05 18:09:53 +01:00
Ines Montani
bbe7cab3a1
Move non-English-specific tests back to general tokenizer tests
2017-01-05 18:09:29 +01:00
Ines Montani
038002d616
Reformat HU tokenizer tests and adapt to general style
...
Improve readability of test cases and add conftest.py with fixture
2017-01-05 18:06:44 +01:00
Ines Montani
bc911322b3
Move ") to emoticons (see Tweebo challenge test)
2017-01-05 18:05:38 +01:00
Ines Montani
637f785036
Add general sanity tests for all tokenizers
2017-01-05 16:25:38 +01:00
Ines Montani
c5f2dc15de
Move English tokenizer tests to directory /en
2017-01-05 16:25:04 +01:00
Ines Montani
8b45363b4d
Modernize and merge general tokenizer tests
2017-01-05 13:17:05 +01:00
Ines Montani
02cfda48c9
Modernize and merge tokenizer tests for string loading
2017-01-05 13:16:55 +01:00
Ines Montani
a11f684822
Modernize and merge tokenizer tests for whitespace
2017-01-05 13:16:33 +01:00
Ines Montani
8b284fc6f1
Modernize and merge tokenizer tests for text from file
2017-01-05 13:15:52 +01:00
Ines Montani
2c2e878653
Modernize and merge tokenizer tests for punctuation
2017-01-05 13:14:16 +01:00
Ines Montani
8a74129cdf
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
2017-01-05 13:13:12 +01:00
Ines Montani
0e65dca9a5
Modernize and merge tokenizer tests for exception and emoticons
2017-01-05 13:11:31 +01:00
Ines Montani
34c47bb20d
Fix formatting
2017-01-05 13:10:51 +01:00
Ines Montani
2e72683baa
Add missing docstrings
2017-01-05 13:10:21 +01:00
Ines Montani
da10a049a6
Add unicode declarations
2017-01-05 13:09:48 +01:00
Ines Montani
58adae8774
Remove unused file
2017-01-05 13:09:22 +01:00
Ines Montani
c6e5a5349d
Move regression test for #360 into own file
2017-01-04 00:49:31 +01:00
Ines Montani
8279993a6f
Modernize and merge tokenizer tests for punctuation
2017-01-04 00:49:20 +01:00
Ines Montani
550630df73
Update tokenizer tests for contractions
2017-01-04 00:48:42 +01:00
Ines Montani
109f202e8f
Update conftest fixture
2017-01-04 00:48:21 +01:00
Ines Montani
ee6b49b293
Modernize tokenizer tests for emoticons
2017-01-04 00:47:59 +01:00
Ines Montani
f09b5a5dfd
Modernize tokenizer tests for infixes
2017-01-04 00:47:42 +01:00
Ines Montani
59059fed27
Move regression test for #351 to own file
2017-01-04 00:47:11 +01:00
Ines Montani
667051375d
Modernize tokenizer tests for whitespace
2017-01-04 00:46:35 +01:00
Ines Montani
aafc894285
Modernize tokenizer tests for contractions
...
Use @pytest.mark.parametrize.
2017-01-03 23:02:21 +01:00
Ines Montani
1d237664af
Add lowercase lemma to tokenizer exceptions
2017-01-03 23:02:21 +01:00
Ines Montani
84a87951eb
Fix typos
2017-01-03 18:27:43 +01:00
Ines Montani
35b39f53c3
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:26:09 +01:00
Ines Montani
fb9d3bb022
Revert "Merge remote-tracking branch 'origin/master'"
...
This reverts commit d3b181cdf1
, reversing
changes made to b19cfcc144
.
2017-01-03 18:21:36 +01:00
Ines Montani
461cbb99d8
Revert "Reorganise English tokenizer exceptions (as discussed in #718 )"
...
This reverts commit b19cfcc144
.
2017-01-03 18:21:29 +01:00
Ines Montani
d3b181cdf1
Merge remote-tracking branch 'origin/master'
...
# Conflicts:
# spacy/en/tokenizer_exceptions.py
2017-01-03 18:20:01 +01:00
Ines Montani
b19cfcc144
Reorganise English tokenizer exceptions (as discussed in #718 )
...
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
2017-01-03 18:17:57 +01:00
Ines Montani
1bd53bbf89
Fix typos ( resolves #718 )
2017-01-03 11:26:21 +01:00
Matthew Honnibal
fde53be3b4
Move whole token mach inside _split_affixes.
2016-12-30 17:11:50 -06:00
Matthew Honnibal
3ba7c167a8
Fix URL tests
2016-12-30 17:10:08 -06:00
Matthew Honnibal
9936a1b9b5
Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
2016-12-30 14:53:40 -06:00
Magnus Burton
56e2219b65
Added Swedish city abbreviations
2016-12-30 21:17:34 +01:00
Magnus Burton
e935c950d8
Added months and days as abbreviations for Swedish
2016-12-30 21:08:44 +01:00
Matthew Honnibal
3e8d9c772e
Test interaction of token_match and punctuation
...
Check that the new token_match function applies after punctuation is split off.
2016-12-31 00:52:17 +11:00
Matthew Honnibal
74b921f394
Merge branch 'master' of ssh://github.com/explosion/spaCy into develop
2016-12-30 14:38:27 +01:00
Matthew Honnibal
623d94e14f
Whitespace
2016-12-31 00:30:28 +11:00
Matthew Honnibal
af81ac8bb0
Use thinc 6.0
2016-12-29 11:58:42 +01:00
Petter Hohle
f112e7754e
Add PART to tag map
...
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
2016-12-28 18:39:01 +01:00
Matthew Honnibal
f62db78dc3
Increment version
2016-12-27 21:11:22 +01:00
Matthew Honnibal
cade536d1e
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-27 21:04:10 +01:00
Matthew Honnibal
ce4539dafd
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
2016-12-27 21:03:45 +01:00
Ines Montani
ad3669cef5
Merge pull request #703 from magnusburton/master
...
Added Swedish abbreviations
2016-12-27 01:01:49 +01:00
Ines Montani
78f754dd9a
Merge pull request #705 from oroszgy/hu_tokenizer
...
Initial support for Hungarian
2016-12-27 00:48:13 +01:00
Ines Montani
8785706039
Reformat stop words for better readability
2016-12-24 00:58:40 +01:00
Gyorgy Orosz
45e045a87b
Unicode/UTF8 compatibility for Python2
2016-12-24 00:21:00 +01:00
Gyorgy Orosz
72b61b6d03
Typo fix.
2016-12-24 00:10:29 +01:00
Gyorgy Orosz
3a9be4d485
Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
2016-12-23 23:49:34 +01:00
Ines Montani
1436b9f15a
Fix formatting and consistency
2016-12-23 21:36:01 +01:00
Ines Montani
1d64527727
Update Spanish tokenizer
...
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
2016-12-23 21:36:01 +01:00
Ines Montani
7f411fd01c
Remove exceptions containing whitespace / no special chars
2016-12-23 14:30:06 +01:00
Magnus Burton
fdf4776262
Added Swedish abbreviations
2016-12-22 22:45:18 +01:00
Gyorgy Orosz
d9c59c4751
Maintaining backward compatibility.
2016-12-21 23:30:49 +01:00
Gyorgy Orosz
1748549aeb
Added exception pattern mechanism to the tokenizer.
2016-12-21 23:16:19 +01:00
Gyorgy Orosz
35aa54765d
Hungarian module is exposed in spacy.
2016-12-21 20:45:36 +01:00
Gyorgy Orosz
ab2f6ea46c
Removed data files from tests..
2016-12-21 20:22:09 +01:00
Ines Montani
3c87c71d43
Add tokenizer exceptions for a.m. and p.m. in Spanish
2016-12-21 18:19:10 +01:00
Ines Montani
78e63dc7d0
Update tokenizer exceptions for English
2016-12-21 18:06:34 +01:00
Ines Montani
702d1eed93
Update tokenizer exceptions for German
2016-12-21 18:06:27 +01:00
Ines Montani
d60380418e
Update tokenizer exceptions for Spanish
2016-12-21 18:06:17 +01:00
Ines Montani
920fa0fed2
Add DET_LEMMA constant
2016-12-21 18:05:41 +01:00
Ines Montani
8978806ea6
Allow Vocab to load without serializer_freqs
2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6
Remove trailing whitespace
2016-12-21 18:04:41 +01:00
Ines Montani
926e19184a
Merge pull request #695 from magnusburton/master
...
Added Swedish morph rules
2016-12-21 01:06:00 +01:00
Gyorgy Orosz
3d5306acb9
Added further testcases.
2016-12-20 23:49:35 +01:00
Gyorgy Orosz
23956e72ff
Improved partial support for tokenzing Hungarian numbers
2016-12-20 23:36:59 +01:00
Gyorgy Orosz
6add156075
Refactored language data structure
2016-12-20 22:28:20 +01:00
Gyorgy Orosz
366b3f8685
Merge branch 'master' into hu_tokenizer
2016-12-20 20:53:31 +01:00
Gyorgy Orosz
c035928156
Partial Hungarian number tokenization is added.
2016-12-20 20:46:20 +01:00
JM
70ff0639b5
Fixed missing vec_path declaration that was failing if 'add_vectors' was set
...
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
2016-12-20 18:21:05 +01:00
Magnus Burton
48dcc9f647
Added morph rules
2016-12-20 13:18:41 +01:00
Magnus Burton
db5a077d2b
Initial commit for Swedish
2016-12-20 11:05:06 +01:00
Matthew Honnibal
3f5747a9b2
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-18 23:44:22 +01:00
Matthew Honnibal
40e71586d6
Fix Issue #683 : Add 'SP' to tag_map, if it's not there already, within the Morphology class.
2016-12-18 23:44:05 +01:00
Matthew Honnibal
fa1d23e10d
Merge branch 'master' of https://github.com/explosion/spaCy
2016-12-18 23:32:03 +01:00
Matthew Honnibal
f38eb25fe1
Fix test for word vector
2016-12-18 23:31:55 +01:00
Matthew Honnibal
4e68abebc4
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-18 23:19:45 +01:00
Matthew Honnibal
5a6328a5a4
Increment version
2016-12-18 23:19:19 +01:00
Matthew Honnibal
13a0b31279
Another tweak to GloVe path hackery.
2016-12-18 23:12:49 +01:00
Matthew Honnibal
2c6228565e
Fix vector loading re glove hack
2016-12-18 23:06:44 +01:00
Matthew Honnibal
618b50a064
Fix issue #684 : GloVe vectors not loaded in spacy.en.English.
2016-12-18 22:46:31 +01:00
Matthew Honnibal
404019ad2f
Fix issue #672 : ent_iob_ was a string, not unicode, due to missing unicode_literals statement.
2016-12-18 22:33:53 +01:00
Matthew Honnibal
2ef9d53117
Untested fix for issue #684 : GloVe vectors hack should be inserted in English, not in spacy.load.
2016-12-18 22:29:31 +01:00
Matthew Honnibal
c065359459
Fix path-override bug in spacy.load
2016-12-18 22:15:29 +01:00
Matthew Honnibal
813249f826
Work on morphology class. Still not fully consistent with rest of library.
2016-12-18 17:35:22 +01:00
Matthew Honnibal
3679fb43a3
Fix loading of lemmatizer
2016-12-18 17:34:09 +01:00
Matthew Honnibal
3980f1b0cb
Ignore more morphology attributes in deprecated mode of intify_attrs
2016-12-18 17:33:46 +01:00
Matthew Honnibal
7a98ee5e5a
Merge language data change
2016-12-18 17:03:52 +01:00
Matthew Honnibal
e4c951c153
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93
Fix formatting
2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd
Fix tokenizer test
2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5
Use base language data as default
2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09
Remove trailing whitespace
2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c
Add base tag map
2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11
Reorganise language data
2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8
Whitespace
2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96
Add import in regression test
2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5
Set tag_map to None if it's not seen in the data by vocab
2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b
Update header for morphology class
2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7
Filter out morphology keys in deprecated attrs
2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100
Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.
2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd
Wire up lemmatizer rules for English
2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04
Whitespace
2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882
Break language data components into their own files
2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db
Update English language data
2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd
Add lemma rules
2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8
Add morph rules
2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9
Add entity rules
2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d
Fix formatting
2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0
Break language data components into their own files
2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f
Add ENT_ID constant
2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453
Refactor loading of morphology exceptions, adding a method add_special_case.
2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6
Remove unnecessary argument in test
2016-12-18 14:06:27 +01:00
Ines Montani
121c310566
Remove trailing whitespace
2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3
Fix tag map for German
2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3
Fix typo
2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635
Change test595 to mock data, instead of requiring model.
2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff
Check POS key in lemmatizer, to update it for new data format
2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e
Restore missing '' character in tokenizer exceptions.
2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9
Remove duplicates in tag map
2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8
Fix tag map
2016-12-17 22:44:22 +01:00
Ines Montani
577adad945
Fix formatting
2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136
Fix typo
2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc
Fix typo
2016-12-17 13:59:30 +01:00
Ines Montani
afda532595
Use symbols in tag map
2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9
Fix formatting
2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6
Reformat dutch language data to match new style
2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504
Resolve stopwords conflict to merge Dutch
2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f
Merge pull request #688 from nlesc-sherlock/dutch
...
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f
Add missing lemmas to tokenizer exceptions ( fixes #674 )
2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd
Expand tokenizer exceptions with unicode apostrophe ( fixes #685 )
2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612
Fix formatting
2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67
Move shared functions and constants to global language data
2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086
Move update_exc to global language data utils
2016-12-17 12:29:02 +01:00
Ines Montani
f324311249
Add global language data utils
2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a
Add encoding declaration
2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334
Add tokenizer exception for "gonna" ( fixes #691 )
2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa
Revert "Add exception for "gonna""
...
This reverts commit 280c03f67b
.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b
Add exception for "gonna"
2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2
Fix typo in stopwords ( fixes #689 )
2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a
Merge github.com:explosion/spaCy into dutch
2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7
Revert "Add acl to symbols.pyx"
2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed
Update symbols.pxd
2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777
Add acl to symbols.pyx
2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24
Adding partial hyphen and quote handling support.
2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3
Passing Hungatian abbrev tests.
2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9
Add Portuguese stopwords
2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc
Update Portuguese language data
2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950
Remove unused data and download script
2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104
Remove unused data
2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660
Add French stopwords
2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb
Update French language data
2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6
Add Italian stopwords
2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea
Update Italian language data
2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e
Add Spanish language data
2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786
Remove unused import
2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761
Split punctuation into its own file
2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8
Remove time from German language data
2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9
Add emoticons
2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f
Fix formatting
2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35
Reorganize exceptions for English and German
2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda
Add update_exc util function
2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad
Fix formatting
2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c
Fix formatting
2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004
Fix formatting
2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b
Add more custom rules for abbreviations
2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa
Additional abbreviation tests.
2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023
Added Hungarian resource files.
2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c
Update language data for German
2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321
Fix capitalization on morphological features
2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955
First steps towards the Hungarian tokenizer code.
2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df
Resolve conflict
2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695
Change morphology and lemmatizer API
...
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df
Remove trailing whitespace
2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80
Apply emoticon exceptions to tokenizer
2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3
Fix formatting
2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee
Declare encoding and unicode literals
2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657
Fix __all__
2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c
Add missing emoticons
2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93
Update English language data
2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe
Add emoticons
2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294
Add line breaks
2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102
Add test for tokenizer regular expressions
2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32
Reformat language data
2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d
Increment version
2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada
Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667
2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07
Merge github.com:explosion/spaCy into dutch
2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86
Update language data with tag map from UD_Dutch
2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9
Update Dutch language data
...
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2
Added language Dutch to init file
2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5
Fix create_tokenizer when nlp is None
2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9
Fix model saving error for Python 3
2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c
Fix unicode problem in nonproj module
2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6
Filter out deprecated attributes when reading special-case tokenization rules.
2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2
Exclude morphs from deprecated token attributes for now
2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25
Merge branch 'master' of https://github.com/explosion/spaCy
2016-11-25 09:16:43 -06:00
Matthew Honnibal
159e8c46e1
Merge old training fixes with newer state
2016-11-25 09:16:36 -06:00
Matthew Honnibal
846e80f2f4
Exclude morphs from deprecated token attributes for now
2016-11-25 16:14:54 +01:00
Matthew Honnibal
664f2dd1c0
Allow dep to be None in scorer, for missing labels.
2016-11-25 09:02:49 -06:00
Matthew Honnibal
39341598bb
Fix NER label calculation
2016-11-25 09:02:22 -06:00
Matthew Honnibal
ca773a1f53
Tweak arc_eager n_gold to deal with negative costs, and improve error message.
2016-11-25 09:01:52 -06:00
Matthew Honnibal
a2f55e7015
Pass cfg through loading, for training.
2016-11-25 09:01:20 -06:00
Matthew Honnibal
608d8f5421
Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state
2016-11-25 09:00:21 -06:00
Matthew Honnibal
cc7e607a8a
Fix gold.pyx for 1.0
2016-11-25 08:57:59 -06:00
root
080d29e092
Fix train.py for 1.0
2016-11-25 08:55:33 -06:00
Matthew Honnibal
6652f2a135
Test #656 , #624 : special case rules for tokenizer with attributes.
2016-11-25 12:44:13 +01:00
Matthew Honnibal
1e0f566d95
Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.
2016-11-25 12:43:24 +01:00
Matthew Honnibal
87613edf8f
Add set_struct_attr staticmethod to token
2016-11-25 12:41:47 +01:00
Matthew Honnibal
fb69aa648f
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-25 11:35:44 +01:00
Matthew Honnibal
9a03a3f85e
Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.
2016-11-25 11:35:17 +01:00
Matthew Honnibal
53d8ca8f51
Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.
2016-11-25 11:34:30 +01:00
Ines Montani
d21ad01840
Add emoticons
2016-11-24 19:13:00 +01:00
dafnevk
d8c7ac203a
Added nl module for dutch
2016-11-24 16:39:49 +01:00
dafnevk
3db8b0d322
Added language class and some language data (with some TODOs) for Dutch
2016-11-24 15:56:38 +01:00
Ines Montani
4dcfafde02
Add line breaks
2016-11-24 14:57:37 +01:00
Ines Montani
6247c005a2
Add test for tokenizer regular expressions
2016-11-24 13:51:59 +01:00
Ines Montani
de747e39e7
Reformat language data
2016-11-24 13:51:32 +01:00
Matthew Honnibal
b8c4f5ea76
Allow German noun chunks to work on Span
...
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule
3e3bda142d
Add noun_chunks to Span
2016-11-24 10:47:20 +00:00
Janneke van der Zwaan
83daade0e4
Add directory and initial (empty) files for language Dutch
2016-11-24 09:45:41 +01:00
Matthew Honnibal
09f68bc641
Fix Issue #639 : stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
2016-11-24 00:13:55 +01:00
Matthew Honnibal
48e1dc29d4
Fix default path loading.
2016-11-23 23:48:55 +01:00
Matthew Honnibal
e01c1875ee
Work on test for #615
2016-11-23 23:48:41 +01:00
ExplodingCabbage
6c4f488e89
Fix syntax mistake
2016-11-23 15:12:45 +00:00
Matthew Honnibal
60eb2343ce
Only try to load vectors if they exist.
2016-11-23 13:50:24 +01:00
Matthew Honnibal
618ac36093
Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional.
2016-11-23 13:26:34 +01:00
Mark Amery
fbe19680a6
Fix another bug related to Language.__init__'s path parameter
2016-11-20 20:31:34 +00:00
Mark Amery
b0a07c21a0
Fix path
param of Language.__init__
always being ignored
...
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery
1988fce389
Merge remote-tracking branch 'origin/master' into specify-data-path
2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72
Let --data-path be specified when running download.py scripts
...
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9
Strip trailing whitespace
2016-11-20 16:45:51 +01:00
Ines Montani
3082e49326
Update and reformat German stopwords
2016-11-20 16:45:26 +01:00
Sourav Singh
6745eac309
Update language_data.py
2016-11-20 19:52:02 +05:30
Sourav Singh
4d9aae7d6a
Add German Stopwords
2016-11-19 22:47:53 +05:30
Matthew Honnibal
7afb2544a7
Merge pull request #627 from sadovnychyi/patch-1
...
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao
762169da29
Fixed bug: eg.guess is a tag id, rather than tag
2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi
e70a7050e1
Remove duplicated line of vocab declaration
...
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal
f123f92e0c
Fix #617 : Vocab.load() required Path. Should work with string as well.
2016-11-10 22:48:48 +01:00
Matthew Honnibal
e86f440ca6
Fix test for issue 617
2016-11-10 22:48:10 +01:00
Matthew Honnibal
faa7610c56
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-11-10 22:46:38 +01:00
Matthew Honnibal
a2c7de8329
spacy/tests/regression/test_issue617.py
...
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago
2a3e342c1f
Added a test case to cover the span.merge returning values
2016-11-09 18:57:50 +00:00
tiago
b38cfd0ef9
now span.merge returns token like it says on documentation
2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi
9488222e79
Fix PhraseMatcher to work with updated Matcher
...
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi
86c056ba64
Add basic test for PhraseMatcher
...
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal
3ea15b257f
Fix test for 605
2016-11-06 11:59:26 +01:00
Matthew Honnibal
efe7790439
Test #590 : Order dependence in Matcher rules.
2016-11-06 11:21:36 +01:00
Matthew Honnibal
5cd3acb265
Fix #605 : Acceptor now rejects matches as expected.
2016-11-06 10:50:42 +01:00
Matthew Honnibal
75805397dd
Test Issue #605
2016-11-06 10:42:32 +01:00
Matthew Honnibal
014b6936ac
Fix #608 -- __version__ should be available at the base of the package.
2016-11-04 21:21:02 +01:00
Matthew Honnibal
42b0736db7
Increment version
2016-11-04 20:04:21 +01:00
Matthew Honnibal
9f93386994
Update version
2016-11-04 19:28:16 +01:00
Matthew Honnibal
1fb09c3dc1
Fix morphology tagger
2016-11-04 19:19:09 +01:00
Matthew Honnibal
a36353df47
Temporarily put back the tokenize_from_strings method, while tests aren't updated yet.
2016-11-04 19:18:07 +01:00
Matthew Honnibal
f0917b6808
Fix Issue #376 : and/or was tagged as a noun.
2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e
Fix #368 : Tokenizer handled pattern 'unicode close quote, period' incorrectly.
2016-11-04 15:16:20 +01:00
Matthew Honnibal
ab952b4756
Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one.
2016-11-04 10:44:11 +01:00
Matthew Honnibal
6e37ba1d82
Fix #602 , #603 --- Broken build
2016-11-04 09:54:24 +01:00
Matthew Honnibal
293c79c09a
Fix #595 : Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly.
2016-11-04 00:29:07 +01:00
Matthew Honnibal
e30348b331
Prefer to import from symbols instead of parts_of_speech
2016-11-04 00:27:55 +01:00
Matthew Honnibal
4a8a2b6001
Test #595 -- Bug in lemmatization of base forms.
2016-11-04 00:27:32 +01:00
Matthew Honnibal
f1605df2ec
Fix #588 : Matcher should reject empty pattern.
2016-11-03 00:16:44 +01:00
Matthew Honnibal
72b9bd57ec
Test Issue #588 : Matcher accepts invalid, empty patterns.
2016-11-03 00:09:35 +01:00
Matthew Honnibal
41a90a7fbb
Add tokenizer exception for 'Ph.D.', to fix 592.
2016-11-03 00:03:34 +01:00
Matthew Honnibal
532318e80b
Import Jieba inside zh.make_doc
2016-11-02 23:49:19 +01:00
Matthew Honnibal
f292f7f0e6
Fix Issue #599 , by considering empty documents to be parsed and tagged. Implementation is a bit dodgy.
2016-11-02 23:48:43 +01:00
Matthew Honnibal
b6b01d4680
Remove deprecated tokens_from_list test.
2016-11-02 23:47:21 +01:00
Matthew Honnibal
3d6c79e595
Test Issue #599 : .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents.
2016-11-02 23:40:11 +01:00
Matthew Honnibal
05a8b752a2
Fix Issue #600 : Missing setters for Token attribute.
2016-11-02 23:28:59 +01:00
Matthew Honnibal
125c910a8d
Test Issue #600
2016-11-02 23:24:13 +01:00
Matthew Honnibal
e0c9695615
Fix doc strings for tokenizer
2016-11-02 23:15:39 +01:00
Matthew Honnibal
80824f6d29
Fix test
2016-11-02 20:48:40 +01:00
Matthew Honnibal
dbe47902bc
Add import fr
2016-11-02 20:48:29 +01:00
Matthew Honnibal
8f24dc1982
Fix infixes in Italian
2016-11-02 20:43:52 +01:00
Matthew Honnibal
41a4766c1c
Fix infixes in spanish and portuguese
2016-11-02 20:43:12 +01:00
Matthew Honnibal
3d4bd96e8a
Fix infixes in french
2016-11-02 20:41:43 +01:00
Matthew Honnibal
c09a8ce5bb
Add test for french tokenizer
2016-11-02 20:40:31 +01:00
Matthew Honnibal
b012ae3044
Add test for loading languages
2016-11-02 20:38:48 +01:00
Matthew Honnibal
ad1c747c6b
Fix stray POS in language stubs
2016-11-02 20:37:55 +01:00
Matthew Honnibal
e9e6fce576
Handle null prefix/suffix/infix search in tokenizer
2016-11-02 20:35:48 +01:00
Matthew Honnibal
22647c2423
Check that patterns aren't null before compiling regex for tokenizer
2016-11-02 20:35:29 +01:00
Matthew Honnibal
5ac735df33
Link languages in __init__.py
2016-11-02 20:05:14 +01:00
Matthew Honnibal
c68dfe2965
Stub out support for Italian
2016-11-02 20:03:24 +01:00
Matthew Honnibal
6dbf4f7ad7
Stub out support for French, Spanish, Italian and Portuguese
2016-11-02 20:02:41 +01:00
Matthew Honnibal
6b8b05ef83
Specify that spacy.util is encoded in utf8
2016-11-02 19:58:00 +01:00
Matthew Honnibal
5363224395
Add draft Jieba tokenizer for Chinese
2016-11-02 19:57:38 +01:00
Matthew Honnibal
f7fee6c24b
Check for class-defined make_docs method before assigning one provided as an argument
2016-11-02 19:57:13 +01:00
Matthew Honnibal
19c1e83d3d
Work on draft Italian tokenizer
2016-11-02 19:56:32 +01:00
Matthew Honnibal
9efe568177
Add missing unicode_literals to spacy.util. I think this was messing up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
2016-11-02 12:31:34 +01:00
Matthew Honnibal
d8db648ebf
Add __init__.py file for regression tests
2016-11-01 13:45:06 +01:00
Matthew Honnibal
11664b9f20
Fix variable error in token
2016-11-01 13:28:00 +01:00
Matthew Honnibal
8c4d1b46ce
Fix variable error in Span
2016-11-01 13:27:44 +01:00
Matthew Honnibal
e7af6b937f
Fix syntax error while fixing doc strings
2016-11-01 13:27:32 +01:00
Matthew Honnibal
62fc6b1afa
Use 32 bit hashes for OOV, re Issue #589 , Issue #285
2016-11-01 13:27:13 +01:00
Matthew Honnibal
6977a2b8cd
Add test for Issue #589
2016-11-01 12:33:36 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
d563f1eadb
Fix Issue #587 : Segfault in Matcher, due to simple error in the state machine.
2016-10-28 17:42:00 +02:00
Matthew Honnibal
7e5f63a595
Improve test slightly
2016-10-28 17:41:16 +02:00
Matthew Honnibal
782e4814f4
Test Issue #587 : Matcher segfaults on particular input
2016-10-28 16:38:32 +02:00
Matthew Honnibal
708ea22208
Infer types in transition_system.pyx
2016-10-27 18:08:13 +02:00
Matthew Honnibal
18590eba94
Fix training evaluate method
2016-10-27 18:02:19 +02:00
Matthew Honnibal
301f3cc898
Fix Issue #429 . Add an initialize_state method to the named entity recogniser that adds missing entity types. This is a messy place to add this, because it's strange to have the method mutate state. A better home for this logic could be found.
2016-10-27 18:01:55 +02:00
Matthew Honnibal
afea6505f3
Test Issue 429: No valid actions for NER after matcher adds a new entity label.
2016-10-27 18:01:34 +02:00
Matthew Honnibal
03a520ec4f
Change signature of Parser.parseC, so that nr_class is read from the transition system. This allows the transition system to modify the number of actions in initialize_state.
2016-10-27 17:58:56 +02:00
Matthew Honnibal
6c47048912
Fix test, after IOB tweak.
2016-10-26 17:22:03 +02:00
Matthew Honnibal
4ca31b4d87
Fix clobbering of 'missing' named ent values after assigning ents.
2016-10-26 13:13:56 +02:00
Matthew Honnibal
cb49189477
Remove dead code
2016-10-26 13:11:07 +02:00
Matthew Honnibal
a209b10579
Improve error message when oracle fails for non-projective trees, re Issue #571 .
2016-10-24 20:31:30 +02:00
Matthew Honnibal
b2d43b93d2
Fix Python 3 basestring error
2016-10-24 14:22:51 +02:00
Matthew Honnibal
276478fe0f
Update strings.pxd
2016-10-24 14:00:35 +02:00
Matthew Honnibal
d8134817ff
Workaround Issue #285 : Allow the StringStore to be 'frozen', in which case strings will be pushed into an OOV map. We can then flush this OOV map, freeing all of the OOV strings.
2016-10-24 13:49:03 +02:00
Matthew Honnibal
d3a617aa99
Test workaround for Issue #285 : Streaming data memory growth
2016-10-24 13:48:06 +02:00
Matthew Honnibal
64e5f02cf7
Update test
2016-10-23 21:08:07 +02:00
Matthew Honnibal
66d7a6eca2
Update test
2016-10-23 21:02:05 +02:00
Matthew Honnibal
90bf797125
Update test
2016-10-23 20:54:17 +02:00
Matthew Honnibal
5e76320ffe
Update test
2016-10-23 20:44:54 +02:00
Matthew Honnibal
aa105927f3
Update test
2016-10-23 20:31:25 +02:00
Matthew Honnibal
6b9237aa83
Increment version
2016-10-23 20:22:53 +02:00
Matthew Honnibal
150e02d72e
Fix Issue #566
2016-10-23 20:19:01 +02:00
Matthew Honnibal
e120561294
Fix vector_norm test.
2016-10-23 19:56:16 +02:00
Matthew Honnibal
fefde8aef8
Make installation print data path.
2016-10-23 19:46:44 +02:00
Matthew Honnibal
e7414cd064
Try to fix weird install glitch.
2016-10-23 19:46:28 +02:00
Matthew Honnibal
90f7544edd
Increment version
2016-10-23 19:43:06 +02:00
Matthew Honnibal
6036ec7c77
Fix vector norm when loading lexemes.
2016-10-23 19:40:18 +02:00
Matthew Honnibal
c05cd2356e
Fix similarity test for Python 3
2016-10-23 18:16:56 +02:00
Matthew Honnibal
3e688e6d4b
Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
2016-10-23 17:45:44 +02:00
Matthew Honnibal
79aa03fe98
Test Issue #514 : Serializer fails when new entity type has been added.
2016-10-23 17:41:44 +02:00
Matthew Honnibal
f97548c6f1
Fix broken test, re Issue #461
2016-10-23 17:02:23 +02:00
Matthew Honnibal
4de30a8e38
Test Issue #514 : Serialization fails after adding a new entity label.
2016-10-23 16:40:27 +02:00
Matthew Honnibal
936e6246aa
Fix Issue #459 -- failed to deserialize empty doc.
2016-10-23 16:31:05 +02:00
Matthew Honnibal
e99b3f5322
Test Issue #459 : Fail to deserialize empty doc
2016-10-23 16:30:22 +02:00
Matthew Honnibal
49c117960c
Fix bug where huffman codec died if given empty freqs dict.
2016-10-23 16:28:05 +02:00
Matthew Honnibal
99ff8b902f
Test that huffman codec works with empty freqs dict
2016-10-23 16:27:45 +02:00
Matthew Honnibal
15c9b59f0e
Fix Issue #461 : O tag was being clobbered by doc.ents.__set__
2016-10-23 15:50:26 +02:00
Matthew Honnibal
e5627134d9
Test Issue #461 : ent_iob tag incorrect after setting entities.
2016-10-23 15:50:04 +02:00
Matthew Honnibal
f62088d646
Fix compile error
2016-10-23 14:50:50 +02:00
Matthew Honnibal
2c3a67b693
Fix calculation of vector norm, re Issue #522 . Need to consolidate the calculations into a helper function.
2016-10-23 14:49:31 +02:00
Matthew Honnibal
a0a4ada42a
Fix calculation of L2-norm for Lexeme
2016-10-23 14:44:45 +02:00
Matthew Honnibal
2989072aac
Add tests to verify that Issue #442 is fixed in 1.1
2016-10-23 14:33:13 +02:00
Matthew Honnibal
739213a8af
Fix create_pipeline keyword argument.
2016-10-23 14:24:16 +02:00
Matthew Honnibal
bea44bd3c4
Fix vector_norm when vector is assigned to Lexeme.
2016-10-23 14:23:56 +02:00
Matthew Honnibal
e838b6d53f
Add tests for using the new Entity ID tracking in the rule matcher
2016-10-23 14:04:01 +02:00
Matthew Honnibal
e7af75e0a9
Add test for vector resizing, re Issue #544
2016-10-21 17:07:21 +02:00
Matthew Honnibal
ca8ea33abc
Bump version to 1.1.0
2016-10-21 16:30:57 +02:00
Matthew Honnibal
7ab03050d4
Add resize_vectors method to Vocab
2016-10-21 01:44:50 +02:00
Matthew Honnibal
8ce8803824
Fix JSON in tokenizer
2016-10-21 01:44:20 +02:00
Matthew Honnibal
6eb73a095f
Fix JSON in tagger
2016-10-21 01:44:10 +02:00
Matthew Honnibal
e16e78a737
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-10-21 00:00:15 +02:00
Matthew Honnibal
147373c807
Increment version
2016-10-21 00:00:03 +02:00
Matthew Honnibal
e80944276f
Fix Span.vector_norm
2016-10-20 21:58:56 +02:00
Matthew Honnibal
f5fe4f595b
Fix json loading, for Python 3.
2016-10-20 21:23:26 +02:00
Matthew Honnibal
2e92c6fb3a
Fix JSON encoding issue on load
2016-10-20 21:06:48 +02:00
Matthew Honnibal
4ad7bb96c9
Increment version.
2016-10-20 20:48:30 +02:00
Matthew Honnibal
5ec32f5d97
Fix loading of GloVe vectors, to address Issue #541
2016-10-20 18:27:48 +02:00
Matthew Honnibal
ddeabd76c4
Fix mistake loading GloVe vectors. GloVe vectors now loaded by default if present, as promised.
2016-10-20 16:57:53 +02:00
Matthew Honnibal
bfe5cb1244
Increment version.
2016-10-20 14:52:00 +02:00
Matthew Honnibal
f189a3cb00
Fix encoding when opening files in Python 2.7, re Issue #539
2016-10-20 14:42:56 +02:00
Matthew Honnibal
c353a5214d
Increment version
2016-10-19 23:51:01 +02:00
Matthew Honnibal
d10c17f2a4
Fix Issue #536 : oov_prob was 0 for OOV words.
2016-10-19 23:38:47 +02:00
Matthew Honnibal
dfa752d064
Increment version
2016-10-19 23:19:13 +02:00
Matthew Honnibal
3588a18fb8
Fix hook names in doc
2016-10-19 21:15:16 +02:00
Matthew Honnibal
5d5742b773
Add sentiment field to doc, rename getters_for_tokens and getters_for_spans, add user_hooks field to Doc.
2016-10-19 20:54:22 +02:00
Matthew Honnibal
ed5e178817
Add sentiment property on lexeme object
2016-10-19 20:52:52 +02:00
Matthew Honnibal
d4aaf2752c
Fix issue #535 : Pipeline elements added even when data not installed.
2016-10-19 19:55:19 +02:00
Matthew Honnibal
04d1c959da
Fix version
2016-10-19 03:45:37 +02:00
Matthew Honnibal
d35aa7344e
Change version ID to make PyPi happy
2016-10-19 03:24:39 +02:00
Matthew Honnibal
89d2a5c8b3
Increment build version.
2016-10-19 03:05:17 +02:00
Matthew Honnibal
622b0a9674
Tweak download script
2016-10-19 00:52:16 +02:00
Matthew Honnibal
5a5c7192a5
Fix download.py for GloVe vectors.
2016-10-19 00:47:44 +02:00
Matthew Honnibal
edc45c19d6
Update download script
2016-10-19 00:41:14 +02:00
Matthew Honnibal
2bbb050500
Fix default of serializer_freqs
2016-10-18 19:55:41 +02:00
Matthew Honnibal
1b651db9c5
Fix parser creation in Language class.
2016-10-18 19:36:44 +02:00
Matthew Honnibal
45a6f9b9c7
Fix loading of tagger.
2016-10-18 19:33:04 +02:00
Matthew Honnibal
76c815f40d
Fix spacy.load
2016-10-18 19:23:31 +02:00
Matthew Honnibal
8c8f5c62c6
Add LANG attribute to English and German
2016-10-18 18:52:48 +02:00
Matthew Honnibal
05e2a589a4
Fix None label in matcher
2016-10-18 18:05:21 +02:00
Matthew Honnibal
c3a8a1cf51
Update serializer test.
2016-10-18 16:18:46 +02:00
Matthew Honnibal
7d5212f131
Refactor defaults
2016-10-18 16:18:25 +02:00
Matthew Honnibal
a45a9d5092
Remove stray .tensor attribute from Lexeme
2016-10-18 01:16:32 +02:00
Matthew Honnibal
9258db788a
Revert "Have the matcher return character offsets, to handle the match better."
...
This reverts commit 049c937540
.
2016-10-17 16:49:51 +02:00
Matthew Honnibal
7d446e5094
Revert "Update matcher test, to reflect character offset return instead of token offset."
...
This reverts commit f8d3e3bcfe
.
2016-10-17 16:49:49 +02:00
Matthew Honnibal
4bf2c53c13
Revert "Hack on matcher tests, for new implementation."
...
This reverts commit dbe60644ab
.
2016-10-17 16:49:48 +02:00
Matthew Honnibal
2fd97c71cc
Revert "Don't try to pickle matcher."
...
This reverts commit 97bd0c9d00
.
2016-10-17 16:49:43 +02:00
Matthew Honnibal
97bd0c9d00
Don't try to pickle matcher.
2016-10-17 16:38:40 +02:00
Matthew Honnibal
dbe60644ab
Hack on matcher tests, for new implementation.
2016-10-17 16:12:22 +02:00
Matthew Honnibal
f8d3e3bcfe
Update matcher test, to reflect character offset return instead of token offset.
2016-10-17 16:00:10 +02:00