Ines Montani
|
d5c72c40eb
|
Remove old tests for old website example code
|
2017-01-08 22:28:53 +01:00 |
|
Ines Montani
|
eef94e3ee2
|
Split off period after two or more uppercase letters (fixes #483)
|
2017-01-08 22:28:25 +01:00 |
|
Ines Montani
|
a89a6000e5
|
Remove unused import
|
2017-01-08 22:17:37 +01:00 |
|
Ines Montani
|
5d28664fc5
|
Don't test Hungarian for numbers and hyphens for now
Reinvestigate behaviour of case affixes given reorganised tokenizer
patterns.
|
2017-01-08 20:45:40 +01:00 |
|
Ines Montani
|
53362b6b93
|
Reorganise Hungarian prefixes/suffixes/infixes
Use global prefixes and suffixes for non-language-specific rules,
import list of alpha unicode characters and adjust regexes.
|
2017-01-08 20:40:33 +01:00 |
|
Ines Montani
|
347c4a2d06
|
Reorganise and reformat global tokenizer prefixes, suffixes and infixes
|
2017-01-08 20:37:39 +01:00 |
|
Ines Montani
|
0dec90e9f7
|
Use global abbreviation data languages and remove duplicates
|
2017-01-08 20:36:00 +01:00 |
|
Ines Montani
|
7c3cb2a652
|
Add global abbreviations data
|
2017-01-08 20:34:03 +01:00 |
|
Ines Montani
|
de5aa92bc2
|
Handle deprecated tokenizer prefix data
|
2017-01-08 20:33:28 +01:00 |
|
Ines Montani
|
abb09782f9
|
Move sun.txt to original location and fix path to not break parser tests
|
2017-01-08 20:32:54 +01:00 |
|
Ines Montani
|
cab39c59c5
|
Add missing contractions to English tokenizer exceptions
Inspired by
https://github.com/kootenpv/contractions/blob/master/contractions/__init
__.py
|
2017-01-05 19:59:06 +01:00 |
|
Ines Montani
|
a23504fe07
|
Move abbreviations below other exceptions
|
2017-01-05 19:58:07 +01:00 |
|
Ines Montani
|
7d2cf934b9
|
Generate he/she/it correctly with 's instead of 've
|
2017-01-05 19:57:00 +01:00 |
|
Ines Montani
|
8328925e1f
|
Add newlines to long German text
|
2017-01-05 18:13:30 +01:00 |
|
Ines Montani
|
55b46d7cf6
|
Add tokenizer tests for German
|
2017-01-05 18:11:25 +01:00 |
|
Ines Montani
|
5bb4081f52
|
Remove redundant test_tokenizer.py for English
|
2017-01-05 18:11:11 +01:00 |
|
Ines Montani
|
8216ba599b
|
Add tests for longer and mixed English texts
|
2017-01-05 18:11:04 +01:00 |
|
Ines Montani
|
65f937d5c6
|
Move basic contraction tests to test_contractions.py
|
2017-01-05 18:09:53 +01:00 |
|
Ines Montani
|
bbe7cab3a1
|
Move non-English-specific tests back to general tokenizer tests
|
2017-01-05 18:09:29 +01:00 |
|
Ines Montani
|
038002d616
|
Reformat HU tokenizer tests and adapt to general style
Improve readability of test cases and add conftest.py with fixture
|
2017-01-05 18:06:44 +01:00 |
|
Ines Montani
|
bc911322b3
|
Move ") to emoticons (see Tweebo challenge test)
|
2017-01-05 18:05:38 +01:00 |
|
Ines Montani
|
637f785036
|
Add general sanity tests for all tokenizers
|
2017-01-05 16:25:38 +01:00 |
|
Ines Montani
|
c5f2dc15de
|
Move English tokenizer tests to directory /en
|
2017-01-05 16:25:04 +01:00 |
|
Ines Montani
|
8b45363b4d
|
Modernize and merge general tokenizer tests
|
2017-01-05 13:17:05 +01:00 |
|
Ines Montani
|
02cfda48c9
|
Modernize and merge tokenizer tests for string loading
|
2017-01-05 13:16:55 +01:00 |
|
Ines Montani
|
a11f684822
|
Modernize and merge tokenizer tests for whitespace
|
2017-01-05 13:16:33 +01:00 |
|
Ines Montani
|
8b284fc6f1
|
Modernize and merge tokenizer tests for text from file
|
2017-01-05 13:15:52 +01:00 |
|
Ines Montani
|
2c2e878653
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-05 13:14:16 +01:00 |
|
Ines Montani
|
8a74129cdf
|
Modernize and merge tokenizer tests for prefixes/suffixes/infixes
|
2017-01-05 13:13:12 +01:00 |
|
Ines Montani
|
0e65dca9a5
|
Modernize and merge tokenizer tests for exception and emoticons
|
2017-01-05 13:11:31 +01:00 |
|
Ines Montani
|
34c47bb20d
|
Fix formatting
|
2017-01-05 13:10:51 +01:00 |
|
Ines Montani
|
2e72683baa
|
Add missing docstrings
|
2017-01-05 13:10:21 +01:00 |
|
Ines Montani
|
da10a049a6
|
Add unicode declarations
|
2017-01-05 13:09:48 +01:00 |
|
Ines Montani
|
58adae8774
|
Remove unused file
|
2017-01-05 13:09:22 +01:00 |
|
Ines Montani
|
c6e5a5349d
|
Move regression test for #360 into own file
|
2017-01-04 00:49:31 +01:00 |
|
Ines Montani
|
8279993a6f
|
Modernize and merge tokenizer tests for punctuation
|
2017-01-04 00:49:20 +01:00 |
|
Ines Montani
|
550630df73
|
Update tokenizer tests for contractions
|
2017-01-04 00:48:42 +01:00 |
|
Ines Montani
|
109f202e8f
|
Update conftest fixture
|
2017-01-04 00:48:21 +01:00 |
|
Ines Montani
|
ee6b49b293
|
Modernize tokenizer tests for emoticons
|
2017-01-04 00:47:59 +01:00 |
|
Ines Montani
|
f09b5a5dfd
|
Modernize tokenizer tests for infixes
|
2017-01-04 00:47:42 +01:00 |
|
Ines Montani
|
59059fed27
|
Move regression test for #351 to own file
|
2017-01-04 00:47:11 +01:00 |
|
Ines Montani
|
667051375d
|
Modernize tokenizer tests for whitespace
|
2017-01-04 00:46:35 +01:00 |
|
Ines Montani
|
aafc894285
|
Modernize tokenizer tests for contractions
Use @pytest.mark.parametrize.
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
1d237664af
|
Add lowercase lemma to tokenizer exceptions
|
2017-01-03 23:02:21 +01:00 |
|
Ines Montani
|
84a87951eb
|
Fix typos
|
2017-01-03 18:27:43 +01:00 |
|
Ines Montani
|
35b39f53c3
|
Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
|
2017-01-03 18:26:09 +01:00 |
|
Ines Montani
|
fb9d3bb022
|
Revert "Merge remote-tracking branch 'origin/master'"
This reverts commit d3b181cdf1 , reversing
changes made to b19cfcc144 .
|
2017-01-03 18:21:36 +01:00 |
|
Ines Montani
|
461cbb99d8
|
Revert "Reorganise English tokenizer exceptions (as discussed in #718)"
This reverts commit b19cfcc144 .
|
2017-01-03 18:21:29 +01:00 |
|
Ines Montani
|
d3b181cdf1
|
Merge remote-tracking branch 'origin/master'
# Conflicts:
# spacy/en/tokenizer_exceptions.py
|
2017-01-03 18:20:01 +01:00 |
|
Ines Montani
|
b19cfcc144
|
Reorganise English tokenizer exceptions (as discussed in #718)
Add logic to generate exceptions that follow a consistent pattern (like
verbs and pronouns) and allow certain tokens to be excluded explicitly.
|
2017-01-03 18:17:57 +01:00 |
|
Ines Montani
|
1bd53bbf89
|
Fix typos (resolves #718)
|
2017-01-03 11:26:21 +01:00 |
|
Matthew Honnibal
|
fde53be3b4
|
Move whole token mach inside _split_affixes.
|
2016-12-30 17:11:50 -06:00 |
|
Matthew Honnibal
|
3ba7c167a8
|
Fix URL tests
|
2016-12-30 17:10:08 -06:00 |
|
Matthew Honnibal
|
9936a1b9b5
|
Merge branch 'tokenization_w_exception_patterns' of https://github.com/oroszgy/spaCy.hu into oroszgy-tokenization_w_exception_patterns
|
2016-12-30 14:53:40 -06:00 |
|
Magnus Burton
|
56e2219b65
|
Added Swedish city abbreviations
|
2016-12-30 21:17:34 +01:00 |
|
Magnus Burton
|
e935c950d8
|
Added months and days as abbreviations for Swedish
|
2016-12-30 21:08:44 +01:00 |
|
Matthew Honnibal
|
3e8d9c772e
|
Test interaction of token_match and punctuation
Check that the new token_match function applies after punctuation is split off.
|
2016-12-31 00:52:17 +11:00 |
|
Matthew Honnibal
|
74b921f394
|
Merge branch 'master' of ssh://github.com/explosion/spaCy into develop
|
2016-12-30 14:38:27 +01:00 |
|
Matthew Honnibal
|
623d94e14f
|
Whitespace
|
2016-12-31 00:30:28 +11:00 |
|
Matthew Honnibal
|
af81ac8bb0
|
Use thinc 6.0
|
2016-12-29 11:58:42 +01:00 |
|
Petter Hohle
|
f112e7754e
|
Add PART to tag map
16 of the 17 PoS tags in the UD tag set is added; PART is missing.
|
2016-12-28 18:39:01 +01:00 |
|
Matthew Honnibal
|
f62db78dc3
|
Increment version
|
2016-12-27 21:11:22 +01:00 |
|
Matthew Honnibal
|
cade536d1e
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2016-12-27 21:04:10 +01:00 |
|
Matthew Honnibal
|
ce4539dafd
|
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
|
2016-12-27 21:03:45 +01:00 |
|
Ines Montani
|
ad3669cef5
|
Merge pull request #703 from magnusburton/master
Added Swedish abbreviations
|
2016-12-27 01:01:49 +01:00 |
|
Ines Montani
|
78f754dd9a
|
Merge pull request #705 from oroszgy/hu_tokenizer
Initial support for Hungarian
|
2016-12-27 00:48:13 +01:00 |
|
Ines Montani
|
8785706039
|
Reformat stop words for better readability
|
2016-12-24 00:58:40 +01:00 |
|
Gyorgy Orosz
|
45e045a87b
|
Unicode/UTF8 compatibility for Python2
|
2016-12-24 00:21:00 +01:00 |
|
Gyorgy Orosz
|
72b61b6d03
|
Typo fix.
|
2016-12-24 00:10:29 +01:00 |
|
Gyorgy Orosz
|
3a9be4d485
|
Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
|
2016-12-23 23:49:34 +01:00 |
|
Ines Montani
|
1436b9f15a
|
Fix formatting and consistency
|
2016-12-23 21:36:01 +01:00 |
|
Ines Montani
|
1d64527727
|
Update Spanish tokenizer
Remove reflexive pronouns as they're part of an open class, fix
mistakes and add exceptions
|
2016-12-23 21:36:01 +01:00 |
|
Ines Montani
|
7f411fd01c
|
Remove exceptions containing whitespace / no special chars
|
2016-12-23 14:30:06 +01:00 |
|
Magnus Burton
|
fdf4776262
|
Added Swedish abbreviations
|
2016-12-22 22:45:18 +01:00 |
|
Gyorgy Orosz
|
d9c59c4751
|
Maintaining backward compatibility.
|
2016-12-21 23:30:49 +01:00 |
|
Gyorgy Orosz
|
1748549aeb
|
Added exception pattern mechanism to the tokenizer.
|
2016-12-21 23:16:19 +01:00 |
|
Gyorgy Orosz
|
35aa54765d
|
Hungarian module is exposed in spacy.
|
2016-12-21 20:45:36 +01:00 |
|
Gyorgy Orosz
|
ab2f6ea46c
|
Removed data files from tests..
|
2016-12-21 20:22:09 +01:00 |
|
Ines Montani
|
3c87c71d43
|
Add tokenizer exceptions for a.m. and p.m. in Spanish
|
2016-12-21 18:19:10 +01:00 |
|
Ines Montani
|
78e63dc7d0
|
Update tokenizer exceptions for English
|
2016-12-21 18:06:34 +01:00 |
|
Ines Montani
|
702d1eed93
|
Update tokenizer exceptions for German
|
2016-12-21 18:06:27 +01:00 |
|
Ines Montani
|
d60380418e
|
Update tokenizer exceptions for Spanish
|
2016-12-21 18:06:17 +01:00 |
|
Ines Montani
|
920fa0fed2
|
Add DET_LEMMA constant
|
2016-12-21 18:05:41 +01:00 |
|
Ines Montani
|
8978806ea6
|
Allow Vocab to load without serializer_freqs
|
2016-12-21 18:05:23 +01:00 |
|
Ines Montani
|
be8ed811f6
|
Remove trailing whitespace
|
2016-12-21 18:04:41 +01:00 |
|
Ines Montani
|
926e19184a
|
Merge pull request #695 from magnusburton/master
Added Swedish morph rules
|
2016-12-21 01:06:00 +01:00 |
|
Gyorgy Orosz
|
3d5306acb9
|
Added further testcases.
|
2016-12-20 23:49:35 +01:00 |
|
Gyorgy Orosz
|
23956e72ff
|
Improved partial support for tokenzing Hungarian numbers
|
2016-12-20 23:36:59 +01:00 |
|
Gyorgy Orosz
|
6add156075
|
Refactored language data structure
|
2016-12-20 22:28:20 +01:00 |
|
Gyorgy Orosz
|
366b3f8685
|
Merge branch 'master' into hu_tokenizer
|
2016-12-20 20:53:31 +01:00 |
|
Gyorgy Orosz
|
c035928156
|
Partial Hungarian number tokenization is added.
|
2016-12-20 20:46:20 +01:00 |
|
JM
|
70ff0639b5
|
Fixed missing vec_path declaration that was failing if 'add_vectors' was set
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
|
2016-12-20 18:21:05 +01:00 |
|
Magnus Burton
|
48dcc9f647
|
Added morph rules
|
2016-12-20 13:18:41 +01:00 |
|
Magnus Burton
|
db5a077d2b
|
Initial commit for Swedish
|
2016-12-20 11:05:06 +01:00 |
|
Matthew Honnibal
|
3f5747a9b2
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2016-12-18 23:44:22 +01:00 |
|
Matthew Honnibal
|
40e71586d6
|
Fix Issue #683: Add 'SP' to tag_map, if it's not there already, within the Morphology class.
|
2016-12-18 23:44:05 +01:00 |
|
Matthew Honnibal
|
fa1d23e10d
|
Merge branch 'master' of https://github.com/explosion/spaCy
|
2016-12-18 23:32:03 +01:00 |
|
Matthew Honnibal
|
f38eb25fe1
|
Fix test for word vector
|
2016-12-18 23:31:55 +01:00 |
|
Matthew Honnibal
|
4e68abebc4
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2016-12-18 23:19:45 +01:00 |
|
Matthew Honnibal
|
5a6328a5a4
|
Increment version
|
2016-12-18 23:19:19 +01:00 |
|
Matthew Honnibal
|
13a0b31279
|
Another tweak to GloVe path hackery.
|
2016-12-18 23:12:49 +01:00 |
|
Matthew Honnibal
|
2c6228565e
|
Fix vector loading re glove hack
|
2016-12-18 23:06:44 +01:00 |
|
Matthew Honnibal
|
618b50a064
|
Fix issue #684: GloVe vectors not loaded in spacy.en.English.
|
2016-12-18 22:46:31 +01:00 |
|
Matthew Honnibal
|
404019ad2f
|
Fix issue #672: ent_iob_ was a string, not unicode, due to missing unicode_literals statement.
|
2016-12-18 22:33:53 +01:00 |
|
Matthew Honnibal
|
2ef9d53117
|
Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load.
|
2016-12-18 22:29:31 +01:00 |
|
Matthew Honnibal
|
c065359459
|
Fix path-override bug in spacy.load
|
2016-12-18 22:15:29 +01:00 |
|
Matthew Honnibal
|
813249f826
|
Work on morphology class. Still not fully consistent with rest of library.
|
2016-12-18 17:35:22 +01:00 |
|
Matthew Honnibal
|
3679fb43a3
|
Fix loading of lemmatizer
|
2016-12-18 17:34:09 +01:00 |
|
Matthew Honnibal
|
3980f1b0cb
|
Ignore more morphology attributes in deprecated mode of intify_attrs
|
2016-12-18 17:33:46 +01:00 |
|
Matthew Honnibal
|
7a98ee5e5a
|
Merge language data change
|
2016-12-18 17:03:52 +01:00 |
|
Matthew Honnibal
|
e4c951c153
|
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
|
2016-12-18 17:01:08 +01:00 |
|
Ines Montani
|
b99d683a93
|
Fix formatting
|
2016-12-18 16:58:28 +01:00 |
|
Ines Montani
|
b11d8cd3db
|
Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data
|
2016-12-18 16:57:12 +01:00 |
|
Ines Montani
|
d1c1d3f9cd
|
Fix tokenizer test
|
2016-12-18 16:55:32 +01:00 |
|
Ines Montani
|
753068f1d5
|
Use base language data as default
|
2016-12-18 16:55:25 +01:00 |
|
Ines Montani
|
bcc1d50d09
|
Remove trailing whitespace
|
2016-12-18 16:54:52 +01:00 |
|
Ines Montani
|
4e95737c6c
|
Add base tag map
|
2016-12-18 16:54:28 +01:00 |
|
Ines Montani
|
2b2ea8ca11
|
Reorganise language data
|
2016-12-18 16:54:19 +01:00 |
|
Matthew Honnibal
|
1b31c05bf8
|
Whitespace
|
2016-12-18 16:51:40 +01:00 |
|
Matthew Honnibal
|
bdcecb3c96
|
Add import in regression test
|
2016-12-18 16:51:31 +01:00 |
|
Matthew Honnibal
|
6ee1df93c5
|
Set tag_map to None if it's not seen in the data by vocab
|
2016-12-18 16:51:10 +01:00 |
|
Matthew Honnibal
|
33996e770b
|
Update header for morphology class
|
2016-12-18 16:50:42 +01:00 |
|
Matthew Honnibal
|
d58187ffa7
|
Filter out morphology keys in deprecated attrs
|
2016-12-18 16:50:26 +01:00 |
|
Matthew Honnibal
|
837a5d4100
|
Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced.
|
2016-12-18 16:49:46 +01:00 |
|
Matthew Honnibal
|
44f4f008bd
|
Wire up lemmatizer rules for English
|
2016-12-18 15:50:09 +01:00 |
|
Matthew Honnibal
|
e6fc4afb04
|
Whitespace
|
2016-12-18 15:48:00 +01:00 |
|
Ines Montani
|
32b36c3882
|
Break language data components into their own files
|
2016-12-18 15:40:22 +01:00 |
|
Ines Montani
|
1bff59a8db
|
Update English language data
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
2eb163c5dd
|
Add lemma rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
29ad8143d8
|
Add morph rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
bc40dad7d9
|
Add entity rules
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
eaa3b1319d
|
Fix formatting
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
704c7442e0
|
Break language data components into their own files
|
2016-12-18 15:36:53 +01:00 |
|
Ines Montani
|
62655fd36f
|
Add ENT_ID constant
|
2016-12-18 15:36:53 +01:00 |
|
Matthew Honnibal
|
fa272fdf12
|
Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data
|
2016-12-18 15:00:21 +01:00 |
|
Matthew Honnibal
|
57c4341453
|
Refactor loading of morphology exceptions, adding a method add_special_case.
|
2016-12-18 14:59:44 +01:00 |
|
Ines Montani
|
77cf2fb0f6
|
Remove unnecessary argument in test
|
2016-12-18 14:06:27 +01:00 |
|
Ines Montani
|
121c310566
|
Remove trailing whitespace
|
2016-12-18 14:06:27 +01:00 |
|
Ines Montani
|
0fc4e45cb3
|
Fix tag map for German
|
2016-12-18 13:30:03 +01:00 |
|
Ines Montani
|
28326649f3
|
Fix typo
|
2016-12-18 13:30:03 +01:00 |
|
Matthew Honnibal
|
0595cc0635
|
Change test595 to mock data, instead of requiring model.
|
2016-12-18 13:28:51 +01:00 |
|
Matthew Honnibal
|
a4eb5c2bff
|
Check POS key in lemmatizer, to update it for new data format
|
2016-12-18 13:28:20 +01:00 |
|
Matthew Honnibal
|
28d63ec58e
|
Restore missing '' character in tokenizer exceptions.
|
2016-12-18 05:34:51 +01:00 |
|
Ines Montani
|
a9421652c9
|
Remove duplicates in tag map
|
2016-12-17 22:44:31 +01:00 |
|
Ines Montani
|
69baf1c9a8
|
Fix tag map
|
2016-12-17 22:44:22 +01:00 |
|
Ines Montani
|
577adad945
|
Fix formatting
|
2016-12-17 14:00:52 +01:00 |
|
Ines Montani
|
fc4ad17136
|
Fix typo
|
2016-12-17 14:00:47 +01:00 |
|
Ines Montani
|
bb94e784dc
|
Fix typo
|
2016-12-17 13:59:30 +01:00 |
|
Ines Montani
|
afda532595
|
Use symbols in tag map
|
2016-12-17 13:56:24 +01:00 |
|
Ines Montani
|
07249145c9
|
Fix formatting
|
2016-12-17 13:34:46 +01:00 |
|
Ines Montani
|
dd55d085b6
|
Reformat dutch language data to match new style
|
2016-12-17 13:26:01 +01:00 |
|
Ines Montani
|
f2c48ef504
|
Resolve stopwords conflict to merge Dutch
|
2016-12-17 13:08:16 +01:00 |
|
Matthew Honnibal
|
ff03ade08f
|
Merge pull request #688 from nlesc-sherlock/dutch
Support for Dutch in SpaCy
|
2016-12-17 22:44:58 +11:00 |
|
Ines Montani
|
a22322187f
|
Add missing lemmas to tokenizer exceptions (fixes #674)
|
2016-12-17 12:42:41 +01:00 |
|
Ines Montani
|
5445074cbd
|
Expand tokenizer exceptions with unicode apostrophe (fixes #685)
|
2016-12-17 12:34:08 +01:00 |
|
Ines Montani
|
e0a7b5c612
|
Fix formatting
|
2016-12-17 12:33:09 +01:00 |
|
Ines Montani
|
08162dce67
|
Move shared functions and constants to global language data
|
2016-12-17 12:32:48 +01:00 |
|
Ines Montani
|
6a60a61086
|
Move update_exc to global language data utils
|
2016-12-17 12:29:02 +01:00 |
|
Ines Montani
|
f324311249
|
Add global language data utils
|
2016-12-17 12:27:41 +01:00 |
|
Ines Montani
|
487ce1e20a
|
Add encoding declaration
|
2016-12-17 12:25:44 +01:00 |
|
Ines Montani
|
d8d50a0334
|
Add tokenizer exception for "gonna" (fixes #691)
|
2016-12-17 11:59:28 +01:00 |
|
Ines Montani
|
c69b77d8aa
|
Revert "Add exception for "gonna""
This reverts commit 280c03f67b .
|
2016-12-17 11:56:44 +01:00 |
|
Ines Montani
|
280c03f67b
|
Add exception for "gonna"
|
2016-12-17 11:54:59 +01:00 |
|
Ines Montani
|
5031a015e2
|
Fix typo in stopwords (fixes #689)
|
2016-12-15 17:57:06 +01:00 |
|
Janneke van der Zwaan
|
4a3fdcce8a
|
Merge github.com:explosion/spaCy into dutch
|
2016-12-13 09:25:23 +01:00 |
|
Matthew Honnibal
|
5965d3c2a7
|
Revert "Add acl to symbols.pyx"
|
2016-12-12 10:10:28 +11:00 |
|
Matthew Honnibal
|
6dee76dfed
|
Update symbols.pxd
|
2016-12-12 10:09:58 +11:00 |
|
Pokey Rule
|
18a15c0777
|
Add acl to symbols.pyx
|
2016-12-11 20:00:07 +00:00 |
|
Gyorgy Orosz
|
0cf2144d24
|
Adding partial hyphen and quote handling support.
|
2016-12-11 00:14:36 +01:00 |
|
Gyorgy Orosz
|
2051726fd3
|
Passing Hungatian abbrev tests.
|
2016-12-10 23:37:58 +01:00 |
|
Ines Montani
|
63024466a9
|
Add Portuguese stopwords
|
2016-12-08 20:45:07 +01:00 |
|
Ines Montani
|
7bfe2d4abc
|
Update Portuguese language data
|
2016-12-08 20:41:41 +01:00 |
|
Ines Montani
|
c0c5f31950
|
Remove unused data and download script
|
2016-12-08 20:39:49 +01:00 |
|
Ines Montani
|
0a6d529104
|
Remove unused data
|
2016-12-08 20:36:56 +01:00 |
|
Ines Montani
|
1b3b043660
|
Add French stopwords
|
2016-12-08 20:12:43 +01:00 |
|
Ines Montani
|
8863e504eb
|
Update French language data
|
2016-12-08 20:07:14 +01:00 |
|
Ines Montani
|
7cb9f51be6
|
Add Italian stopwords
|
2016-12-08 20:05:25 +01:00 |
|
Ines Montani
|
470a0e0bea
|
Update Italian language data
|
2016-12-08 19:52:18 +01:00 |
|
Ines Montani
|
1a284d342e
|
Add Spanish language data
|
2016-12-08 19:47:03 +01:00 |
|
Ines Montani
|
0c39654786
|
Remove unused import
|
2016-12-08 19:46:53 +01:00 |
|
Ines Montani
|
e47ee94761
|
Split punctuation into its own file
|
2016-12-08 19:46:43 +01:00 |
|
Ines Montani
|
70b51ed7c8
|
Remove time from German language data
|
2016-12-08 19:45:50 +01:00 |
|
Ines Montani
|
e8ae588be9
|
Add emoticons
|
2016-12-08 19:45:18 +01:00 |
|
Ines Montani
|
5908c0ed9f
|
Fix formatting
|
2016-12-08 19:45:11 +01:00 |
|
Ines Montani
|
311b30ab35
|
Reorganize exceptions for English and German
|
2016-12-08 13:58:32 +01:00 |
|
Ines Montani
|
66c7348cda
|
Add update_exc util function
|
2016-12-08 13:58:12 +01:00 |
|
Ines Montani
|
1256232fad
|
Fix formatting
|
2016-12-08 13:56:40 +01:00 |
|
Ines Montani
|
8e977cc71c
|
Fix formatting
|
2016-12-08 13:56:17 +01:00 |
|
Ines Montani
|
0176b99004
|
Fix formatting
|
2016-12-08 12:48:02 +01:00 |
|
Ines Montani
|
877f09218b
|
Add more custom rules for abbreviations
|
2016-12-08 12:47:01 +01:00 |
|
Gyorgy Orosz
|
0289b8ceaa
|
Additional abbreviation tests.
|
2016-12-08 12:17:44 +01:00 |
|
Gyorgy Orosz
|
90d22db023
|
Added Hungarian resource files.
|
2016-12-08 12:06:36 +01:00 |
|
Ines Montani
|
bfaa42636c
|
Update language data for German
|
2016-12-08 12:01:09 +01:00 |
|
Ines Montani
|
ec44bee321
|
Fix capitalization on morphological features
|
2016-12-08 12:00:54 +01:00 |
|
Gyorgy Orosz
|
5b00039955
|
First steps towards the Hungarian tokenizer code.
|
2016-12-07 23:07:43 +01:00 |
|
Ines Montani
|
ce979553df
|
Resolve conflict
|
2016-12-07 21:16:52 +01:00 |
|
Ines Montani
|
8350d65695
|
Change morphology and lemmatizer API
Take morphology features as object instead of keyword arguments
|
2016-12-07 21:12:49 +01:00 |
|
Ines Montani
|
52e7d634df
|
Remove trailing whitespace
|
2016-12-07 21:12:19 +01:00 |
|
Ines Montani
|
0d07d7fc80
|
Apply emoticon exceptions to tokenizer
|
2016-12-07 21:11:59 +01:00 |
|
Ines Montani
|
71f0f34cb3
|
Fix formatting
|
2016-12-07 21:11:29 +01:00 |
|
Ines Montani
|
9413bcd9ee
|
Declare encoding and unicode literals
|
2016-12-07 21:10:34 +01:00 |
|
Ines Montani
|
a280ff2657
|
Fix __all__
|
2016-12-07 21:10:12 +01:00 |
|
Ines Montani
|
ba8721953c
|
Add missing emoticons
|
2016-12-07 21:09:44 +01:00 |
|
Ines Montani
|
1285c4ba93
|
Update English language data
|
2016-12-07 20:33:28 +01:00 |
|
Ines Montani
|
79dce0aabe
|
Add emoticons
|
2016-12-07 20:33:28 +01:00 |
|
Ines Montani
|
a662a95294
|
Add line breaks
|
2016-12-07 20:33:28 +01:00 |
|
Ines Montani
|
07f0efb102
|
Add test for tokenizer regular expressions
|
2016-12-07 20:33:28 +01:00 |
|
Ines Montani
|
e0712d1b32
|
Reformat language data
|
2016-12-07 20:33:28 +01:00 |
|
Matthew Honnibal
|
0c0f4c965d
|
Increment version
|
2016-12-03 11:16:52 +01:00 |
|
Matthew Honnibal
|
f6e356aada
|
Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667
|
2016-12-02 11:05:50 +01:00 |
|
Janneke van der Zwaan
|
88869e0e07
|
Merge github.com:explosion/spaCy into dutch
|
2016-11-30 17:13:39 +01:00 |
|
Janneke van der Zwaan
|
51ade86b86
|
Update language data with tag map from UD_Dutch
|
2016-11-30 14:41:23 +01:00 |
|
Janneke van der Zwaan
|
90f6ff12c9
|
Update Dutch language data
- Use Dutch tag map
- remove tokenizer exceptions
|
2016-11-30 11:59:39 +01:00 |
|
dafnevk
|
7b8f4c49f2
|
Added language Dutch to init file
|
2016-11-29 16:42:05 +01:00 |
|
Matthew Honnibal
|
296d33a4fc
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2016-11-26 12:36:18 +01:00 |
|
Matthew Honnibal
|
1f6c37c6f5
|
Fix create_tokenizer when nlp is None
|
2016-11-26 12:36:04 +01:00 |
|
Matthew Honnibal
|
c7889492f9
|
Fix model saving error for Python 3
|
2016-11-25 18:04:30 -06:00 |
|
Matthew Honnibal
|
bc0a202c9c
|
Fix unicode problem in nonproj module
|
2016-11-25 17:29:17 -06:00 |
|
Matthew Honnibal
|
6dd3b94fa6
|
Filter out deprecated attributes when reading special-case tokenization rules.
|
2016-11-25 09:57:18 -06:00 |
|
Matthew Honnibal
|
e879c79b8c
|
Merge branch 'master' of https://github.com/explosion/spaCy
|
2016-11-25 09:18:28 -06:00 |
|
Matthew Honnibal
|
a335c6dcc2
|
Exclude morphs from deprecated token attributes for now
|
2016-11-25 16:17:32 +01:00 |
|
Matthew Honnibal
|
f799a07f25
|
Merge branch 'master' of https://github.com/explosion/spaCy
|
2016-11-25 09:16:43 -06:00 |
|
Matthew Honnibal
|
159e8c46e1
|
Merge old training fixes with newer state
|
2016-11-25 09:16:36 -06:00 |
|
Matthew Honnibal
|
846e80f2f4
|
Exclude morphs from deprecated token attributes for now
|
2016-11-25 16:14:54 +01:00 |
|
Matthew Honnibal
|
664f2dd1c0
|
Allow dep to be None in scorer, for missing labels.
|
2016-11-25 09:02:49 -06:00 |
|
Matthew Honnibal
|
39341598bb
|
Fix NER label calculation
|
2016-11-25 09:02:22 -06:00 |
|
Matthew Honnibal
|
ca773a1f53
|
Tweak arc_eager n_gold to deal with negative costs, and improve error message.
|
2016-11-25 09:01:52 -06:00 |
|
Matthew Honnibal
|
a2f55e7015
|
Pass cfg through loading, for training.
|
2016-11-25 09:01:20 -06:00 |
|
Matthew Honnibal
|
608d8f5421
|
Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state
|
2016-11-25 09:00:21 -06:00 |
|
Matthew Honnibal
|
cc7e607a8a
|
Fix gold.pyx for 1.0
|
2016-11-25 08:57:59 -06:00 |
|
root
|
080d29e092
|
Fix train.py for 1.0
|
2016-11-25 08:55:33 -06:00 |
|
Matthew Honnibal
|
6652f2a135
|
Test #656, #624: special case rules for tokenizer with attributes.
|
2016-11-25 12:44:13 +01:00 |
|
Matthew Honnibal
|
1e0f566d95
|
Fix #656, #624: Support arbitrary token attributes when adding special-case rules.
|
2016-11-25 12:43:24 +01:00 |
|
Matthew Honnibal
|
87613edf8f
|
Add set_struct_attr staticmethod to token
|
2016-11-25 12:41:47 +01:00 |
|
Matthew Honnibal
|
fb69aa648f
|
Merge branch 'master' of ssh://github.com/explosion/spaCy
|
2016-11-25 11:35:44 +01:00 |
|
Matthew Honnibal
|
9a03a3f85e
|
Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr.
|
2016-11-25 11:35:17 +01:00 |
|
Matthew Honnibal
|
53d8ca8f51
|
Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries.
|
2016-11-25 11:34:30 +01:00 |
|
Ines Montani
|
d21ad01840
|
Add emoticons
|
2016-11-24 19:13:00 +01:00 |
|
dafnevk
|
d8c7ac203a
|
Added nl module for dutch
|
2016-11-24 16:39:49 +01:00 |
|
dafnevk
|
3db8b0d322
|
Added language class and some language data (with some TODOs) for Dutch
|
2016-11-24 15:56:38 +01:00 |
|
Ines Montani
|
4dcfafde02
|
Add line breaks
|
2016-11-24 14:57:37 +01:00 |
|
Ines Montani
|
6247c005a2
|
Add test for tokenizer regular expressions
|
2016-11-24 13:51:59 +01:00 |
|
Ines Montani
|
de747e39e7
|
Reformat language data
|
2016-11-24 13:51:32 +01:00 |
|
Matthew Honnibal
|
b8c4f5ea76
|
Allow German noun chunks to work on Span
Update the German noun chunks iterator, so that it also works on Span objects.
|
2016-11-24 23:30:15 +11:00 |
|
Pokey Rule
|
3e3bda142d
|
Add noun_chunks to Span
|
2016-11-24 10:47:20 +00:00 |
|
Janneke van der Zwaan
|
83daade0e4
|
Add directory and initial (empty) files for language Dutch
|
2016-11-24 09:45:41 +01:00 |
|
Matthew Honnibal
|
09f68bc641
|
Fix Issue #639: stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored.
|
2016-11-24 00:13:55 +01:00 |
|
Matthew Honnibal
|
48e1dc29d4
|
Fix default path loading.
|
2016-11-23 23:48:55 +01:00 |
|
Matthew Honnibal
|
e01c1875ee
|
Work on test for #615
|
2016-11-23 23:48:41 +01:00 |
|
ExplodingCabbage
|
6c4f488e89
|
Fix syntax mistake
|
2016-11-23 15:12:45 +00:00 |
|