Commit Graph

2186 Commits

Author SHA1 Message Date
Ines Montani
7f411fd01c Remove exceptions containing whitespace / no special chars 2016-12-23 14:30:06 +01:00
Magnus Burton
fdf4776262 Added Swedish abbreviations 2016-12-22 22:45:18 +01:00
Gyorgy Orosz
d9c59c4751 Maintaining backward compatibility. 2016-12-21 23:30:49 +01:00
Gyorgy Orosz
1748549aeb Added exception pattern mechanism to the tokenizer. 2016-12-21 23:16:19 +01:00
Gyorgy Orosz
35aa54765d Hungarian module is exposed in spacy. 2016-12-21 20:45:36 +01:00
Gyorgy Orosz
ab2f6ea46c Removed data files from tests.. 2016-12-21 20:22:09 +01:00
Ines Montani
3c87c71d43 Add tokenizer exceptions for a.m. and p.m. in Spanish 2016-12-21 18:19:10 +01:00
Ines Montani
78e63dc7d0 Update tokenizer exceptions for English 2016-12-21 18:06:34 +01:00
Ines Montani
702d1eed93 Update tokenizer exceptions for German 2016-12-21 18:06:27 +01:00
Ines Montani
d60380418e Update tokenizer exceptions for Spanish 2016-12-21 18:06:17 +01:00
Ines Montani
920fa0fed2 Add DET_LEMMA constant 2016-12-21 18:05:41 +01:00
Ines Montani
8978806ea6 Allow Vocab to load without serializer_freqs 2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6 Remove trailing whitespace 2016-12-21 18:04:41 +01:00
Ines Montani
926e19184a Merge pull request #695 from magnusburton/master
Added Swedish morph rules
2016-12-21 01:06:00 +01:00
Gyorgy Orosz
3d5306acb9 Added further testcases. 2016-12-20 23:49:35 +01:00
Gyorgy Orosz
23956e72ff Improved partial support for tokenzing Hungarian numbers 2016-12-20 23:36:59 +01:00
Gyorgy Orosz
6add156075 Refactored language data structure 2016-12-20 22:28:20 +01:00
Gyorgy Orosz
366b3f8685 Merge branch 'master' into hu_tokenizer 2016-12-20 20:53:31 +01:00
Gyorgy Orosz
c035928156 Partial Hungarian number tokenization is added. 2016-12-20 20:46:20 +01:00
JM
70ff0639b5 Fixed missing vec_path declaration that was failing if 'add_vectors' was set
Added vec_path variable declaration to avoid accessing it before assignment in case 'add_vectors' is in overrides.
2016-12-20 18:21:05 +01:00
Magnus Burton
48dcc9f647 Added morph rules 2016-12-20 13:18:41 +01:00
Magnus Burton
db5a077d2b Initial commit for Swedish 2016-12-20 11:05:06 +01:00
Matthew Honnibal
3f5747a9b2 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-18 23:44:22 +01:00
Matthew Honnibal
40e71586d6 Fix Issue #683: Add 'SP' to tag_map, if it's not there already, within the Morphology class. 2016-12-18 23:44:05 +01:00
Matthew Honnibal
fa1d23e10d Merge branch 'master' of https://github.com/explosion/spaCy 2016-12-18 23:32:03 +01:00
Matthew Honnibal
f38eb25fe1 Fix test for word vector 2016-12-18 23:31:55 +01:00
Matthew Honnibal
4e68abebc4 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-18 23:19:45 +01:00
Matthew Honnibal
5a6328a5a4 Increment version 2016-12-18 23:19:19 +01:00
Matthew Honnibal
13a0b31279 Another tweak to GloVe path hackery. 2016-12-18 23:12:49 +01:00
Matthew Honnibal
2c6228565e Fix vector loading re glove hack 2016-12-18 23:06:44 +01:00
Matthew Honnibal
618b50a064 Fix issue #684: GloVe vectors not loaded in spacy.en.English. 2016-12-18 22:46:31 +01:00
Matthew Honnibal
404019ad2f Fix issue #672: ent_iob_ was a string, not unicode, due to missing unicode_literals statement. 2016-12-18 22:33:53 +01:00
Matthew Honnibal
2ef9d53117 Untested fix for issue #684: GloVe vectors hack should be inserted in English, not in spacy.load. 2016-12-18 22:29:31 +01:00
Matthew Honnibal
c065359459 Fix path-override bug in spacy.load 2016-12-18 22:15:29 +01:00
Matthew Honnibal
813249f826 Work on morphology class. Still not fully consistent with rest of library. 2016-12-18 17:35:22 +01:00
Matthew Honnibal
3679fb43a3 Fix loading of lemmatizer 2016-12-18 17:34:09 +01:00
Matthew Honnibal
3980f1b0cb Ignore more morphology attributes in deprecated mode of intify_attrs 2016-12-18 17:33:46 +01:00
Matthew Honnibal
7a98ee5e5a Merge language data change 2016-12-18 17:03:52 +01:00
Matthew Honnibal
e4c951c153 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 17:01:08 +01:00
Ines Montani
b99d683a93 Fix formatting 2016-12-18 16:58:28 +01:00
Ines Montani
b11d8cd3db Merge remote-tracking branch 'origin/organize-language-data' into organize-language-data 2016-12-18 16:57:12 +01:00
Ines Montani
d1c1d3f9cd Fix tokenizer test 2016-12-18 16:55:32 +01:00
Ines Montani
753068f1d5 Use base language data as default 2016-12-18 16:55:25 +01:00
Ines Montani
bcc1d50d09 Remove trailing whitespace 2016-12-18 16:54:52 +01:00
Ines Montani
4e95737c6c Add base tag map 2016-12-18 16:54:28 +01:00
Ines Montani
2b2ea8ca11 Reorganise language data 2016-12-18 16:54:19 +01:00
Matthew Honnibal
1b31c05bf8 Whitespace 2016-12-18 16:51:40 +01:00
Matthew Honnibal
bdcecb3c96 Add import in regression test 2016-12-18 16:51:31 +01:00
Matthew Honnibal
6ee1df93c5 Set tag_map to None if it's not seen in the data by vocab 2016-12-18 16:51:10 +01:00
Matthew Honnibal
33996e770b Update header for morphology class 2016-12-18 16:50:42 +01:00
Matthew Honnibal
d58187ffa7 Filter out morphology keys in deprecated attrs 2016-12-18 16:50:26 +01:00
Matthew Honnibal
837a5d4100 Update morphology class so that exceptions can be added one-by-one, and so that arbitrary attributes can be referenced. 2016-12-18 16:49:46 +01:00
Matthew Honnibal
44f4f008bd Wire up lemmatizer rules for English 2016-12-18 15:50:09 +01:00
Matthew Honnibal
e6fc4afb04 Whitespace 2016-12-18 15:48:00 +01:00
Ines Montani
32b36c3882 Break language data components into their own files 2016-12-18 15:40:22 +01:00
Ines Montani
1bff59a8db Update English language data 2016-12-18 15:36:53 +01:00
Ines Montani
2eb163c5dd Add lemma rules 2016-12-18 15:36:53 +01:00
Ines Montani
29ad8143d8 Add morph rules 2016-12-18 15:36:53 +01:00
Ines Montani
bc40dad7d9 Add entity rules 2016-12-18 15:36:53 +01:00
Ines Montani
eaa3b1319d Fix formatting 2016-12-18 15:36:53 +01:00
Ines Montani
704c7442e0 Break language data components into their own files 2016-12-18 15:36:53 +01:00
Ines Montani
62655fd36f Add ENT_ID constant 2016-12-18 15:36:53 +01:00
Matthew Honnibal
fa272fdf12 Merge branch 'organize-language-data' of ssh://github.com/explosion/spaCy into organize-language-data 2016-12-18 15:00:21 +01:00
Matthew Honnibal
57c4341453 Refactor loading of morphology exceptions, adding a method add_special_case. 2016-12-18 14:59:44 +01:00
Ines Montani
77cf2fb0f6 Remove unnecessary argument in test 2016-12-18 14:06:27 +01:00
Ines Montani
121c310566 Remove trailing whitespace 2016-12-18 14:06:27 +01:00
Ines Montani
0fc4e45cb3 Fix tag map for German 2016-12-18 13:30:03 +01:00
Ines Montani
28326649f3 Fix typo 2016-12-18 13:30:03 +01:00
Matthew Honnibal
0595cc0635 Change test595 to mock data, instead of requiring model. 2016-12-18 13:28:51 +01:00
Matthew Honnibal
a4eb5c2bff Check POS key in lemmatizer, to update it for new data format 2016-12-18 13:28:20 +01:00
Matthew Honnibal
28d63ec58e Restore missing '' character in tokenizer exceptions. 2016-12-18 05:34:51 +01:00
Ines Montani
a9421652c9 Remove duplicates in tag map 2016-12-17 22:44:31 +01:00
Ines Montani
69baf1c9a8 Fix tag map 2016-12-17 22:44:22 +01:00
Ines Montani
577adad945 Fix formatting 2016-12-17 14:00:52 +01:00
Ines Montani
fc4ad17136 Fix typo 2016-12-17 14:00:47 +01:00
Ines Montani
bb94e784dc Fix typo 2016-12-17 13:59:30 +01:00
Ines Montani
afda532595 Use symbols in tag map 2016-12-17 13:56:24 +01:00
Ines Montani
07249145c9 Fix formatting 2016-12-17 13:34:46 +01:00
Ines Montani
dd55d085b6 Reformat dutch language data to match new style 2016-12-17 13:26:01 +01:00
Ines Montani
f2c48ef504 Resolve stopwords conflict to merge Dutch 2016-12-17 13:08:16 +01:00
Matthew Honnibal
ff03ade08f Merge pull request #688 from nlesc-sherlock/dutch
Support for Dutch in SpaCy
2016-12-17 22:44:58 +11:00
Ines Montani
a22322187f Add missing lemmas to tokenizer exceptions (fixes #674) 2016-12-17 12:42:41 +01:00
Ines Montani
5445074cbd Expand tokenizer exceptions with unicode apostrophe (fixes #685) 2016-12-17 12:34:08 +01:00
Ines Montani
e0a7b5c612 Fix formatting 2016-12-17 12:33:09 +01:00
Ines Montani
08162dce67 Move shared functions and constants to global language data 2016-12-17 12:32:48 +01:00
Ines Montani
6a60a61086 Move update_exc to global language data utils 2016-12-17 12:29:02 +01:00
Ines Montani
f324311249 Add global language data utils 2016-12-17 12:27:41 +01:00
Ines Montani
487ce1e20a Add encoding declaration 2016-12-17 12:25:44 +01:00
Ines Montani
d8d50a0334 Add tokenizer exception for "gonna" (fixes #691) 2016-12-17 11:59:28 +01:00
Ines Montani
c69b77d8aa Revert "Add exception for "gonna""
This reverts commit 280c03f67b.
2016-12-17 11:56:44 +01:00
Ines Montani
280c03f67b Add exception for "gonna" 2016-12-17 11:54:59 +01:00
Ines Montani
5031a015e2 Fix typo in stopwords (fixes #689) 2016-12-15 17:57:06 +01:00
Janneke van der Zwaan
4a3fdcce8a Merge github.com:explosion/spaCy into dutch 2016-12-13 09:25:23 +01:00
Matthew Honnibal
5965d3c2a7 Revert "Add acl to symbols.pyx" 2016-12-12 10:10:28 +11:00
Matthew Honnibal
6dee76dfed Update symbols.pxd 2016-12-12 10:09:58 +11:00
Pokey Rule
18a15c0777 Add acl to symbols.pyx 2016-12-11 20:00:07 +00:00
Gyorgy Orosz
0cf2144d24 Adding partial hyphen and quote handling support. 2016-12-11 00:14:36 +01:00
Gyorgy Orosz
2051726fd3 Passing Hungatian abbrev tests. 2016-12-10 23:37:58 +01:00
Ines Montani
63024466a9 Add Portuguese stopwords 2016-12-08 20:45:07 +01:00
Ines Montani
7bfe2d4abc Update Portuguese language data 2016-12-08 20:41:41 +01:00
Ines Montani
c0c5f31950 Remove unused data and download script 2016-12-08 20:39:49 +01:00
Ines Montani
0a6d529104 Remove unused data 2016-12-08 20:36:56 +01:00
Ines Montani
1b3b043660 Add French stopwords 2016-12-08 20:12:43 +01:00
Ines Montani
8863e504eb Update French language data 2016-12-08 20:07:14 +01:00
Ines Montani
7cb9f51be6 Add Italian stopwords 2016-12-08 20:05:25 +01:00
Ines Montani
470a0e0bea Update Italian language data 2016-12-08 19:52:18 +01:00
Ines Montani
1a284d342e Add Spanish language data 2016-12-08 19:47:03 +01:00
Ines Montani
0c39654786 Remove unused import 2016-12-08 19:46:53 +01:00
Ines Montani
e47ee94761 Split punctuation into its own file 2016-12-08 19:46:43 +01:00
Ines Montani
70b51ed7c8 Remove time from German language data 2016-12-08 19:45:50 +01:00
Ines Montani
e8ae588be9 Add emoticons 2016-12-08 19:45:18 +01:00
Ines Montani
5908c0ed9f Fix formatting 2016-12-08 19:45:11 +01:00
Ines Montani
311b30ab35 Reorganize exceptions for English and German 2016-12-08 13:58:32 +01:00
Ines Montani
66c7348cda Add update_exc util function 2016-12-08 13:58:12 +01:00
Ines Montani
1256232fad Fix formatting 2016-12-08 13:56:40 +01:00
Ines Montani
8e977cc71c Fix formatting 2016-12-08 13:56:17 +01:00
Ines Montani
0176b99004 Fix formatting 2016-12-08 12:48:02 +01:00
Ines Montani
877f09218b Add more custom rules for abbreviations 2016-12-08 12:47:01 +01:00
Gyorgy Orosz
0289b8ceaa Additional abbreviation tests. 2016-12-08 12:17:44 +01:00
Gyorgy Orosz
90d22db023 Added Hungarian resource files. 2016-12-08 12:06:36 +01:00
Ines Montani
bfaa42636c Update language data for German 2016-12-08 12:01:09 +01:00
Ines Montani
ec44bee321 Fix capitalization on morphological features 2016-12-08 12:00:54 +01:00
Gyorgy Orosz
5b00039955 First steps towards the Hungarian tokenizer code. 2016-12-07 23:07:43 +01:00
Ines Montani
ce979553df Resolve conflict 2016-12-07 21:16:52 +01:00
Ines Montani
8350d65695 Change morphology and lemmatizer API
Take morphology features as object instead of keyword arguments
2016-12-07 21:12:49 +01:00
Ines Montani
52e7d634df Remove trailing whitespace 2016-12-07 21:12:19 +01:00
Ines Montani
0d07d7fc80 Apply emoticon exceptions to tokenizer 2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3 Fix formatting 2016-12-07 21:11:29 +01:00
Ines Montani
9413bcd9ee Declare encoding and unicode literals 2016-12-07 21:10:34 +01:00
Ines Montani
a280ff2657 Fix __all__ 2016-12-07 21:10:12 +01:00
Ines Montani
ba8721953c Add missing emoticons 2016-12-07 21:09:44 +01:00
Ines Montani
1285c4ba93 Update English language data 2016-12-07 20:33:28 +01:00
Ines Montani
79dce0aabe Add emoticons 2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294 Add line breaks 2016-12-07 20:33:28 +01:00
Ines Montani
07f0efb102 Add test for tokenizer regular expressions 2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32 Reformat language data 2016-12-07 20:33:28 +01:00
Matthew Honnibal
0c0f4c965d Increment version 2016-12-03 11:16:52 +01:00
Matthew Honnibal
f6e356aada Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667 2016-12-02 11:05:50 +01:00
Janneke van der Zwaan
88869e0e07 Merge github.com:explosion/spaCy into dutch 2016-11-30 17:13:39 +01:00
Janneke van der Zwaan
51ade86b86 Update language data with tag map from UD_Dutch 2016-11-30 14:41:23 +01:00
Janneke van der Zwaan
90f6ff12c9 Update Dutch language data
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk
7b8f4c49f2 Added language Dutch to init file 2016-11-29 16:42:05 +01:00
Matthew Honnibal
296d33a4fc Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-26 12:36:18 +01:00
Matthew Honnibal
1f6c37c6f5 Fix create_tokenizer when nlp is None 2016-11-26 12:36:04 +01:00
Matthew Honnibal
c7889492f9 Fix model saving error for Python 3 2016-11-25 18:04:30 -06:00
Matthew Honnibal
bc0a202c9c Fix unicode problem in nonproj module 2016-11-25 17:29:17 -06:00
Matthew Honnibal
6dd3b94fa6 Filter out deprecated attributes when reading special-case tokenization rules. 2016-11-25 09:57:18 -06:00
Matthew Honnibal
e879c79b8c Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:18:28 -06:00
Matthew Honnibal
a335c6dcc2 Exclude morphs from deprecated token attributes for now 2016-11-25 16:17:32 +01:00
Matthew Honnibal
f799a07f25 Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:16:43 -06:00