Commit Graph

255 Commits

Author SHA1 Message Date
Ines Montani
ce979553df Resolve conflict 2016-12-07 21:16:52 +01:00
Ines Montani
0d07d7fc80 Apply emoticon exceptions to tokenizer 2016-12-07 21:11:59 +01:00
Ines Montani
71f0f34cb3 Fix formatting 2016-12-07 21:11:29 +01:00
Ines Montani
1285c4ba93 Update English language data 2016-12-07 20:33:28 +01:00
Ines Montani
a662a95294 Add line breaks 2016-12-07 20:33:28 +01:00
Ines Montani
e0712d1b32 Reformat language data 2016-12-07 20:33:28 +01:00
Ines Montani
4dcfafde02 Add line breaks 2016-11-24 14:57:37 +01:00
Ines Montani
de747e39e7 Reformat language data 2016-11-24 13:51:32 +01:00
Mark Amery
1988fce389 Merge remote-tracking branch 'origin/master' into specify-data-path 2016-11-20 16:07:14 +00:00
Mark Amery
3871007c72 Let --data-path be specified when running download.py scripts
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani
dad2c6cae9 Strip trailing whitespace 2016-11-20 16:45:51 +01:00
Matthew Honnibal
f0917b6808 Fix Issue #376: and/or was tagged as a noun. 2016-11-04 15:21:28 +01:00
Matthew Honnibal
737816e86e Fix #368: Tokenizer handled pattern 'unicode close quote, period' incorrectly. 2016-11-04 15:16:20 +01:00
Matthew Honnibal
41a90a7fbb Add tokenizer exception for 'Ph.D.', to fix 592. 2016-11-03 00:03:34 +01:00
Matthew Honnibal
e7414cd064 Try to fix weird install glitch. 2016-10-23 19:46:28 +02:00
Matthew Honnibal
622b0a9674 Tweak download script 2016-10-19 00:52:16 +02:00
Matthew Honnibal
edc45c19d6 Update download script 2016-10-19 00:41:14 +02:00
Matthew Honnibal
8c8f5c62c6 Add LANG attribute to English and German 2016-10-18 18:52:48 +02:00
Matthew Honnibal
ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal
7db956133e Move tokenizer data for German into spacy.de.language_data 2016-09-25 15:37:33 +02:00
Matthew Honnibal
95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal
d7e9acdcdf Add English language data, so that the tokenizer doesn't require the data download 2016-09-25 14:49:00 +02:00
Matthew Honnibal
fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Henning Peters
470cdf5bf9 remove deprecated LOCAL_DATA_DIR 2016-04-05 11:25:54 +02:00
Henning Peters
a7d7ea3afa first idea for supporting multiple langs in download script 2016-03-24 11:19:43 +01:00
Henning Peters
9cc4f8d5b3 avoid shadowing __name__ 2016-02-15 01:33:39 +01:00
Matthew Honnibal
445164d5b4 * Restore the LOCAL_DATA_DIR global in spacy/en/__init__.py, although this is now deprecated 2016-01-19 02:54:56 +01:00
Henning Peters
5551052840 fix py2/3 issue 2016-01-16 12:44:53 +01:00
Henning Peters
235f094534 untangle data_path/via 2016-01-16 12:23:45 +01:00
Henning Peters
211913d689 add about.py, adapt setup.py 2016-01-15 18:57:01 +01:00
Henning Peters
780cb847c9 add default_model to about 2016-01-15 18:07:15 +01:00
Henning Peters
788f734513 refactored data_dir->via, add zip_safe, add spacy.load() 2016-01-15 18:01:02 +01:00
Henning Peters
9b75d872b0 fix model download 2016-01-14 12:02:56 +01:00
Matthew Honnibal
187960606f * Fix pickle problems 2015-12-28 16:54:03 +01:00
Henning Peters
32d655b6e1 bump version 2015-12-28 09:34:39 +01:00
Matthew Honnibal
8b61d45ed0 * Fix merge conflicts for headers branch 2015-12-27 17:46:25 +01:00
Henning Peters
0e321a7105 get mingw32 to work 2015-12-22 23:25:38 +01:00
Henning Peters
8359bd4d93 strip data/ from package, friendlier Language invocation, make data_dir backward/forward-compatible 2015-12-18 09:52:55 +01:00
Henning Peters
970278a3d6 no need to link data dir anymore 2015-12-18 09:49:45 +01:00
Henning Peters
2d4efe40f9 fix sputnik call 2015-12-13 14:46:08 +01:00
Henning Peters
ac318b568c new approach to dependency headers 2015-12-13 11:49:17 +01:00
Henning Peters
9027cef3bc access model via sputnik 2015-12-07 06:01:28 +01:00
Henning Peters
73e5650be5 change index server 2015-11-18 18:09:46 +01:00
Henning Peters
50d15ea5d2 fix 2015-11-18 17:35:21 +01:00
Henning Peters
919a4f0b04 change data path, add repository 2015-11-18 11:40:46 +01:00
Henning Peters
12de895e60 fix version 2015-11-15 16:38:16 +01:00
Henning Peters
03d2f98cd5 add sputnik 2015-11-15 15:58:21 +01:00
Matthew Honnibal
3b74739c3e * Download updated data 2015-11-08 21:24:25 +11:00
Matthew Honnibal
ffedff9e6c * Remove the archive after download, to save disk space 2015-11-03 18:54:05 +11:00
Matthew Honnibal
ff4fe524ee * Fix exception for python 2 2015-10-23 01:56:13 +02:00
Matthew Honnibal
341a3e85cd * Upd downloaded data version 2015-10-23 00:56:57 +02:00
Henning Peters
ccffd2ef53 fixed extract directory 2015-10-21 07:59:34 +02:00
Henning Peters
da4c9cee06 assert filename match 2015-10-20 19:33:59 +02:00
Henning Peters
4f703f0cb4 better error reporting, cleanup 2015-10-20 19:11:29 +02:00
Matthew Honnibal
9cdea6e450 * Import uget correctly 2015-10-19 08:32:41 +02:00
Henning Peters
bfde91fa49 add custom download tool (uget), replace wget with uget 2015-10-18 12:35:04 +02:00
Matthew Honnibal
e886e6a406 * Inc version 2015-10-13 13:46:17 +11:00
Matthew Honnibal
a3dfe2b901 * Increment data version 2015-10-09 13:26:17 +02:00
Matthew Honnibal
b228a8f4a6 * Remove spacy/en/attrs 2015-10-06 16:20:46 +11:00
Matthew Honnibal
693677fd8d * Prepare to remove en/attrx file, now that moving to symbols.pyx 2015-10-06 16:20:13 +11:00
Matthew Honnibal
ecc5281b36 * Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx 2015-10-06 10:12:08 +11:00
Robert
8711b64860 Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
2015-09-21 17:26:01 -07:00
Matthew Honnibal
e13e47e9e5 * Add English stop words 2015-09-14 17:48:51 +10:00
Matthew Honnibal
0b7d2a6c62 * Inc version 2015-09-13 01:26:29 +02:00
Matthew Honnibal
e2ef78b29c * Gut pos.pyx module, since functionality moved to spacy/tagger.pyx 2015-08-26 19:15:42 +02:00
Matthew Honnibal
c4d8754385 * Specify LOCAL_DATA_DIR global in spacy.en.__init__.py 2015-08-26 19:15:07 +02:00
Matthew Honnibal
c5a27d1821 * Move lemmatizer to spacy 2015-08-25 15:47:08 +02:00
Matthew Honnibal
82217c6ec6 * Generalize lemmatizer 2015-08-25 15:46:19 +02:00
Matthew Honnibal
8083a07c3e * Use language base class 2015-08-25 15:37:30 +02:00
Matthew Honnibal
5dd76be446 * Split EnPosTagger up into base class and subclass 2015-08-24 05:25:55 +02:00
Matthew Honnibal
6f1743692a * Work on language-independent refactoring 2015-08-23 20:49:18 +02:00
Matthew Honnibal
cad0cca4e3 * Tmp 2015-08-22 22:04:34 +02:00
Matthew Honnibal
5737115e1e * Work on gazetteer matching 2015-08-06 14:33:21 +02:00
Matthew Honnibal
c609ea18f0 * Increment version in download script 2015-07-28 15:22:17 +02:00
Matthew Honnibal
ddc1a5cfe5 * Fix training under python3 2015-07-28 14:09:30 +02:00
Matthew Honnibal
a296d72b54 * Fix en/attrs 2015-07-27 21:16:33 +02:00
Matthew Honnibal
8535d872e8 * Set is_oov property in get_flags 2015-07-27 01:51:24 +02:00
Matthew Honnibal
8e4c69ee8c * Add is_oov property, and fix up handling of attributes 2015-07-27 01:50:06 +02:00
Matthew Honnibal
6bb96c122d * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 16:37:16 +02:00
Matthew Honnibal
eeaea25f0c * Check oov_prob file is present 2015-07-26 16:36:38 +02:00
Matthew Honnibal
1b5d1da2a7 * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:03:43 +02:00
Matthew Honnibal
cd6e25132b * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:01:46 +02:00
Matthew Honnibal
5b41744270 * Check for directory presence before loading annotators 2015-07-23 09:27:37 +02:00
Matthew Honnibal
12699a1152 * Set initial freqs, to avoid missing values in serializer 2015-07-23 01:16:27 +02:00
Matthew Honnibal
680bb47b55 * Write serializer freqs to single file, vocab/serializer.json 2015-07-23 01:15:25 +02:00
Matthew Honnibal
38ef986b29 * Update spacy/en/attrs.pxd 2015-07-23 01:10:58 +02:00
Matthew Honnibal
c86dbe4944 * Update English.save_models for new Packer save/load stuff 2015-07-22 13:40:23 +02:00
Matthew Honnibal
317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal
4dddc8a69b * Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 22:39:57 +02:00
Matthew Honnibal
95e57c2780 * Remove unnecessary key and id properties from Utf8String. 2015-07-17 01:40:18 +02:00
Matthew Honnibal
db9dfd2e23 * Major refactor of serialization. Nearly complete now. 2015-07-17 01:27:54 +02:00
Matthew Honnibal
897de2d438 * Add 'bitter' property for serializer in English class 2015-07-16 17:47:53 +02:00
Matthew Honnibal
6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal
ff9ff6f3fa * Ensure unseen words are given low log probability 2015-07-12 01:31:09 +02:00
Matthew Honnibal
89a91ad726 * Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity 2015-07-09 13:30:41 +02:00
Matthew Honnibal
6ddb2f5e45 * Restore merge_mwe in English class 2015-07-08 19:35:30 +02:00
Matthew Honnibal
6859f6adac * Restore merge_mwe in English class 2015-07-08 19:34:55 +02:00
Matthew Honnibal
e3c53f5ecd * Fix mention of Tokens in docstring 2015-07-08 18:56:27 +02:00
Matthew Honnibal
bb522496dd * Rename Tokens to Doc 2015-07-08 18:53:00 +02:00
Matthew Honnibal
4e4fac452b * Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser. 2015-07-08 12:35:29 +02:00