Matthew Honnibal
|
341a3e85cd
|
* Upd downloaded data version
|
2015-10-23 00:56:57 +02:00 |
|
Henning Peters
|
ccffd2ef53
|
fixed extract directory
|
2015-10-21 07:59:34 +02:00 |
|
Henning Peters
|
da4c9cee06
|
assert filename match
|
2015-10-20 19:33:59 +02:00 |
|
Henning Peters
|
4f703f0cb4
|
better error reporting, cleanup
|
2015-10-20 19:11:29 +02:00 |
|
Matthew Honnibal
|
9cdea6e450
|
* Import uget correctly
|
2015-10-19 08:32:41 +02:00 |
|
Henning Peters
|
bfde91fa49
|
add custom download tool (uget), replace wget with uget
|
2015-10-18 12:35:04 +02:00 |
|
Matthew Honnibal
|
e886e6a406
|
* Inc version
|
2015-10-13 13:46:17 +11:00 |
|
Matthew Honnibal
|
a3dfe2b901
|
* Increment data version
|
2015-10-09 13:26:17 +02:00 |
|
Matthew Honnibal
|
b228a8f4a6
|
* Remove spacy/en/attrs
|
2015-10-06 16:20:46 +11:00 |
|
Matthew Honnibal
|
693677fd8d
|
* Prepare to remove en/attrx file, now that moving to symbols.pyx
|
2015-10-06 16:20:13 +11:00 |
|
Matthew Honnibal
|
ecc5281b36
|
* Remove en/pos.pyx, as the tagger code now lives in spacy/tagger.pyx
|
2015-10-06 10:12:08 +11:00 |
|
Robert
|
8711b64860
|
Force SSL for downloading English language data.
It would also be nice to have a checksum for this.
|
2015-09-21 17:26:01 -07:00 |
|
Matthew Honnibal
|
e13e47e9e5
|
* Add English stop words
|
2015-09-14 17:48:51 +10:00 |
|
Matthew Honnibal
|
0b7d2a6c62
|
* Inc version
|
2015-09-13 01:26:29 +02:00 |
|
Matthew Honnibal
|
e2ef78b29c
|
* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx
|
2015-08-26 19:15:42 +02:00 |
|
Matthew Honnibal
|
c4d8754385
|
* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py
|
2015-08-26 19:15:07 +02:00 |
|
Matthew Honnibal
|
c5a27d1821
|
* Move lemmatizer to spacy
|
2015-08-25 15:47:08 +02:00 |
|
Matthew Honnibal
|
82217c6ec6
|
* Generalize lemmatizer
|
2015-08-25 15:46:19 +02:00 |
|
Matthew Honnibal
|
8083a07c3e
|
* Use language base class
|
2015-08-25 15:37:30 +02:00 |
|
Matthew Honnibal
|
5dd76be446
|
* Split EnPosTagger up into base class and subclass
|
2015-08-24 05:25:55 +02:00 |
|
Matthew Honnibal
|
6f1743692a
|
* Work on language-independent refactoring
|
2015-08-23 20:49:18 +02:00 |
|
Matthew Honnibal
|
cad0cca4e3
|
* Tmp
|
2015-08-22 22:04:34 +02:00 |
|
Matthew Honnibal
|
5737115e1e
|
* Work on gazetteer matching
|
2015-08-06 14:33:21 +02:00 |
|
Matthew Honnibal
|
c609ea18f0
|
* Increment version in download script
|
2015-07-28 15:22:17 +02:00 |
|
Matthew Honnibal
|
ddc1a5cfe5
|
* Fix training under python3
|
2015-07-28 14:09:30 +02:00 |
|
Matthew Honnibal
|
a296d72b54
|
* Fix en/attrs
|
2015-07-27 21:16:33 +02:00 |
|
Matthew Honnibal
|
8535d872e8
|
* Set is_oov property in get_flags
|
2015-07-27 01:51:24 +02:00 |
|
Matthew Honnibal
|
8e4c69ee8c
|
* Add is_oov property, and fix up handling of attributes
|
2015-07-27 01:50:06 +02:00 |
|
Matthew Honnibal
|
6bb96c122d
|
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects
|
2015-07-26 16:37:16 +02:00 |
|
Matthew Honnibal
|
eeaea25f0c
|
* Check oov_prob file is present
|
2015-07-26 16:36:38 +02:00 |
|
Matthew Honnibal
|
1b5d1da2a7
|
* Allow an OOV probability to be specified in get_lex_props
|
2015-07-26 00:03:43 +02:00 |
|
Matthew Honnibal
|
cd6e25132b
|
* Allow an OOV probability to be specified in get_lex_props
|
2015-07-26 00:01:46 +02:00 |
|
Matthew Honnibal
|
5b41744270
|
* Check for directory presence before loading annotators
|
2015-07-23 09:27:37 +02:00 |
|
Matthew Honnibal
|
12699a1152
|
* Set initial freqs, to avoid missing values in serializer
|
2015-07-23 01:16:27 +02:00 |
|
Matthew Honnibal
|
680bb47b55
|
* Write serializer freqs to single file, vocab/serializer.json
|
2015-07-23 01:15:25 +02:00 |
|
Matthew Honnibal
|
38ef986b29
|
* Update spacy/en/attrs.pxd
|
2015-07-23 01:10:58 +02:00 |
|
Matthew Honnibal
|
c86dbe4944
|
* Update English.save_models for new Packer save/load stuff
|
2015-07-22 13:40:23 +02:00 |
|
Matthew Honnibal
|
317cbbc015
|
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
|
2015-07-19 15:18:17 +02:00 |
|
Matthew Honnibal
|
4dddc8a69b
|
* Fix type declarations for attr_t. Remove unused id_t.
|
2015-07-18 22:39:57 +02:00 |
|
Matthew Honnibal
|
95e57c2780
|
* Remove unnecessary key and id properties from Utf8String.
|
2015-07-17 01:40:18 +02:00 |
|
Matthew Honnibal
|
db9dfd2e23
|
* Major refactor of serialization. Nearly complete now.
|
2015-07-17 01:27:54 +02:00 |
|
Matthew Honnibal
|
897de2d438
|
* Add 'bitter' property for serializer in English class
|
2015-07-16 17:47:53 +02:00 |
|
Matthew Honnibal
|
6eef0bf9ab
|
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
|
2015-07-13 20:20:58 +02:00 |
|
Matthew Honnibal
|
ff9ff6f3fa
|
* Ensure unseen words are given low log probability
|
2015-07-12 01:31:09 +02:00 |
|
Matthew Honnibal
|
89a91ad726
|
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
|
2015-07-09 13:30:41 +02:00 |
|
Matthew Honnibal
|
6ddb2f5e45
|
* Restore merge_mwe in English class
|
2015-07-08 19:35:30 +02:00 |
|
Matthew Honnibal
|
6859f6adac
|
* Restore merge_mwe in English class
|
2015-07-08 19:34:55 +02:00 |
|
Matthew Honnibal
|
e3c53f5ecd
|
* Fix mention of Tokens in docstring
|
2015-07-08 18:56:27 +02:00 |
|
Matthew Honnibal
|
bb522496dd
|
* Rename Tokens to Doc
|
2015-07-08 18:53:00 +02:00 |
|
Matthew Honnibal
|
4e4fac452b
|
* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.
|
2015-07-08 12:35:29 +02:00 |
|
Matthew Honnibal
|
1d2deb4616
|
* Work on refactoring default arguments to English.__init__
|
2015-07-07 15:53:25 +02:00 |
|
Matthew Honnibal
|
6788c86b2f
|
* Begin refactor
|
2015-07-07 14:00:07 +02:00 |
|
Matthew Honnibal
|
9af86b0b0b
|
* Fix attrs.pxd
|
2015-06-30 18:16:30 +02:00 |
|
Matthew Honnibal
|
5d595b5a8c
|
* Inc versions
|
2015-06-30 18:11:06 +02:00 |
|
Matthew Honnibal
|
d2eeba6667
|
* Start wiring up color and emotion lexicons. Hopefully we get to use them.
|
2015-06-30 16:22:23 +02:00 |
|
Matthew Honnibal
|
b266a63f2c
|
* Inc version of downloadble data
|
2015-06-24 04:53:08 +02:00 |
|
Matthew Honnibal
|
7d265a9c62
|
* Revert to wget in spacy.en.download
|
2015-06-08 00:48:56 +02:00 |
|
Matthew Honnibal
|
1515862861
|
* Fix download.py
|
2015-06-08 00:08:05 +02:00 |
|
Matthew Honnibal
|
7e9e8f654a
|
* Use urllib in spacy.en.download
|
2015-06-07 23:51:38 +02:00 |
|
Matthew Honnibal
|
80cff41a9c
|
* Upd download.py
|
2015-06-07 19:13:28 +02:00 |
|
Matthew Honnibal
|
58d5ac0944
|
* Add beam search capabilities to Parser. Rename GreedyParser to Parser.
|
2015-06-02 00:28:02 +02:00 |
|
Matthew Honnibal
|
62424e6c76
|
* Remove unused regularize argument from _ml.Model
|
2015-06-02 00:27:07 +02:00 |
|
Matthew Honnibal
|
04bda8648d
|
* Pass parameter for regularization to model
|
2015-05-27 03:16:58 +02:00 |
|
Matthew Honnibal
|
eba7b34f66
|
* Add flag to disable loading of word vectors
|
2015-05-25 01:02:42 +02:00 |
|
Matthew Honnibal
|
03ebf70a66
|
* Inc version to 0.84
|
2015-05-12 02:38:51 +02:00 |
|
Matthew Honnibal
|
fb8d50b3d5
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-04-30 12:45:15 +02:00 |
|
Matthew Honnibal
|
378c2a6435
|
* Fix POS model: make it use tag instead of pos in history features
|
2015-04-29 00:02:53 +02:00 |
|
Jordan Suchow
|
3a8d9b37a6
|
Remove trailing whitespace
|
2015-04-19 13:01:38 -07:00 |
|
Matthew Honnibal
|
cc4e395927
|
* Add some ad hoc regexes, for multi-word location prepositions
|
2015-04-17 04:44:24 +02:00 |
|
Matthew Honnibal
|
684d0e5e85
|
* Download updated data
|
2015-04-16 04:29:15 +02:00 |
|
Matthew Honnibal
|
42617548af
|
* Disable merge_mwes by default
|
2015-04-16 04:20:31 +02:00 |
|
Matthew Honnibal
|
77d0700caf
|
* Add on X way regexes
|
2015-04-16 01:35:46 +02:00 |
|
Matthew Honnibal
|
c6707778dd
|
* Fix Issue #51: Handle non-ascii lemmas correctly
|
2015-04-13 22:28:59 +02:00 |
|
Matthew Honnibal
|
761a19113a
|
* Fix /tmp moving thing in download.py
|
2015-04-12 07:04:10 +02:00 |
|
Matthew Honnibal
|
b64b2bd910
|
* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
|
2015-04-07 06:00:30 +02:00 |
|
Matthew Honnibal
|
b8d34531c4
|
* Add support for units to English.__init__, by loading and applying regular expressions
|
2015-04-07 04:02:32 +02:00 |
|
Matthew Honnibal
|
2fee67cfa3
|
* Add regular expressions for English multi-word expressions
|
2015-04-07 03:45:18 +02:00 |
|
Matthew Honnibal
|
567388e38d
|
* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
801bf14f4f
|
* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
f21ab2d7fb
|
* Fix bug in ugly ent_strings hack on English class
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
8057a95f20
|
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
220ce8bfed
|
* Prepare English class for NER
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
179b7eb0a7
|
* Specify parser transition system in language
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
8cc3524dc9
|
* Ws
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
2e8d0e5d45
|
* Upd download script
|
2015-03-03 05:47:16 -05:00 |
|
Matthew Honnibal
|
caf046b220
|
* Hastily add method to apply tags from a list of strings, instead of predicting the tags.
|
2015-02-23 15:40:17 -05:00 |
|
Matthew Honnibal
|
64645a1c2f
|
* Improve docstring on English
|
2015-02-11 15:13:20 -05:00 |
|
Matthew Honnibal
|
594e50bd45
|
* Add option to download speech-parsing data set.
|
2015-02-11 14:20:29 -05:00 |
|
Matthew Honnibal
|
0b7e769211
|
* Add POS tags to support SWBD tag set
|
2015-02-11 14:08:28 -05:00 |
|
Matthew Honnibal
|
312b3a45f3
|
* Fix issue #19: Allow parsing/pos tagging of empty strings
|
2015-02-10 10:15:58 -05:00 |
|
Matthew Honnibal
|
2a0615104b
|
* Upd download script
|
2015-02-09 10:22:59 -05:00 |
|
Matthew Honnibal
|
5c3513583d
|
* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
|
2015-02-09 03:57:10 -05:00 |
|
Matthew Honnibal
|
be5536d239
|
* Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON.
|
2015-02-08 18:36:18 -05:00 |
|
Matthew Honnibal
|
44c7eafe44
|
* Fix download.py
|
2015-02-07 12:00:36 -05:00 |
|
Matthew Honnibal
|
6ca7f2eedc
|
* Upd download script
|
2015-02-07 11:32:33 -05:00 |
|
Matthew Honnibal
|
56c2ef2982
|
* Tweak POS features for web text
|
2015-02-02 11:59:36 +11:00 |
|
Matthew Honnibal
|
a20fdbd8ee
|
* Upd download script
|
2015-02-01 13:22:23 +11:00 |
|
Matthew Honnibal
|
63abdf154c
|
* Hastily hack download file
|
2015-01-31 22:48:32 +11:00 |
|
Matthew Honnibal
|
a1ed574b7b
|
* Fix default model path for English
|
2015-01-31 16:38:27 +11:00 |
|
Matthew Honnibal
|
e013555b25
|
* Add option to download script
|
2015-01-31 13:51:56 +11:00 |
|
Matthew Honnibal
|
024cfd485c
|
* Pass tag_strings as a tuple, to support new Tokens API
|
2015-01-31 13:43:37 +11:00 |
|
Matthew Honnibal
|
83a4df5a1a
|
* Fix download script
|
2015-01-30 20:40:42 +11:00 |
|
Matthew Honnibal
|
6f9ebc2f34
|
* Fix download script
|
2015-01-30 20:33:19 +11:00 |
|
Matthew Honnibal
|
8b85d0bb8a
|
* Only download small data if no data dir exists
|
2015-01-30 20:27:14 +11:00 |
|
Matthew Honnibal
|
cb95ef6934
|
* Fix download script
|
2015-01-30 19:28:43 +11:00 |
|
Matthew Honnibal
|
e578bd37bd
|
* Fix download script
|
2015-01-30 18:59:31 +11:00 |
|
Matthew Honnibal
|
df52014d12
|
* Fix download script
|
2015-01-30 18:36:24 +11:00 |
|
Matthew Honnibal
|
998b607f65
|
* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
|
2015-01-30 18:04:01 +11:00 |
|
Matthew Honnibal
|
67d6e53a69
|
* Ensure parser and tagger function correctly when training from missing values, indicated by -1
|
2015-01-30 14:08:56 +11:00 |
|
Matthew Honnibal
|
c38c62d4a3
|
* Add docstring to English class
|
2015-01-27 02:45:21 +11:00 |
|
Matthew Honnibal
|
7f87716cf7
|
* Fix download script
|
2015-01-25 23:01:10 +11:00 |
|
Matthew Honnibal
|
12b034e3ef
|
* Move POS tag definitions to parts_of_speech.pxd
|
2015-01-25 16:31:07 +11:00 |
|
Matthew Honnibal
|
7431c133d8
|
* Add error if try to access head and not is_parsed
|
2015-01-25 15:33:54 +11:00 |
|
Matthew Honnibal
|
951d06c824
|
* Silently don't parse if data is not present
|
2015-01-25 14:47:38 +11:00 |
|
Matthew Honnibal
|
4e857ab7a6
|
* Fix bug in POS tagger feature
|
2015-01-25 02:20:15 +11:00 |
|
Matthew Honnibal
|
dd56e298e2
|
* Ensure tagging is applied if parse=True
|
2015-01-25 02:19:44 +11:00 |
|
Matthew Honnibal
|
94750819cd
|
* Set parse=True by default --- i.e. parse unless told not to.
|
2015-01-25 01:28:28 +11:00 |
|
Matthew Honnibal
|
a97bed9359
|
* Fix POS and dependency label tag names. Add parse and string navigation functions.
|
2015-01-24 17:29:04 +11:00 |
|
Matthew Honnibal
|
fda94271af
|
* Rename NORM1 and NORM2 attrs to lower and norm
|
2015-01-24 06:17:03 +11:00 |
|
Matthew Honnibal
|
5ed8b2b98f
|
* Rename sic to orth
|
2015-01-23 02:08:25 +11:00 |
|
Matthew Honnibal
|
f2a229136c
|
* Fix data_dir=None argument to English class
|
2015-01-21 18:27:31 +11:00 |
|
Matthew Honnibal
|
ef49b8c179
|
* Add stop-word flag
|
2015-01-21 18:22:31 +11:00 |
|
Matthew Honnibal
|
6646bfc5df
|
* Add LOWER attr
|
2015-01-21 18:19:08 +11:00 |
|
Matthew Honnibal
|
6c7e44140b
|
* Work on word vectors, and other stuff
|
2015-01-17 16:21:17 +11:00 |
|
Matthew Honnibal
|
7d3c40de7d
|
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
|
2015-01-15 00:33:16 +11:00 |
|
Matthew Honnibal
|
0930892fc1
|
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
|
2015-01-14 00:03:48 +11:00 |
|
Matthew Honnibal
|
46da3d74d2
|
* Tmp. Refactoring, introducing a Lexeme PyObject.
|
2015-01-12 11:23:44 +11:00 |
|
Matthew Honnibal
|
ce2edd6312
|
* Tmp commit. Refactoring to create a Python Lexeme class.
|
2015-01-12 10:26:22 +11:00 |
|
Matthew Honnibal
|
7689dccd0f
|
* Remove unused import
|
2015-01-05 18:48:48 +11:00 |
|
Matthew Honnibal
|
3f1944d688
|
* Make PyPy work
|
2015-01-05 17:54:38 +11:00 |
|
Matthew Honnibal
|
a510d9f677
|
* Another assertion removed
|
2015-01-05 13:01:40 +11:00 |
|
Matthew Honnibal
|
2856946a66
|
* Remove assertion that doesn't work on Python 3
|
2015-01-05 12:51:16 +11:00 |
|
Matthew Honnibal
|
94034f1112
|
* Fix encoding in lemmatization
|
2015-01-05 11:54:29 +11:00 |
|
Matthew Honnibal
|
b132b3caa6
|
* Fix unicode error in lemmatizer
|
2015-01-05 11:53:54 +11:00 |
|
Matthew Honnibal
|
477e7fbffe
|
* Fix data reading for lemmatizer
|
2015-01-05 06:01:32 +11:00 |
|
Matthew Honnibal
|
4e085d5166
|
* Fix lemmatizer for Python3
|
2015-01-05 05:51:26 +11:00 |
|
Matthew Honnibal
|
0e4c2ba036
|
* Fix loading of special morph words
|
2015-01-03 23:13:00 +11:00 |
|
Matthew Honnibal
|
f5d41028b5
|
* Move around data files for test release
|
2015-01-03 01:59:22 +11:00 |
|
Matthew Honnibal
|
a24321b63a
|
* Add downloader
|
2015-01-02 21:44:41 +11:00 |
|
Matthew Honnibal
|
5d9a096e2f
|
* Some minor clean-up after HastyModel
|
2014-12-31 19:46:04 +11:00 |
|
Matthew Honnibal
|
aafaf58cbe
|
* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
|
2014-12-31 19:40:59 +11:00 |
|
Matthew Honnibal
|
1a075f77ff
|
* Don't over-ride pre-loaded POS tags, if set by special-cases
|
2014-12-30 23:26:32 +11:00 |
|
Matthew Honnibal
|
785c7ba76a
|
* Embed signature on attrs
|
2014-12-30 23:25:31 +11:00 |
|
Matthew Honnibal
|
30e5805656
|
* Lazy-load tagger and parser
|
2014-12-30 23:25:09 +11:00 |
|
Matthew Honnibal
|
bb0b00f819
|
* Repurporse the Tagger class as a generic Model, wrapping thinc's interface
|
2014-12-30 21:20:15 +11:00 |
|
Matthew Honnibal
|
bb80937544
|
* Upd docstrings
|
2014-12-27 18:45:16 +11:00 |
|
Matthew Honnibal
|
b8b65903fc
|
* Tmp
|
2014-12-24 17:42:00 +11:00 |
|
Matthew Honnibal
|
7708d0e24a
|
* Move lemmatizer to en dir
|
2014-12-23 15:16:57 +11:00 |
|
Matthew Honnibal
|
98eb4c0426
|
* Fix path to parser model
|
2014-12-23 15:09:09 +11:00 |
|
Matthew Honnibal
|
b00bc01d8c
|
* All tests now passing for reorg
|
2014-12-23 13:18:59 +11:00 |
|