ines
e1efd589c3
Fix json imports and use ujson
2017-04-15 12:13:34 +02:00
ines
c05ec4b89a
Add compat functions and remove old workarounds
...
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines
d24589aa72
Clean up imports, unused code, whitespace, docstrings
2017-04-15 12:05:47 +02:00
ines
561f2a3eb4
Use consistent formatting for docstrings
2017-04-15 11:59:21 +02:00
Matthew Honnibal
d013aba7b5
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 18:30:53 +01:00
Matthew Honnibal
854cfce7cf
Make vocabs more compatible across versions
...
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal
1cc841e600
Merge branch 'master' of https://github.com/explosion/spaCy
2017-03-17 08:18:11 -05:00
Matthew Honnibal
4bfc55b532
Auto-add words to vocab when loading vectors
...
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
Matthew Honnibal
4382f175b3
Squelch compiler warnings
2017-03-11 12:44:43 -06:00
Matthew Honnibal
d814892805
Hackish pickle support for Vocab.
2017-03-07 20:25:12 +01:00
ines
aa92d4e9b5
Fix unicode regex for Python 2 (see #834 )
2017-02-16 23:49:54 +01:00
ines
85d249d451
Revert "Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )""
...
This reverts commit ea05f78660
.
2017-02-16 23:26:25 +01:00
ines
ea05f78660
Revert "Merge pull request #836 from raphael0202/load_vectors ( closes #834 )"
...
This reverts commit 7d8c9eee7f
, reversing
changes made to f6b69babcc
.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque
e17dc2db75
Remove useless import
2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque
3fd2742649
load_vectors should accept arbitrary space characters as word tokens
...
Fix bug #834
2017-02-16 12:08:30 +01:00
Daniel Hershcovich
99eb494a82
Fix #737 : support loading word vectors with " " as a word
2017-01-12 17:00:14 +02:00
Daniel Hershcovich
8e603cc917
Avoid "True if ... else False"
2017-01-11 11:18:22 +02:00
Matthew Honnibal
cade536d1e
Merge branch 'master' of ssh://github.com/explosion/spaCy
2016-12-27 21:04:10 +01:00
Matthew Honnibal
ce4539dafd
Allow the vocabulary to grow to 10,000, to prevent cold-start problem.
2016-12-27 21:03:45 +01:00
Ines Montani
8978806ea6
Allow Vocab to load without serializer_freqs
2016-12-21 18:05:23 +01:00
Ines Montani
be8ed811f6
Remove trailing whitespace
2016-12-21 18:04:41 +01:00
Matthew Honnibal
6ee1df93c5
Set tag_map to None if it's not seen in the data by vocab
2016-12-18 16:51:10 +01:00
Matthew Honnibal
1e0f566d95
Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.
2016-11-25 12:43:24 +01:00
Matthew Honnibal
f123f92e0c
Fix #617 : Vocab.load() required Path. Should work with string as well.
2016-11-10 22:48:48 +01:00
Matthew Honnibal
b86f8af0c1
Fix doc strings
2016-11-01 12:25:36 +01:00
Matthew Honnibal
6036ec7c77
Fix vector norm when loading lexemes.
2016-10-23 19:40:18 +02:00
Matthew Honnibal
3e688e6d4b
Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
2016-10-23 17:45:44 +02:00
Matthew Honnibal
f62088d646
Fix compile error
2016-10-23 14:50:50 +02:00
Matthew Honnibal
a0a4ada42a
Fix calculation of L2-norm for Lexeme
2016-10-23 14:44:45 +02:00
Matthew Honnibal
7ab03050d4
Add resize_vectors method to Vocab
2016-10-21 01:44:50 +02:00
Matthew Honnibal
f5fe4f595b
Fix json loading, for Python 3.
2016-10-20 21:23:26 +02:00
Matthew Honnibal
5ec32f5d97
Fix loading of GloVe vectors, to address Issue #541
2016-10-20 18:27:48 +02:00
Matthew Honnibal
d10c17f2a4
Fix Issue #536 : oov_prob was 0 for OOV words.
2016-10-19 23:38:47 +02:00
Matthew Honnibal
2bbb050500
Fix default of serializer_freqs
2016-10-18 19:55:41 +02:00
Matthew Honnibal
2cc515b2ed
Add add_flag method to Vocab, re Issue #504 .
2016-10-14 12:15:38 +02:00
Matthew Honnibal
ea23b64cc8
Refactor training, with new spacy.train module. Defaults still a little awkward.
2016-10-09 12:24:24 +02:00
Matthew Honnibal
ca32a1ab01
Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
...
This reverts commit 8423e8627f
.
2016-09-30 20:20:22 +02:00
Matthew Honnibal
1f1cd5013f
Revert "Changes to vocab for new stringstore scheme"
...
This reverts commit a51149a717
.
2016-09-30 20:10:30 +02:00
Matthew Honnibal
a51149a717
Changes to vocab for new stringstore scheme
2016-09-30 20:01:19 +02:00
Matthew Honnibal
8423e8627f
Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.
2016-09-30 10:14:47 +02:00
Matthew Honnibal
95aaea0d3f
Refactor so that the tokenizer data is read from Python data, rather than from disk
2016-09-25 14:49:53 +02:00
Matthew Honnibal
df88690177
Fix encoding of path variable
2016-09-24 21:13:15 +02:00
Matthew Honnibal
af847e07fc
Fix usage of pathlib for Python3 -- turning paths to strings.
2016-09-24 21:05:27 +02:00
Matthew Honnibal
453683aaf0
Fix spacy/vocab.pyx
2016-09-24 20:50:31 +02:00
Matthew Honnibal
fd65cf6cbb
Finish refactoring data loading
2016-09-24 20:26:17 +02:00
Matthew Honnibal
83e364188c
Mostly finished loading refactoring. Design is in place, but doesn't work yet.
2016-09-24 15:42:01 +02:00
Matthew Honnibal
872695759d
Merge pull request #306 from wbwseeker/german_noun_chunks
...
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters
b8f63071eb
add lang registration facility
2016-03-25 18:54:45 +01:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00