Matthew Honnibal
191d593e03
* Fix vectors bug in lexeme
2015-09-15 19:05:11 +10:00
Matthew Honnibal
3d87519f64
* Remove vectors argument from Vocab object
2015-09-15 14:47:14 +10:00
Matthew Honnibal
362526b592
* Rename vectors_length attribute
2015-09-15 14:43:31 +10:00
Matthew Honnibal
60c26b2dfa
* Fix slicing when start or stop is None
2015-09-15 14:43:10 +10:00
Matthew Honnibal
7ac6cacc26
* Remove const qualifier on LexemeC.repvec
2015-09-15 14:42:51 +10:00
Matthew Honnibal
dd4d64b235
* Support setting of word vectors on Lexeme object.
2015-09-15 14:42:27 +10:00
Matthew Honnibal
27f988b167
* Remove the vectors option to Vocab, preferring to either load vectors from disk, or set them on the Lexeme objects.
2015-09-15 14:41:48 +10:00
Matthew Honnibal
193f127f81
* Fix ugly py_check_flag and py_set_flag functions in Lexeme
2015-09-15 13:06:18 +10:00
Matthew Honnibal
9561d88529
* Add is_stop to Python API
2015-09-14 18:25:40 +10:00
Matthew Honnibal
65dc0d1dfb
* Extend word vectors support, with .similarity() function, vector_norm property, and rename repvec to vector. Keep repvec name as well for now for backwards compatibility.
2015-09-14 17:49:58 +10:00
Matthew Honnibal
e13e47e9e5
* Add English stop words
2015-09-14 17:48:51 +10:00
Matthew Honnibal
24ed3fc25c
* Check file existance before opening in lemmatizer
2015-09-13 10:45:21 +10:00
Matthew Honnibal
dbb48ce49e
* Delete extra wordnets
2015-09-13 10:31:37 +10:00
Matthew Honnibal
e9c59693ea
* Remove assertion from vocab.pyx
2015-09-13 10:30:08 +10:00
Matthew Honnibal
c08f10083c
* Add test and test_with_ws attributes.
2015-09-13 10:27:42 +10:00
Matthew Honnibal
0b7d2a6c62
* Inc version
2015-09-13 01:26:29 +02:00
Matthew Honnibal
e1dfaeed8a
* Check serializer freqs exist before loading
2015-09-12 23:49:38 +02:00
Matthew Honnibal
a412c66c8c
* Check serializer freqs exist before loading
2015-09-12 23:40:01 +02:00
Matthew Honnibal
631c843ed1
* Don't look for index.adv in le,matizer
2015-09-12 06:03:44 +02:00
Matthew Honnibal
dfdd4f2d60
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-10 15:23:06 +02:00
Matthew Honnibal
e285ca7d6c
* Load serializer freqs in vocab
2015-09-10 15:22:48 +02:00
Matthew Honnibal
f7fdcce1f9
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-10 14:52:47 +02:00
Matthew Honnibal
85c3fec1d1
* Fix morphology loading
2015-09-10 14:52:23 +02:00
Matthew Honnibal
7c660c5efc
* Use dict.get in lemmatizer
2015-09-10 14:51:39 +02:00
Matthew Honnibal
094440f9f5
Merge branch 'develop' of ssh://github.com/honnibal/spaCy into develop
2015-09-10 14:51:17 +02:00
Matthew Honnibal
c3f773cd63
* Fix Lexeme.check_flag
2015-09-10 14:51:05 +02:00
Matthew Honnibal
90da3a695d
* Load lemmatizer from disk in Vocab.from_dir
2015-09-10 14:49:10 +02:00
Matthew Honnibal
e7e529edf4
* Fix Lexeme.check_flag
2015-09-10 14:45:43 +02:00
Matthew Honnibal
9e7bfe8449
* Fix space at end of merged token
2015-09-10 14:45:17 +02:00
Matthew Honnibal
f634191e27
* Fix vocab read/write
2015-09-10 14:44:38 +02:00
Matthew Honnibal
31ccf494e6
Merge branch 'develop' of https://github.com/honnibal/spaCy into develop
2015-09-09 14:33:38 +02:00
Matthew Honnibal
a7f4b26c8c
* Tmp
2015-09-09 14:33:26 +02:00
Matthew Honnibal
07686470a9
* Don't consider a coordinated NP a base chunk
2015-09-09 14:32:28 +02:00
Matthew Honnibal
d9f1fc2112
* Add deprecation warning for unused load_vectors argument.
2015-09-09 14:31:09 +02:00
Matthew Honnibal
0b527fbdc8
* Set POS tag in morphology
2015-09-09 14:30:24 +02:00
Matthew Honnibal
07c09a0e1b
* Fix attribute getters and setters in Lexeme
2015-09-09 14:29:22 +02:00
Matthew Honnibal
d6561988cf
* Fix lexemes.bin
2015-09-09 11:49:51 +02:00
Matthew Honnibal
c301bebd33
Merge branch 'master' of https://github.com/honnibal/spaCy into develop
2015-09-09 10:55:39 +02:00
Matthew Honnibal
0e24d099a1
* Fix L/R edge bug, by ensuring l_edge and r_edge are preset, and fixing the way the edge update in del_arc. Bugs keep arising here because the edges are absolute positions, where everything else is relative. I'm also not 100% convinced that del_arc is handled correctly. Do we need to update the parents?
2015-09-09 03:40:44 +02:00
Matthew Honnibal
2be3620333
* Save morphological analyses in a cache
2015-09-08 15:39:24 +02:00
Matthew Honnibal
1def5a6cbe
* Fix print statements in matcher
2015-09-08 15:38:19 +02:00
Matthew Honnibal
64d71f8893
* Fix lemmatizer
2015-09-08 15:38:03 +02:00
Matthew Honnibal
623329b19a
Merge branch 'master' of ssh://github.com/honnibal/spaCy into develop
2015-09-08 14:27:01 +02:00
Matthew Honnibal
62a01dd41d
* Fix issue #92 : lexemes.bin read error on 32-bit platforms.
2015-09-08 14:23:58 +02:00
Matthew Honnibal
ef58607a99
* Add spacy.it
2015-09-06 22:10:37 +02:00
Matthew Honnibal
2154a54f6b
* Add spacy.de
2015-09-06 21:56:47 +02:00
Matthew Honnibal
f6ec5bf1b0
* Use empty tag map in vocab if none supplied
2015-09-06 20:19:27 +02:00
Matthew Honnibal
4f8e38271d
* Fix merge errors in lexeme.pxd
2015-09-06 20:19:08 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26
* Begin merge of Gazetteer and DE branches
2015-09-06 19:45:15 +02:00
Matthew Honnibal
dbf8dce109
Merge branch 'gaz' of ssh://github.com/honnibal/spaCy into gaz
2015-09-06 18:44:14 +02:00
Matthew Honnibal
9eae9837c4
* Fix morphology look up
2015-09-06 17:53:39 +02:00
Matthew Honnibal
6427a3fcac
* Temporarily import flag attributes in matcher
2015-09-06 17:53:12 +02:00
Matthew Honnibal
7cc56ada6e
* Temporarily add py_set_flag attribute in Lexeme
2015-09-06 17:52:51 +02:00
Matthew Honnibal
e35bb36be7
* Ensure Lexeme.check_flag returns a boolean value
2015-09-06 17:52:32 +02:00
Matthew Honnibal
7e4fea67d3
* Fix bug in token subtree, introduced by duplication of L/R code in Stateclass. Need to consolidate the two methods.
2015-09-06 10:48:36 +02:00
Matthew Honnibal
5edac11225
* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.
2015-09-06 04:15:00 +02:00
Matthew Honnibal
fd1eeb3102
* Add POS attribute support in get_attr
2015-09-06 04:13:03 +02:00
Matthew Honnibal
534e3dda3c
* More work on language independent parsing
2015-08-28 03:44:54 +02:00
Matthew Honnibal
c2307fa9ee
* More work on language-generic parsing
2015-08-28 02:02:33 +02:00
Matthew Honnibal
86c4a8e3e2
* Work on new morphology organization
2015-08-27 23:11:51 +02:00
Matthew Honnibal
5b89e2454c
* Improve error-reporting in tagger
2015-08-27 10:26:36 +02:00
Matthew Honnibal
f0a7c99554
* Relax rule-requirement in lemmatizer
2015-08-27 10:26:19 +02:00
Matthew Honnibal
0af139e183
* Tagger training now working. Still need to test load/save of model. Morphology still broken.
2015-08-27 09:16:11 +02:00
Matthew Honnibal
1302d35dff
* Rework interfaces in vocab
2015-08-26 19:21:46 +02:00
Matthew Honnibal
2d521768a3
* Store Morphology class in Vocab
2015-08-26 19:21:03 +02:00
Matthew Honnibal
d30029979e
* Avoid import of morphology in spans
2015-08-26 19:20:46 +02:00
Matthew Honnibal
119c0f8c3f
* Hack out morphology stuff from tokenizer, while morphology being reimplemented.
2015-08-26 19:20:11 +02:00
Matthew Honnibal
b4faf551f5
* Refactor language-independent tagger class
2015-08-26 19:19:21 +02:00
Matthew Honnibal
a3d5e6c0dd
* Reform constructor and save/load workflow in parser model
2015-08-26 19:19:01 +02:00
Matthew Honnibal
1d7f2d3abc
* Hack on morphology structs
2015-08-26 19:18:36 +02:00
Matthew Honnibal
f8f2f4e545
* Temporarily add PUNC name to parts_of_specch dictionary, until better solution
2015-08-26 19:18:19 +02:00
Matthew Honnibal
008b02b035
* Comment out enums in Morpohlogy for now
2015-08-26 19:17:35 +02:00
Matthew Honnibal
378729f81a
* Hack Morphology class towards usability
2015-08-26 19:17:21 +02:00
Matthew Honnibal
430affc347
* Fix missing n_patterns property in Matcher class. Fix from_dir method
2015-08-26 19:17:02 +02:00
Matthew Honnibal
3acf60df06
* Add missing properties in Lexeme class
2015-08-26 19:16:28 +02:00
Matthew Honnibal
76996f4145
* Hack on generic Language class. Still needs work for morphology, defaults, etc
2015-08-26 19:16:09 +02:00
Matthew Honnibal
e2ef78b29c
* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx
2015-08-26 19:15:42 +02:00
Matthew Honnibal
c4d8754385
* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py
2015-08-26 19:15:07 +02:00
Matthew Honnibal
c2d8edd0bd
* Add PROB attribute in attrs.pxd
2015-08-26 19:14:19 +02:00
Matthew Honnibal
c5a27d1821
* Move lemmatizer to spacy
2015-08-25 15:47:08 +02:00
Matthew Honnibal
82217c6ec6
* Generalize lemmatizer
2015-08-25 15:46:19 +02:00
Matthew Honnibal
8083a07c3e
* Use language base class
2015-08-25 15:37:30 +02:00
Matthew Honnibal
f2f699ac18
* Add language base class
2015-08-25 15:37:17 +02:00
Matthew Honnibal
5dd76be446
* Split EnPosTagger up into base class and subclass
2015-08-24 05:25:55 +02:00
Matthew Honnibal
5d5922dbfa
* Begin laying out morphological features
2015-08-24 01:04:30 +02:00
Matthew Honnibal
6f1743692a
* Work on language-independent refactoring
2015-08-23 20:49:18 +02:00
Matthew Honnibal
3879d28457
* Fix https for url detection
2015-08-23 02:40:35 +02:00
Matthew Honnibal
cad0cca4e3
* Tmp
2015-08-22 22:04:34 +02:00
Matthew Honnibal
bf38b3b883
* Hack on l/r reversal bug
2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47
* Fix label prediction in StepwiseState
2015-08-10 05:05:31 +02:00
Matthew Honnibal
2c9753eff2
* Whitespace
2015-08-10 00:09:02 +02:00
Matthew Honnibal
9de98f5a6f
* Add Parser.stepthrough method, with context manager
2015-08-10 00:08:46 +02:00
Matthew Honnibal
fe43f8cf39
* Whitespace
2015-08-09 02:31:53 +02:00
Matthew Honnibal
9c090945e0
* Add Parser.predict method, and clean up Parser.get_state
2015-08-09 02:29:58 +02:00
Matthew Honnibal
04fccfb984
* Fix get_state for parser prediction
2015-08-09 02:11:22 +02:00
Matthew Honnibal
55fde0e240
* Fix get_state
2015-08-09 01:45:30 +02:00
Matthew Honnibal
f0f4fa9838
* Fix Parser.get_state
2015-08-09 01:40:13 +02:00
Matthew Honnibal
18331dca89
* Add continue_for argument to parser 'partial' function, which is now renamed to get_state
2015-08-09 01:31:54 +02:00
Matthew Honnibal
0653288fa5
* Fix stateclass.queue
2015-08-09 00:39:02 +02:00
Matthew Honnibal
9de218b7ba
* Fix Parser.partial function
2015-08-08 23:45:18 +02:00
Matthew Honnibal
01be34d55a
* Whitespace
2015-08-08 23:37:44 +02:00
Matthew Honnibal
cc9deae960
* Add is_valid method to transition_system
2015-08-08 23:36:18 +02:00
Matthew Honnibal
2a46c77324
* Whitespace
2015-08-08 23:35:59 +02:00
Matthew Honnibal
7bafc789e7
* Add stack and queue properties to stateclass, for python access
2015-08-08 23:32:42 +02:00
Matthew Honnibal
3af938365f
* Add function partial to Parser
2015-08-08 23:32:15 +02:00
Matthew Honnibal
76a1f0481a
* Whitespace
2015-08-08 23:31:54 +02:00
Matthew Honnibal
b0f5c39084
* Fix handling of exclusion entities
2015-08-06 17:28:43 +02:00
Matthew Honnibal
9f65879991
* Fix shape attr bug, and fix handling of false positive matches
2015-08-06 17:28:14 +02:00
Matthew Honnibal
10d869d102
* Don't allow conjunction between NPs in base NP chunks
2015-08-06 16:31:53 +02:00
Matthew Honnibal
383dfabd67
* Fix matcher setting of entities
2015-08-06 16:27:01 +02:00
Matthew Honnibal
59c3bf60a6
* Ensure entity recognizer doesn't over-write preset types
2015-08-06 16:09:08 +02:00
Matthew Honnibal
cd7d1682cd
* Fix loading of gazetteer.json file
2015-08-06 16:08:25 +02:00
Matthew Honnibal
9c667b7f15
* Set a value in attrs.pxd on the first flag, to reduce bugs
2015-08-06 16:08:04 +02:00
Matthew Honnibal
c263577424
* Fix lower attribute in lexeme.pxd
2015-08-06 16:07:41 +02:00
Matthew Honnibal
5737115e1e
* Work on gazetteer matching
2015-08-06 14:33:21 +02:00
Matthew Honnibal
9c1724ecae
* Gazetteer stuff working, now need to wire up to API
2015-08-06 00:35:40 +02:00
Matthew Honnibal
5bc0e83f9a
* Reimplement matching in Cython, instead of Python.
2015-08-05 01:05:54 +02:00
Matthew Honnibal
4c87a696b3
* Add draft dfa matcher, in Python. Passing tests.
2015-08-04 15:55:28 +02:00
Matthew Honnibal
eb7138c761
* Add attr relation in base NP detection
2015-08-01 00:34:40 +02:00
Matthew Honnibal
4988356cf0
* Fix dependency type bug from merged tokens
2015-08-01 00:33:24 +02:00
Matthew Honnibal
78a9068319
* Fix spacy attr on merged tokens
2015-07-30 04:25:58 +02:00
Matthew Honnibal
430e2edb96
* Fix noun_chunks issue
2015-07-30 03:51:50 +02:00
Matthew Honnibal
9590968fc1
* Fix negative indices in Span
2015-07-30 02:30:24 +02:00
Matthew Honnibal
74d8cb3980
* Add noun_chunks iterator, and fix left/right child setting in Doc.merge
2015-07-30 02:29:49 +02:00
Matthew Honnibal
d153f18969
* Fix negative indices on spans
2015-07-29 22:36:03 +02:00
Matthew Honnibal
b5132bed7d
* Set left and right children when loading parse from byte string
2015-07-28 21:03:18 +02:00
Matthew Honnibal
6609fcf4b2
* Make mem and vocab python-visible in Doc
2015-07-28 20:46:59 +02:00
Matthew Honnibal
d42fe2e694
* Add unicode_literals to strings.pyx
2015-07-28 16:15:53 +02:00
Matthew Honnibal
bb910cff92
* Fix Python3 problem in align_raw
2015-07-28 16:06:53 +02:00
Matthew Honnibal
dcafb181b9
* Fix Python3 problem in align_raw
2015-07-28 15:52:10 +02:00
Matthew Honnibal
c609ea18f0
* Increment version in download script
2015-07-28 15:22:17 +02:00
Matthew Honnibal
9c4d0aae62
* Switch to better Python2/3 compatible unicode handling
2015-07-28 14:45:37 +02:00
Matthew Honnibal
7606d9936f
* Python3 correction for GoldParse
2015-07-28 14:44:53 +02:00
Matthew Honnibal
ddc1a5cfe5
* Fix training under python3
2015-07-28 14:09:30 +02:00
Matthew Honnibal
a8bbd7312c
* Hackishly patch long dependencies problem
2015-07-28 00:14:29 +02:00
Matthew Honnibal
bb583f7f09
* Hackishly patch long dependencies problem
2015-07-27 23:14:33 +02:00
Matthew Honnibal
aa7a964a4f
* Add a type declaration for doc.from_array
2015-07-27 22:57:22 +02:00
Matthew Honnibal
25a8774f42
* Fix regression in packer
2015-07-27 21:53:38 +02:00
Matthew Honnibal
1601e488ee
* Fix bug in decoding non-ascii characters
2015-07-27 21:43:58 +02:00
Matthew Honnibal
6a95409cd2
* Fix type on bits
2015-07-27 21:16:49 +02:00
Matthew Honnibal
a296d72b54
* Fix en/attrs
2015-07-27 21:16:33 +02:00
Matthew Honnibal
45460f505c
* Fix data type on read32 in BitArray
2015-07-27 21:12:13 +02:00
Matthew Honnibal
3d43f49f69
* Revert prev change
2015-07-27 10:58:15 +02:00
Matthew Honnibal
6b586cdad4
* Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information.
2015-07-27 08:31:51 +02:00
Matthew Honnibal
af6ed18f2a
* Ensure we don't use orth_encode on OOV words.
2015-07-27 02:12:01 +02:00
Matthew Honnibal
8535d872e8
* Set is_oov property in get_flags
2015-07-27 01:51:24 +02:00
Matthew Honnibal
8e4c69ee8c
* Add is_oov property, and fix up handling of attributes
2015-07-27 01:50:06 +02:00
Matthew Honnibal
fc268f03eb
* Assert against null pointer exceptions in vocab
2015-07-27 01:00:10 +02:00
Matthew Honnibal
0f093fdb30
* Fix get_by_orth for py3
2015-07-26 19:26:41 +02:00
Matthew Honnibal
ceeda5a739
* Fix get_by_orth for py3
2015-07-26 18:39:27 +02:00
Matthew Honnibal
6bb96c122d
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects
2015-07-26 16:37:16 +02:00
Matthew Honnibal
eeaea25f0c
* Check oov_prob file is present
2015-07-26 16:36:38 +02:00
Matthew Honnibal
7eb2446082
* Return empty lexeme on empty string
2015-07-26 00:18:30 +02:00
Matthew Honnibal
1b5d1da2a7
* Allow an OOV probability to be specified in get_lex_props
2015-07-26 00:03:43 +02:00
Matthew Honnibal
cd6e25132b
* Allow an OOV probability to be specified in get_lex_props
2015-07-26 00:01:46 +02:00
Matthew Honnibal
fd525f0675
* Pass OOV probability around
2015-07-25 23:29:51 +02:00
Matthew Honnibal
3fe14b8ed6
* Fix CFile for Python2
2015-07-25 22:55:53 +02:00
Matthew Honnibal
823ef4a00b
* Remove profile declarations
2015-07-25 18:13:06 +02:00
Matthew Honnibal
f4809e562f
* Allow json to be used as a fallback if ujson is not available
2015-07-25 18:11:36 +02:00
Matthew Honnibal
9da06671cf
* Remove unused import
2015-07-25 18:11:16 +02:00
Matthew Honnibal
2060935cdb
* Remove explicit bytes type in doc.from_bytes, to accept bytearray
2015-07-24 04:54:13 +02:00
Matthew Honnibal
aa28e2e01d
* Release the GIL around parse function
2015-07-24 04:53:27 +02:00
Matthew Honnibal
d62eb34b76
* More Py 2/3 compatibility in bit strings
2015-07-24 04:52:06 +02:00
Matthew Honnibal
0bb839d299
* Fix string coercion for Python 3
2015-07-24 03:49:30 +02:00
Matthew Honnibal
c4ff410fdb
* Fix bytes problems for Python3
2015-07-24 03:48:23 +02:00
Matthew Honnibal
1ab25e4dad
* Fix python3 type error
2015-07-24 02:45:34 +02:00
Matthew Honnibal
f35ff173b0
* Fix bits.pyx unicode error
2015-07-23 20:37:57 +02:00
Matthew Honnibal
1406e24327
* Fix unicode error for Python3
2015-07-23 19:36:21 +02:00
Matthew Honnibal
dbda6c27fa
* Fix python3 error
2015-07-23 14:52:30 +02:00
Matthew Honnibal
99387f9572
* Fix python3 error
2015-07-23 14:30:29 +02:00
Matthew Honnibal
b81ffe9032
* Fix typing on mode string in CFile
2015-07-23 13:24:43 +02:00
Matthew Honnibal
22028602a9
* Add unicode_literals declaration in vocab.pyx
2015-07-23 13:24:20 +02:00
Matthew Honnibal
5b41744270
* Check for directory presence before loading annotators
2015-07-23 09:27:37 +02:00
Matthew Honnibal
df01a88763
Merge branch 'refactor' (and serializaton)
...
Add Huffman-code serialization, and do a lot of
refactoring. Highlights include:
* Much more efficient StringStore
* Vocab maintains a by-orth mapping of Lexemes
* Avoid manually slicing Py_UNICODE buffers,
simplifying tokenizer and vocab C APIs
* Remove various bits of dead code
* Work on removing GIL around parser
* Work on bridge to Theano
Conflicts:
spacy/strings.pxd
spacy/strings.pyx
spacy/structs.pxd
2015-07-23 02:18:35 +02:00
Matthew Honnibal
a7c4d72e83
* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method.
2015-07-23 01:18:19 +02:00
Matthew Honnibal
6ab1696b15
* Remove read_encoding_freqs from util.py
2015-07-23 01:17:32 +02:00
Matthew Honnibal
d5255aad77
* Update freqs for missing tags in ner, for serializer
2015-07-23 01:17:11 +02:00
Matthew Honnibal
12699a1152
* Set initial freqs, to avoid missing values in serializer
2015-07-23 01:16:27 +02:00
Matthew Honnibal
680bb47b55
* Write serializer freqs to single file, vocab/serializer.json
2015-07-23 01:15:25 +02:00
Matthew Honnibal
a0e36e8efc
* Add working to/from bytes API to Doc
2015-07-23 01:14:45 +02:00
Matthew Honnibal
1f31d96bf9
* Fix Packer API, so that it reads and writes bytes strings, instead of BitArray. Docs are always byte aligned anyway.
2015-07-23 01:13:02 +02:00
Matthew Honnibal
38ef986b29
* Update spacy/en/attrs.pxd
2015-07-23 01:10:58 +02:00
Matthew Honnibal
06eac32610
* Add cfile.pyx
2015-07-23 01:10:36 +02:00
Matthew Honnibal
0c507bd80a
* Fix tokenizer
2015-07-22 14:10:30 +02:00
Matthew Honnibal
c86dbe4944
* Update English.save_models for new Packer save/load stuff
2015-07-22 13:40:23 +02:00
Matthew Honnibal
bf77bcd6b9
* Add comment explaining hash_string
2015-07-22 13:39:42 +02:00
Matthew Honnibal
815bda201d
* Remove UniStr struct
2015-07-22 13:39:17 +02:00
Matthew Honnibal
2fc66e3723
* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff
2015-07-22 13:38:45 +02:00
Matthew Honnibal
4d61239eac
* Reorganize the serialization functions on Doc
2015-07-22 04:53:01 +02:00
Matthew Honnibal
109106a949
* Replace UniStr, using unicode objects instead
2015-07-22 04:52:05 +02:00
Matthew Honnibal
424854028f
* Fix decode_int32
2015-07-21 20:09:59 +00:00
Matthew Honnibal
304d0e2633
* Use decode_int32 in _orth_decode
2015-07-21 20:40:55 +02:00
Matthew Honnibal
9cfa59ec33
* Optimistically try orth encoding, with char as a back-off
2015-07-21 20:22:45 +02:00
Matthew Honnibal
c8b89e37a5
* Bug fix to faster huffman decoding
2015-07-21 20:05:53 +02:00
Matthew Honnibal
b166d1d2a2
* Use encode32 and decode32
2015-07-21 19:59:06 +02:00
Matthew Honnibal
c6cd0ddce8
* Add faster encode_int32 and decode_int32 methods
2015-07-21 19:58:45 +02:00
Matthew Honnibal
dd60594f41
* Fix double encoding error in strings.pyx
2015-07-20 13:52:56 +02:00
Matthew Honnibal
06639dc497
* Add length cap to word shape feature
2015-07-20 12:06:59 +02:00
Matthew Honnibal
128b6d9714
* Move Utf8Str struct to strings module, as that's the only place it's relevant
2015-07-20 12:06:41 +02:00
Matthew Honnibal
01a97b90f3
* Fix header for string store
2015-07-20 12:06:10 +02:00
Matthew Honnibal
52d538ea42
* Fix short string optimization in strings.pyx. StringStore tests now all pass.
2015-07-20 12:05:23 +02:00
Matthew Honnibal
09a3055630
* Work on short string optimization in Utf8Str
2015-07-20 11:26:46 +02:00
Matthew Honnibal
bb0ba1f0cd
* Improve serialization speed
2015-07-20 03:27:59 +02:00
Matthew Honnibal
8743a8c084
* Update Doc serialization for new Packer interface
2015-07-20 01:38:04 +02:00
Matthew Honnibal
1f7170e0e1
* Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us.
2015-07-20 01:37:34 +02:00
Matthew Honnibal
5a7d060d9c
* Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words.
2015-07-20 01:36:22 +02:00
Matthew Honnibal
5a042ee0d3
* Add function to predict number of bits needed to encode message
2015-07-20 01:35:11 +02:00
Matthew Honnibal
b89b489bb4
* Implement both character and orth encoding in Packer, so that we can decide which to use per-text
2015-07-19 22:39:45 +02:00
Matthew Honnibal
ae78c9e3ce
* Implement character-based codec, so that we can do word/char backoff
2015-07-19 22:03:39 +02:00
Matthew Honnibal
cd1d047cb8
* Delete out-dated HuffmanCodec comment
2015-07-19 18:28:14 +02:00
Matthew Honnibal
b8086067d5
* Build Huffman codec from unsorted inputs
2015-07-19 17:58:44 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Matthew Honnibal
6b13e7227c
* Remove duplicate get_lex_attr method from doc.pyx
2015-07-18 22:46:07 +02:00
Matthew Honnibal
e49c7f1478
* Update oov check in tokenizer
2015-07-18 22:45:28 +02:00
Matthew Honnibal
cfd842769e
* Allow infix tokens to be variable length
2015-07-18 22:45:00 +02:00
Matthew Honnibal
5b4c78bbb2
* Use an AttributeCodec based on orth for words. Still no oov handling mechanism.
2015-07-18 22:43:18 +02:00
Matthew Honnibal
82d84b0f2b
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.
2015-07-18 22:42:15 +02:00
Matthew Honnibal
4dddc8a69b
* Fix type declarations for attr_t. Remove unused id_t.
2015-07-18 22:39:57 +02:00
Matthew Honnibal
ced59ab9ea
* Make minor efficiency improvement in Doc.__iter__
2015-07-18 04:10:53 +02:00
Matthew Honnibal
cd91914dd8
* Fix hard-coded length
2015-07-18 04:09:56 +02:00
Matthew Honnibal
b1d74ce60d
* Remove unused joint.pyx and joint.pxd files
2015-07-17 23:31:44 +02:00
Matthew Honnibal
c27514512b
* Remove cruft ner/ directory
2015-07-17 23:24:32 +02:00
Matthew Honnibal
f8d6d319f4
* Remove cruft module
2015-07-17 23:23:05 +02:00
Matthew Honnibal
fb0a641a2d
* Don't release the gil around Parser.parse. Does this indicate thread problems?
2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f
* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe.
2015-07-17 22:37:24 +02:00
Matthew Honnibal
cf0c788892
* Tests passing on round-trip pack/unpack on basic example
2015-07-17 21:20:48 +02:00
Matthew Honnibal
44f39a876f
* Add a blank attrs.pyx
2015-07-17 16:40:42 +02:00
Matthew Honnibal
c2c83120d4
* Remove codec property from Vocab
2015-07-17 16:40:11 +02:00
Matthew Honnibal
dfdf19f6a9
* Draft a from_orth method for Doc
2015-07-17 16:39:54 +02:00
Matthew Honnibal
9e3f17051b
* Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing
2015-07-17 16:38:29 +02:00
Matthew Honnibal
15ff739996
* Fix passing of ID attribute in string store
2015-07-17 14:49:42 +02:00
Matthew Honnibal
95e57c2780
* Remove unnecessary key and id properties from Utf8String.
2015-07-17 01:40:18 +02:00
Matthew Honnibal
234c7e440a
* Add spacy/serialize/__init__ files
2015-07-17 01:37:33 +02:00
Matthew Honnibal
db9dfd2e23
* Major refactor of serialization. Nearly complete now.
2015-07-17 01:27:54 +02:00
Matthew Honnibal
c8282f9934
* Work on serialization. Needs more reorganisation
2015-07-16 19:56:02 +02:00
Matthew Honnibal
d8458d6a25
* Fix attr_id_t import in Spans
2015-07-16 19:55:21 +02:00
Matthew Honnibal
d1cb30dbc4
* Remove unnecessary key and id properties from Utf8String.
2015-07-16 19:29:02 +02:00
Matthew Honnibal
897de2d438
* Add 'bitter' property for serializer in English class
2015-07-16 17:47:53 +02:00
Matthew Honnibal
fb54052ae0
* Work on serializer design
2015-07-16 17:46:46 +02:00
Matthew Honnibal
a6f401580d
* Add from_array function to Doc.
2015-07-16 17:46:11 +02:00
Matthew Honnibal
2a5d050134
* Give codec loading back to Vocab.
2015-07-16 17:45:42 +02:00
Matthew Honnibal
8bf0f65f1c
* Remove dead code in strings.pyx
2015-07-16 17:35:53 +02:00
Matthew Honnibal
a9c3863665
* Fix inefficiency in StringStore.dump function
2015-07-16 17:34:32 +02:00
Matthew Honnibal
b59d271510
* Move serialization functionality into Serializer class
2015-07-16 11:23:48 +02:00
Matthew Honnibal
30be4f15da
* Import attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:23:25 +02:00
Matthew Honnibal
6c99e5f4aa
* Move serialization into Serializer class, with __call__ and train() api
2015-07-16 11:22:35 +02:00
Matthew Honnibal
e2133d990e
* Move serialization functionality out into a Serializer object
2015-07-16 11:21:44 +02:00
Matthew Honnibal
a6d040bd11
* Import Lexeme attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:20:08 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00