Commit Graph

8565 Commits

Author SHA1 Message Date
Matthew Honnibal
0d57b9748a Serialize lex_attr_getters with dill, for better pickle support 2017-10-17 18:17:45 +02:00
Matthew Honnibal
45d1dd90b1 Add tests for pickling doc 2017-10-17 17:20:58 +02:00
Ines Montani
afa67de7ee Merge pull request #1428 from roanuz/develop
Fix trailing whitespace and Language.from_disk overwrites
2017-10-17 16:29:15 +02:00
ines
a74cba2ffa Remove Binder from docs (now covered by Doc API) 2017-10-17 16:27:19 +02:00
Matthew Honnibal
92c1eb2d6f Fix Doc pickling. This also removes need for Binder class 2017-10-17 16:11:13 +02:00
Matthew Honnibal
ed8da9b11f Add missing return statement in SentenceSegmenter 2017-10-17 15:32:56 +02:00
Ines Montani
aab299c8ae Merge pull request #1429 from vishnunekkanti/develop
fix syntax error in zh
2017-10-17 14:45:02 +02:00
Anto Binish Kaspar
534240648e Fix trailing whitespace on morphology features 2017-10-17 17:15:58 +05:30
Anto Binish Kaspar
8f5b60c168 Fix Language.from_disk overwrites the meta.json file. 2017-10-17 17:15:32 +05:30
ines
8ca344712d Add Language.has_pipe method 2017-10-17 11:20:07 +02:00
ines
485c4f6df5 Add Hungarian examples (see #1107) 2017-10-17 02:37:45 +02:00
Matthew Honnibal
fc797a58de Merge pull request #1424 from explosion/feature/streaming-data-memory-growth
💫 Fix streaming data memory growth (!!)
2017-10-16 23:08:18 +02:00
Matthew Honnibal
19531bad4c Merge branch 'develop' into feature/streaming-data-memory-growth 2017-10-16 21:44:11 +02:00
Matthew Honnibal
df488274b1 Fix deserialization of vectors 2017-10-16 20:55:00 +02:00
Matthew Honnibal
4018486d31 Merge remote-tracking branch 'origin/develop' into feature/streaming-data-memory-growth 2017-10-16 20:49:48 +02:00
ines
4cfe259266 Fix formatting 2017-10-16 20:36:41 +02:00
ines
18793efef1 Remove Russian from v2.0 docs for now 2017-10-16 20:36:36 +02:00
ines
d383612225 Add note about word vectors in example (see #1117) 2017-10-16 20:31:58 +02:00
Matthew Honnibal
4174477161 Fix equality check in test 2017-10-16 19:50:35 +02:00
Matthew Honnibal
2bc06e4b22 Bump rolling buffer size to 10k 2017-10-16 19:38:29 +02:00
Matthew Honnibal
66e2eb8f39 Clean up remnant of frozen in StringStore 2017-10-16 19:34:41 +02:00
Matthew Honnibal
a002264fec Remove caching of Token in Doc, as caused cycle. 2017-10-16 19:34:21 +02:00
Matthew Honnibal
3e037054c8 Remove obsolete is_frozen functionality from StringStore 2017-10-16 19:23:10 +02:00
Matthew Honnibal
5c14f3f033 Create a rolling buffer for the StringStore in Language.pipe() 2017-10-16 19:22:40 +02:00
Matthew Honnibal
59c216196c Allow weakrefs on Doc objects 2017-10-16 19:22:11 +02:00
ines
d5418553eb Fix whitespace 2017-10-16 18:30:04 +02:00
ines
6ceadcdb5c Make sure from_disk passes string to numpy (see #1421)
If path is a WindowsPath, numpy does not recognise it as a path and as
a result, doesn't open the file.
https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L369
2017-10-16 18:29:56 +02:00
Matthew Honnibal
010a7309ff Merge pull request #1402 from explosion/feature/fix-matcher-operators
💫 Fix Matcher variable-length operators
2017-10-16 17:53:19 +02:00
Matthew Honnibal
c29927d2e7 Fix matcher test 2017-10-16 17:22:18 +02:00
Vishnu Kumar Nekkanti
d3c54cf39a fixed SyntaxError while checking for jieba 2017-10-16 18:51:33 +05:30
Vishnu Kumar Nekkanti
18ec6610dd Merge pull request #1 from explosion/develop
Develop
2017-10-16 18:34:13 +05:30
ines
63393b4e0d Update matcher docs to reflect operator changes 2017-10-16 13:44:12 +02:00
Matthew Honnibal
a928ae2f35 Merge branch 'develop' into feature/fix-matcher-operators 2017-10-16 13:38:36 +02:00
Matthew Honnibal
56aa42cc5d Fix and document matcher operator 'shadowing' behaviour 2017-10-16 13:38:20 +02:00
Matthew Honnibal
748d525801 Add more matcher operator tests 2017-10-16 13:38:01 +02:00
Matthew Honnibal
0433181658 Document operator semantics in Matcher docstring 2017-10-16 12:06:33 +02:00
Matthew Honnibal
cd9378c8f1 Merge pull request #1423 from yuukos/master
Fixed Russian tokenizer
2017-10-16 11:45:53 +02:00
Matthew Honnibal
6b0121091c Merge pull request #1420 from polm/master
[ja] Stash tokenizer output for speed
2017-10-16 10:28:22 +02:00
yuukos
34e9c6ddc0 Merge remote-tracking branch 'origin/master' 2017-10-16 13:48:10 +07:00
yuukos
92931a2efd Merge branch 'russian_language' 2017-10-16 13:46:28 +07:00
yuukos
241d19a3e6 fixed Russian Tokenizer
- added trailing space flags for tokens
2017-10-16 13:37:05 +07:00
Paul O'Leary McCann
71ae8013ec [ja] Use user_details instead of a wrapper class
Instead of using a JapaneseDoc wrapper class to store Mecab output,
stash it in `user_data`. -POLM
2017-10-16 00:24:34 +09:00
Paul O'Leary McCann
43eedf73f2 [ja] Stash tokenizer output for speed
Before this commit, the Mecab tokenizer had to be called twice when
creating a Doc- once during tokenization and once during tagging. This
creates a JapaneseDoc wrapper class for Doc that stashes the parsed
tokenizer output to remove redundant processing. -POLM
2017-10-15 23:33:25 +09:00
ines
15514dc333 Add section on upgrading 2017-10-14 22:14:47 +02:00
ines
c0aceb9fbe Add Hindi to supported languages 2017-10-14 15:16:41 +02:00
Ines Montani
e00a6c08cf Merge pull request #1418 from polm/master
Contributor agreement
2017-10-14 15:10:58 +02:00
ines
266e7180a7 Add Language class, stop words and basic stemmer that sets NORM 2017-10-14 14:59:52 +02:00
ines
e85e1d571b Update base punctuation 2017-10-14 14:59:23 +02:00
ines
9d6c8eaa49 Update base norm exceptions with more unicode characters
e.g. unicode variations of punctuation used in Chinese
2017-10-14 14:58:52 +02:00
ines
3516aa0cea Port over changes from #1389 2017-10-14 13:32:55 +02:00