Commit Graph

5252 Commits

Author SHA1 Message Date
Matthew Honnibal
4b2e5e59ed Add flush_cache method to tokenizer, to fix #1061
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061.

When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Ines Montani
96df9c7154 Update CONTRIBUTORS.md 2017-07-22 15:05:46 +02:00
ines
b22b18a019 Add notes on spacy.explain() to annotation docs 2017-07-22 15:02:15 +02:00
ines
e3f23f9d91 Use latest available version in examples 2017-07-22 14:57:51 +02:00
Matthew Honnibal
23a55b40ca Default to English noun chunks iterator if no lang set 2017-07-22 14:15:25 +02:00
Matthew Honnibal
9750a0128c Fix Span.noun_chunks. Closes #1207 2017-07-22 14:14:57 +02:00
Matthew Honnibal
d9b85675d7 Rename regression test 2017-07-22 14:14:35 +02:00
Matthew Honnibal
dfbc7e49de Add test for Issue #1207 2017-07-22 14:14:01 +02:00
Matthew Honnibal
0ae3807d7d Fix gaps in Lexeme API. Closes #1031 2017-07-22 13:53:48 +02:00
Matthew Honnibal
83e1b5f1e3 Merge branch 'master' of https://github.com/explosion/spaCy 2017-07-22 13:45:35 +02:00
Matthew Honnibal
45f6961ae0 Add __version__ symbol in __init__.py 2017-07-22 13:45:21 +02:00
Matthew Honnibal
8b9c4c5e1c Add missing SP symbol to tag map, re #1052 2017-07-22 13:44:17 +02:00
Ines Montani
69396dcfd3 Update CONTRIBUTORS.md 2017-07-22 13:43:15 +02:00
Ines Montani
9af04ea11f Merge pull request #1161 from AlexisEidelman/patch-1
French NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:40:46 +02:00
Matthew Honnibal
8b581fdac5 Remove unused example 2017-07-22 13:36:54 +02:00
Matthew Honnibal
44dd247e73 Merge branch 'master' of https://github.com/explosion/spaCy 2017-07-22 13:35:30 +02:00
Matthew Honnibal
94267ec50f Fix merge conflit in printer 2017-07-22 13:35:15 +02:00
Ines Montani
c7708dc736 Merge pull request #1177 from swierh/master
Dutch NUM_WORDS and ORDINAL_WORDS
2017-07-22 13:35:08 +02:00
Matthew Honnibal
5916d46ba8 Avoid use of deepcopy in printer 2017-07-22 13:34:01 +02:00
Matthew Honnibal
a405660068 Add commit to tagger example 2017-07-22 13:32:48 +02:00
Matthew Honnibal
3fef5f642b Rename tagger training example 2017-07-22 13:29:15 +02:00
Matthew Honnibal
8bb443be4f Add standalone tagger training example 2017-07-22 13:28:51 +02:00
Ines Montani
7c66691790 Merge pull request #1197 from jsparedes/patch-1
Fix url broken
2017-07-21 14:05:26 +02:00
Jorge Paredes
fadacd0d47 Fix url broken
The related url to **custom named entities** was broken
2017-07-16 10:06:32 -05:00
Ines Montani
2d22b63e09 Merge pull request #1186 from lgenerknol/master
.../cli/#foo is 404
2017-07-13 17:33:55 +02:00
lgenerknol
2b219caf0d .../cli/#foo is 404
https://spacy.io/docs/usage/cli/#package is a 404.  
Changed to https://spacy.io/docs/usage/cli#package 

Definitely a larger fix possible to deal with trailing slashes
2017-07-12 13:12:24 -04:00
Ines Montani
d79fa8743a Merge pull request #1185 from lgenerknol/master
Missing markup char
2017-07-12 17:27:42 +02:00
lgenerknol
6cf2690943 Missing markup char
Frontend displayed: 
```
 If start_idx and do not mark[...]
```
Note the missing "end_idx" after 'and'.
2017-07-12 11:06:16 -04:00
Ines Montani
9eca6503c1 Merge pull request #1157 from polm/master
Add basic Japanese Tokenizer Test
2017-07-10 13:07:11 +02:00
Paul O'Leary McCann
bc87b815cc Add comment clarifying what LANGUAGES does 2017-07-09 16:28:55 +09:00
Paul O'Leary McCann
04e6a65188 Remove Japanese from LANGUAGES
LANGUAGES is a list of languages whose tokenizers get run through a
variety of generic tests. Since the generic tests don't check the JA
fixture, it blows up when it can't find janome. -POLM
2017-07-09 16:23:26 +09:00
Ines Montani
2b9411bb54 Merge pull request #1181 from val314159/patch-1
make this work in python2.7
2017-07-08 00:15:47 +02:00
val314159
19d4706f69 make this work in python2.7 2017-07-07 13:18:17 -07:00
Swier
29720150f9 fix import of stop words in language data 2017-07-05 14:08:04 +02:00
Swier
f377c9c952 Rename stop_words.py to word_sets.py 2017-07-05 14:06:28 +02:00
Swier
5357874bf7 add Dutch numbers and ordinals 2017-07-05 14:03:30 +02:00
Ines Montani
84eb9d6bd3 Merge pull request #1167 from callumkift/fix/docs-ner-training
Fixed error training NER documentation and example
2017-07-01 11:46:31 +02:00
Ines Montani
0c7f5af5ee Merge pull request #1168 from gispk47/master
Update zh language error
2017-07-01 11:43:12 +02:00
gispk47
669bd14213 Update __init__.py
remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error
2017-07-01 13:12:00 +08:00
Callum Kift
dfaeee1f37 fixed bug in training ner documentation and example 2017-06-30 09:56:33 +02:00
Paul O'Leary McCann
c336193392 Parametrize and extend Japanese tokenizer tests 2017-06-29 00:09:40 +09:00
Paul O'Leary McCann
30a34ebb6e Add importorskip for janome 2017-06-29 00:09:20 +09:00
Alexis
1b3a5d87ba French NUM_WORDS and ORDINAL_WORDS 2017-06-28 14:11:20 +02:00
Paul O'Leary McCann
e56fea14eb Add basic Japanese tokenizer test 2017-06-28 01:24:25 +09:00
Paul O'Leary McCann
84041a2bb5 Make create_tokenizer work with Japanese 2017-06-28 01:18:05 +09:00
Ines Montani
f69ff15089 Update CONTRIBUTORS.md 2017-06-27 14:49:02 +02:00
Ines Montani
d6e08f2bf6 Merge pull request #1142 from garfieldnate/patch-1
fix confusing typo
2017-06-26 10:41:47 +02:00
Nathan Glenn
81166c3d56 fix confusing typo
This document describes the `Vocab` class, not the `Span` class.
2017-06-21 19:22:30 +02:00
Ines Montani
9335736c20 Merge pull request #1127 from bartbroere/master
Fixed a minor typo in the documentation
2017-06-13 13:15:20 +02:00
Bart Broere
e3be243e06 Merge pull request #1 from explosion/master
Update
2017-06-12 22:06:59 +02:00