Jeffrey Gerard
b6ebedd09c
Document Tokenizer(token_match) and clarify tokenizer_pseudo_code
...
Closes #835
In the `tokenizer_pseudo_code` I put the `special_cases` kwarg
before `find_prefix` because this now matches the order the args
are used in the pseudocode, and it also matches spacy's actual code.
2017-09-25 13:13:25 -07:00
Matthew Honnibal
2f8d535f65
Merge pull request #1351 from hscspring/patch-4
...
Update punctuation.py
2017-09-24 12:16:39 +02:00
Matthew Honnibal
9177313063
Merge pull request #1352 from hscspring/patch-5
...
Update customizing-tokenizer.jade
2017-09-22 16:11:49 +02:00
Matthew Honnibal
1dbc2285b8
Merge pull request #1350 from hscspring/patch-3
...
Update word-vectors-similarities.jade
2017-09-22 16:11:05 +02:00
Yam
54855f0eee
Update customizing-tokenizer.jade
2017-09-22 12:15:48 +08:00
Yam
6f450306c3
Update customizing-tokenizer.jade
...
update some codes:
- `me` -> `-PRON`
- `TAG` -> `POS`
- `create_tokenizer` function
2017-09-22 10:53:22 +08:00
Yam
923c4c2fb2
Update punctuation.py
...
add `……`
2017-09-22 09:50:46 +08:00
Yam
425c09488d
Update word-vectors-similarities.jade
...
add
```
import spacy
nlp = spacy.load('en') ```
2017-09-22 08:56:34 +08:00
Matthew Honnibal
ea2732469b
Merge pull request #1340 from hscspring/patch-1
...
Update punctuation.py
2017-09-20 23:57:00 +02:00
Yam
978b24ccd4
Update punctuation.py
...
In Chinese, `~` and `——` is hyphens,
`·` is intermittent symbol
2017-09-20 23:02:22 +08:00
Matthew Honnibal
aa728b33ca
Merge pull request #1333 from galaxyh/master
...
Add Chinese punctuation
2017-09-19 15:09:30 +02:00
Yu-chun Huang
188b439b25
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:58:42 +08:00
Yu-chun Huang
1f1f35dcd0
Add Chinese punctuation
...
Add Chinese punctuation.
2017-09-19 16:57:24 +08:00
Ines Montani
4bee26188d
Merge pull request #1323 from galaxyh/master
...
Set the "cut_all" parameter in jieba.cut() to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-14 15:23:41 +02:00
Yu-chun Huang
7692b8c071
Update __init__.py
...
Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.
2017-09-12 16:23:47 +08:00
Matthew Honnibal
ddaff6ca56
Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks
...
Capture more noun chunks
2017-09-08 07:59:10 +02:00
Matthew Honnibal
45029a550e
Fix customized-tokenizer tests
2017-09-04 20:13:13 +02:00
Matthew Honnibal
34c585396a
Merge pull request #1294 from Vimos/master
...
Fix issue #1292 and add test case for the Assertion Error
2017-09-04 19:20:40 +02:00
Matthew Honnibal
c68f188eb0
Fix error on test
2017-09-04 18:59:36 +02:00
Matthew Honnibal
33313c01ad
Merge pull request #1298 from ericzhao28/master
...
Lowest common ancestor matrix for spans and docs
2017-09-04 18:57:54 +02:00
Matthew Honnibal
e8a26ebfab
Add efficiency note to new get_lca_matrix() method
2017-09-04 15:43:52 +02:00
Eric Zhao
d61c117081
Lowest common ancestor matrix for spans and docs
...
Added functionality for spans and docs to get lowest common ancestor
matrix by simply calling: doc.get_lca_matrix() or
doc[:3].get_lca_matrix().
Corresponding unit tests were also added under spacy/tests/doc and
spacy/tests/spans.
Designed to address: https://github.com/explosion/spaCy/issues/969 .
2017-09-03 12:22:19 -07:00
Matthew Honnibal
9bffcaa73d
Update test to make it slightly more direct
...
The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.
2017-09-01 21:16:56 +02:00
Vimos Tan
a6d9fb5bb6
fix issue #1292
2017-08-30 14:49:14 +08:00
Jeffrey Gerard
884ba168a8
Capture more noun chunks
2017-08-23 21:18:53 -07:00
ines
dcff10abe9
Add regression test for #1281
2017-08-21 16:11:47 +02:00
ines
edc596d9a7
Add missing tokenizer exceptions ( resolves #1281 )
2017-08-21 16:11:36 +02:00
ines
c5c3f4c7d9
Use more generous .env ignore rule
2017-08-21 16:08:40 +02:00
Ines Montani
dca026124f
Merge pull request #1262 from kevinmarsh/patch-1
...
Fix broken tutorial link on website
2017-08-16 09:58:07 +02:00
Kevin Marsh
e3738aba0d
Fix broken tutorial link on website
2017-08-15 21:50:09 +01:00
Ines Montani
a9465271a7
Merge pull request #1245 from delirious-lettuce/fix_typos
...
Fix typos
2017-08-07 23:11:20 +02:00
Delirious Lettuce
d3b03f0544
Fix typos:
...
* `auxillary` -> `auxiliary`
* `consistute` -> `constitute`
* `earlist` -> `earliest`
* `prefered` -> `preferred`
* `direcory` -> `directory`
* `reuseable` -> `reusable`
* `idiosyncracies` -> `idiosyncrasies`
* `enviroment` -> `environment`
* `unecessary` -> `unnecessary`
* `yesteday` -> `yesterday`
* `resouces` -> `resources`
2017-08-06 21:31:39 -06:00
Matthew Honnibal
b7b121103f
Merge pull request #1244 from gideonite/patch-1
...
improve pipe, tee, izip explanation
2017-08-06 14:34:07 +02:00
Gideon Dresdner
7e98a3613c
improve pipe, tee, izip explanation
...
Use an example from an old issue https://github.com/explosion/spaCy/issues/172#issuecomment-183963403 .
2017-08-06 13:21:45 +02:00
ines
864cefd3b2
Update README.rst
2017-07-22 18:29:55 +02:00
ines
e349271506
Increment version
2017-07-22 18:29:30 +02:00
Ines Montani
570964e67f
Update README.rst
2017-07-22 16:20:19 +02:00
Matthew Honnibal
5494605689
Fiddle with regex pin
2017-07-22 16:09:50 +02:00
Matthew Honnibal
78fcf56dd5
Update version pin for regex library
2017-07-22 15:57:58 +02:00
Matthew Honnibal
d51d55bba6
Increment version
2017-07-22 15:43:16 +02:00
Matthew Honnibal
8ccf154413
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 15:42:44 +02:00
Matthew Honnibal
796b2f4c1b
Remove print statements in tests
2017-07-22 15:42:38 +02:00
ines
7c4bf9994d
Add note on requirements and preventing model re-downloads ( closes #1143 )
2017-07-22 15:40:12 +02:00
ines
de25bad036
Use lower min version for requests dependency ( fixes #1137 )
...
Ensure compatibility with docker-compose and other packages
2017-07-22 15:29:10 +02:00
ines
d7560047c5
Fix version
2017-07-22 15:24:33 +02:00
Matthew Honnibal
af945ea8e2
Merge branch 'master' of https://github.com/explosion/spaCy
2017-07-22 15:09:59 +02:00
Matthew Honnibal
4b2e5e59ed
Add flush_cache method to tokenizer, to fix #1061
...
The tokenizer caches output for common chunks, for efficiency. This
cache is be invalidated when the tokenizer rules change, e.g. when a new
special-case rule is introduced. That's what was causing #1061 .
When the cache is flushed, we free the intermediate token chunks.
I *think* this is safe --- but if we start getting segfaults, this patch
is to blame. The resolution would be to simply not free those bits of
memory. They'll be freed when the tokenizer exits anyway.
2017-07-22 15:06:50 +02:00
Ines Montani
96df9c7154
Update CONTRIBUTORS.md
2017-07-22 15:05:46 +02:00
ines
b22b18a019
Add notes on spacy.explain() to annotation docs
2017-07-22 15:02:15 +02:00
ines
e3f23f9d91
Use latest available version in examples
2017-07-22 14:57:51 +02:00