spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-30 20:06:30 +03:00

Author	SHA1	Message	Date
Wannaphong Phatthiyaphaibun	1abf472068	add th test	2017-09-21 12:56:58 +07:00
Matthew Honnibal	ea2732469b	Merge pull request #1340 from hscspring/patch-1 Update punctuation.py	2017-09-20 23:57:00 +02:00
Wannaphong Phatthiyaphaibun	39bb5690f0	update th	2017-09-21 00:36:02 +07:00
Wannaphong Phatthiyaphaibun	44291f6697	add thai	2017-09-20 23:26:34 +07:00
Yam	978b24ccd4	Update punctuation.py In Chinese, `~` and `——` is hyphens, `·` is intermittent symbol	2017-09-20 23:02:22 +08:00
Matthew Honnibal	aa728b33ca	Merge pull request #1333 from galaxyh/master Add Chinese punctuation	2017-09-19 15:09:30 +02:00
Yu-chun Huang	188b439b25	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:58:42 +08:00
Yu-chun Huang	1f1f35dcd0	Add Chinese punctuation Add Chinese punctuation.	2017-09-19 16:57:24 +08:00
Ines Montani	4bee26188d	Merge pull request #1323 from galaxyh/master Set the "cut_all" parameter in jieba.cut() to False, or jieba will return ALL POSSIBLE word segmentations.	2017-09-14 15:23:41 +02:00
Yu-chun Huang	7692b8c071	Update __init__.py Set the "cut_all" parameter to False, or jieba will return ALL POSSIBLE word segmentations.	2017-09-12 16:23:47 +08:00
Matthew Honnibal	ddaff6ca56	Merge pull request #1287 from IamJeffG/feature/1226-more-complete-noun-chunks Capture more noun chunks	2017-09-08 07:59:10 +02:00
Matthew Honnibal	45029a550e	Fix customized-tokenizer tests	2017-09-04 20:13:13 +02:00
Matthew Honnibal	34c585396a	Merge pull request #1294 from Vimos/master Fix issue #1292 and add test case for the Assertion Error	2017-09-04 19:20:40 +02:00
Matthew Honnibal	c68f188eb0	Fix error on test	2017-09-04 18:59:36 +02:00
Matthew Honnibal	33313c01ad	Merge pull request #1298 from ericzhao28/master Lowest common ancestor matrix for spans and docs	2017-09-04 18:57:54 +02:00
Matthew Honnibal	e8a26ebfab	Add efficiency note to new get_lca_matrix() method	2017-09-04 15:43:52 +02:00
Eric Zhao	d61c117081	Lowest common ancestor matrix for spans and docs Added functionality for spans and docs to get lowest common ancestor matrix by simply calling: doc.get_lca_matrix() or doc[:3].get_lca_matrix(). Corresponding unit tests were also added under spacy/tests/doc and spacy/tests/spans. Designed to address: https://github.com/explosion/spaCy/issues/969.	2017-09-03 12:22:19 -07:00
Matthew Honnibal	9bffcaa73d	Update test to make it slightly more direct The `nlp` container should be unnecessary here. If so, we can test the tokenizer class just a little more directly.	2017-09-01 21:16:56 +02:00
Vimos Tan	a6d9fb5bb6	fix issue #1292	2017-08-30 14:49:14 +08:00
Paul O'Leary McCann	8b3e1f7b5b	Handle out-of-vocab words Wasn't handling words out of the tokenizer dictionary vocabulary properly. This adds a fix and test for that. -POLM	2017-08-29 23:58:42 +09:00
Jeffrey Gerard	884ba168a8	Capture more noun chunks	2017-08-23 21:18:53 -07:00
Paul O'Leary McCann	95050201ce	Add importorskip for Japanese fixture	2017-08-22 21:30:59 +09:00
Paul O'Leary McCann	bcf2b9b4f5	Update tagger & tokenizer tests Tagger is now parametrized and has two sentences with more tag coverage. The tokenizer tests are updated to reflect differences in tokenization between IPAdic and Unidic. -POLM	2017-08-22 00:03:11 +09:00
Paul O'Leary McCann	adfd987316	Update the TAG_MAP	2017-08-22 00:02:55 +09:00
Paul O'Leary McCann	53e17296e9	Fix pronoun handling Missed this case earlier. 連体詞 have three classes for UD purposes: - その -> DET - それ -> PRON - 同じ -> ADJ -POLM	2017-08-22 00:01:49 +09:00
Paul O'Leary McCann	c435f748d7	Put Mecab import in utility function	2017-08-22 00:01:28 +09:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
ines	edc596d9a7	Add missing tokenizer exceptions (resolves #1281 )	2017-08-21 16:11:36 +02:00
ines	c5c3f4c7d9	Use more generous .env ignore rule	2017-08-21 16:08:40 +02:00
Paul O'Leary McCann	234a8a7591	Change default tag for 動詞,非自立可能 Example of this is いる in these sentences: 彼はそこにいる。# should be VERB 彼は底に立っている。# should be AUX Unclear which case is more numerous - need to check a large corpus - but in keeping with the other ambiguous tags, this is mapped to the "dominant" or first part of the tag. -POLM	2017-08-21 00:21:45 +09:00
Ines Montani	dca026124f	Merge pull request #1262 from kevinmarsh/patch-1 Fix broken tutorial link on website	2017-08-16 09:58:07 +02:00
Kevin Marsh	e3738aba0d	Fix broken tutorial link on website	2017-08-15 21:50:09 +01:00
Ines Montani	a9465271a7	Merge pull request #1245 from delirious-lettuce/fix_typos Fix typos	2017-08-07 23:11:20 +02:00
Paul O'Leary McCann	6e9e686568	Sample implementation of Japanese Tagger (ref #1214 ) This is far from complete but it should be enough to check some things. 1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD tag mappings are based on Unidic. This switches out Mecab for Janome to get around that. 2. Raw tag extension. A simple tag map can't meet the specifications for UD tag mappings, so this adds an extra field to ambiguous cases. For this demo it just deals with the simplest case, which only needs to look at the literal token. (In reality it may be necessary to look at the whole sentence, but that's another issue.) 3. General code structure. Seems nobody else has implemented a custom Tagger yet, so still not sure this is the correct way to pass the vocabulary around, for example. Any feedback would be greatly appreciated. -POLM	2017-08-08 01:27:15 +09:00
Delirious Lettuce	d3b03f0544	Fix typos: * `auxillary` -> `auxiliary` * `consistute` -> `constitute` * `earlist` -> `earliest` * `prefered` -> `preferred` * `direcory` -> `directory` * `reuseable` -> `reusable` * `idiosyncracies` -> `idiosyncrasies` * `enviroment` -> `environment` * `unecessary` -> `unnecessary` * `yesteday` -> `yesterday` * `resouces` -> `resources`	2017-08-06 21:31:39 -06:00
Matthew Honnibal	b7b121103f	Merge pull request #1244 from gideonite/patch-1 improve pipe, tee, izip explanation	2017-08-06 14:34:07 +02:00
Gideon Dresdner	7e98a3613c	improve pipe, tee, izip explanation Use an example from an old issue https://github.com/explosion/spaCy/issues/172#issuecomment-183963403.	2017-08-06 13:21:45 +02:00
ines	864cefd3b2	Update README.rst	2017-07-22 18:29:55 +02:00
ines	e349271506	Increment version	2017-07-22 18:29:30 +02:00
Ines Montani	570964e67f	Update README.rst	2017-07-22 16:20:19 +02:00
Matthew Honnibal	5494605689	Fiddle with regex pin	2017-07-22 16:09:50 +02:00
Matthew Honnibal	78fcf56dd5	Update version pin for regex library	2017-07-22 15:57:58 +02:00
Matthew Honnibal	d51d55bba6	Increment version	2017-07-22 15:43:16 +02:00
Matthew Honnibal	8ccf154413	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 15:42:44 +02:00
Matthew Honnibal	796b2f4c1b	Remove print statements in tests	2017-07-22 15:42:38 +02:00
ines	7c4bf9994d	Add note on requirements and preventing model re-downloads (closes #1143 )	2017-07-22 15:40:12 +02:00
ines	de25bad036	Use lower min version for requests dependency (fixes #1137 ) Ensure compatibility with docker-compose and other packages	2017-07-22 15:29:10 +02:00
ines	d7560047c5	Fix version	2017-07-22 15:24:33 +02:00
Matthew Honnibal	af945ea8e2	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 15:09:59 +02:00
Matthew Honnibal	4b2e5e59ed	Add flush_cache method to tokenizer, to fix #1061 The tokenizer caches output for common chunks, for efficiency. This cache is be invalidated when the tokenizer rules change, e.g. when a new special-case rule is introduced. That's what was causing #1061. When the cache is flushed, we free the intermediate token chunks. I think this is safe --- but if we start getting segfaults, this patch is to blame. The resolution would be to simply not free those bits of memory. They'll be freed when the tokenizer exits anyway.	2017-07-22 15:06:50 +02:00

1 2 3 4 5 ...

5251 Commits