spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-18 10:02:40 +03:00

Author	SHA1	Message	Date
Paul O'Leary McCann	c435f748d7	Put Mecab import in utility function	2017-08-22 00:01:28 +09:00
ines	dcff10abe9	Add regression test for #1281	2017-08-21 16:11:47 +02:00
ines	edc596d9a7	Add missing tokenizer exceptions (resolves #1281 )	2017-08-21 16:11:36 +02:00
Paul O'Leary McCann	234a8a7591	Change default tag for 動詞,非自立可能 Example of this is いる in these sentences: 彼はそこにいる。# should be VERB 彼は底に立っている。# should be AUX Unclear which case is more numerous - need to check a large corpus - but in keeping with the other ambiguous tags, this is mapped to the "dominant" or first part of the tag. -POLM	2017-08-21 00:21:45 +09:00
Paul O'Leary McCann	6e9e686568	Sample implementation of Japanese Tagger (ref #1214 ) This is far from complete but it should be enough to check some things. 1. Mecab transition. Janome doesn't support Unidic, only IPAdic, but UD tag mappings are based on Unidic. This switches out Mecab for Janome to get around that. 2. Raw tag extension. A simple tag map can't meet the specifications for UD tag mappings, so this adds an extra field to ambiguous cases. For this demo it just deals with the simplest case, which only needs to look at the literal token. (In reality it may be necessary to look at the whole sentence, but that's another issue.) 3. General code structure. Seems nobody else has implemented a custom Tagger yet, so still not sure this is the correct way to pass the vocabulary around, for example. Any feedback would be greatly appreciated. -POLM	2017-08-08 01:27:15 +09:00
Delirious Lettuce	d3b03f0544	Fix typos: * `auxillary` -> `auxiliary` * `consistute` -> `constitute` * `earlist` -> `earliest` * `prefered` -> `preferred` * `direcory` -> `directory` * `reuseable` -> `reusable` * `idiosyncracies` -> `idiosyncrasies` * `enviroment` -> `environment` * `unecessary` -> `unnecessary` * `yesteday` -> `yesterday` * `resouces` -> `resources`	2017-08-06 21:31:39 -06:00
Matthew Honnibal	d51d55bba6	Increment version	2017-07-22 15:43:16 +02:00
Matthew Honnibal	796b2f4c1b	Remove print statements in tests	2017-07-22 15:42:38 +02:00
Matthew Honnibal	4b2e5e59ed	Add flush_cache method to tokenizer, to fix #1061 The tokenizer caches output for common chunks, for efficiency. This cache is be invalidated when the tokenizer rules change, e.g. when a new special-case rule is introduced. That's what was causing #1061. When the cache is flushed, we free the intermediate token chunks. I think this is safe --- but if we start getting segfaults, this patch is to blame. The resolution would be to simply not free those bits of memory. They'll be freed when the tokenizer exits anyway.	2017-07-22 15:06:50 +02:00
Matthew Honnibal	23a55b40ca	Default to English noun chunks iterator if no lang set	2017-07-22 14:15:25 +02:00
Matthew Honnibal	9750a0128c	Fix Span.noun_chunks. Closes #1207	2017-07-22 14:14:57 +02:00
Matthew Honnibal	d9b85675d7	Rename regression test	2017-07-22 14:14:35 +02:00
Matthew Honnibal	dfbc7e49de	Add test for Issue #1207	2017-07-22 14:14:01 +02:00
Matthew Honnibal	0ae3807d7d	Fix gaps in Lexeme API. Closes #1031	2017-07-22 13:53:48 +02:00
Matthew Honnibal	83e1b5f1e3	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 13:45:35 +02:00
Matthew Honnibal	45f6961ae0	Add __version__ symbol in __init__.py	2017-07-22 13:45:21 +02:00
Matthew Honnibal	8b9c4c5e1c	Add missing SP symbol to tag map, re #1052	2017-07-22 13:44:17 +02:00
Ines Montani	9af04ea11f	Merge pull request #1161 from AlexisEidelman/patch-1 French NUM_WORDS and ORDINAL_WORDS	2017-07-22 13:40:46 +02:00
Matthew Honnibal	44dd247e73	Merge branch 'master' of https://github.com/explosion/spaCy	2017-07-22 13:35:30 +02:00
Matthew Honnibal	94267ec50f	Fix merge conflit in printer	2017-07-22 13:35:15 +02:00
Ines Montani	c7708dc736	Merge pull request #1177 from swierh/master Dutch NUM_WORDS and ORDINAL_WORDS	2017-07-22 13:35:08 +02:00
Matthew Honnibal	5916d46ba8	Avoid use of deepcopy in printer	2017-07-22 13:34:01 +02:00
Ines Montani	9eca6503c1	Merge pull request #1157 from polm/master Add basic Japanese Tokenizer Test	2017-07-10 13:07:11 +02:00
Paul O'Leary McCann	bc87b815cc	Add comment clarifying what LANGUAGES does	2017-07-09 16:28:55 +09:00
Paul O'Leary McCann	04e6a65188	Remove Japanese from LANGUAGES LANGUAGES is a list of languages whose tokenizers get run through a variety of generic tests. Since the generic tests don't check the JA fixture, it blows up when it can't find janome. -POLM	2017-07-09 16:23:26 +09:00
Swier	29720150f9	fix import of stop words in language data	2017-07-05 14:08:04 +02:00
Swier	f377c9c952	Rename stop_words.py to word_sets.py	2017-07-05 14:06:28 +02:00
Swier	5357874bf7	add Dutch numbers and ordinals	2017-07-05 14:03:30 +02:00
Raphaël Bournhonesque	8592f3de47	Fix fuzzy unit tests	2017-07-01 15:03:32 +02:00
Raphaël Bournhonesque	f4748834d9	Use spacy hash_string function instead of md5	2017-07-01 13:17:26 +02:00
Raphaël Bournhonesque	c3d722d66f	Add a disclaimer about classes copied from the Jinja2 project	2017-07-01 13:09:56 +02:00
gispk47	669bd14213	Update __init__.py remove the empty string return from jieba.cut,this will cause the list of tokens cant be pushed assert error	2017-07-01 13:12:00 +08:00
Paul O'Leary McCann	c336193392	Parametrize and extend Japanese tokenizer tests	2017-06-29 00:09:40 +09:00
Paul O'Leary McCann	30a34ebb6e	Add importorskip for janome	2017-06-29 00:09:20 +09:00
Alexis	1b3a5d87ba	French NUM_WORDS and ORDINAL_WORDS	2017-06-28 14:11:20 +02:00
Paul O'Leary McCann	e56fea14eb	Add basic Japanese tokenizer test	2017-06-28 01:24:25 +09:00
Paul O'Leary McCann	84041a2bb5	Make create_tokenizer work with Japanese	2017-06-28 01:18:05 +09:00
Raphaël Bournhonesque	46637369aa	Add basic unit tests for Pattern	2017-06-11 18:34:38 +02:00
Raphaël Bournhonesque	1849a110e3	Improve logging	2017-06-11 18:31:19 +02:00
Raphaël Bournhonesque	4289a21703	Add 'ent' to node matching key	2017-06-11 18:30:53 +02:00
Raphaël Bournhonesque	d010f5a123	Fix node matching bug caused by lower function	2017-06-11 18:30:28 +02:00
Raphaël Bournhonesque	4ca8a396a2	Do not add the root token to the adjacency map	2017-06-11 18:30:01 +02:00
Raphaël Bournhonesque	d9c567371f	Move add_node and add_edge methods to the Tree base class	2017-06-11 18:29:28 +02:00
Raphaël Bournhonesque	8ff4f512a2	Check in PatternParser that the generated Pattern is valid	2017-06-11 18:28:36 +02:00
Raphaël Bournhonesque	e55199d454	Implementation of Pattern	2017-06-11 01:06:24 +02:00
György Orosz	fa26041da6	Fixed typo in cli/package.py	2017-06-07 16:19:08 +02:00
Ines Montani	e7ef51b382	Update tokenizer_exceptions.py	2017-06-02 19:00:01 +02:00
Ines Montani	81918155ef	Merge pull request #1096 from recognai/master Spanish model features	2017-06-02 11:07:27 +02:00
Francisco Aranda	70a2180199	fix(spanish sentence segmentation): remove tokenizer exceptions the break sentence segmentation. Aligned with training corpus	2017-06-02 08:19:57 +02:00
Francisco Aranda	5b385e7d78	feat(spanish model): add the spanish noun chunker	2017-06-02 08:14:06 +02:00
Ines Montani	7f6be41f21	Fix typo in English tokenizer exceptions (resolves #1071 )	2017-05-23 12:18:00 +02:00
Raphaël Bournhonesque	6381ebfb14	Use yield from syntax	2017-05-18 10:42:35 +02:00
Raphaël Bournhonesque	f37d078d6a	Fix issue #1069 with custom hook `Doc.sents` definition	2017-05-18 09:59:38 +02:00
ines	9003fd25e5	Fix error messages if model is required (resolves #1051 ) Rename about.__docs__ to about.__docs_models__.	2017-05-13 13:14:02 +02:00
ines	24e973b17f	Rename about.__docs__ to about.__docs_models__	2017-05-13 13:09:00 +02:00
ines	6e1dbc608e	Fix parse_tree test	2017-05-13 12:34:20 +02:00
ines	573f0ba867	Replace deepcopy	2017-05-13 12:34:14 +02:00
ines	bd428c0a70	Set defaults for light and flat kwargs	2017-05-13 12:34:05 +02:00
ines	c5669450a0	Fix formatting	2017-05-13 12:33:57 +02:00
Matthew Honnibal	ad590feaa8	Fix test, which imported English incorrectly	2017-05-13 11:36:19 +02:00
Ines Montani	8d742ac8ff	Merge pull request #1055 from recognai/master Enable pruning out rare words from clusters data	2017-05-13 03:22:56 +02:00
Matthew Honnibal	b2540d2379	Merge Kengz's tree_print patch	2017-05-13 03:18:49 +02:00
oeg	cdaefae60a	feature(populate_vocab): Enable pruning out rare words from clusters data	2017-05-12 16:15:19 +02:00
ines	b1f22c5a10	Fix formatting	2017-05-03 20:11:02 +02:00
ines	a04b5be1b2	Add glossary for annotation scheme (closes #1034 ) Can be imported as explain from spacy.glossary, or called as spacy.explain(term)	2017-05-03 17:02:17 +02:00
Ines Montani	3ea23a3f4d	Fix formatting	2017-05-03 09:44:38 +02:00
Ines Montani	d730eb0c0d	Raise custom ImportError if importing janome fails	2017-05-03 09:43:29 +02:00
Ines Montani	949ad6594b	Add newline	2017-05-03 09:38:43 +02:00
Ines Montani	d12ca587ea	Add newline	2017-05-03 09:38:29 +02:00
Ines Montani	8676cd0135	Add newline	2017-05-03 09:38:07 +02:00
Yasuaki Uechi	c8f83aeb87	Add basic japanese support	2017-05-03 13:56:21 +09:00
Matthew Honnibal	31ec9e1371	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-27 13:21:39 +02:00
Matthew Honnibal	2da16adcc2	Add dropout optin for parser and NER Dropout can now be specified in the `Parser.update()` method via the `drop` keyword argument, e.g. nlp.entity.update(doc, gold, drop=0.4) This will randomly drop 40% of features, and multiply the value of the others by 1. / 0.4. This may be useful for generalising from small data sets. This commit also patches the examples/training/train_new_entity_type.py example, to use dropout and fix the output (previously it did not output the learned entity).	2017-04-27 13:18:39 +02:00
Ines Montani	7da9cefd25	Merge pull request #1022 from luvogels/master Initial support for Norwegian Bokmål	2017-04-27 11:16:06 +02:00
Ines Montani	c9e592ae6c	Add newline	2017-04-27 11:15:41 +02:00
Ines Montani	5942adccc2	Add newline	2017-04-27 11:15:19 +02:00
Ines Montani	4cd9269aef	Add newline	2017-04-27 11:15:04 +02:00
Ines Montani	ccf13ecc21	Add newline	2017-04-27 11:14:42 +02:00
Ines Montani	03d2b0cc05	Add newline	2017-04-27 11:14:26 +02:00
luvogels	d12a0b6431	Hooked up tokenizer tests	2017-04-26 23:21:41 +02:00
Matthew Honnibal	f0e1606d27	Increment version	2017-04-26 20:25:41 +02:00
luvogels	b331929a7e	Merge branch 'master' of https://github.com/luvogels/spaCy	2017-04-26 19:15:48 +02:00
luvogels	8de59ce3b9	Added tokenizer tests	2017-04-26 19:10:18 +02:00
Matthew Honnibal	4d98511db7	Make Span hashable. Closes #1019	2017-04-26 19:01:05 +02:00
Matthew Honnibal	24c4c51f13	Try to make test999 less flakey	2017-04-26 18:42:06 +02:00
Leif Uwe Vogelsang	460094bf09	Update __init__.py	2017-04-26 18:27:55 +02:00
ines	527d51ac9a	Fetch shortcuts from GitHub and improve error handling	2017-04-26 18:00:28 +02:00
Matthew Honnibal	c4be9c36fe	Fix unicode header in tests	2017-04-24 10:09:01 +02:00
Matthew Honnibal	65f10b53e5	Fix test	2017-04-24 00:25:55 +02:00
Matthew Honnibal	70a43858e1	Fix flakey test	2017-04-24 00:06:30 +02:00
Matthew Honnibal	3973af2d15	Make training test less flakey	2017-04-23 22:59:34 +02:00
Matthew Honnibal	4f9657b42b	Fix reporting if no dev data with train	2017-04-23 22:27:10 +02:00
Matthew Honnibal	df2ac8b843	Merge branch 'master' of https://github.com/explosion/spaCy	2017-04-23 21:25:07 +02:00
Matthew Honnibal	d0e19267e8	Create directory if missing in save_to_directory	2017-04-23 21:24:43 +02:00
ines	42305bc519	Remove unnecessary test	2017-04-23 21:21:41 +02:00
ines	012ea594d1	Add file for misc tests	2017-04-23 21:06:51 +02:00
ines	83f66947dc	Rename test_download to test_cli	2017-04-23 21:06:50 +02:00
ines	401045433c	Simplify compat.fix_text	2017-04-23 21:06:50 +02:00
Matthew Honnibal	e033c86a64	Increment version	2017-04-23 21:03:43 +02:00
Matthew Honnibal	d2436dc17b	Update fix for Issue #999	2017-04-23 18:14:37 +02:00

1 2 3 4 5 ...

2922 Commits