spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-18 08:01:58 +03:00

Author	SHA1	Message	Date
Ines Montani	3fe5811fa7	Only link model after download if shortcut link (#3378 )	2019-03-10 13:02:24 +01:00
Matthew Honnibal	231bc7bb7b	Add xfailing test for #3345	2019-03-10 13:00:15 +01:00
Matthew Honnibal	bdc77848f5	Add helper method to apply a transition in parser/NER	2019-03-10 13:00:00 +01:00
Matthew Honnibal	ce1fe8a510	Add comment	2019-03-09 17:51:17 +00:00
Matthew Honnibal	28c26e212d	Fix textcat model for GPU	2019-03-09 17:50:08 +00:00
Ines Montani	610fb306bd	Revert hyphens	2019-03-09 12:51:53 +01:00
Ines Montani	bbabb6aaae	Escape more hyphens	2019-03-09 12:41:05 +01:00
Ines Montani	b8db219850	Auto-format	2019-03-09 12:40:58 +01:00
Ines Montani	a145bfe627	Try escaping hyphens again	2019-03-09 03:06:50 +01:00
Ines Montani	b9c71fc0f0	Fix flags	2019-03-09 02:46:04 +01:00
Ines Montani	ae09b6a6cf	Try fixing unicode inconsistencies on Python 2	2019-03-09 02:37:50 +01:00
Ines Montani	d957d7a697	Auto-format	2019-03-09 02:37:41 +01:00
Ines Montani	65402c3d02	Revert "Experiment with escaping hyphens" This reverts commit `9b42e2d5dd`.	2019-03-09 02:13:00 +01:00
Ines Montani	9b42e2d5dd	Experiment with escaping hyphens	2019-03-09 02:05:26 +01:00
Ines Montani	76764fcf59	💫 Improve converters and training data file formats (#3374 ) * Populate converter argument info automatically * Add conversion option for msgpack * Update docs * Allow reading training data from JSONL	2019-03-08 23:15:23 +01:00
Ines Montani	296446a1c8	Tidy up and improve docs and docstrings (#3370 ) <!--- Provide a general summary of your changes in the title. --> ## Description * tidy up and adjust Cython code to code style * improve docstrings and make calling `help()` nicer * add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects * fix various typos and inconsistencies in docs ### Types of change enhancement, docs ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-08 11:42:26 +01:00
Ines Montani	daaeeb7a2b	Merge branch 'master' into develop	2019-03-07 22:07:31 +01:00
Adrien Ball	88909a9adb	Fix egg fragments in direct download (#3369 ) ## Description The egg fragment in the URL must be of the form `#egg=package_name==version` instead of `#egg=package_name-version`. One of the consequences of specifying wrong egg fragments is that `pip` does not recognize the package and its version properly, and thus it re-downloads the package systematically. I'm not sure how this should be tested properly. Here is what I had before the fix when running the same direct download twice: ``` $ python -m spacy download en_core_web_sm-2.0.0 --direct Looking in indexes: https://pypi.python.org/simple/ Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB) 100% \|████████████████████████████████\| 37.4MB 1.6MB/s Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments. Installing collected packages: en-core-web-sm Running setup.py install for en-core-web-sm ... done Successfully installed en-core-web-sm-2.0.0 $ python -m spacy download en_core_web_sm-2.0.0 --direct Looking in indexes: https://pypi.python.org/simple/ Collecting en_core_web_sm-2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB) 100% \|████████████████████████████████\| 37.4MB 919kB/s Generating metadata for package en-core-web-sm-2.0.0 produced metadata for project name en-core-web-sm. Fix your #egg=en-core-web-sm-2.0.0 fragments. Requirement already satisfied (use --upgrade to upgrade): en-core-web-sm from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 in ./venv3/lib/python3.6/site-packages ``` And after the fix: ``` $ python -m spacy download en_core_web_sm-2.0.0 --direct Looking in indexes: https://pypi.python.org/simple/ Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB) 100% \|████████████████████████████████\| 37.4MB 1.1MB/s Installing collected packages: en-core-web-sm Running setup.py install for en-core-web-sm ... done Successfully installed en-core-web-sm-2.0.0 $ python -m spacy download en_core_web_sm-2.0.0 --direct Looking in indexes: https://pypi.python.org/simple/ Requirement already satisfied: en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 in ./venv3/lib/python3.6/site-packages (2.0.0) ``` ### Types of change This is an enhancement as it avoids unnecessary downloads of (potentially big) spacy models, when they have already been downloaded. ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-07 21:07:19 +01:00
Ines Montani	96b91a8898	Fix noqa [ci skip]	2019-03-07 12:25:00 +01:00
Ines Montani	9d6ca18a10	Tidy up and only use self.vector once	2019-03-07 01:06:12 +01:00
Ines Montani	a8f1efd2f5	Merge branch 'master' into develop	2019-03-07 00:56:31 +01:00
Daniel King	5f40229397	Don't use numpy directly for similarity (#3362 ) * Don't use numpy directly for similarity * Contributor agreement	2019-03-06 22:58:38 +00:00
Ines Montani	6bd34e9d54	Expose Japanese stop words (closes #3346 )	2019-03-06 14:21:15 +01:00
Ines Montani	85deb96278	Fix whitespace	2019-03-06 14:20:34 +01:00
Ines Montani	23f6ebf0f3	Add missing " (closes #3343 )	2019-02-27 16:37:03 +01:00
Ines Montani	533b580c19	Add test for stray print statements in languages (see #3342 )	2019-02-27 16:04:30 +01:00
Ines Montani	48a2046d1c	Remove stray print statement (closes #3342 )	2019-02-27 15:35:04 +01:00
Ines Montani	07d7c0a1af	Fix whitespace	2019-02-27 15:34:21 +01:00
Ines Montani	9b62639d19	Auto-format [ci skip]	2019-02-27 14:24:55 +01:00
Matthew Honnibal	656edcb984	Set version to v2.1.0a10	2019-02-27 12:26:13 +01:00
Matthew Honnibal	f1d77eb140	💫 Improve handling of missing NER tags (closes #2603 ) (#3341 ) * Improve handling of missing NER tags GoldParse can accept missing NER tags, if entities is provided in BILUO format (rather than as spans). Missing tags can be provided as None values. Fix bug that occurred when first tag was a None value. Closes #2603. * Document specification of missing NER tags.	2019-02-27 12:06:32 +01:00
Ines Montani	e359bdd0e3	Auto-format	2019-02-27 11:56:45 +01:00
Matthew Honnibal	4a3371acd5	Make doc[0].is_sent_start == True (closes #2869 ) (#3340 ) * Make doc[0] have sent_start True. Closes #2869 * Document that doc[0].is_sent_start defaults True.	2019-02-27 11:17:17 +01:00
Matthew Honnibal	2d3ce89b78	Improve matcher tests re issue #3328	2019-02-27 10:25:56 +01:00
Matthew Honnibal	8d6954e0e7	Fix matcher bug #3328	2019-02-27 10:25:39 +01:00
Ines Montani	aadf586789	Add xfailing test for #3331	2019-02-25 22:33:30 +01:00
Matthew Honnibal	3cdd3eb518	Set version to v2.1.0a9	2019-02-25 21:55:19 +01:00
Matthew Honnibal	b449be0f04	Add comment re issue #3170	2019-02-25 21:24:03 +01:00
Matthew Honnibal	9ccd6a3062	Fix head-outside-sentence bug. Fixes #3170	2019-02-25 21:21:44 +01:00
Matthew Honnibal	f2fae1f186	Add batch size argument to Language.evaluate(). Closes #3263	2019-02-25 19:30:33 +01:00
Ines Montani	f135d663f7	Update conftest.py	2019-02-25 15:55:29 +01:00
Ines Montani	76ce8b2662	Merge branch 'master' into develop	2019-02-25 15:54:55 +01:00
Julia Makogon	f1c3108d52	Fixing pymorphy2 dependency issue (#3329 ) (closes #3327 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement * pymorphy2 initialization split for ru and uk (#3327) * stop-words fixed * Unit-tests updated	2019-02-25 15:48:17 +01:00
Ines Montani	1a735e0f1f	Add regression test for #3328	2019-02-25 10:12:58 +01:00
Ines Montani	dfbed07d3b	Remove unused temp errors	2019-02-24 22:26:08 +01:00
Ines Montani	62b558ab72	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 ) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop	2019-02-24 21:13:51 +01:00
Ines Montani	a48deb4081	Merge regression tests	2019-02-24 21:03:39 +01:00
Ines Montani	8f6c193a4d	Delete _test_issue1622.py	2019-02-24 20:33:31 +01:00
Ines Montani	c8e967c78d	Try include previously segfaulting test	2019-02-24 20:32:46 +01:00
Ines Montani	328b589deb	Merge regression tests	2019-02-24 20:31:38 +01:00
Ines Montani	3bc53905cc	Remove print statements from test	2019-02-24 20:31:15 +01:00
Ines Montani	1ae0df3da9	Un-x-fail passing test	2019-02-24 20:24:15 +01:00
Ines Montani	399a5803d0	Tidy up tests [ci skip]	2019-02-24 19:02:16 +01:00
Ines Montani	2011563c51	Update docstrings [ci skip]	2019-02-24 18:39:59 +01:00
Ines Montani	df19e2bff6	💫 Allow setting of custom attributes during retokenization (closes #3314 ) (#3324 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter and a setter implemented. ```python Token.set_extension('is_musician', default=False) doc = nlp("I like David Bowie.") with doc.retokenize() as retokenizer: attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}} retokenizer.merge(doc[2:4], attrs=attrs) assert doc[2].text == "David Bowie" assert doc[2].lemma_ == "David Bowie" assert doc[2]._.is_musician ``` ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-24 18:38:47 +01:00
Ines Montani	1ea1bc98e7	Document regex utilities [ci skip]	2019-02-24 18:34:10 +01:00
Matthew Honnibal	1f7c56cd93	Fix parser.add_label()	2019-02-24 16:53:22 +01:00
Matthew Honnibal	893aa40d73	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-24 16:43:01 +01:00
Matthew Honnibal	5882d82915	Set version to v2.1.0a9.dev2	2019-02-24 16:42:06 +01:00
Matthew Honnibal	0367f864fe	Fix handling of added labels. Resolves #3189	2019-02-24 16:41:41 +01:00
Matthew Honnibal	d74dbde828	Fix order of actions when labels added to parser When labels were added to the parser or NER, we weren't loading back the classes in the correct order. Re issue #3189	2019-02-24 16:36:29 +01:00
Ines Montani	6de81ae310	Fix formatting of errors	2019-02-24 15:11:28 +01:00
Ines Montani	d8f69d592f	Tidy up retokenizer tests	2019-02-24 14:14:11 +01:00
Ines Montani	723e27cb8c	Tidy up tests	2019-02-24 14:11:23 +01:00
Ines Montani	2982f82934	Auto-format	2019-02-24 14:09:15 +01:00
Matthew Honnibal	909a9d9932	Set version to v2.1.0a9.dev1	2019-02-23 13:10:42 +01:00
Matthew Honnibal	6b0008afc6	Clean up TextCategorizer slightly	2019-02-23 12:28:06 +01:00
Matthew Honnibal	d13b9373bf	Improve initialization for mutually textcat	2019-02-23 12:27:45 +01:00
Matthew Honnibal	e9dd5943b9	Support exclusive_classes setting for textcat models	2019-02-23 11:57:16 +01:00
Matthew Honnibal	ce1e4eace2	Default to former TextCategorizer model * Keep TextCategorizer default model same as v2.0 * Add option 'architecture' that allows "simple_cnn" to switch to simpler model. * Add option exclusive_classes, defaulting to False. If set to True, the model treats classes as mutually exclusive, i.e. only one class can be true per instance.	2019-02-23 11:55:16 +01:00
Matthew Honnibal	829c9091a4	Set version to v2.1.0a9.dev0	2019-02-21 17:13:34 +01:00
Matthew Honnibal	d396a69c7b	More fixes for issue #3112	2019-02-21 17:12:23 +01:00
Ines Montani	80bdcb99c5	Fix escaping of HTML in displacy ENT (closes #2728 )	2019-02-21 14:30:39 +01:00
Matthew Honnibal	7d529ebdfb	Set version to v2.1.0a8	2019-02-21 12:09:34 +01:00
Matthew Honnibal	f75be6e7be	Set version to v2.1.0a8.dev1	2019-02-21 11:57:06 +01:00
Matthew Honnibal	c5f947f194	Fix regex deprecation warnings	2019-02-21 11:56:47 +01:00
Matthew Honnibal	7f02464494	Set version to v2.1.0a8.dev0	2019-02-21 11:42:23 +01:00
Matthew Honnibal	f31dbec528	More fixes for #3112	2019-02-21 11:10:10 +01:00
Matthew Honnibal	80195bc2d1	Fix issue #3288 (#3308 )	2019-02-21 09:48:53 +01:00
Matthew Honnibal	a137e8b418	Fix Pipe.to_bytes() when model uninitialized Closes #3289	2019-02-21 09:42:02 +01:00
Matthew Honnibal	6574e4f2d3	Fix issue #3112 part 1	2019-02-21 09:27:38 +01:00
Matthew Honnibal	b21481eeca	Load token_match regex with .match, not .search	2019-02-21 09:09:03 +01:00
Sofie	9a478b6db8	Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293 ) * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * splitting up latin unicode interval * removing hyphen as infix for French * adding failing test for issue 1235 * test for issue #3002 which now works * partial fix for issue #2070 * keep the hyphen as infix for French (as it was) * restore french expressions with hyphen as infix (as it was) * added succeeding unit test for Issue #2656 * Fix issue #2822 with custom Italian exception * Fix issue #2926 by allowing numbers right before infix / * remove duplicate * remove xfail for Issue #2179 fixed by Matt * adjust documentation and remove reference to regex lib	2019-02-20 22:10:13 +01:00
Matthew Honnibal	0d1ca15b13	💫 Fix bugs in matcher extensions. Closes #1971 (#3301 ) * Fix matching on extension attrs and predicates * Fix detection of match_id when using extension attributes. The match ID is stored as the last entry in the pattern. We were checking for this with nr_attr == 0, which didn't account for extension attributes. * Fix handling of predicates. The wrong count was being passed through, so even patterns that didn't have a predicate were being checked. * Fix regex pattern * Fix matcher set value test	2019-02-20 21:30:39 +01:00
Ines Montani	3b667787a9	Add xfailing test for #3289	2019-02-18 16:45:04 +01:00
Ines Montani	91f260f2c4	Add another test for #1971	2019-02-18 13:36:20 +01:00
Ines Montani	f30aac324c	Update test_issue1971.py	2019-02-18 13:36:15 +01:00
Ines Montani	8fa26ca97e	Fix tensor shape in test for #3288	2019-02-18 11:01:54 +01:00
Ines Montani	c32290557f	Add xfailing test for #3288	2019-02-18 10:59:31 +01:00
Ines Montani	3fdcdec6a0	Merge branch 'master' into develop	2019-02-18 10:03:32 +01:00
Roshni Biswas	e09f1347fa	updates for Bengali language (#3286 ) * Update morph_rules.py * contributor agreement for roshni-b * created example sentences	2019-02-18 10:02:28 +01:00
Ines Montani	043e8186f3	Merge branch 'master' into develop	2019-02-17 17:51:17 +01:00
Marc Puig	51268e9f21	Typo error fixed (#3284 )	2019-02-17 17:51:02 +01:00
Ines Montani	3af0b2dd1c	Add xfailing test for #1971 [ci skip]	2019-02-17 13:04:47 +01:00
Ines Montani	19a002bfd3	Merge branch 'master' into develop	2019-02-17 12:22:54 +01:00
Ines Montani	1e252b129c	Auto-format	2019-02-17 12:22:07 +01:00
Roshni Biswas	e26d923726	Update morph_rules.py (#3283 )	2019-02-17 12:21:47 +01:00
Matthew Honnibal	7d4a52a4d0	Set version to v2.1.0a7	2019-02-16 17:48:34 +01:00
Matthew Honnibal	07617b6b7f	Set version to v2.1.0a7.dev12	2019-02-16 17:30:29 +01:00
Matthew Honnibal	1dc314bada	Set version to v2.1.0a7.dev11	2019-02-16 17:02:49 +01:00
Matthew Honnibal	2ef227c313	Set version to v2.1.0a7.dev1	2019-02-16 16:22:46 +01:00
Matthew Honnibal	22923b9cb1	Set version to v2.1.0a7.dev9	2019-02-16 15:47:19 +01:00
Matthew Honnibal	e0c91a4c8d	Set version to 2.1.0a7	2019-02-16 14:43:38 +01:00
Matthew Honnibal	92b6bd2977	Refinements to retokenize.split() function (#3282 ) * Change retokenize.split() API for heads * Pass lists as values for attrs in split * Fix test_doc_split filename * Add error for mismatched tokens after split * Raise error if new tokens don't match text * Fix doc test * Fix error * Move deps under attrs * Fix split tests * Fix retokenize.split	2019-02-15 17:32:31 +01:00
Matthew Honnibal	2dbc61bc26	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-15 14:03:54 +01:00
Ines Montani	1aa57690dc	Add xfailing test for orth mismatch in retokenizer.split	2019-02-15 13:55:04 +01:00
Ines Montani	819768483f	Add xfailing test for out-of-bounds heads	2019-02-15 13:09:07 +01:00
Ines Montani	d8051e89ca	Tidy up tests	2019-02-15 12:56:51 +01:00
Matthew Honnibal	58aac58631	Set version to v2.1.0a7.dev8	2019-02-15 12:39:26 +01:00
Matthew Honnibal	5f1abe2cc7	Set version to v2.1.0a7.dev7	2019-02-15 10:30:53 +01:00
Matthew Honnibal	a66e8e0c8a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-15 10:30:22 +01:00
Ines Montani	c31a9dabd5	💫 Add en/em dash to prefixes and suffixes (#3281 ) * Auto-format * Add en/em dash to prefixes and suffixes	2019-02-15 10:29:59 +01:00
Ines Montani	5651a0d052	💫 Replace {Doc,Span}.merge with Doc.retokenize (#3280 ) * Add deprecation warning to Doc.merge and Span.merge * Replace {Doc,Span}.merge with Doc.retokenize	2019-02-15 10:29:44 +01:00
Matthew Honnibal	dcf79c5ef3	Set version to v2.1.0a7.dev6	2019-02-14 20:12:02 +01:00
Matthew Honnibal	0371ac23e7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-14 20:09:10 +01:00
Ines Montani	f146121092	💫 Make handling of [Pipe].labels consistent (#3273 ) * Make handling of [Pipe].labels consistent * Un-xfail passing test * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Update spacy/tests/pipeline/test_pipe_methods.py Co-Authored-By: ines <ines@ines.io> * Update spacy/pipeline/pipes.pyx Co-Authored-By: ines <ines@ines.io> * Move error message to spacy.errors * Fix textcat labels and test * Make EntityRuler.labels return tuple as well	2019-02-15 06:03:19 +11:00
Ines Montani	3d577b77c6	Auto-formatting	2019-02-14 19:56:38 +01:00
Ines Montani	2569339a98	Formatting and whitespace [ci skip]	2019-02-14 18:05:07 +01:00
Matthew Honnibal	aebf71bc72	Set version to v2.1.0a7.dev5	2019-02-14 15:51:42 +01:00
Matthew Honnibal	6ccd67c682	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-14 15:51:12 +01:00
Ines Montani	e104e47c21	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-14 15:35:34 +01:00
Ines Montani	0cd01a8c5e	Merge branch 'master' into develop	2019-02-14 15:35:20 +01:00
Ines Montani	2e31921d0a	💫 Add base Language classes for more languages (#3276 ) * Add base classes for more languages * Add test for language class initialization Make sure language can be initialize – otherwise, it's difficult to catch serious errors in the test suite, because languages are lazy-loaded	2019-02-15 01:31:19 +11:00
Grivaz	39815513e2	Add split one token into several (resolves #2838 ) (#3253 ) * Add split one token into several (resolves #2838) * Improve error message for token splitting * Make retokenizer.split() tests use a Token object Change retokenizer.split() to use a Token object, instead of an index. * Pass Token into retokenize.split() Tweak retokenize.split() API so that we pass the `Token` object, not the index. * Fix token.idx in retokenize.split() * Test that token.idx is correct after split * Fix token.idx for split tokens * Fix retokenize.split() * Fix retokenize.split * Fix retokenize.split() test	2019-02-15 01:27:13 +11:00
Ines Montani	743ecf728c	Tidy up conftest	2019-02-14 13:27:13 +01:00
Ines Montani	106d95b01a	Fix typo	2019-02-14 12:26:56 +01:00
Ines Montani	11d6b874db	Update stop_words.py	2019-02-14 12:25:19 +01:00
Ines Montani	60c2a3bb65	Also raise original error message in util.get_lang_class Otherwise, the true error that happens within a Language subclass is swallowed, because if it's imported lazily like that, it'll always be an ImportError	2019-02-13 16:52:25 +01:00
Ines Montani	4d2438f985	Tidy up and auto-format	2019-02-13 15:29:08 +01:00
Ines Montani	fbf9f1edf1	Also raise error in Span.__reduce__	2019-02-13 13:22:05 +01:00
Matthew Honnibal	1831e1423d	Set version to v2.1.0a7.dev4	2019-02-13 23:08:40 +11:00
Matthew Honnibal	63dc4234a3	Set version to v2.1.0a7.dev3	2019-02-13 22:53:10 +11:00
Matthew Honnibal	b7ea39564f	Set version to v2.1.0a7.dev2	2019-02-13 22:52:43 +11:00
Ines Montani	2d0c3c73f4	Raise better error if token is pickled (resolves #2833 ) (#3267 )	2019-02-13 11:27:04 +01:00
Ines Montani	2f45bd94c0	Auto-formatting	2019-02-12 18:30:11 +01:00
Ines Montani	0184a95340	Merge branch 'master' into develop	2019-02-12 18:29:24 +01:00
Akhilesh	a78db10941	add kannada support (#3264 ) * add kannada support * add few more stop words * add support for Kannada Language	2019-02-12 18:28:39 +01:00
Ines Montani	b589b945db	Fix PhraseMatcher pickling and length (resolves #3248 ) (#3252 )	2019-02-12 18:27:54 +01:00
Ines Montani	483dddc9bc	💫 Add token match pattern validation via JSON schemas (#3244 ) * Add custom MatchPatternError * Improve validators and add validation option to Matcher * Adjust formatting * Never validate in Matcher within PhraseMatcher If we do decide to make validate default to True, the PhraseMatcher's Matcher shouldn't ever validate. Here, we create the patterns automatically anyways (and it's currently unclear whether the validation has performance impacts at a very large scale).	2019-02-13 01:47:26 +11:00
Ines Montani	ad2a514cdf	Show warning if phrase pattern Doc was overprocessed (#3255 ) In most cases, the PhraseMatcher will match on the verbatim token text or as of v2.1, sometimes the lowercase text. This means that we only need a tokenized Doc, without any other attributes. If phrase patterns are created by processing large terminology lists with the full `nlp` object, this easily can make things a lot slower, because all components will be applied, even if we don't actually need the attributes they set (like part-of-speech tags, dependency labels). The warning message also includes a suggestion to use nlp.make_doc or nlp.tokenizer.pipe for even faster processing. For now, the validation has to be enabled explicitly by setting validate=True.	2019-02-13 01:45:31 +11:00
Matthew Honnibal	6ec834dc72	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-02-13 01:14:44 +11:00
Matthew Honnibal	43fa039d96	xfail regression test for model labels	2019-02-13 01:14:26 +11:00
Matthew Honnibal	bc300d4e31	Add test for issue 3209	2019-02-13 01:13:01 +11:00
Ines Montani	34a3cc26a9	Add xfailing test for reverse pattern (see #1971 )	2019-02-12 14:49:59 +01:00
Ines Montani	fe39fd4d13	Make warning tests more explicit	2019-02-10 14:02:19 +01:00
Ines Montani	a9f8d17632	💫 Break up large pipeline.pyx (#3246 ) * Break up large pipeline.pyx * Merge some components back together * Fix typo	2019-02-10 12:14:51 +01:00
Ines Montani	e7593b791e	Fix import	2019-02-08 20:50:52 +01:00
Ines Montani	0754b848fe	Actually xfail test for #1971	2019-02-08 20:50:35 +01:00
Ines Montani	414a69b736	Add xfailing test (see #1971 , #2675 , #2671 )	2019-02-08 20:50:01 +01:00
Ines Montani	ea07f3022e	Only run noun chunks iterator in Span if available (closes #3199 )	2019-02-08 18:33:16 +01:00
Ines Montani	ff36b14cb2	Fix whitespace	2019-02-08 18:31:31 +01:00
Ines Montani	f4ce7bb7e9	Fix typo and deprecation message (resolves #3195 ) [ci skip]	2019-02-08 18:09:23 +01:00
Ines Montani	694139aad3	Fix formatting [ci skip]	2019-02-08 16:32:36 +01:00
Ines Montani	2898768757	Remove unused attribute [ci skip]	2019-02-08 16:31:30 +01:00
Ines Montani	586c56fc6c	Tidy up regression tests	2019-02-08 15:51:13 +01:00
Ines Montani	25602c794c	Tidy up and fix small bugs and typos	2019-02-08 14:14:49 +01:00
Ines Montani	9e652afa4b	Merge branch 'master' into develop	2019-02-08 13:28:09 +01:00
Björn Lennartsson	647f0140c7	Fixed tag map for Swedish Talbanken (#3186 )	2019-02-08 14:28:59 +11:00
Stanisław Giziński	1448ad100c	Improved polish tokenizer and stop words. (#2974 ) * Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions	2019-02-08 14:27:21 +11:00
Ines Montani	402d133c90	Add Ukrainian unicode	2019-02-07 21:11:58 +01:00
Ines Montani	e2d93e4852	Merge branch 'master' into develop	2019-02-07 21:10:08 +01:00
Ines Montani	2499da97e8	Format	2019-02-07 21:07:02 +01:00
Julia Makogon	b41d64825a	Ukrainian language added. Small fixes in Russian (#3241 ) * Classes for Ukrainian; small fix in Russian. * Contributor agreement	2019-02-07 21:05:11 +01:00
Ines Montani	77efee0295	Auto-format	2019-02-07 21:00:04 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Matthew Honnibal	dbeebfa3a2	Set version to v2.1.0a7.dev1	2019-02-08 01:54:01 +11:00
Ines Montani	338d659bd0	Store JSON schemas in Python and tidy up (#3235 )	2019-02-07 19:44:31 +11:00
Ines Montani	1ea4df459d	💫 Break up large matcher.pyx (#3236 ) * Break up large matcher.pyx * Remove unused function	2019-02-07 19:42:25 +11:00
Ines Montani	a9bf5d9fd8	Add xfailing test for set value with operator [ci skip]	2019-02-06 13:40:11 +01:00
Ines Montani	e51a238b3f	Auto-format	2019-02-06 13:32:18 +01:00
Ines Montani	f25bd9f5e4	Add gold.spans_from_biluo_tags helper (#3227 )	2019-02-06 21:50:26 +11:00
Ines Montani	5e16490d9d	Fix default argument in TextCategorizer.Model (resolves #3221 )	2019-02-05 12:33:47 +01:00
Ines Montani	89ad095900	Fix whitespace	2019-02-05 12:32:20 +01:00
Sofie	9745b0d523	Improve Italian & Urdu tokenization accuracy (#3228 ) ## Description 1. Added the same infix rule as in French (`d'une`, `j'ai`) for Italian (`c'è`, `l'ha`), bringing F-score on `it_isdt-ud-train.txt` from 96% to 99%. Added unit test to check this behaviour. 2. Added specific Urdu punctuation character as suffix, improving F-score on `ur_udtb-ud-train.txt` from 94% to 100%. Added unit test to check this behaviour. ### Types of change Enhancement of Italian & Urdu tokenization ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-04 22:39:25 +01:00
Sofie	a3efa3e8d9	Improve Catalan tokenization accuracy (#3225 ) * small hyphen clean up for French * catalan infix similar to french	2019-02-04 20:37:19 +11:00
Ines Montani	e00680a33a	Remove unused outdated file	2019-02-01 11:39:48 +01:00
Matthew Honnibal	27e3f98cae	Set version to v2.1.0a7.dev0	2019-02-01 18:06:34 +11:00
Sofie	46dfe773e1	Replacing regex library with re to increase tokenization speed (#3218 ) * replace unicode categories with raw list of code points * simplifying ranges * fixing variable length quotes * removing redundant regular expression * small cleanup of regexp notations * quotes and alpha as ranges instead of alterations * removed most regexp dependencies and features * exponential backtracking - unit tests * rewrote expression with pathological backtracking * disabling double hyphen tests for now * test additional variants of repeating punctuation * remove regex and redundant backslashes from load_reddit script * small typo fixes * disable double punctuation test for russian * clean up old comments * format block code * final cleanup * naming consistency * french strings as unicode for python 2 support * french regular expression case insensitive	2019-02-01 18:05:22 +11:00
Amandine Périnet	d570e75dbb	Improving the French lookup dictionnary for ambiguous words (#3185 ) * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * modifying FR lookup to remove ambiguity and adding lookup vocab to FR files * updating the contributor agreement for amperinet	2019-01-31 23:53:45 +01:00
Ines Montani	e9a6dbe4f3	Don't check for Jupyter in global scope and fix check (#3213 ) Resolves #3208. Prevent interactions with other libraries (pandas) that also access `get_ipython().config` and its parameters. See #3208 for details. I don't fully understand why this happens, but in spaCy, we can at least make sure we avoid calling into this method. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-31 23:49:13 +01:00
Amandine Périnet	b34bc9d2e9	add small fix for French lemmatizer (#3206 )	2019-01-31 23:44:10 +01:00
Loghi	5ca8e2b269	Tamil (#3194 ) * Tamil language support stop wors, examples and numerical attribite supports added Contributor agreement signed * Create Loghijiaha.md Added contributor agreement * Update CONTRIBUTOR_AGREEMENT.md Adjusted contributor_agreement.md * Norm exceptions added	2019-01-27 06:02:04 +01:00
foufaster	8bd85fd9d5	Fix french lemmatization (#3180 )	2019-01-27 06:01:30 +01:00
Sofie	66016ac289	Batch UD evaluation script (#3174 ) * running UD eval * printing timing of tokenizer: tokens per second * timing of default English model * structured output and parameterization to compare different runs * additional flag to allow evaluation without parsing info * printing verbose log of errors for manual inspection * printing over- and undersegmented cases (and combo's) * add under and oversegmented numbers to Score and structured output * print high-freq over/under segmented words and word shapes * printing examples as part of the structured output * print the results to file * batch run of different models and treebanks per language * cleaning up code * commandline script to process all languages in spaCy & UD * heuristic to remove blinded corpora and option to run one single best per language * pathlib instead of os for file paths	2019-01-27 06:01:02 +01:00
Matthew Honnibal	5a4737df09	Set version to 2.1.0a6	2019-01-21 18:32:34 +01:00
Matthew Honnibal	246538be2e	Set version to 2.1.0a6.dev1	2019-01-21 15:12:17 +01:00
Matthew Honnibal	77ddcf7381	💫 Update matcher engine for regex and extensions (#3173 ) * Update matcher engine for regex and extensions Add support for matching over arbitrary Python predicate functions, and arbitrary Python attribute getters. This will allow matching over regex patterns, and allow supporting extension attributes. The results of the Python predicate functions are cached, so that we don't call the same predicate function twice for the same token. The extension attributes are fetched into an array for each token in the doc. This should minimise the performance impact of the new features. We still need to wire up these features to the patterns, and test it all. * Work on wiring up extra attributes in matcher * Work on tests for extra matcher attrs * Add support for extension attrs to matcher * Test extension attribute matching * Work on implementing predicate-based match patterns * Get predicates working for set membership * Add test for set membership * Make extensions+predicates work * Test matcher extensions * Cache predicate results better in Matcher * Remove print statement in matcher test * Use srsly to get key for predicates	2019-01-21 13:23:15 +01:00
Björn Lennartsson	b892b446cc	Updates to Swedish Language (#3164 ) * Added the same punctuation rules as danish language. * Added abbreviations and also the possibility to have capitalized abbreviations on some. Added a few specific cases too * Added test for long texts in swedish * Added morph rules, infixes and suffixes to __init__.py for swedish * Added some tests for prefixes, infixes and suffixes * Added tests for lemma * Renamed files to follow convention * [sv] Removed ambigious abbreviations * Added more tests for tokenizer exceptions * Added test for problem with punctuation in issue #2578 * Contributor agreement * Removed faulty lemmatization of 'jag' ('I') as it was lemmatized to 'jaga' ('hunt')	2019-01-16 13:45:50 +01:00
Gavriel Loria	9a5003d5c8	iob converter: add 'exception' for error 'too many values' (#3159 ) * added contributor agreement * issue #3128 throw exception on bad IOB/2 formatting * Update spacy/cli/converters/iob2json.py with ValueError Co-Authored-By: gavrieltal <gtloria@protonmail.com>	2019-01-16 13:44:16 +01:00
Mark Neumann	e599ed9ef8	Allow vectors to be optional in init-model, more robust string counting (#3155 ) * more robust init-model * key not word * add license agreement	2019-01-14 23:48:30 +01:00
mauryaland	214c2ec263	check if argument flat is true or not (#3156 )	2019-01-14 23:47:05 +01:00
Loghi	d97661d18b	Tamil language support (#3154 ) Tamil language support to spaCy Description Hereby, creating new PR to add support for Tamil language in spaCy added stop words, examples and numerical attributes <--Working on other language data--> Types of change Enhancement Checklist [ x] I have submitted the spaCy Contributor Agreement. [x ] I ran the tests, and all new and existing tests passed. [ x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-14 15:32:30 +01:00
Amandine Périnet	ee24e2534d	French lemmatization: adding lemmas for adverbs and irregular lemmas for function words (#3131 ) * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * adding adverbs and irregular cases for empty words * updating contributor agreement for amperinet	2019-01-10 15:41:15 +01:00
Kirill Bulygin	7b064542f7	Making `lang/th/test_tokenizer.py` pass by creating `ThaiTokenizer` (#3078 )	2019-01-10 15:40:37 +01:00
Álvaro Abella Bascarán	1cd8f9823f	Correct docs of `Token.subtree` and `Span.subtree` (issue #3122 ) (#3124 ) * solve inconsistency between docs and Span.subtree (issue #3122) * solve inconsistency between docs and Token.subtree (issue #3122)	2019-01-09 03:11:15 +01:00
Amandine Périnet	eef11a7a2c	French lemmatization: correcting wrong lemmas in the lookup dictionnary (#3104 ) * modifying French lookup that contained wrong lemmas * correcting wrong line breaks on hyphen * adding contributor agreement for amperinet@ * correcting a typo	2019-01-07 14:15:19 +01:00
Álvaro Abella Bascarán	e03e1eee92	Bugfix/get lca matrix (#3110 ) This PR adds a test for an untested case of `Span.get_lca_matrix`, and fixes a bug for that scenario, which I introduced in [this PR](https://github.com/explosion/spaCy/pull/3089) (sorry!). ## Description The previous implementation of get_lca_matrix was failing for the case `doc[j:k].get_lca_matrix()` where `j > 0`. A test has been added for this case and the bug has been fixed. ### Types of change Bug fix ## Checklist - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-01-06 19:07:50 +01:00
Matthew Honnibal	fe4e68cb71	Set version to v2.1.0a6.dev0	2019-01-05 14:44:42 +01:00
Matthew Honnibal	3c09d3d986	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-30 15:49:57 +01:00
Matthew Honnibal	d8d0ce081b	Fix clobber of doc.is_tagged in doc.from_array() If doc.from_array() was called with say, only entity information, this would cause doc.is_tagged to be set to False, even if tags were set. This caused tags to be dropped from serialisation. The same was true for doc.is_parsed. Closes #3012.	2018-12-30 15:48:10 +01:00

... 2 3 4 5 6 ...

5879 Commits