spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-01-30 11:14:08 +03:00

Author	SHA1	Message	Date
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	387263d618	simplify chains	2019-04-29 13:58:07 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	9a7d534b1b	enable nogil for cython functions in kb.pxd	2019-04-10 17:25:10 +02:00
Ines Montani	4faf62d515	Merge pull request #3530 from svlandeg/fix/issue_3521 Allow English stopwords with any type of apostrophe	2019-04-03 14:14:03 +02:00
Yves Peirsman	951825532c	Improved Dutch language resources and Dutch lemmatization (#3409 ) * Improved Dutch language resources and Dutch lemmatization * Fix conftest * Update punctuation.py * Auto-format * Format and fix tests * Remove unused test file * Re-add deleted test * removed redundant infix regex pattern for ','; note: brackets + simple hyphen remains * Cleaner lemmatization files	2019-04-03 14:13:26 +02:00
svlandeg	4ff786e113	addressed all comments by Ines	2019-04-03 13:50:33 +02:00
Ines Montani	6a4575a56c	Don't make "settings" or "title" required in displaCy data (closes #3531 )	2019-04-03 10:13:16 +02:00
svlandeg	85b4319f33	specify encoding in files	2019-04-02 15:05:31 +02:00
svlandeg	673c81bbb4	unicode string for python 2.7	2019-04-02 13:52:07 +02:00
svlandeg	eca9cc5417	fixing Issue #3521 by adding all hyphen variants for each stopword	2019-04-02 13:24:59 +02:00
svlandeg	e7062cf699	failing test for Issue #3521	2019-04-02 13:15:35 +02:00
svlandeg	1424b12b09	failing test for Issue #3449	2019-04-02 13:06:37 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Ines Montani	68900066e0	Merge pull request #3459 from svlandeg/feature/el-framework Basic framework and APIs for entity linker	2019-03-29 14:02:22 +01:00
Hiromu Hota	914b9ff3d2	Tags are joined with a comma and padded with asterisks (#3491 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Fix a bug in the test of JapaneseTokenizer. This PR may require @polm's review. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-28 16:17:31 +01:00
Samuel Kane	06a1846379	fix(util): fix decaying function output (#3495 ) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text	2019-03-28 13:24:47 +01:00
Duygu Altinok	5a7bc6b39d	Fix/irreg adverbs extension (#3499 ) * extended list of irreg adverbs * added test to exceptions * fixed typo	2019-03-28 13:23:33 +01:00
Sofie	a4a6bfa4e1	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
svlandeg	8814b9010d	entity as one field instead of both ID and name	2019-03-25 18:10:41 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Matthew Honnibal	d9a07a7f6e	💫 Fix class mismap on parser deserializing (closes #3433 ) (#3470 ) v2.1 introduced a regression when deserializing the parser after parser.add_label() had been called. The code around the class mapping is pretty confusing currently, as it was written to accommodate backwards model compatibility. It needs to be revised when the models are next retrained. Closes #3433	2019-03-23 13:46:25 +01:00
Matthew Honnibal	444a3abfe5	Add xfail test for #3433 . Improve test for add label.	2019-03-23 12:36:00 +01:00
Ines Montani	6b6e9b638e	Fix test for #3468	2019-03-23 11:24:29 +01:00
Ines Montani	fbec72b4c3	Slightly modify test for #3468 Check for Token.is_sent_start first (which is serialized/deserialized correctly)	2019-03-23 11:22:44 +01:00
Ines Montani	02d9378d8c	Add xfailing test for #3468	2019-03-23 11:19:11 +01:00
svlandeg	9de9900510	adding future import unicode literals to .py files	2019-03-22 16:18:04 +01:00
svlandeg	9751312aff	specify unicode strings for python 2.7	2019-03-22 14:15:18 +01:00
svlandeg	ec3e860b44	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:47:08 +01:00
svlandeg	12d4caf341	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:44:36 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	188ccd5750	Fix xfail marker	2019-03-22 12:54:14 +01:00
svlandeg	5b1cd49222	error msg and unit tests for setting kb_id on span	2019-03-22 12:05:35 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	c71123dd0c	ensure no candidates are returned for unknown aliases	2019-03-22 11:36:45 +01:00
svlandeg	98ae77a682	unit test on number of candidates generated	2019-03-22 11:36:45 +01:00
svlandeg	a9074e0886	check the length of entities and probabilities vector + unit test	2019-03-22 11:36:45 +01:00
svlandeg	d133ffaff9	correct size, not counting dummy elements in the vector	2019-03-22 11:36:45 +01:00
svlandeg	33f8a0fe2e	check and unit test in case prior probs exceed 1	2019-03-22 11:36:45 +01:00
svlandeg	20a7b7b1c0	raising error when adding alias for unknown entity + unit test	2019-03-22 11:36:45 +01:00
Matthew Honnibal	d811c97da1	Fix test that caused pytest to choke on Python3	2019-03-22 10:28:51 +01:00
Matthew Honnibal	a2ad9832e5	Add failing test for #3356	2019-03-22 02:42:37 +01:00
Ryan Ford	00842d7f1b	Merging conversion scripts for conll formats (#3405 ) * merging conllu/conll and conllubio scripts * tabs to spaces * removing conllubio2json from converters/__init__.py * Move not-really-CLI tests to misc * Add converter test using no-ud data * Fix test I broke * removing include_biluo parameter * fixing read_conllx * remove include_biluo from convert.py	2019-03-15 18:14:46 +01:00
Ines Montani	bec8db91e6	Add actual deprecation warning for n_threads (resolves #3410 )	2019-03-15 16:38:44 +01:00
Sofie	c45ed32c74	label in span not writable anymore (#3408 ) * label in span not writable anymore * more explicit unit test and error message for readonly label * bit more explanation (view) * error msg tailored to specific case * fix None case	2019-03-15 00:46:45 +01:00
Ines Montani	479b5cff43	Auto-format [ci skip]	2019-03-12 13:35:34 +01:00
Ines Montani	886e5966c0	Update test_displacy.py	2019-03-11 19:03:52 +01:00
Ines Montani	4bd2688eac	💫 Fix displaCy support for RTL languages (#3393 ) Closes #2091. ## Description With the new `vocab.writing_system` property introduced in #3390 (exposed via the language defaults), I was able to finally fix this (I think!). Based on the `Doc`, dispaCy now detects whether it's a RTL or LTR language and adjusts the visualization accordingly. Wherever possible, I've also added `direction` and `lang` attributes. Entity visualization now looks like this: <img width="318" alt="Screenshot 2019-03-11 at 16 06 51" src="https://user-images.githubusercontent.com/13643239/54136866-d97afd80-441c-11e9-8c27-3d46994cc833.png"> And dependencies like this (ignore the most likely incorrect tags and dependencies): <img width="621" alt="Screenshot 2019-03-11 at 16 51 59" src="https://user-images.githubusercontent.com/13643239/54137771-8b66f980-441e-11e9-8460-0682b95eef2a.png"> ### Types of change enhancement, bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 18:52:50 +01:00
Matthew Honnibal	b0b990e405	Fix token.conjuncts (closes #795 ) (#3392 ) * Implement conjuncts method * Add span.conjuncts property * Un-xfail token.conjuncts tests * Update docs for token.conjuncts and span.conjuncts * Fix merge error in token.conjuncts	2019-03-11 17:05:45 +01:00
Matthew Honnibal	e2b9b523ce	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-03-11 15:59:28 +01:00

1 2 3 4 5 ...

1237 Commits