spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-08 17:51:16 +03:00

Author	SHA1	Message	Date
Ines Montani	5d9212c44c	Remove unused imports	2019-04-01 11:46:25 +02:00
Ines Montani	8d6b544632	Auto-format	2019-04-01 11:45:43 +02:00
jeannefukumaru	6567f27849	added missing SCONJ symbol	2019-04-01 17:02:53 +08:00
jeannefukumaru	082a0a2232	added utf8 encoding flag	2019-04-01 16:37:11 +08:00
jeannefukumaru	a741bed7a7	added symbols import	2019-04-01 16:21:06 +08:00
jeannefukumaru	745cf0c914	changed tag map from .py to .txt to see if tests pass	2019-04-01 07:04:50 +08:00
jeannefukumaru	3cc897102f	added tag_map for indonesian	2019-04-01 00:00:08 +08:00
Matthew Honnibal	e64b241f9c	Merge branch 'master' of https://github.com/explosion/spaCy	2019-03-31 13:58:38 +02:00
Ines Montani	68900066e0	Merge pull request #3459 from svlandeg/feature/el-framework Basic framework and APIs for entity linker	2019-03-29 14:02:22 +01:00
Hiromu Hota	914b9ff3d2	Tags are joined with a comma and padded with asterisks (#3491 ) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Fix a bug in the test of JapaneseTokenizer. This PR may require @polm's review. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> Bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-28 16:17:31 +01:00
Samuel Kane	06a1846379	fix(util): fix decaying function output (#3495 ) * fix(util): fix decaying function output * fix(util): better test and adhere to code standards * fix(util): correct variable name, pytestify test, update website text	2019-03-28 13:24:47 +01:00
Duygu Altinok	5a7bc6b39d	Fix/irreg adverbs extension (#3499 ) * extended list of irreg adverbs * added test to exceptions * fixed typo	2019-03-28 13:23:33 +01:00
Bharat Raghunathan	1db3e47509	DOC: Update tokenizer docs to include default value for batch_size in pipe (#3492 )	2019-03-28 12:48:02 +01:00
Matthew Honnibal	f77bf2bdb1	Fix GPU training for textcat. Closes #3473	2019-03-26 13:36:11 +01:00
Sofie	a4a6bfa4e1	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
svlandeg	8814b9010d	entity as one field instead of both ID and name	2019-03-25 18:10:41 +01:00
Wannaphong Phatthiyaphaibun	297a051992	Update Thai tag map (#3480 ) * Update Thai tag map Update Thai tag map * Create wannaphongcom.md	2019-03-25 16:53:26 +01:00
Matthew Honnibal	85dcd9477e	Set version to v2.1.3	2019-03-23 16:47:57 +01:00
Matthew Honnibal	f436efd8a4	Small tweak to ensemble textcat model	2019-03-23 16:47:26 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Matthew Honnibal	d9a07a7f6e	💫 Fix class mismap on parser deserializing (closes #3433 ) (#3470 ) v2.1 introduced a regression when deserializing the parser after parser.add_label() had been called. The code around the class mapping is pretty confusing currently, as it was written to accommodate backwards model compatibility. It needs to be revised when the models are next retrained. Closes #3433	2019-03-23 13:46:25 +01:00
Matthew Honnibal	444a3abfe5	Add xfail test for #3433 . Improve test for add label.	2019-03-23 12:36:00 +01:00
Ines Montani	6b6e9b638e	Fix test for #3468	2019-03-23 11:24:29 +01:00
Ines Montani	fbec72b4c3	Slightly modify test for #3468 Check for Token.is_sent_start first (which is serialized/deserialized correctly)	2019-03-23 11:22:44 +01:00
Ines Montani	02d9378d8c	Add xfailing test for #3468	2019-03-23 11:19:11 +01:00
svlandeg	46f4eb5db3	error and warning messages	2019-03-22 16:55:05 +01:00
svlandeg	9de9900510	adding future import unicode literals to .py files	2019-03-22 16:18:04 +01:00
svlandeg	b4cd5d5ee9	property annotations for fields with only a getter	2019-03-22 16:10:49 +01:00
svlandeg	9751312aff	specify unicode strings for python 2.7	2019-03-22 14:15:18 +01:00
svlandeg	5318ce88fa	'entity_linker' instead of 'el'	2019-03-22 13:55:10 +01:00
svlandeg	ec3e860b44	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:47:08 +01:00
Ines Montani	c9bd0e5a96	Set version to 2.1.2	2019-03-22 13:44:47 +01:00
svlandeg	12d4caf341	Merge remote-tracking branch 'upstream/master' into feature/el-framework	2019-03-22 13:44:36 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	188ccd5750	Fix xfail marker	2019-03-22 12:54:14 +01:00
svlandeg	7cf0bc9a8c	delete sandbox folder	2019-03-22 12:25:11 +01:00
svlandeg	5b1cd49222	error msg and unit tests for setting kb_id on span	2019-03-22 12:05:35 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	1ee0e78fd7	select candidate with highest prior probabiity	2019-03-22 11:36:45 +01:00
svlandeg	7b708ab8a4	name per entity	2019-03-22 11:36:45 +01:00
svlandeg	c593607ce2	minimal EL pipe	2019-03-22 11:36:45 +01:00
svlandeg	c71123dd0c	ensure no candidates are returned for unknown aliases	2019-03-22 11:36:45 +01:00
svlandeg	b6c3255a9f	Entity class	2019-03-22 11:36:45 +01:00
svlandeg	1289cd6e8f	property getters and keep track of KB internally	2019-03-22 11:36:45 +01:00
svlandeg	98ae77a682	unit test on number of candidates generated	2019-03-22 11:36:45 +01:00
svlandeg	9a46c431c3	store entity hash instead of pointer	2019-03-22 11:36:45 +01:00
svlandeg	9819dca80e	create candidate object from entry pointer (not fully functional yet)	2019-03-22 11:36:45 +01:00
svlandeg	a9074e0886	check the length of entities and probabilities vector + unit test	2019-03-22 11:36:45 +01:00
svlandeg	d133ffaff9	correct size, not counting dummy elements in the vector	2019-03-22 11:36:45 +01:00

1 2 3 4 5 ...

5884 Commits