spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-10 02:31:16 +03:00

Author	SHA1	Message	Date
Ines Montani	200d8bdb3c	Merge branch 'spacy.io' [ci skip]	2019-03-23 16:46:34 +01:00
Ines Montani	1e5b917d75	Fix formatting [ci skip]	2019-03-23 16:45:50 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
Ines Montani	06bf130890	💫 Add better and serializable sentencizer (#3471 ) * Add better serializable sentencizer component * Replace default factory * Add tests * Tidy up * Pass test * Update docs	2019-03-23 15:45:02 +01:00
Matthew Honnibal	d9a07a7f6e	💫 Fix class mismap on parser deserializing (closes #3433 ) (#3470 ) v2.1 introduced a regression when deserializing the parser after parser.add_label() had been called. The code around the class mapping is pretty confusing currently, as it was written to accommodate backwards model compatibility. It needs to be revised when the models are next retrained. Closes #3433	2019-03-23 13:46:25 +01:00
Matthew Honnibal	444a3abfe5	Add xfail test for #3433 . Improve test for add label.	2019-03-23 12:36:00 +01:00
Ines Montani	6b6e9b638e	Fix test for #3468	2019-03-23 11:24:29 +01:00
Ines Montani	fbec72b4c3	Slightly modify test for #3468 Check for Token.is_sent_start first (which is serialized/deserialized correctly)	2019-03-23 11:22:44 +01:00
Ines Montani	02d9378d8c	Add xfailing test for #3468	2019-03-23 11:19:11 +01:00
Ines Montani	ed91592726	Merge branch 'master' into spacy.io	2019-03-22 19:02:26 +01:00
Ines Montani	dcd6e06c47	Improve landing example [ci skip]	2019-03-22 19:02:15 +01:00
Ines Montani	c2bb39dcb4	Merge branch 'master' into spacy.io	2019-03-22 18:50:16 +01:00
Ines Montani	a841324034	Update landing example [ci skip]	2019-03-22 18:50:00 +01:00
Ines Montani	a9ad735241	Merge branch 'master' into spacy.io	2019-03-22 18:36:28 +01:00
Ines Montani	b532386a60	Fix typo [ci skip]	2019-03-22 18:36:17 +01:00
Ines Montani	7b5496027b	Merge branch 'master' into spacy.io	2019-03-22 18:21:16 +01:00
Ines Montani	d8533f0149	Update Binder [ci skip]	2019-03-22 18:16:46 +01:00
Matthew Honnibal	4c5f265884	Fix train loop for train_textcat example	2019-03-22 16:10:11 +01:00
Ines Montani	680eafab94	Merge branch 'master' into spacy.io	2019-03-22 15:17:51 +01:00
Christos Aridas	9cee3f702a	Add missing space in landing page (#3462 ) [ci skip]	2019-03-22 15:17:35 +01:00
Ines Montani	5073ce63fd	Merge branch 'spacy.io' [ci skip]	2019-03-22 15:17:11 +01:00
Ines Montani	c9bd0e5a96	Set version to 2.1.2	2019-03-22 13:44:47 +01:00
Matthew Honnibal	e65b5bb9a0	Fix tokenizer on Python2.7 (#3460 ) spaCy v2.1 switched to the built-in re module, where v2.0 had been using the third-party regex library. When the tokenizer was deserialized on Python2.7, the `re.compile()` function was called with expressions that featured escaped unicode codepoints that were not in Python2.7's unicode database. Problems occurred when we had a range between two of these unknown codepoints, like this: ``` '[\\uAA77-\\uAA79]' ``` On Python2.7, the unknown codepoints are not unescaped correctly, resulting in arbitrary out-of-range characters being matched by the expression. This problem does not occur if we instead have a range between two unicode literals, rather than the escape sequences. To fix the bug, we therefore add a new compat function that unescapes unicode sequences using the `ast.literal_eval()` function. Care is taken to ensure we do not also escape non-unicode sequences. Closes #3356. - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-22 13:42:47 +01:00
Ines Montani	c81923ee30	Update wasabi pin	2019-03-22 13:31:58 +01:00
Ines Montani	188ccd5750	Fix xfail marker	2019-03-22 12:54:14 +01:00
Ines Montani	7dd5e2f564	Update v2-1.md	2019-03-22 12:43:23 +01:00
Matthew Honnibal	d811c97da1	Fix test that caused pytest to choke on Python3	2019-03-22 10:28:51 +01:00
Matthew Honnibal	a2ad9832e5	Add failing test for #3356	2019-03-22 02:42:37 +01:00
Matthew Honnibal	7ec64a36fd	Merge pull request #3455 from explosion/bugfix/fix-en-tag-map 💫 Bring English tag_map in line with UD Treebank	2019-03-21 21:19:30 +01:00
Matthew Honnibal	c66bd61e88	Fix lemmas	2019-03-21 14:22:12 +01:00
Matthew Honnibal	04395ffa49	Bring English tag_map in line with UD Treebank I wrote a small script to read the UD English training data and check that our tag map and morph rules were resulting in the best POS map. This hadn't been done for some time, and there have been various changes to the UD schema since it has been done. After these changes we should see much better agreement between our POS assignments and the UD POS tags.	2019-03-21 13:53:44 +01:00
Ines Montani	375fbf3586	Update v2-1.md	2019-03-21 12:29:08 +01:00
Ines Montani	9394ca1f29	Update index.md	2019-03-21 10:24:55 +01:00
Ines Montani	0c82a5ddb2	Merge branch 'master' of https://github.com/explosion/spaCy	2019-03-21 10:23:56 +01:00
Ines Montani	0712efc6b3	Update version requirements [ci skip]	2019-03-21 10:23:54 +01:00
Matthew Honnibal	4e3ed2ea88	Add -t2v argument to train_textcat script	2019-03-20 23:05:42 +01:00
Ines Montani	764359c952	Merge branch 'master' into spacy.io	2019-03-20 17:24:28 +01:00
Ines Montani	dac8f8ff99	Update Span.__init__ docs (see #3445 ) [ci skip]	2019-03-20 17:24:17 +01:00
Matthew Honnibal	c7f26abe5f	Merge pull request #3434 from Bharat123rox/narrow-unicode Raise Error for a narrow unicode build of Python	2019-03-20 12:19:52 +01:00
Matthew Honnibal	1c8ff59185	Merge pull request #3441 from explosion/fix/cli-ud-scripts 💫 Move UD scripts to bin	2019-03-20 12:19:15 +01:00
Matthew Honnibal	72889a16d5	Fix similarity calculation if vectors are on GPU (#3440 )	2019-03-20 12:09:59 +01:00
Matthew Honnibal	1612990e88	Implement cosine loss for spacy pretrain. Make default	2019-03-20 11:06:58 +00:00
Ines Montani	ae5b4d0e84	Fix formatting (hopefully also restarts build properly)	2019-03-20 09:55:45 +01:00
Ines Montani	6abc1ddb26	Update __main__.py	2019-03-20 09:43:26 +01:00
Bharat123Rox	f2547f02d6	Made changes suggested by @ines	2019-03-20 07:43:19 +05:30
Ines Montani	7400c7f8a7	Move UD scripts to bin	2019-03-20 01:19:34 +01:00
Ines Montani	685fff40cf	Revert "Add --always-link flag to cli.download (see #3435 )" This reverts commit `583a566843`.	2019-03-20 01:03:40 +01:00
Matthew Honnibal	6cfbb2d34e	Merge branch 'master' of https://github.com/explosion/spaCy	2019-03-20 00:59:54 +01:00
Matthew Honnibal	5a53e9358a	Set version to 2.1.1	2019-03-20 00:59:45 +01:00
Matthew Honnibal	02d7b41893	Fix GPU installation. Closes #3437	2019-03-20 00:59:27 +01:00

1 2 3 4 5 ...

9879 Commits