spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-16 11:12:25 +03:00

Author	SHA1	Message	Date
Adriane Boyd	3f727bc539	Switch to v2.3.0.dev0	2020-05-25 12:57:20 +02:00
Adriane Boyd	736f3cb5af	Bump version and deps for v2.3.0 * spacy to v2.3.0 * thinc to v7.4.1 * spacy-lookups-data to v0.3.2	2020-05-25 12:03:49 +02:00
Adriane Boyd	e06ca7ea24	Switch to new add API in PhraseMatcher unpickle	2020-05-25 11:22:47 +02:00
Ines Montani	6728747f71	Merge pull request #5486 from explosion/fix/compat-py2	2020-05-22 15:47:21 +02:00
Matthew Honnibal	f6078d866a	Merge pull request #5121 from adrianeboyd/bugfix/revert-token-match Revert token_match priority changes from #4374 and extend token match options	2020-05-22 14:42:51 +02:00
Ines Montani	c685ee734a	Fix compat for v2.x branch	2020-05-22 14:22:36 +02:00
Adriane Boyd	e4a1b5dab1	Rename to url_match Rename to `url_match` and update docs.	2020-05-22 12:41:03 +02:00
Adriane Boyd	730fa493a4	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-22 12:18:00 +02:00
Adriane Boyd	71fe61fdcd	Disallow merging 0-length spans	2020-05-22 10:14:34 +02:00
Matthew Honnibal	93c4d13588	Merge pull request #5264 from lfiedler/issue-5230 Fix ResourceWarnings during unittest	2020-05-22 00:31:07 +02:00
Matthew Honnibal	e1cb7e838b	Merge pull request #5481 from explosion/feature/blank-shortcut-v2 Add blank:{lang} shortcut support to util.load_model	2020-05-22 00:08:23 +02:00
Ines Montani	2250380816	Merge pull request #5482 from explosion/fix/backwards-compat-super	2020-05-21 21:51:46 +02:00
Ines Montani	891fa59009	Use backwards-compatible super()	2020-05-21 20:52:48 +02:00
Matthew Honnibal	5ce02c1b17	Merge pull request #5470 from svlandeg/bugfix/noun-chunks Bugfix in noun chunks	2020-05-21 20:51:31 +02:00
Ines Montani	cb02bff0eb	Add blank:{lang} shortcut to util.load_mode	2020-05-21 20:24:07 +02:00
Ines Montani	0f1beb5ff2	Tidy up and avoid absolute spacy imports in core	2020-05-21 20:05:03 +02:00
svlandeg	51715b9f72	span / noun chunk has +1 because end is exclusive	2020-05-21 19:56:56 +02:00
svlandeg	84d5b7ad0a	Merge remote-tracking branch 'upstream/master' into bugfix/noun-chunks # Conflicts: # spacy/lang/el/syntax_iterators.py # spacy/lang/en/syntax_iterators.py # spacy/lang/fa/syntax_iterators.py # spacy/lang/fr/syntax_iterators.py # spacy/lang/id/syntax_iterators.py # spacy/lang/nb/syntax_iterators.py # spacy/lang/sv/syntax_iterators.py	2020-05-21 19:19:50 +02:00
svlandeg	f7d10da555	avoid unnecessary loop to check overlapping noun chunks	2020-05-21 19:15:57 +02:00
Ines Montani	c6ec19c844	Add missing declaration	2020-05-21 17:30:05 +02:00
Matthew Honnibal	884d9b060d	Merge pull request #5466 from adrianeboyd/feature/omit-extra-lexeme-info Add option to omit extra lexeme tables in CLI	2020-05-21 16:40:02 +02:00
Matthew Honnibal	26cd6a0229	Merge pull request #5462 from adrianeboyd/feature/lemmatizer-all-upos Extend lemmatizer rules for all UPOS tags	2020-05-21 16:05:31 +02:00
Matthew Honnibal	cad9b290a2	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 16:04:24 +02:00
Matthew Honnibal	1f572ce89b	Merge pull request #5473 from explosion/fix/travis-tests Fix Python 2.7 compat	2020-05-21 15:56:16 +02:00
Ines Montani	a9cb2882cb	Rename argument: doc_or_span/obj -> doclike (#5463 ) * doc_or_span -> obj * Revert "doc_or_span -> obj" This reverts commit `78bb9ff5e0`. * obj -> doclike * Refer to correct object	2020-05-21 15:17:39 +02:00
Ines Montani	bea863acd2	Fix naming conflict and formatting	2020-05-21 14:24:38 +02:00
Ines Montani	bd6353715a	Merge branch 'master' into fix/travis-tests	2020-05-21 14:23:04 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Ines Montani	56de520afd	Try to fix tests on Travis (2.7)	2020-05-21 14:04:57 +02:00
adrianeboyd	d45602bc11	Merge branch 'master' into feature/omit-extra-lexeme-info	2020-05-21 10:26:01 +02:00
svlandeg	b221bcf1ba	fixing all languages	2020-05-21 00:17:28 +02:00
svlandeg	b509a3e7fc	fix: use actual range in 'seen' instead of subtree	2020-05-20 23:06:39 +02:00
svlandeg	36a94c409a	failing test to reproduce overlapping spans problem	2020-05-20 23:06:03 +02:00
adrianeboyd	49ef06d793	Add option for base model in init-model CLI (#5467 ) Intended for languages like Chinese with a custom tokenizer.	2020-05-20 18:49:11 +02:00
Adriane Boyd	daaa7bf451	Add option to omit extra lexeme tables in CLI	2020-05-20 15:51:44 +02:00
Adriane Boyd	8cba0e41d8	Return lowercase form as default except for PROPN	2020-05-20 15:35:08 +02:00
adrianeboyd	9393253b66	Remove peeking from Parser.begin_training (#5456 ) Inspect all instances in `Parser.begin_training` rather than only the first 1000.	2020-05-20 15:18:06 +02:00
Adriane Boyd	4fa9670537	Extend lemmatizer rules for all UPOS tags	2020-05-20 10:15:43 +02:00
adrianeboyd	40e65d6f63	Fix most_similar for vectors with unused rows (#5348 ) * Fix most_similar for vectors with unused rows Address issues related to the unused rows in the vector table and `most_similar`: * Update `most_similar()` to search only through rows that are in use according to `key2row`. * Raise an error when `most_similar(n=n)` is larger than the number of vectors in the table. * Set and restore `_unset` correctly when vectors are added or deserialized so that new vectors are added in the correct row. * Set data and keys to the same length in `Vocab.prune_vectors()` to avoid spurious entries in `key2row`. * Fix regression test using `most_similar` Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 16:41:26 +02:00
adrianeboyd	70da1fd2d6	Add warning for misaligned character offset spans (#5007 ) * Add warning for misaligned character offset spans * Resolve conflict * Filter warnings in example scripts Filter warnings in example scripts to show warnings once, in particular warnings about misaligned entities. Co-authored-by: Ines Montani <ines@ines.io>	2020-05-19 16:01:18 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
Ilkyu Ju	72a25c9cef	Very minor issues in Korean example sentences (#5446 ) * Add contributor agreement * Improve ko translation of example sentences I fixed unnatural translations and word spacing errors. * Update osori.md	2020-05-17 13:43:34 +02:00
adrianeboyd	908dea3939	Skip duplicate lexeme rank setting (#5401 ) Skip duplicate lexeme rank setting within `_fix_pretrained_vectors_name()`.	2020-05-14 18:26:12 +02:00
adrianeboyd	f49e2810e6	Add Polish lemmatizer (#5413 ) * Add Polish lemmatizer Contributed by @ryszardtuora * Add missing import	2020-05-14 18:23:19 +02:00
adrianeboyd	e63880e081	Use Token.sent_start for Span.sent (#5439 ) Use `Token.sent_start` for sentence boundaries in `Span.sent` so that `Doc.sents` and `Span.sent` return the same sentence boundaries.	2020-05-14 18:22:51 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	07639dd6ac	Remove TAG from da/sv tokenizer exceptions (#5428 ) Remove `TAG` value from Danish and Swedish tokenizer exceptions because it may not be included in a tag map (and these settings are problematic as tokenizer exceptions anyway).	2020-05-13 10:25:54 +02:00
adrianeboyd	24e7108f80	Modify array type to accommodate OOV_RANK (#5429 ) Modify indices array type in `Vocab.prune_vectors` to accommodate OOV_RANK index as max(uint64).	2020-05-13 10:25:05 +02:00
adrianeboyd	440b81bddc	Improve exceptions for 'd (would/had) in English (#5379 ) Instead of treating `'d` in contractions like `I'd` as `would` in all cases in the tokenizer exceptions, leave the tagging and lemmatization up to later components.	2020-05-08 15:10:57 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Adriane Boyd	565e0eef73	Add tokenizer option for token match with affixes To fix the slow tokenizer URL (#4374) and allow `token_match` to take priority over prefixes and suffixes by default, introduce a new tokenizer option for a token match pattern that's applied after prefixes and suffixes but before infixes.	2020-05-05 10:35:33 +02:00
Adriane Boyd	792c8af8cf	Merge remote-tracking branch 'upstream/master' into bugfix/revert-token-match	2020-05-05 09:25:57 +02:00
adrianeboyd	c045a9c7f6	Fix logic in train CLI timing eval on GPU (#5387 ) Run CPU timing in first iteration only	2020-05-01 12:05:33 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
adrianeboyd	74da669326	Fix problems with lower and whitespace in variants (#5361 ) * Initialize lower flag explicitly * Handle whitespace words from GoldParse correctly when creating raw text with orth variants * Return the text with original casing if anything goes wrong	2020-04-29 13:01:25 +02:00
adrianeboyd	3f43c73d37	Normalize TokenC.sent_start values for Matcher (#5346 ) Normalize TokenC.sent_start values to booleans for the `Matcher`.	2020-04-29 12:57:30 +02:00
adrianeboyd	bdff76dede	Various updates/additions to CLI scripts (#5362 ) * `debug-data`: determine coverage of provided vectors * `evaluate`: support `blank:lg` model to make it possible to just evaluate tokenization * `init-model`: add option to truncate vectors to N most frequent vectors from word2vec file * `train`: * if training on GPU, only run evaluation/timing on CPU in the first iteration * if training is aborted, exit with a non-0 exit status	2020-04-29 12:56:46 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Ines Montani	efec28ce70	Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2	2020-04-29 12:55:37 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
Ines Montani	eac47971f1	Merge pull request #5258 from mirfan899/master	2020-04-29 12:51:55 +02:00
adrianeboyd	d5f18f8307	Add missing import	2020-04-28 14:01:29 +02:00
adrianeboyd	ac40a8f7a5	Add missing import	2020-04-28 14:00:11 +02:00
Adriane Boyd	3a045572ed	Add missing import	2020-04-28 13:48:37 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
adrianeboyd	f8ac5b9f56	bugfix in span similarity (#5155 ) (#5358 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-27 16:51:27 +02:00
Sofie Van Landeghem	9203d821ae	Add 2 ini files in tests/lang (#5359 )	2020-04-27 13:01:54 +02:00
Punitvara	b2b7e1f37a	This PR adds Gujarati Language class along with (#5355 ) * This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer	2020-04-27 11:07:37 +02:00
sabiqueqb	fc91660aa2	Gh 5339 language class for malayalam (#5342 ) * Initialize Malayalam Language class * Add lex_attrs and examples for Malayalam * Add spaCy Contributor Agreement * Add test for ml tokenizer	2020-04-27 09:45:08 +02:00
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
adrianeboyd	bf5c13d170	Modify jieba install message (#5328 ) Modify jieba install message to instruct the user to use `ChineseDefaults.use_jieba = False` so that it's possible to load pkuseg-only models without jieba installed.	2020-04-20 22:06:53 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Leander Fiedler	a3401b1194	issue5230 changed reference to function to anonymous function	2020-04-15 21:52:52 +02:00
Leander Fiedler	cef0c909b9	issue5230 changed reference to function to anonymous function	2020-04-15 19:28:33 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
Leander Fiedler	6700006830	issue5230 attempted fix of pytest segfault for python3.5	2020-04-12 09:34:54 +02:00
Leander Fiedler	d60e2d3ebf	issue5230 added unit test for dumping and loading knowledgebase	2020-04-12 09:08:41 +02:00
Leander Fiedler	d2bb649227	issue5230 filter warnings in addition to filterwarnings to prevent deprecation warnings in python35(win) setup to pop up	2020-04-10 23:21:13 +02:00
Leander Fiedler	ca2a7a44db	issue5230 store string values of warnings to remotely debug failing python35(win) setup	2020-04-10 22:26:55 +02:00
Leander Fiedler	88ca40a15d	issue5230 raise warnings as errors to remotely debug failing python35(win) setup	2020-04-10 21:45:53 +02:00
Leander Fiedler	a7bdfe42e1	issue5230 added print statement to warnings filter to remotely debug failing python35(win) setup	2020-04-10 21:14:33 +02:00
Leander Fiedler	8c1d0d628f	issue5230 writer now checks instance of loc parameter before trying to operate on it	2020-04-10 20:35:52 +02:00
Umar Butler	8952effcc4	Fixed Typo in Warning (#5284 ) * Fixed typo in cli warning Fixed a typo in the warning for the provision of exactly two labels, which have not been designated as binary, to textcat. * Create and signed contributor form	2020-04-09 15:46:15 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
adrianeboyd	ae4af52ce7	Add ideographic stops to sentencizer (#5263 ) Add ideographic half- and fullwidth full stops to default sentencizer punctuation.	2020-04-08 12:58:39 +02:00

1 2 3 4 5 ...

6851 Commits