spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-03-03 19:08:06 +03:00

Author	SHA1	Message	Date
Ines Montani	bd6353715a	Merge branch 'master' into fix/travis-tests	2020-05-21 14:23:04 +02:00
Ines Montani	d8f3190c0a	Tidy up and auto-format	2020-05-21 14:14:01 +02:00
Ines Montani	56de520afd	Try to fix tests on Travis (2.7)	2020-05-21 14:04:57 +02:00
adrianeboyd	40e65d6f63	Fix most_similar for vectors with unused rows (#5348 ) * Fix most_similar for vectors with unused rows Address issues related to the unused rows in the vector table and `most_similar`: * Update `most_similar()` to search only through rows that are in use according to `key2row`. * Raise an error when `most_similar(n=n)` is larger than the number of vectors in the table. * Set and restore `_unset` correctly when vectors are added or deserialized so that new vectors are added in the correct row. * Set data and keys to the same length in `Vocab.prune_vectors()` to avoid spurious entries in `key2row`. * Fix regression test using `most_similar` Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 16:41:26 +02:00
adrianeboyd	70da1fd2d6	Add warning for misaligned character offset spans (#5007 ) * Add warning for misaligned character offset spans * Resolve conflict * Filter warnings in example scripts Filter warnings in example scripts to show warnings once, in particular warnings about misaligned entities. Co-authored-by: Ines Montani <ines@ines.io>	2020-05-19 16:01:18 +02:00
adrianeboyd	0061992d95	Update Polish tokenizer for UD_Polish-PDB (#5432 ) Update Polish tokenizer for UD_Polish-PDB, which is a relatively major change from the existing tokenizer. Unused exceptions files and conflicting test cases removed. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:55 +02:00
adrianeboyd	a5cd203284	Reduce stored lexemes data, move feats to lookups (#5238 ) * Reduce stored lexemes data, move feats to lookups * Move non-derivable lexemes features (`norm / cluster / prob`) to `spacy-lookups-data` as lookups * Get/set `norm` in both lookups and `LexemeC`, serialize in lookups * Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in lookups only * Remove serialization of lexemes data as `vocab/lexemes.bin` * Remove `SerializedLexemeC` * Remove `Lexeme.to_bytes/from_bytes` * Modify normalization exception loading: * Always create `Vocab.lookups` table `lexeme_norm` for normalization exceptions * Load base exceptions from `lang.norm_exceptions`, but load language-specific exceptions from lookups * Set `lex_attr_getter[NORM]` including new lookups table in `BaseDefaults.create_vocab()` and when deserializing `Vocab` * Remove all cached lexemes when deserializing vocab to override existing normalizations with the new normalizations (as a replacement for the previous step that replaced all lexemes data with the deserialized data) * Skip English normalization test Skip English normalization test because the data is now in `spacy-lookups-data`. * Remove norm exceptions Moved to spacy-lookups-data. * Move norm exceptions test to spacy-lookups-data * Load extra lookups from spacy-lookups-data lazily Load extra lookups (currently for cluster and prob) lazily from the entry point `lg_extra` as `Vocab.lookups_extra`. * Skip creating lexeme cache on load To improve model loading times, do not create the full lexeme cache when loading. The lexemes will be created on demand when processing. * Identify numeric values in Lexeme.set_attrs() With the removal of a special case for `PROB`, also identify `float` to avoid trying to convert it with the `StringStore`. * Skip lexeme cache init in from_bytes * Unskip and update lookups tests for python3.6+ * Update vocab pickle to include lookups_extra * Update vocab serialization tests Check strings rather than lexemes since lexemes aren't initialized automatically, account for addition of "_SP". * Re-skip lookups test because of python3.5 * Skip PROB/float values in Lexeme.set_attrs * Convert is_oov from lexeme flag to lex in vectors Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether the lexeme has a vector. Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-05-19 15:59:14 +02:00
Ines Montani	a41e28ceba	Merge pull request #5436 from ilivans/fix_errors_with_codes	2020-05-18 10:45:56 +02:00
adrianeboyd	780b869345	Fix syntax iterators for Persian (#5437 )	2020-05-14 16:51:03 +02:00
Ilia Ivanov	712d9d4820	fixup! Fix ErrorsWithCodes().__class__ return value	2020-05-14 15:45:58 +02:00
Ilia Ivanov	a987e9e45d	Fix ErrorsWithCodes().__class__ return value	2020-05-14 14:14:15 +02:00
Vishnu Priya VR	9ce059dd06	Limiting noun_chunks for specific languages (#5396 ) * Limiting noun_chunks for specific langauges * Limiting noun_chunks for specific languages Contributor Agreement * Addressing review comments * Removed unused fixtures and imports * Add fa_tokenizer in test suite * Use fa_tokenizer in test * Undo extraneous reformatting Co-authored-by: adrianeboyd <adrianeboyd@gmail.com>	2020-05-14 12:58:06 +02:00
Sofie Van Landeghem	b04738903e	prevent None in gold fields (#5425 ) * set gold fields to empty list instead of keeping them as None * add unit test	2020-05-13 22:08:50 +02:00
adrianeboyd	113e7981d0	Check that row is within bounds when adding vector (#5430 ) Check that row is within bounds for the vector data array when adding a vector. Don't add vectors with rank OOV_RANK in `init-model` (change is due to shift from OOV as 0 to OOV as OOV_RANK).	2020-05-13 22:08:28 +02:00
adrianeboyd	c963e269ba	Add method to update / reset pkuseg user dict (#5404 )	2020-05-08 11:21:46 +02:00
Samuel Rodríguez Medina	5e55bfa821	Fixed tests for Swedish that were written in Danish. (#5395 )	2020-05-05 14:06:27 +02:00
Samuel Rodríguez Medina	148b036e0c	Spanish like num improvement (#5381 ) * Add tests for Spanish like_num. * Add missing numbers in Spanish lexical attributes for like_num. * Modify Spanish test function name. * Add contributor agreement.	2020-04-30 11:13:23 +02:00
Samuel Rodríguez Medina	8602daba85	Swedish like_num (#5371 ) * Sign contributor agreement. * Add like_num functionality to Swedish. * Update spacy/tests/lang/sv/test_lex_attrs.py Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Update contributor agreement Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-29 21:25:22 +02:00
Sofie Van Landeghem	cfdaf99b80	Fix passing of component configuration (#5374 ) * add kwargs to to_disk methods in docs - otherwise crashes on 'exclude' argument * add fix and test for Issue 5137	2020-04-29 12:56:17 +02:00
Ines Montani	efec28ce70	Merge pull request #5367 from adrianeboyd/feature/simplify-warnings-v2	2020-04-29 12:55:37 +02:00
Sofie Van Landeghem	f67343295d	Update NEL examples and documentation (#5370 ) * simplify creation of KB by skipping dim reduction * small fixes to train EL example script * add KB creation and NEL training example scripts to example section * update descriptions of example scripts in the documentation * moving wiki_entity_linking folder from bin to projects * remove test for wiki NEL functionality that is being moved	2020-04-29 12:53:53 +02:00
adrianeboyd	a6e521cd79	Add is_sent_end token property (#5375 ) Reconstruction of the original PR #4697 by @MiniLau. Removes unused `SENT_END` symbol and `IS_SENT_END` from `Matcher` schema because the Matcher is only going to be able to support `IS_SENT_START`.	2020-04-29 12:53:16 +02:00
Adriane Boyd	bc39f97e11	Simplify warnings	2020-04-28 13:37:37 +02:00
adrianeboyd	f8ac5b9f56	bugfix in span similarity (#5155 ) (#5358 ) * bugfix in span similarity * also rewrite doc.pyx for clarity * formatting Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2020-04-27 16:51:27 +02:00
Sofie Van Landeghem	9203d821ae	Add 2 ini files in tests/lang (#5359 )	2020-04-27 13:01:54 +02:00
Punitvara	b2b7e1f37a	This PR adds Gujarati Language class along with (#5355 ) * This PR adds Gujarati Language class along with - stop words * Add test for gu tokenizer	2020-04-27 11:07:37 +02:00
sabiqueqb	fc91660aa2	Gh 5339 language class for malayalam (#5342 ) * Initialize Malayalam Language class * Add lex_attrs and examples for Malayalam * Add spaCy Contributor Agreement * Add test for ml tokenizer	2020-04-27 09:45:08 +02:00
adrianeboyd	84e06f9fb7	Improve GoldParse NER alignment (#5335 ) Improve GoldParse NER alignment by including all cases where the start and end of the NER span can be aligned, regardless of internal tokenization differences. To do this, convert BILUO tags to character offsets, check start/end alignment with `doc.char_span()`, and assign the BILUO tags for the aligned spans. Alignment for `O/-` tags is handled through the one-to-one and multi alignments.	2020-04-23 16:58:23 +02:00
adrianeboyd	521f361052	Switch to new gold.align method (#5334 ) * Switch from original `_align` to new simpler alignment algorithm from #4526 * Remove alignment normalizations beyond whitespace and lowercasing	2020-04-21 19:31:03 +02:00
adrianeboyd	f7471abd82	Add pkuseg and serialization support for Chinese (#5308 ) * Add pkuseg and serialization support for Chinese Add support for pkuseg alongside jieba * Specify model through `Language` meta: * split on characters (if no word segmentation packages are installed) ``` Chinese(meta={"tokenizer": {"config": {"use_jieba": False, "use_pkuseg": False}}}) ``` * jieba (remains the default tokenizer if installed) ``` Chinese() Chinese(meta={"tokenizer": {"config": {"use_jieba": True}}}) # explicit ``` * pkuseg ``` Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True}}}) ``` * The new tokenizer setting `require_pkuseg` is used to override `use_jieba` default, which is intended for models that provide a pkuseg model: ``` nlp_pkuseg = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "default", "require_pkuseg": True}}}) nlp = Chinese() # has `use_jieba` as `True` by default nlp.from_bytes(nlp_pkuseg.to_bytes()) # `require_pkuseg` overrides `use_jieba` when calling the tokenizer ``` Add support for serialization of tokenizer settings and pkuseg model, if loaded * Add sorting for `Language.to_bytes()` serialization of `Language.meta` so that the (emptied, but still present) tokenizer metadata is in a consistent position in the serialized data Extend tests to cover all three tokenizer configurations and serialization * Fix from_disk and tests without jieba or pkuseg * Load cfg first and only show error if `use_pkuseg` * Fix blank/default initialization in serialization tests * Explicitly initialize jieba's cache on init * Add serialization for pkuseg pre/postprocessors * Reformat pkuseg install message	2020-04-18 17:01:53 +02:00
Jakob Jul Elben	663333c3b2	Fixes #5413 (#5315 ) * Fix 5314 * Add contributor * Resolve requested changes Co-authored-by: Jakob Jul Elben <jakob@datamaga.com>	2020-04-16 13:29:02 +02:00
Paolo Arduin	1ca32d8f9c	Matcher support for Span as well as Doc (#5113 ) * Matcher support for Span, as well as Doc #5056 * Removes an import unused * Signed contributors agreement * Code optimization and better test * Add error message for bad Matcher call argument * Fix merging	2020-04-15 13:51:33 +02:00
adrianeboyd	98c59027ed	Use max(uint64) for OOV lexeme rank (#5303 ) * Use max(uint64) for OOV lexeme rank * Add test for default OOV rank * Revert back to thinc==7.4.0 Requiring the updated version of thinc was unnecessary. * Define OOV_RANK in one place Define OOV_RANK in one place in `util`. * Fix formatting [ci skip] * Switch to external definitions of max(uint64) Switch to external defintions of max(uint64) and confirm that they are equal.	2020-04-15 13:49:47 +02:00
adrianeboyd	3d2c308906	Add Doc init from list of words and text (#5251 ) * Add Doc init from list of words and text Add an option to initialize a `Doc` from a text and list of words where the words may or may not include all whitespace tokens. If the text and words are mismatched, raise an error. * Fix error code * Remove all whitespace before aligning words/text * Move words/text init to util function * Update error message * Rename to get_words_and_spaces * Fix formatting	2020-04-14 19:15:52 +02:00
Paolo Arduin	8ce408d2e1	Comparison predicate handling for `!=` (#5282 ) * Fix #5281 * Optim test	2020-04-14 19:14:15 +02:00
adrianeboyd	cf579a398d	Add __init__.py to eu and hy tests (#5278 )	2020-04-08 20:03:06 +02:00
YohannesDatasci	beef184e53	Armenian language support (#5246 ) * add Armenian language and test cases * agreement submission	2020-04-03 13:02:18 +02:00
adrianeboyd	963bd890c1	Modify Vector.resize to work with cupy and improve resizing (#5216 ) * Modify Vector.resize to work with cupy Modify `Vectors.resize` to work with cupy. Modify behavior when resizing to a different vector dimension so that individual vectors are truncated or extended with zeros instead of having the original values filled into the new shape without regard for the original axes. * Update spacy/tests/vocab_vectors/test_vectors.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2020-03-29 13:51:20 +02:00
Ines Montani	828acffc12	Tidy up and auto-format	2020-03-25 12:28:12 +01:00
adrianeboyd	86c43e55fa	Improve Lithuanian tokenization (#5205 ) * Improve Lithuanian tokenization Modify Lithuanian tokenization to improve performance for UD_Lithuanian-ALKSNIS. * Update Lithuanian tokenizer tests	2020-03-25 11:28:12 +01:00
Adriane Boyd	09d442f5ad	Merge remote-tracking branch 'upstream/master' into feature/ud-tokenization-da	2020-03-25 09:41:52 +01:00
Adriane Boyd	cba2d1d972	Disable failing abbreviation test UD_Danish-DDT has (as far as I can tell) hallucinated periods after abbreviations, so the changes are an artifact of the corpus and not due to anything meaningful about Danish tokenization.	2020-03-25 09:39:26 +01:00
Adriane Boyd	423849f94a	Fix sents comparison in test util Due to changes to `Span` (#5005), spans from different documents are now never equal. Check `Token.is_sent_start` values instead.	2020-03-13 09:25:23 +01:00
Sofie Van Landeghem	1a2b8fc264	set vector of merged entity (#5085 ) * merge_entities sets the vector in the vocab for the merged token * add unit test * import unicode_literals * move code to _merge function * only set vector if vocab has non-zero vectors	2020-03-06 14:45:28 +01:00
Muhammad Irfan	03376c9d9b	Basque language added and tested.	2020-03-04 11:58:56 +05:00
adrianeboyd	9be90dbca3	Improve token head verification (#5079 ) * Improve token head verification Improve the verification for valid token heads when heads are set: * in `Token.head`: heads come from the same document * in `Doc.from_array()`: head indices are within the bounds of the document * Improve error message	2020-03-03 21:44:51 +01:00
Sofie Van Landeghem	d307e9ca58	take care of global vectors in multiprocessing (#5081 ) * restore load_nlp.VECTORS in the child process * add unit test * fix test * remove unnecessary import * add utf8 encoding * import unicode_literals	2020-03-03 13:58:22 +01:00
adrianeboyd	697bec764d	Normalize IS_SENT_START to SENT_START for Matcher (#5080 )	2020-03-03 12:22:39 +01:00
adrianeboyd	2281c4708c	Restore empty tokenizer properties (#5026 ) * Restore empty tokenizer properties * Check for types in tokenizer.from_bytes() * Add test for setting empty tokenizer rules	2020-03-02 11:55:02 +01:00
Sofie Van Landeghem	c6b12ab02a	Bugfix/get doc (#5049 ) * new (broken) unit test * fixing get_doc method	2020-03-02 11:49:28 +01:00

1 2 3 4 5 ...

1546 Commits