spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-16 23:21:58 +03:00

Author	SHA1	Message	Date
Ines Montani	de11ea753a	Merge branch 'master' into develop	2020-02-18 14:47:23 +01:00
adrianeboyd	3b22eb651b	Sync Span __eq__ and __hash__ (#5005 ) * Sync Span __eq__ and __hash__ Use the same tuple for `__eq__` and `__hash__`, including all attributes except `vector` and `vector_norm`. * Update entity comparison in tests Update `assert_docs_equal()` test util to compare `Span` properties for ents rather than `Span` objects.	2020-02-16 17:20:36 +01:00
adrianeboyd	5b102963bf	Require HEAD for is_parsed in Doc.from_array() (#5011 ) Modify flag settings so that `DEP` is not sufficient to set `is_parsed` and only run `set_children_from_heads()` if `HEAD` is provided. Then the combination `[SENT_START, DEP]` will set deps and not clobber sent starts with a lot of one-word sentences.	2020-02-16 17:17:09 +01:00
adrianeboyd	5ee9d8c9b8	Add MORPH attr, add support in retokenizer (#4947 ) * Add MORPH attr / symbol for token attrs * Update retokenizer for MORPH	2020-01-29 17:45:46 +01:00
adrianeboyd	adc9745718	Modify morphology to support arbitrary features (#4932 ) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.	2020-01-23 22:01:54 +01:00
Ines Montani	33a2682d60	Add better schemas and validation using Pydantic (#4831 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Add better schemas and validation using Pydantic * Revert lookups.md * Remove unused import * Update spacy/schemas.py Co-Authored-By: Sebastián Ramírez <tiangolo@gmail.com> * Various small fixes * Fix docstring Co-authored-by: Sebastián Ramírez <tiangolo@gmail.com>	2019-12-25 12:39:49 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
tamuhey	1707e77c5e	add char_span to Span (#4793 )	2019-12-13 15:54:58 +01:00
adrianeboyd	91f89f9693	Fix realloc in retokenizer.split() (#4606 ) Always realloc to a size larger than `doc.max_length` in `retokenizer.split()` (or cymem will throw errors).	2019-11-11 16:26:46 +01:00
adrianeboyd	6f54e59fe7	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 ) * Update util.filter_spans() to prefer earlier spans * Add filter_spans test for first same-length span * Update entity relation example to refer to util.filter_spans()	2019-10-10 17:00:03 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	f7d1736241	Skip duplicate spans in Doc.retokenize (#4339 )	2019-09-30 12:43:48 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Sofie Van Landeghem	de5a9ecdf3	Distinction between outside, missing and blocked NER annotations (#4307 ) * remove duplicate unit test * unit test (currently failing) for issue 4267 * bugfix: ensure doc.ents preserves kb_id annotations * fix in setting doc.ents with empty label * rename * test for presetting an entity to a certain type * allow overwriting Outside + blocking presets * fix actions when previous label needs to be kept * fix default ent_iob in set entities * cleaner solution with U- action * remove debugging print statements * unit tests with explicit transitions and is_valid testing * remove U- from move_names explicitly * remove unit tests with pre-trained models that don't work * remove (working) unit tests with pre-trained models * clean up unit tests * move unit tests * small fixes * remove two TODO's from doc.ents comments	2019-09-18 21:37:17 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Matthew Honnibal	1a65c5b7af	Update develop from master	2019-09-08 18:21:41 +02:00
adrianeboyd	aec755d3a3	Modify retokenizer to use span root attributes (#4219 ) * Modify retokenizer to use span root attributes * tag/pos/morph are set to root tag/pos/morph * lemma and norm are reset and end up as orth (not ideal, but better than orth of first token) * Also handle individual merge case * Add test * Attempt to handle ent_iob and ent_type in merges * Fix check for whether B-ENT should become I-ENT * Move IOB consistency check to after attrs Move all IOB consistency checks after attrs are set and simplify to check entire document, modifying I to B at the beginning of the document or if the entity type of the previous token isn't the same. * Move IOB consistency check for single merge Move IOB consistency check after the token array is compressed for the single merge case. * Update spacy/tokens/_retokenize.pyx Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com> * Remove single vs. multiple merge distinction Remove original single-instance `_merge()` and use `_bulk_merge()` (now renamed `_merge()`) for all merges. * Add out-of-bound check in previous entity check	2019-09-08 13:04:49 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
Ines Montani	8baff1c7c0	💫 Improve introspection of custom extension attributes (#3729 ) * Add custom __dir__ to Underscore (see #3707) * Make sure custom extension methods keep their docstrings (see #3707) * Improve tests * Prepend note on partial to docstring (see #3707) * Remove print statement * Handle cases where docstring is None	2019-05-12 00:53:11 +02:00
Ines Montani	505c9e0e19	Add util.filter_spans helper (#3686 )	2019-05-08 02:33:40 +02:00
svlandeg	5b1cd49222	error msg and unit tests for setting kb_id on span	2019-03-22 12:05:35 +01:00
Ines Montani	278e9d2eb0	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
Sofie	c45ed32c74	label in span not writable anymore (#3408 ) * label in span not writable anymore * more explicit unit test and error message for readonly label * bit more explanation (view) * error msg tailored to specific case * fix None case	2019-03-15 00:46:45 +01:00
Matthew Honnibal	b0b990e405	Fix token.conjuncts (closes #795 ) (#3392 ) * Implement conjuncts method * Add span.conjuncts property * Un-xfail token.conjuncts tests * Update docs for token.conjuncts and span.conjuncts * Fix merge error in token.conjuncts	2019-03-11 17:05:45 +01:00
Matthew Honnibal	db79a704bf	Add xfail tests for token.conjuncts	2019-03-11 15:46:52 +01:00
Ines Montani	ebcf2bb1c3	Add Doc.lang and Doc.lang_	2019-03-11 14:21:40 +01:00
Ines Montani	7c05ca01e8	💫 Support mutable default values for extension attributes (#3389 ) * Support mutable default values in extensions * Update documentation	2019-03-11 12:50:44 +01:00
Ines Montani	7ba3a5d95c	💫 Make serialization methods consistent (#3385 ) * Make serialization methods consistent exclude keyword argument instead of random named keyword arguments and deprecation handling * Update docs and add section on serialization fields	2019-03-10 19:16:45 +01:00
Ines Montani	67e38690d4	Un-xfail passing tests and tidy up	2019-03-10 18:42:16 +01:00
Matthew Honnibal	8a6272f842	Un-xfail test	2019-03-10 15:51:15 +01:00
Ines Montani	0426689db8	💫 Improve Doc.to_json and add Doc.is_nered (#3381 ) * Use default return instead of else * Add Doc.is_nered to indicate if entities have been set * Add properties in Doc.to_json if they were set, not if they're available This way, if a processed Doc exports "pos": None, it means that the tag was explicitly unset. If it exports "ents": [], it means that entity annotations are available but that this document doesn't contain any entities. Before, this would have been unclear and problematic for training.	2019-03-10 15:24:34 +01:00
Ines Montani	7984543953	Add xfailing test for to_array/from_array string attrs	2019-03-10 15:08:15 +01:00
Ines Montani	6bbf4ea309	Simplify tests and avoid tokenizing	2019-03-10 15:05:56 +01:00
Ines Montani	ad834be494	Tidy up and auto-format	2019-03-08 13:28:53 +01:00
Matthew Honnibal	19e6b39786	Test morphological features	2019-03-08 01:38:54 +01:00
Matthew Honnibal	3c32590243	Add test for morph analysis	2019-03-08 00:10:07 +01:00
Matthew Honnibal	fed0371db7	Remove enums from morphology	2019-03-07 17:14:57 +01:00
Ines Montani	e359bdd0e3	Auto-format	2019-02-27 11:56:45 +01:00
Matthew Honnibal	4a3371acd5	Make doc[0].is_sent_start == True (closes #2869 ) (#3340 ) * Make doc[0] have sent_start True. Closes #2869 * Document that doc[0].is_sent_start defaults True.	2019-02-27 11:17:17 +01:00
Ines Montani	62b558ab72	💫 Support lexical attributes in retokenizer attrs (closes #2390 ) (#3325 ) * Fix formatting and whitespace * Add support for lexical attributes (closes #2390) * Document lexical attribute setting during retokenization * Assign variable oputside of nested loop	2019-02-24 21:13:51 +01:00
Ines Montani	3bc53905cc	Remove print statements from test	2019-02-24 20:31:15 +01:00
Ines Montani	399a5803d0	Tidy up tests [ci skip]	2019-02-24 19:02:16 +01:00
Ines Montani	df19e2bff6	💫 Allow setting of custom attributes during retokenization (closes #3314 ) (#3324 ) <!--- Provide a general summary of your changes in the title. --> ## Description This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter and a setter implemented. ```python Token.set_extension('is_musician', default=False) doc = nlp("I like David Bowie.") with doc.retokenize() as retokenizer: attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}} retokenizer.merge(doc[2:4], attrs=attrs) assert doc[2].text == "David Bowie" assert doc[2].lemma_ == "David Bowie" assert doc[2]._.is_musician ``` ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-02-24 18:38:47 +01:00

1 2 3

126 Commits