spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-10 19:57:17 +03:00

Author	SHA1	Message	Date
Adriane Boyd	c9f0f75778	Update get_loss for senter and morphologizer (#5724 ) * Update get_loss for senter Update `SentenceRecognizer.get_loss` to keep it similar to `Tagger`. * Update get_loss for morphologizer Update `Morphologizer.get_loss` to keep it similar to `Tagger`.	2020-07-08 13:59:28 +02:00
Sofie Van Landeghem	8d3c0306e1	refactor fixes (#5664 ) * fixes in ud_train, UX for morphs * update pyproject with new version of thinc * fixes in debug_data script * cleanup of old unused error messages * remove obsolete TempErrors * move error messages to errors.py * add ENT_KB_ID to default DocBin serialization * few fixes to simple_ner * fix tags	2020-06-29 14:33:00 +02:00
Ines Montani	52728d8fa3	Merge branch 'develop' into master-tmp	2020-06-20 15:52:00 +02:00
Ines Montani	a8875d4a4b	Fix typo	2020-06-03 14:42:39 +02:00
Ines Montani	4e0610d0d4	Update warning codes	2020-06-03 14:37:09 +02:00
Ines Montani	810fce3bb1	Merge branch 'develop' into master-tmp	2020-06-03 14:36:59 +02:00
Adriane Boyd	b6b5908f5e	Prefer _SP over SP for default tag map space attrs If `_SP` is already in the tag map, use the mapping from `_SP` instead of `SP` so that `SP` can be a valid non-space tag. (Chinese has a non-space tag `SP` which was overriding the mapping of `_SP` to `SPACE`.)	2020-05-26 14:57:13 +02:00
Ines Montani	5d3806e059	unicode -> str consistency	2020-05-24 17:20:58 +02:00
Ines Montani	f44897e4c6	Update warning IDs	2020-05-21 18:39:11 +02:00
Ines Montani	70ee4ef4fd	Fix small errors	2020-03-26 13:47:31 +01:00
Ines Montani	b0cfab317f	Merge branch 'develop' into refactor/simplify-warnings	2020-03-04 16:38:55 +01:00
Ines Montani	648f61d077	Tidy up compiler flags and imports (#5071 )	2020-03-02 11:48:10 +01:00
Ines Montani	37691e6d5d	Simplify warnings	2020-02-28 12:20:23 +01:00
adrianeboyd	adc9745718	Modify morphology to support arbitrary features (#4932 ) * Restructure tag maps for MorphAnalysis changes Prepare tag maps for upcoming MorphAnalysis changes that allow arbritrary features. * Use default tag map rather than duplicating for ca / uk / vi * Import tag map into defaults for ga * Modify tag maps so all morphological fields and features are strings * Move features from `"Other"` to the top level * Rewrite tuples as strings separated by `","` * Rewrite morph symbols for fr lemmatizer as strings * Export MorphAnalysis under spacy.tokens * Modify morphology to support arbitrary features Modify `Morphology` and `MorphAnalysis` so that arbitrary features are supported. * Modify `MorphAnalysisC` so that it can support arbitrary features and multiple values per field. `MorphAnalysisC` is redesigned to contain: * key: hash of UD FEATS string of morphological features * array of `MorphFeatureC` structs that each contain a hash of `Field` and `Field=Value` for a given morphological feature, which makes it possible to: * find features by field * represent multiple values for a given field * `get_field()` is renamed to `get_by_field()` and is no longer `nogil`. Instead a new helper function `get_n_by_field()` is `nogil` and returns `n` features by field. * `MorphAnalysis.get()` returns all possible values for a field as a list of individual features such as `["Tense=Pres", "Tense=Past"]`. * `MorphAnalysis`'s `str()` and `repr()` are the UD FEATS string. * `Morphology.feats_to_dict()` converts a UD FEATS string to a dict where: * Each field has one entry in the dict * Multiple values remain separated by a separator in the value string * `Token.morph_` returns the UD FEATS string and you can set `Token.morph_` with a UD FEATS string or with a tag map dict. * Modify get_by_field to use np.ndarray Modify `get_by_field()` to use np.ndarray. Remove `max_results` from `get_n_by_field()` and always iterate over all the fields. * Rewrite without MorphFeatureC * Add shortcut for existing feats strings as keys Add shortcut for existing feats strings as keys in `Morphology.add()`. * Check for '_' as empty analysis when adding morphs * Extend helper converters in Morphology Add and extend helper converters that convert and normalize between: * UD FEATS strings (`"Case=dat,gen\|Number=sing"`) * per-field dict of feats (`{"Case": "dat,gen", "Number": "sing"}`) * list of individual features (`["Case=dat", "Case=gen", "Number=sing"]`) All converters sort fields and values where applicable.	2020-01-23 22:01:54 +01:00
Ines Montani	a892821c51	More formatting changes	2019-12-25 17:59:52 +01:00
Ines Montani	db55577c45	Drop Python 2.7 and 3.5 (#4828 ) * Remove unicode declarations * Remove Python 3.5 and 2.7 from CI * Don't require pathlib * Replace compat helpers * Remove OrderedDict * Use f-strings * Set Cython compiler language level * Fix typo * Re-add OrderedDict for Table * Update setup.cfg * Revert CONTRIBUTING.md * Revert lookups.md * Revert top-level.md * Small adjustments and docs [ci skip]	2019-12-22 01:53:56 +01:00
Ines Montani	16aa092fb5	Improve Morphology errors (#4314 ) * Improve Morphology errors * Also clean up some other errors * Update errors.py	2019-09-21 14:37:06 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
Matthew Honnibal	f7a096b462	Update morphology	2019-09-11 18:06:43 +02:00
Matthew Honnibal	c47c0269b1	Update morphology features	2019-09-11 15:16:53 +02:00
Matthew Honnibal	67c3d03905	Revert morphology serialisation	2019-08-30 13:13:07 +02:00
Adriane Boyd	893f11a9e3	Serialize tag_map directly Fix Aspect_prof typo	2019-08-30 11:30:03 +02:00
Matthew Honnibal	fc0a3c8c38	Add morphology serialization	2019-08-29 21:17:34 +02:00
Matthew Honnibal	188a1cf297	Fix morphology for \| features	2019-08-25 21:57:02 +02:00
Ines Montani	278e9d2eb0	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
Matthew Honnibal	80b94313b6	💫 Fix interaction of lemmatizer and tokenizer exceptions (#3388 ) Closes #2203. Closes #3268. Lemmas set from outside the `Morphology` class were being overwritten. The result was especially confusing when deserialising, as it meant some lemmas could change when storing and retrieving a `Doc` object. This PR applies two fixes: 1) When we go to set the lemma in the `Morphology` class, first check whether a lemma is already set. If so, don't overwrite. 2) When we load with `doc.from_array()`, take care to apply the `TAG` field first. This allows other fields to overwrite the `TAG` implied properties, if they're provided explicitly (e.g. the `LEMMA`). ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-11 01:31:21 +01:00
Matthew Honnibal	5431c47b91	Refactor morphology slightly	2019-03-10 00:59:51 +00:00
Matthew Honnibal	0f12082465	Refactor morphologizer	2019-03-09 22:54:59 +00:00
Matthew Honnibal	41a3016019	Refactor morphologizer class map	2019-03-09 20:55:33 +01:00
Matthew Honnibal	eae384ebb2	Add POS to morphological fields	2019-03-09 11:49:44 +00:00
Matthew Honnibal	42bc3ad73b	Fix class mapping for morphologizer	2019-03-09 00:20:29 +00:00
Matthew Honnibal	09b26f5e2e	Fix compile error	2019-03-08 18:58:26 +01:00
Matthew Honnibal	d7ec1d62cb	Fix Morphologizer	2019-03-08 18:54:25 +01:00
Matthew Honnibal	322b64dca0	Allow lookup of morphology by attribute name	2019-03-08 01:38:15 +01:00
Matthew Honnibal	b5f2b7b454	Add list_features() helper, clean up	2019-03-08 00:08:35 +01:00
Matthew Honnibal	987ee6e884	Fix data reading in morphology	2019-03-07 21:58:43 +01:00
Matthew Honnibal	2669190b85	Normalize props for morph exceptions	2019-03-07 18:32:36 +01:00
Matthew Honnibal	fed0371db7	Remove enums from morphology	2019-03-07 17:14:57 +01:00
Matthew Honnibal	b9ade7d4e0	Add MorphAnalysisC struct	2019-03-07 14:03:07 +01:00
Matthew Honnibal	b69013e2d7	Fix passing of morphological features to lemmatizer	2019-03-07 13:11:38 +01:00
Matthew Honnibal	6734cfec88	Add comment	2019-03-07 12:14:37 +01:00
Matthew Honnibal	ae7c728c5f	Fix json dependency	2019-03-07 01:17:19 +01:00
Matthew Honnibal	2b8a53ebdc	Fix morphology functions	2018-09-26 21:03:57 +02:00
Matthew Honnibal	2be15fa7d2	Fix Python feature enum in morphology	2018-09-25 23:03:43 +02:00
Matthew Honnibal	a4fc397880	Add helper to parse features into field and column IDs	2018-09-25 22:13:10 +02:00
Matthew Honnibal	51a297f934	Fix morphology add and update	2018-09-25 21:07:08 +02:00
Matthew Honnibal	34cab8cc49	Update morphology API	2018-09-25 20:53:24 +02:00
Matthew Honnibal	4b7e772f5d	Implement the is_animacy_feature etc functions	2018-09-25 17:28:34 +02:00
Matthew Honnibal	8308c1525e	Fix exception loading	2018-09-25 15:18:21 +02:00

1 2 3

126 Commits