Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
# coding: utf8
|
|
|
|
from __future__ import unicode_literals
|
|
|
|
|
Reduce size of language data (#4141)
* Move Turkish lemmas to a json file
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.
This focuses on Turkish specifically because it has the most language
data in core.
* Transition all lemmatizer.py files to json
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.
None of the affected files have logic, so this was pretty
straightforward.
One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
* Move large lang data to json for fr/nb/nl/sv
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.
For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
* Fix id lemmas.json
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
* Add .json.gz to gitignore
This covers the json.gz files built as part of distribution.
* Add language data gzip to build process
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Remove Danish lemmatizer.py
Missed this when I added the json.
* Update to match latest explosion/srsly#9
The way gzipped json is loaded/saved in srsly changed a bit.
* Only compress language data if necessary
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
* Move en/el language data to json
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
* Remove empty files in Norwegian tokenizer
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.
This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
* Remove dubious entries in English lookup.json
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
* Fix small issues with en/fr lemmatizers
The en tokenizer was including the removed _nouns.py file, so that's
removed.
The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
* Auto-format
* Auto-format
* Update srsly pin
* Consistently use pathlib paths
2019-08-20 15:54:11 +03:00
|
|
|
from pathlib import Path
|
|
|
|
|
2019-01-10 17:41:15 +03:00
|
|
|
from ....symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT, ADP, SCONJ, CCONJ
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
from ....symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
|
Reduce size of language data (#4141)
* Move Turkish lemmas to a json file
Rather than a large dict in Python source, the data is now a big json
file. This includes a method for loading the json file, falling back to
a compressed file, and an update to MANIFEST.in that excludes json in
the spacy/lang directory.
This focuses on Turkish specifically because it has the most language
data in core.
* Transition all lemmatizer.py files to json
This covers all lemmatizer.py files of a significant size (>500k or so).
Small files were left alone.
None of the affected files have logic, so this was pretty
straightforward.
One unusual thing is that the lemma data for Urdu doesn't seem to be
used anywhere. That may require further investigation.
* Move large lang data to json for fr/nb/nl/sv
These are the languages that use a lemmatizer directory (rather than a
single file) and are larger than English.
For most of these languages there were many language data files, in
which case only the large ones (>500k or so) were converted to json. It
may or may not be a good idea to migrate the remaining Python files to
json in the future.
* Fix id lemmas.json
The contents of this file were originally just copied from the Python
source, but that used single quotes, so it had to be properly converted
to json first.
* Add .json.gz to gitignore
This covers the json.gz files built as part of distribution.
* Add language data gzip to build process
Currently this gzip data on every build; it works, but it should be
changed to only gzip when the source file has been updated.
* Remove Danish lemmatizer.py
Missed this when I added the json.
* Update to match latest explosion/srsly#9
The way gzipped json is loaded/saved in srsly changed a bit.
* Only compress language data if necessary
If a .json.gz file exists and is newer than the corresponding json file,
it's not recompressed.
* Move en/el language data to json
This only affected files >500kb, which was nouns for both languages and
the generic lookup table for English.
* Remove empty files in Norwegian tokenizer
It's unclear why, but the Norwegian (nb) tokenizer had empty files for
adj/adv/noun/verb lemmas. This may have been a result of copying the
structure of the English lemmatizer.
This removed the files, but still creates the empty sets in the
lemmatizer. That may not actually be necessary.
* Remove dubious entries in English lookup.json
" furthest" and " skilled" - both prefixed with a space - were in the
English lookup table. That seems obviously wrong so I have removed them.
* Fix small issues with en/fr lemmatizers
The en tokenizer was including the removed _nouns.py file, so that's
removed.
The fr tokenizer is unusual in that it has a lemmatizer directory with
both __init__.py and lemmatizer.py. lemmatizer.py had not been converted
to load the json language data, so that was fixed.
* Auto-format
* Auto-format
* Update srsly pin
* Consistently use pathlib paths
2019-08-20 15:54:11 +03:00
|
|
|
from ....util import load_language_data
|
|
|
|
|
|
|
|
LOOKUP = load_language_data(Path(__file__).parent / 'lookup.json')
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
|
|
|
|
'''
|
|
|
|
French language lemmatizer applies the default rule based lemmatization
|
|
|
|
procedure with some modifications for better French language support.
|
|
|
|
|
2019-01-10 17:41:15 +03:00
|
|
|
The parts of speech 'ADV', 'PRON', 'DET', 'ADP' and 'AUX' are added to use the
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
rule-based lemmatization. As a last resort, the lemmatizer checks in
|
|
|
|
the lookup table.
|
|
|
|
'''
|
|
|
|
|
|
|
|
class FrenchLemmatizer(object):
|
|
|
|
@classmethod
|
|
|
|
def load(cls, path, index=None, exc=None, rules=None, lookup=None):
|
|
|
|
return cls(index, exc, rules, lookup)
|
|
|
|
|
|
|
|
def __init__(self, index=None, exceptions=None, rules=None, lookup=None):
|
|
|
|
self.index = index
|
|
|
|
self.exc = exceptions
|
|
|
|
self.rules = rules
|
|
|
|
self.lookup_table = lookup if lookup is not None else {}
|
|
|
|
|
|
|
|
def __call__(self, string, univ_pos, morphology=None):
|
|
|
|
if not self.rules:
|
|
|
|
return [self.lookup_table.get(string, string)]
|
|
|
|
if univ_pos in (NOUN, 'NOUN', 'noun'):
|
|
|
|
univ_pos = 'noun'
|
|
|
|
elif univ_pos in (VERB, 'VERB', 'verb'):
|
|
|
|
univ_pos = 'verb'
|
|
|
|
elif univ_pos in (ADJ, 'ADJ', 'adj'):
|
|
|
|
univ_pos = 'adj'
|
2019-01-10 17:41:15 +03:00
|
|
|
elif univ_pos in (ADP, 'ADP', 'adp'):
|
|
|
|
univ_pos = 'adp'
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
elif univ_pos in (ADV, 'ADV', 'adv'):
|
|
|
|
univ_pos = 'adv'
|
|
|
|
elif univ_pos in (AUX, 'AUX', 'aux'):
|
|
|
|
univ_pos = 'aux'
|
2019-01-10 17:41:15 +03:00
|
|
|
elif univ_pos in (CCONJ, 'CCONJ', 'cconj'):
|
|
|
|
univ_pos = 'cconj'
|
|
|
|
elif univ_pos in (DET, 'DET', 'det'):
|
|
|
|
univ_pos = 'det'
|
|
|
|
elif univ_pos in (PRON, 'PRON', 'pron'):
|
|
|
|
univ_pos = 'pron'
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
elif univ_pos in (PUNCT, 'PUNCT', 'punct'):
|
|
|
|
univ_pos = 'punct'
|
2019-01-10 17:41:15 +03:00
|
|
|
elif univ_pos in (SCONJ, 'SCONJ', 'sconj'):
|
|
|
|
univ_pos = 'sconj'
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
else:
|
|
|
|
return [self.lookup(string)]
|
|
|
|
# See Issue #435 for example of where this logic is requied.
|
|
|
|
if self.is_base_form(univ_pos, morphology):
|
|
|
|
return list(set([string.lower()]))
|
|
|
|
lemmas = lemmatize(string, self.index.get(univ_pos, {}),
|
|
|
|
self.exc.get(univ_pos, {}),
|
|
|
|
self.rules.get(univ_pos, []))
|
|
|
|
return lemmas
|
|
|
|
|
|
|
|
def is_base_form(self, univ_pos, morphology=None):
|
|
|
|
"""
|
|
|
|
Check whether we're dealing with an uninflected paradigm, so we can
|
|
|
|
avoid lemmatization entirely.
|
|
|
|
"""
|
|
|
|
morphology = {} if morphology is None else morphology
|
|
|
|
others = [key for key in morphology
|
|
|
|
if key not in (POS, 'Number', 'POS', 'VerbForm', 'Tense')]
|
|
|
|
if univ_pos == 'noun' and morphology.get('Number') == 'sing':
|
|
|
|
return True
|
|
|
|
elif univ_pos == 'verb' and morphology.get('VerbForm') == 'inf':
|
|
|
|
return True
|
|
|
|
# This maps 'VBP' to base form -- probably just need 'IS_BASE'
|
|
|
|
# morphology
|
|
|
|
elif univ_pos == 'verb' and (morphology.get('VerbForm') == 'fin' and
|
|
|
|
morphology.get('Tense') == 'pres' and
|
|
|
|
morphology.get('Number') is None and
|
|
|
|
not others):
|
|
|
|
return True
|
|
|
|
elif univ_pos == 'adj' and morphology.get('Degree') == 'pos':
|
|
|
|
return True
|
|
|
|
elif VerbForm_inf in morphology:
|
|
|
|
return True
|
|
|
|
elif VerbForm_none in morphology:
|
|
|
|
return True
|
|
|
|
elif Number_sing in morphology:
|
|
|
|
return True
|
|
|
|
elif Degree_pos in morphology:
|
|
|
|
return True
|
|
|
|
else:
|
|
|
|
return False
|
|
|
|
|
|
|
|
def noun(self, string, morphology=None):
|
|
|
|
return self(string, 'noun', morphology)
|
|
|
|
|
|
|
|
def verb(self, string, morphology=None):
|
|
|
|
return self(string, 'verb', morphology)
|
|
|
|
|
|
|
|
def adj(self, string, morphology=None):
|
|
|
|
return self(string, 'adj', morphology)
|
|
|
|
|
|
|
|
def punct(self, string, morphology=None):
|
|
|
|
return self(string, 'punct', morphology)
|
|
|
|
|
|
|
|
def lookup(self, string):
|
|
|
|
if string in self.lookup_table:
|
2019-02-01 01:44:10 +03:00
|
|
|
return self.lookup_table[string][0]
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
return string
|
|
|
|
|
|
|
|
|
|
|
|
def lemmatize(string, index, exceptions, rules):
|
|
|
|
string = string.lower()
|
|
|
|
forms = []
|
2018-11-14 17:58:43 +03:00
|
|
|
if (string in index):
|
|
|
|
forms.append(string)
|
|
|
|
return forms
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
forms.extend(exceptions.get(string, []))
|
|
|
|
oov_forms = []
|
|
|
|
if not forms:
|
|
|
|
for old, new in rules:
|
|
|
|
if string.endswith(old):
|
|
|
|
form = string[:len(string) - len(old)] + new
|
|
|
|
if not form:
|
|
|
|
pass
|
|
|
|
elif form in index or not form.isalpha():
|
|
|
|
forms.append(form)
|
|
|
|
else:
|
|
|
|
oov_forms.append(form)
|
|
|
|
if not forms:
|
|
|
|
forms.extend(oov_forms)
|
|
|
|
if not forms and string in LOOKUP.keys():
|
2019-01-27 08:01:30 +03:00
|
|
|
forms.append(LOOKUP[string][0])
|
Rule-based French Lemmatizer (#2818)
<!--- Provide a general summary of your changes in the title. -->
## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->
Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.
### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->
- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-10-13 17:38:21 +03:00
|
|
|
if not forms:
|
|
|
|
forms.append(string)
|
|
|
|
return list(set(forms))
|