mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-24 17:06:29 +03:00
Improved polish tokenizer and stop words. (#2974)
* Improved stop words list * Removed some wrong stop words form list * Improved stop words list * Removed some wrong stop words form list * Improved Polish Tokenizer (#38) * Add tests for polish tokenizer * Add polish tokenizer exceptions * Don't split any words containing hyphens * Fix test case with wrong model answer * Remove commented out line of code until better solution is found * Add source srx' license * Rename exception_list.py to match spaCy conventionality * Add a brief explanation of where the exception list comes from * Add newline after reach exception * Rename COPYING.txt to LICENSE * Delete old files * Add header to the license * Agreements signed * Stanisław Giziński agreement * Krzysztof Kowalczyk - signed agreement * Mateusz Olko agreement * Add DoomCoder's contributor agreement * Improve like number checking in polish lang * like num tests added * all from SI system added * Final licence and removed splitting exceptions * Added polish stop words to LEX_ATTRA * Add encoding info to pl tokenizer exceptions
This commit is contained in:
parent
b41d64825a
commit
1448ad100c
106
.github/contributors/DoomCoder.md
vendored
Normal file
106
.github/contributors/DoomCoder.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Piotr Książek |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 22.11.2018 |
|
||||
| GitHub username | DoomCoder |
|
||||
| Website (optional) | |
|
106
.github/contributors/Gizzio.md
vendored
Normal file
106
.github/contributors/Gizzio.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [X] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Stanisław Giziński |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 21.11.2018 |
|
||||
| GitHub username | Gizzio |
|
||||
| Website (optional) | |
|
106
.github/contributors/MateuszOlko.md
vendored
Normal file
106
.github/contributors/MateuszOlko.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Mateusz Olko |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 22.11.2018 |
|
||||
| GitHub username | MateuszOlko |
|
||||
| Website (optional) | |
|
106
.github/contributors/kowaalczyk.md
vendored
Normal file
106
.github/contributors/kowaalczyk.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name |Krzysztof Kowalczyk |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date |22.11.2018 |
|
||||
| GitHub username |kowaalczyk |
|
||||
| Website (optional) |kowaalczyk.pl |
|
|
@ -3,6 +3,8 @@ from __future__ import unicode_literals
|
|||
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
|
||||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ..norm_exceptions import BASE_NORMS
|
||||
|
@ -13,9 +15,11 @@ from ...util import update_exc, add_lookups
|
|||
|
||||
class PolishDefaults(Language.Defaults):
|
||||
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
|
||||
lex_attr_getters.update(LEX_ATTRS)
|
||||
lex_attr_getters[LANG] = lambda text: 'pl'
|
||||
lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], BASE_NORMS)
|
||||
tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
|
||||
infixes = tuple(TOKENIZER_INFIXES)
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
|
|
1441
spacy/lang/pl/_tokenizer_exceptions_list.py
Normal file
1441
spacy/lang/pl/_tokenizer_exceptions_list.py
Normal file
File diff suppressed because it is too large
Load Diff
|
@ -10,8 +10,10 @@ _num_words = ['zero', 'jeden', 'dwa', 'trzy', 'cztery', 'pięć', 'sześć',
|
|||
'pietnaście', 'szesnaście', 'siedemnaście', 'osiemnaście',
|
||||
'dziewiętnaście', 'dwadzieścia', 'trzydzieści', 'czterdzieści',
|
||||
'pięćdziesiąt', 'szcześćdziesiąt', 'siedemdziesiąt',
|
||||
'osiemdziesiąt', 'dziewięćdziesiąt', 'sto', 'tysiąc', 'milion',
|
||||
'miliard', 'bilion', 'trylion']
|
||||
'osiemdziesiąt', 'dziewięćdziesiąt', 'sto',
|
||||
'dwieście', 'trzysta', 'czterysta', 'pięćset', 'sześćset',
|
||||
'siedemset', 'osiemset', 'dziewięćset', 'tysiąc', 'milion',
|
||||
'miliard', 'bilion', 'biliard', 'trylion', 'tryliard', 'kwadrylion']
|
||||
|
||||
|
||||
def like_num(text):
|
||||
|
|
23
spacy/lang/pl/polish_srx_rules_LICENSE.txt
Normal file
23
spacy/lang/pl/polish_srx_rules_LICENSE.txt
Normal file
|
@ -0,0 +1,23 @@
|
|||
|
||||
Copyright (c) 2019, Marcin Miłkowski
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright notice, this
|
||||
list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
this list of conditions and the following disclaimer in the documentation
|
||||
and/or other materials provided with the distribution.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
|
||||
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
|
||||
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
|
||||
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
||||
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
|
||||
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
|
||||
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
14
spacy/lang/pl/punctuation.py
Normal file
14
spacy/lang/pl/punctuation.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from ..char_classes import LIST_ELLIPSES, LIST_ICONS
|
||||
from ..char_classes import QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
||||
_quotes = QUOTES.replace("'", '')
|
||||
_infixes = (LIST_ELLIPSES + LIST_ICONS +
|
||||
[r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
|
||||
r'(?<=[{a}])[,!?](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
|
||||
r'(?<=[{a}])([{q}\)\]\(\[])(?=[\{a}])'.format(a=ALPHA, q=_quotes),
|
||||
r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA)])
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
|
@ -1,46 +1,80 @@
|
|||
# encoding: utf8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
# Source: http://www.ranks.nl/stopwords/polish
|
||||
#sources: https://github.com/bieli/stopwords/blob/master/polish.stopwords.txt and https://github.com/stopwords-iso/stopwords-pl
|
||||
|
||||
STOP_WORDS = set("""
|
||||
ach aj albo
|
||||
a aby ach acz aczkolwiek aj albo ale alez
|
||||
ależ ani az aż
|
||||
|
||||
bardzo bez bo być
|
||||
bardziej bardzo beda bede bedzie bez bo bowiem by
|
||||
byc byl byla byli bylo byly bym bynajmniej być był
|
||||
była było były będzie będą będę
|
||||
|
||||
ci cię ciebie co czy
|
||||
cala cali caly cała cały chce choć ci cie
|
||||
ciebie cię co cokolwiek coraz cos coś czasami czasem czemu
|
||||
czy czyli często
|
||||
|
||||
daleko dla dlaczego dlatego do dobrze dokąd dość dużo dwa dwaj dwie dwoje dziś
|
||||
dzisiaj
|
||||
daleko dla dlaczego dlatego do dobrze dokad dokąd
|
||||
dosc dość duzo dużo dwa dwaj dwie dwoje dzis
|
||||
dzisiaj dziś
|
||||
|
||||
gdyby gdzie
|
||||
gdy gdyby gdyz gdyż gdzie gdziekolwiek gdzies gdzieś go
|
||||
godz
|
||||
|
||||
go
|
||||
i ich ile im inna inne inny
|
||||
innych iv ix iz iż
|
||||
|
||||
ich ile im inny
|
||||
ja jak jakas jakaś jakby jaki jakichs jakichś jakie
|
||||
jakis jakiz jakiś jakiż jakkolwiek jako jakos jakoś je jeden
|
||||
jedna jednak jednakze jednakże jedno jednym jedynie jego jej jemu
|
||||
jesli jest jestem jeszcze jezeli jeśli jeżeli juz już ją
|
||||
|
||||
ja ją jak jakby jaki je jeden jedna jedno jego jej jemu jeśli jest jestem
|
||||
jeżeli już
|
||||
kazdy każdy kiedy kierunku kilka kilku kims kimś kto
|
||||
ktokolwiek ktora ktore ktorego ktorej ktory ktorych ktorym ktorzy ktos
|
||||
ktoś która które którego której który których którym którzy ku
|
||||
|
||||
każdy kiedy kierunku kto ku
|
||||
lecz lub
|
||||
|
||||
lub
|
||||
ma mają mam mamy mało mi miał miedzy
|
||||
mimo między mna mnie mną moga mogą moi moim moj
|
||||
moja moje moze mozliwe mozna może możliwe można mu musi
|
||||
my mój
|
||||
|
||||
ma mają mam mi mną mnie moi mój moja moje może mu my
|
||||
na nad nam nami nas nasi nasz nasza nasze
|
||||
naszego naszych natomiast natychmiast nawet nia nic nich nie niech
|
||||
niego niej niemu nigdy nim nimi niz nią niż no
|
||||
|
||||
na nam nami nas nasi nasz nasza nasze natychmiast nią nic nich nie niego niej
|
||||
niemu nigdy nim nimi niż
|
||||
o obok od ok około on ona one
|
||||
oni ono oraz oto owszem
|
||||
|
||||
obok od około on ona one oni ono owszem
|
||||
pan pana pani po pod podczas pomimo ponad
|
||||
poniewaz ponieważ powinien powinna powinni powinno poza prawie przeciez
|
||||
przecież przed przede przedtem przez przy
|
||||
|
||||
po pod ponieważ przed przedtem
|
||||
raz razie roku rowniez również
|
||||
|
||||
są sam sama się skąd
|
||||
sam sama sie się skad skąd soba sobie sobą
|
||||
sposob sposób swoje są
|
||||
|
||||
tak taki tam ten to tobą tobie tu tutaj twoi twój twoja twoje ty
|
||||
ta tak taka taki takich takie takze także tam
|
||||
te tego tej tel temu ten teraz też to toba
|
||||
tobie tobą totez toteż totobą trzeba tu tutaj twoi twoim
|
||||
twoj twoja twoje twym twój ty tych tylko tym tys
|
||||
tzw tę
|
||||
|
||||
wam wami was wasi wasz wasza wasze we więc wszystko wtedy wy
|
||||
u
|
||||
|
||||
żaden zawsze że
|
||||
""".split())
|
||||
vi vii viii
|
||||
|
||||
w wam wami was wasi wasz wasza wasze we
|
||||
według wie wiele wielu więc więcej wlasnie wszyscy wszystkich wszystkie
|
||||
wszystkim wszystko wtedy wy właśnie wśród
|
||||
|
||||
xi xii xiii xiv xv
|
||||
|
||||
z za zaden zadna zadne zadnych zapewne zawsze zaś
|
||||
ze zeby znow znowu znów zostal został
|
||||
|
||||
żaden żadna żadne żadnych że żeby""".split())
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import ORTH, LEMMA, POS, ADV, ADJ, NOUN, ADP
|
||||
|
||||
from ._tokenizer_exceptions_list import PL_BASE_EXCEPTIONS
|
||||
|
||||
_exc = {}
|
||||
|
||||
|
@ -32,5 +32,7 @@ for orth in [
|
|||
"wyj.", "xx.", "ks.", "x.", "wyd.", "wsch.", "o.o."]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
for orth in PL_BASE_EXCEPTIONS:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
TOKENIZER_EXCEPTIONS = _exc
|
||||
|
|
|
@ -189,6 +189,9 @@ def uk_tokenizer():
|
|||
def ca_tokenizer():
|
||||
return util.get_lang_class('ca').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def pl_tokenizer():
|
||||
return util.get_lang_class('pl').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
def stringstore():
|
||||
|
|
17
spacy/tests/lang/pl/test_text.py
Normal file
17
spacy/tests/lang/pl/test_text.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
# coding: utf-8
|
||||
"""Words like numbers are recognized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,match', [
|
||||
('10', True), ('1', True), ('10,000', True), ('10,00', True),
|
||||
('jeden', True), ('dwa', True), ('milion', True),
|
||||
('pies', False), (',', False), ('1/2', True)])
|
||||
def test_lex_attrs_like_number(pl_tokenizer, text, match):
|
||||
tokens = pl_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
60
spacy/tests/lang/pl/test_tokenizer.py
Normal file
60
spacy/tests/lang/pl/test_tokenizer.py
Normal file
|
@ -0,0 +1,60 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
DOT_TESTS = [
|
||||
('tel.', ['tel.']),
|
||||
('np.', ['np.']),
|
||||
('godz. 21:37', ['godz.', '21:37']),
|
||||
('inż.', ['inż.']),
|
||||
('gosp.-polit.', ['gosp.-polit.']),
|
||||
('ppoż', ['ppoż']),
|
||||
('płn', ['płn']),
|
||||
('ul.', ['ul.']),
|
||||
('jw.', ['jw.']),
|
||||
('itd.', ['itd.']),
|
||||
('cdn.', ['cdn.']),
|
||||
('itp.', ['itp.']),
|
||||
('10,- zł', ['10,-', 'zł']),
|
||||
('0 zł 99 gr', ['0', 'zł', '99', 'gr']),
|
||||
('0,99 rub.', ['0,99', 'rub.']),
|
||||
('dol.', ['dol.']),
|
||||
('1000 m n.p.m.', ['1000', 'm', 'n.p.m.']),
|
||||
('m.in.', ['m.in.']),
|
||||
('p.n.e.', ['p.n.e.']),
|
||||
('Sz.P.', ['Sz.P.']),
|
||||
('p.o.', ['p.o.']),
|
||||
('k.o.', ['k.o.']),
|
||||
('m.st.', ['m.st.']),
|
||||
('dra.', ['dra', '.']),
|
||||
('pp.', ['pp.']),
|
||||
('oo.', ['oo.'])
|
||||
]
|
||||
|
||||
HYPHEN_TESTS = [
|
||||
('5-fluoropentylo-3-pirydynyloindol', ['5-fluoropentylo-3-pirydynyloindol']),
|
||||
('NESS-040C5', ['NESS-040C5']),
|
||||
('JTE-7-31', ['JTE-7-31']),
|
||||
('BAY-59-3074', ['BAY-59-3074']),
|
||||
('BAY-38-7271', ['BAY-38-7271']),
|
||||
('STS-135', ['STS-135']),
|
||||
('5F-PB-22', ['5F-PB-22']),
|
||||
('cztero-', ['cztero-']),
|
||||
('jedno-', ['jedno-']),
|
||||
('dwu-', ['dwu-']),
|
||||
('trzy-', ['trzy-']),
|
||||
('b-adoratorzy', ['b-adoratorzy']),
|
||||
('2-3-4 drzewa', ['2-3-4', 'drzewa']),
|
||||
('b-drzewa', ['b-drzewa'])
|
||||
]
|
||||
|
||||
|
||||
TESTCASES = DOT_TESTS + HYPHEN_TESTS
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(pl_tokenizer, text, expected_tokens):
|
||||
tokens = pl_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
Loading…
Reference in New Issue
Block a user