Merge branch 'master' into hu_tokenizer

This commit is contained in:
Gyorgy Orosz 2016-12-20 20:53:31 +01:00
commit 366b3f8685
146 changed files with 5925 additions and 3047727 deletions

107
.github/contributors/RvanNieuwpoort.md vendored Executable file
View File

@ -0,0 +1,107 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [ ] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [x] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------------- |
| Name | Rob van Nieuwpoort |
| Signing on behalf of | Dafne van Kuppevelt, Janneke van der Zwaan, Willem van Hage |
| Company name (if applicable) | Netherlands eScience center |
| Title or role (if applicable) | Director of technology |
| Date | 14-12-2016 |
| GitHub username | RvanNieuwpoort |
| Website (optional) | https://www.esciencecenter.nl/ |

106
.github/contributors/magnusburton.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------------------- |
| Name | Magnus Burton |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 17-12-2016 |
| GitHub username | magnusburton |
| Website (optional) | |

View File

@ -21,7 +21,7 @@ install:
script:
- "pip install pytest"
- if [[ "${VIA}" == "compile" ]]; then SPACY_DATA=models/en python -m pytest spacy; fi
- if [[ "${VIA}" == "compile" ]]; then python -m pytest spacy; fi
- if [[ "${VIA}" == "pypi" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi
- if [[ "${VIA}" == "sdist" ]]; then python -m pytest `python -c "import pathlib; import spacy; print(pathlib.Path(spacy.__file__).parent.resolve())"`; fi

View File

@ -53,6 +53,10 @@ Coming soon.
Coming soon.
### Developer resources
The [spaCy developer resources](https://github.com/explosion/spacy-dev-resources) repo contains useful scripts, tools and templates for developing spaCy, adding new languages and training new models. If you've written a script that might help others, feel free to contribute it to that repository.
### Contributor agreement
If you've made a substantial contribution to spaCy, you should fill in the [spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that your contribution can be used across the project. If you agree to be bound by the terms of the agreement, fill in the [template]((.github/CONTRIBUTOR_AGREEMENT.md)) and include it with your pull request, or sumit it separately to [`.github/contributors/`](/.github/contributors). The name of the file should be your GitHub username, with the extension `.md`. For example, the user

View File

@ -6,24 +6,29 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
* Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)
* Christoph Schwienheer, [@chssch](https://github.com/chssch)
* Dafne van Kuppevelt, [@dafnevk](https://github.com/dafnevk)
* Dmytro Sadovnychyi, [@sadovnychyi](https://github.com/sadovnychyi)
* Henning Peters, [@henningpeters](https://github.com/henningpeters)
* Ines Montani, [@ines](https://github.com/ines)
* J Nicolas Schrading, [@NSchrading](https://github.com/NSchrading)
* Janneke van der Zwaan, [@jvdzwaan](https://github.com/jvdzwaan)
* Jordan Suchow, [@suchow](https://github.com/suchow)
* Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
* Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
* Liling Tan, [@alvations](https://github.com/alvations)
* Magnus Burton, [@magnusburton](https://github.com/magnusburton)
* Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
* Matthew Honnibal, [@honnibal](https://github.com/honnibal)
* Maxim Samsonov, [@maxirmx](https://github.com/maxirmx)
* Oleg Zd, [@olegzd](https://github.com/olegzd)
* Pokey Rule, [@pokey](https://github.com/pokey)
* Rob van Nieuwpoort, [@RvanNieuwpoort](https://github.com/RvanNieuwpoort)
* Sam Bozek, [@sambozek](https://github.com/sambozek)
* Sasho Savkov [@savkov](https://github.com/savkov)
* Tiago Rodrigues, [@TiagoMRodrigues](https://github.com/TiagoMRodrigues)
* Vsevolod Solovyov, [@vsolovyov](https://github.com/vsolovyov)
* Wah Loon Keng, [@kengz](https://github.com/kengz)
* Willem van Hage, [@wrvhage](https://github.com/wrvhage)
* Wolfgang Seeker, [@wbwseeker](https://github.com/wbwseeker)
* Yanhao Yang, [@YanhaoYang](https://github.com/YanhaoYang)
* Yubing Dong, [@tomtung](https://github.com/tomtung)

View File

@ -6,7 +6,7 @@ Cython. spaCy is built on the very latest research, but it isn't researchware.
It was designed from day 1 to be used in real products. It's commercial
open-source software, released under the MIT license.
💫 **Version 1.3 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
💫 **Version 1.4 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
.. image:: http://i.imgur.com/wFvLZyJ.png
:target: https://travis-ci.org/explosion/spaCy
@ -241,8 +241,45 @@ calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.En
Changelog
=========
2016-12-03 `v1.3.0 <https://github.com/explosion/spaCy/releases>`_: *Improve API consistency*
---------------------------------------------------------------------------------------------
2016-12-18 `v1.4.0 <https://github.com/explosion/spaCy/releases>`_: *Improved language data and alpha Dutch support*
--------------------------------------------------------------------------------------------------------------------
**✨ Major features and improvements**
* **NEW:** Alpha support for Dutch tokenization.
* Reorganise and improve format for language data.
* Add shared tag map, entity rules, emoticons and punctuation to language data.
* Convert entity rules, morphological rules and lemmatization rules from JSON to Python.
* Update language data for English, German, Spanish, French, Italian and Portuguese.
**🔴 Bug fixes**
* Fix issue `#649 <https://github.com/explosion/spaCy/issues/649>`_: Update and reorganise stop lists.
* Fix issue `#672 <https://github.com/explosion/spaCy/issues/672>`_: Make ``token.ent_iob_`` return unicode.
* Fix issue `#674 <https://github.com/explosion/spaCy/issues/674>`_: Add missing lemmas for contracted forms of "be" to ``TOKENIZER_EXCEPTIONS``.
* Fix issue `#683 <https://github.com/explosion/spaCy/issues/683>`_ ``Morphology`` class now supplies tag map value for the special space tag if it's missing.
* Fix issue `#684 <https://github.com/explosion/spaCy/issues/684>`_: Ensure ``spacy.en.English()`` loads the Glove vector data if available. Previously was inconsistent with behaviour of ``spacy.load('en')``.
* Fix issue `#685 <https://github.com/explosion/spaCy/issues/685>`_: Expand ``TOKENIZER_EXCEPTIONS`` with unicode apostrophe (````).
* Fix issue `#689 <https://github.com/explosion/spaCy/issues/689>`_: Correct typo in ``STOP_WORDS``.
* Fix issue `#691 <https://github.com/explosion/spaCy/issues/691>`_: Add tokenizer exceptions for "gonna" and "Gonna".
**⚠️ Backwards incompatibilities**
No changes to the public, documented API, but the previously undocumented language data and model initialisation processes have been refactored and reorganised. If you were relying on the ``bin/init_model.py`` script, see the new `spaCy Developer Resources <https://github.com/explosion/spacy-dev-resources>`_ repo. Code that references internals of the ``spacy.en`` or ``spacy.de`` packages should also be reviewed before updating to this version.
**📖 Documentation and examples**
* **NEW:** `"Adding languages" <https://spacy.io/docs/usage/adding-languages>`_ workflow.
* **NEW:** `"Part-of-speech tagging" <https://spacy.io/docs/usage/pos-tagging>`_ workflow.
* **NEW:** `spaCy Developer Resources <https://github.com/explosion/spacy-dev-resources>`_ repo scripts, tools and resources for developing spaCy.
* Fix various typos and inconsistencies.
**👥 Contributors**
Thanks to `@dafnevk <https://github.com/dafnevk>`_, `@jvdzwaan <https://github.com/jvdzwaan>`_, `@RvanNieuwpoort <https://github.com/RvanNieuwpoort>`_, `@wrvhage <https://github.com/wrvhage>`_, `@jaspb <https://github.com/jaspb>`_, `@savvopoulos <https://github.com/savvopoulos>`_ and `@davedwards <https://github.com/davedwards>`_ for the pull requests!
2016-12-03 `v1.3.0 <https://github.com/explosion/spaCy/releases/tag/v1.3.0>`_: *Improve API consistency*
--------------------------------------------------------------------------------------------------------
**✨ API improvements**

View File

@ -1,229 +0,0 @@
"""Set up a model directory.
Requires:
lang_data --- Rules for the tokenizer
* prefix.txt
* suffix.txt
* infix.txt
* morphs.json
* specials.json
corpora --- Data files
* WordNet
* words.sgt.prob --- Smoothed unigram probabilities
* clusters.txt --- Output of hierarchical clustering, e.g. Brown clusters
* vectors.bz2 --- output of something like word2vec, compressed with bzip
"""
from __future__ import unicode_literals
from ast import literal_eval
import math
import gzip
import json
import plac
from pathlib import Path
from shutil import copyfile
from shutil import copytree
from collections import defaultdict
import io
from spacy.vocab import Vocab
from spacy.vocab import write_binary_vectors
from spacy.strings import hash_string
from preshed.counter import PreshCounter
from spacy.parts_of_speech import NOUN, VERB, ADJ
from spacy.util import get_lang_class
try:
unicode
except NameError:
unicode = str
def setup_tokenizer(lang_data_dir, tok_dir):
if not tok_dir.exists():
tok_dir.mkdir()
for filename in ('infix.txt', 'morphs.json', 'prefix.txt', 'specials.json',
'suffix.txt'):
src = lang_data_dir / filename
dst = tok_dir / filename
copyfile(str(src), str(dst))
def _read_clusters(loc):
if not loc.exists():
print("Warning: Clusters file not found")
return {}
clusters = {}
for line in io.open(str(loc), 'r', encoding='utf8'):
try:
cluster, word, freq = line.split()
except ValueError:
continue
# If the clusterer has only seen the word a few times, its cluster is
# unreliable.
if int(freq) >= 3:
clusters[word] = cluster
else:
clusters[word] = '0'
# Expand clusters with re-casing
for word, cluster in list(clusters.items()):
if word.lower() not in clusters:
clusters[word.lower()] = cluster
if word.title() not in clusters:
clusters[word.title()] = cluster
if word.upper() not in clusters:
clusters[word.upper()] = cluster
return clusters
def _read_probs(loc):
if not loc.exists():
print("Probabilities file not found. Trying freqs.")
return {}, 0.0
probs = {}
for i, line in enumerate(io.open(str(loc), 'r', encoding='utf8')):
prob, word = line.split()
prob = float(prob)
probs[word] = prob
return probs, probs['-OOV-']
def _read_freqs(loc, max_length=100, min_doc_freq=5, min_freq=200):
if not loc.exists():
print("Warning: Frequencies file not found")
return {}, 0.0
counts = PreshCounter()
total = 0
if str(loc).endswith('gz'):
file_ = gzip.open(str(loc))
else:
file_ = loc.open()
for i, line in enumerate(file_):
freq, doc_freq, key = line.rstrip().split('\t', 2)
freq = int(freq)
counts.inc(i+1, freq)
total += freq
counts.smooth()
log_total = math.log(total)
if str(loc).endswith('gz'):
file_ = gzip.open(str(loc))
else:
file_ = loc.open()
probs = {}
for line in file_:
freq, doc_freq, key = line.rstrip().split('\t', 2)
doc_freq = int(doc_freq)
freq = int(freq)
if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length:
word = literal_eval(key)
smooth_count = counts.smoother(int(freq))
probs[word] = math.log(smooth_count) - log_total
oov_prob = math.log(counts.smoother(0)) - log_total
return probs, oov_prob
def _read_senses(loc):
lexicon = defaultdict(lambda: defaultdict(list))
if not loc.exists():
print("Warning: WordNet senses not found")
return lexicon
sense_names = dict((s, i) for i, s in enumerate(spacy.senses.STRINGS))
pos_ids = {'noun': NOUN, 'verb': VERB, 'adjective': ADJ}
for line in codecs.open(str(loc), 'r', 'utf8'):
sense_strings = line.split()
word = sense_strings.pop(0)
for sense in sense_strings:
pos, sense = sense[3:].split('.')
sense_name = '%s_%s' % (pos[0].upper(), sense.lower())
if sense_name != 'N_tops':
sense_id = sense_names[sense_name]
lexicon[word][pos_ids[pos]].append(sense_id)
return lexicon
def setup_vocab(lex_attr_getters, tag_map, src_dir, dst_dir):
if not dst_dir.exists():
dst_dir.mkdir()
vectors_src = src_dir / 'vectors.bz2'
if vectors_src.exists():
write_binary_vectors(vectors_src.as_posix, (dst_dir / 'vec.bin').as_posix())
else:
print("Warning: Word vectors file not found")
vocab = Vocab(lex_attr_getters=lex_attr_getters, tag_map=tag_map)
clusters = _read_clusters(src_dir / 'clusters.txt')
probs, oov_prob = _read_probs(src_dir / 'words.sgt.prob')
if not probs:
probs, oov_prob = _read_freqs(src_dir / 'freqs.txt.gz')
if not probs:
oov_prob = -20
else:
oov_prob = min(probs.values())
for word in clusters:
if word not in probs:
probs[word] = oov_prob
lexicon = []
for word, prob in reversed(sorted(list(probs.items()), key=lambda item: item[1])):
# First encode the strings into the StringStore. This way, we can map
# the orth IDs to frequency ranks
orth = vocab.strings[word]
# Now actually load the vocab
for word, prob in reversed(sorted(list(probs.items()), key=lambda item: item[1])):
lexeme = vocab[word]
lexeme.prob = prob
lexeme.is_oov = False
# Decode as a little-endian string, so that we can do & 15 to get
# the first 4 bits. See _parse_features.pyx
if word in clusters:
lexeme.cluster = int(clusters[word][::-1], 2)
else:
lexeme.cluster = 0
vocab.dump((dst_dir / 'lexemes.bin').as_posix())
with (dst_dir / 'strings.json').open('w') as file_:
vocab.strings.dump(file_)
with (dst_dir / 'oov_prob').open('w') as file_:
file_.write('%f' % oov_prob)
def main(lang_id, lang_data_dir, corpora_dir, model_dir):
model_dir = Path(model_dir)
lang_data_dir = Path(lang_data_dir) / lang_id
corpora_dir = Path(corpora_dir) / lang_id
assert corpora_dir.exists()
assert lang_data_dir.exists()
if not model_dir.exists():
model_dir.mkdir()
tag_map = json.load((lang_data_dir / 'tag_map.json').open())
setup_tokenizer(lang_data_dir, model_dir / 'tokenizer')
setup_vocab(get_lang_class(lang_id).Defaults.lex_attr_getters, tag_map, corpora_dir,
model_dir / 'vocab')
if (lang_data_dir / 'gazetteer.json').exists():
copyfile((lang_data_dir / 'gazetteer.json').as_posix(),
(model_dir / 'vocab' / 'gazetteer.json').as_posix())
copyfile((lang_data_dir / 'tag_map.json').as_posix(),
(model_dir / 'vocab' / 'tag_map.json').as_posix())
if (lang_data_dir / 'lemma_rules.json').exists():
copyfile((lang_data_dir / 'lemma_rules.json').as_posix(),
(model_dir / 'vocab' / 'lemma_rules.json').as_posix())
if not (model_dir / 'wordnet').exists() and (corpora_dir / 'wordnet').exists():
copytree((corpora_dir / 'wordnet' / 'dict').as_posix(),
(model_dir / 'wordnet').as_posix())
if __name__ == '__main__':
plac.call(main)

View File

@ -0,0 +1,22 @@
# Load NER
from __future__ import unicode_literals
import spacy
import pathlib
from spacy.pipeline import EntityRecognizer
from spacy.vocab import Vocab
def load_model(model_dir):
model_dir = pathlib.Path(model_dir)
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
with (model_dir / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
nlp.vocab.strings.load(file_)
nlp.vocab.load_lexemes(model_dir / 'vocab' / 'lexemes.bin')
ner = EntityRecognizer.load(model_dir, nlp.vocab, require=True)
return (nlp, ner)
(nlp, ner) = load_model('ner')
doc = nlp.make_doc('Who is Shaka Khan?')
nlp.tagger(doc)
ner(doc)
for word in doc:
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)

View File

@ -10,6 +10,13 @@ from spacy.tagger import Tagger
def train_ner(nlp, train_data, entity_types):
# Add new words to vocab.
for raw_text, _ in train_data:
doc = nlp.make_doc(raw_text)
for word in doc:
_ = nlp.vocab[word.orth]
# Train NER.
ner = EntityRecognizer(nlp.vocab, entity_types=entity_types)
for itn in range(5):
random.shuffle(train_data)
@ -20,21 +27,30 @@ def train_ner(nlp, train_data, entity_types):
ner.model.end_training()
return ner
def save_model(ner, model_dir):
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
assert model_dir.is_dir()
with (model_dir / 'config.json').open('w') as file_:
json.dump(ner.cfg, file_)
ner.model.dump(str(model_dir / 'model'))
if not (model_dir / 'vocab').exists():
(model_dir / 'vocab').mkdir()
ner.vocab.dump(str(model_dir / 'vocab' / 'lexemes.bin'))
with (model_dir / 'vocab' / 'strings.json').open('w', encoding='utf8') as file_:
ner.vocab.strings.dump(file_)
def main(model_dir=None):
if model_dir is not None:
model_dir = pathlib.Path(model_dir)
if not model_dir.exists():
model_dir.mkdir()
assert model_dir.is_dir()
nlp = spacy.load('en', parser=False, entity=False, add_vectors=False)
# v1.1.2 onwards
if nlp.tagger is None:
print('---- WARNING ----')
print('Data directory not found')
print('please run: `python -m spacy.en.download force all` for better performance')
print('please run: `python -m spacy.en.download --force all` for better performance')
print('Using feature templates for tagging')
print('-----------------')
nlp.tagger = Tagger(nlp.vocab, features=Tagger.feature_templates)
@ -56,16 +72,17 @@ def main(model_dir=None):
nlp.tagger(doc)
ner(doc)
for word in doc:
print(word.text, word.tag_, word.ent_type_, word.ent_iob)
print(word.text, word.orth, word.lower, word.tag_, word.ent_type_, word.ent_iob)
if model_dir is not None:
with (model_dir / 'config.json').open('w') as file_:
json.dump(ner.cfg, file_)
ner.model.dump(str(model_dir / 'model'))
save_model(ner, model_dir)
if __name__ == '__main__':
main()
main('ner')
# Who "" 2
# is "" 2
# Shaka "" PERSON 3

View File

@ -69,7 +69,7 @@ def main(output_dir=None):
print(word.text, word.tag_, word.pos_)
if output_dir is not None:
tagger.model.dump(str(output_dir / 'pos' / 'model'))
with (output_dir / 'vocab' / 'strings.json').open('wb') as file_:
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
tagger.vocab.strings.dump(file_)

163
fabfile.py vendored
View File

@ -13,134 +13,6 @@ PWD = path.dirname(__file__)
VENV_DIR = path.join(PWD, '.env')
def counts():
pass
# Tokenize the corpus
# tokenize()
# get_freqs()
# Collate the counts
# cat freqs | sort -k2 | gather_freqs()
# gather_freqs()
# smooth()
# clean, make, sdist
# cd to new env, install from sdist,
# Push changes to server
# Pull changes on server
# clean make init model
# test --vectors --slow
# train
# test --vectors --slow --models
# sdist
# upload data to server
# change to clean venv
# py2: install from sdist, test --slow, download data, test --models --vectors
# py3: install from sdist, test --slow, download data, test --models --vectors
def prebuild(build_dir='/tmp/build_spacy'):
if file_exists(build_dir):
shutil.rmtree(build_dir)
os.mkdir(build_dir)
spacy_dir = path.dirname(__file__)
wn_url = 'http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz'
build_venv = path.join(build_dir, '.env')
with lcd(build_dir):
local('git clone %s .' % spacy_dir)
local('virtualenv ' + build_venv)
with prefix('cd %s && PYTHONPATH=`pwd` && . %s/bin/activate' % (build_dir, build_venv)):
local('pip install cython fabric fabtools pytest')
local('pip install --no-cache-dir -r requirements.txt')
local('fab clean make')
local('cp -r %s/corpora/en/wordnet corpora/en/' % spacy_dir)
local('PYTHONPATH=`pwd` python bin/init_model.py en lang_data corpora spacy/en/data')
local('PYTHONPATH=`pwd` fab test')
local('PYTHONPATH=`pwd` python -m spacy.en.download --force all')
local('PYTHONPATH=`pwd` py.test --models spacy/tests/')
def web():
def jade(source_name, out_dir):
pwd = path.join(path.dirname(__file__), 'website')
jade_loc = path.join(pwd, 'src', 'jade', source_name)
out_loc = path.join(pwd, 'site', out_dir)
local('jade -P %s --out %s' % (jade_loc, out_loc))
with virtualenv(VENV_DIR):
local('./website/create_code_samples spacy/tests/website/ website/src/code/')
jade('404.jade', '')
jade('home/index.jade', '')
jade('docs/index.jade', 'docs/')
jade('blog/index.jade', 'blog/')
for collection in ('blog', 'tutorials'):
for post_dir in (Path(__file__).parent / 'website' / 'src' / 'jade' / collection).iterdir():
if post_dir.is_dir() \
and (post_dir / 'index.jade').exists() \
and (post_dir / 'meta.jade').exists():
jade(str(post_dir / 'index.jade'), path.join(collection, post_dir.parts[-1]))
def web_publish(assets_path):
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
site_path = 'website/site'
os.environ['S3_USE_SIGV4'] = 'True'
conn = S3Connection(host='s3.eu-central-1.amazonaws.com',
calling_format=OrdinaryCallingFormat())
bucket = conn.get_bucket('spacy.io', validate=False)
keys_left = set([k.name for k in bucket.list()
if not k.name.startswith('resources')])
for root, dirnames, filenames in os.walk(site_path):
for dirname in dirnames:
target = os.path.relpath(os.path.join(root, dirname), site_path)
source = os.path.join(target, 'index.html')
if os.path.exists(os.path.join(root, dirname, 'index.html')):
key = bucket.new_key(source)
key.set_redirect('//%s/%s' % (bucket.name, target))
print('adding redirect for %s' % target)
keys_left.remove(source)
for filename in filenames:
source = os.path.join(root, filename)
target = os.path.relpath(root, site_path)
if target == '.':
target = filename
elif filename != 'index.html':
target = os.path.join(target, filename)
key = bucket.new_key(target)
key.set_metadata('Content-Type', 'text/html')
key.set_contents_from_filename(source)
print('uploading %s' % target)
keys_left.remove(target)
for key_name in keys_left:
print('deleting %s' % key_name)
bucket.delete_key(key_name)
local('aws s3 sync --delete %s s3://spacy.io/resources' % assets_path)
def publish(version):
with virtualenv(VENV_DIR):
local('git push origin master')
local('git tag -a %s' % version)
local('git push origin %s' % version)
local('python setup.py sdist')
local('python setup.py register')
local('twine upload dist/spacy-%s.tar.gz' % version)
def env(lang="python2.7"):
if file_exists('.env'):
local('rm -rf .env')
@ -172,38 +44,3 @@ def test():
with virtualenv(VENV_DIR):
with lcd(path.dirname(__file__)):
local('py.test -x spacy/tests')
def train(json_dir=None, dev_loc=None, model_dir=None):
if json_dir is None:
json_dir = 'corpora/en/json'
if model_dir is None:
model_dir = 'models/en/'
with virtualenv(VENV_DIR):
with lcd(path.dirname(__file__)):
local('python bin/init_model.py en lang_data/ corpora/ ' + model_dir)
local('python bin/parser/train.py -p en %s/train/ %s/development %s' % (json_dir, json_dir, model_dir))
def travis():
local('open https://travis-ci.org/honnibal/thinc')
def pos():
with virtualenv(VENV_DIR):
local('python tools/train.py ~/work_data/docparse/wsj02-21.conll ~/work_data/docparse/wsj22.conll spacy/en/data')
local('python tools/tag.py ~/work_data/docparse/wsj22.raw /tmp/tmp')
local('python tools/eval_pos.py ~/work_data/docparse/wsj22.conll /tmp/tmp')
def ner():
local('rm -rf data/en/ner')
local('python tools/train_ner.py ~/work_data/docparse/wsj02-21.conll data/en/ner')
local('python tools/tag_ner.py ~/work_data/docparse/wsj22.raw /tmp/tmp')
local('python tools/eval_ner.py ~/work_data/docparse/wsj22.conll /tmp/tmp | tail')
def conll():
local('rm -rf data/en/ner')
local('python tools/conll03_train.py ~/work_data/ner/conll2003/eng.train data/en/ner/')
local('python tools/conll03_eval.py ~/work_data/ner/conll2003/eng.testa')

View File

@ -1,319 +0,0 @@
# surface form lemma pos
# multiple values are separated by |
# empty lines and lines starting with # are being ignored
'' ''
\") \")
\n \n <nl> SP
\t \t <tab> SP
<space> SP
# example: Wie geht's?
's 's es
'S 'S es
# example: Haste mal 'nen Euro?
'n 'n ein
'ne 'ne eine
'nen 'nen einen
# example: Kommen S nur herein!
s' s' sie
S' S' sie
# example: Da haben wir's!
ich's ich|'s ich|es
du's du|'s du|es
er's er|'s er|es
sie's sie|'s sie|es
wir's wir|'s wir|es
ihr's ihr|'s ihr|es
# example: Die katze auf'm dach.
auf'm auf|'m auf|dem
unter'm unter|'m unter|dem
über'm über|'m über|dem
vor'm vor|'m vor|dem
hinter'm hinter|'m hinter|dem
# persons
B.A. B.A.
B.Sc. B.Sc.
Dipl. Dipl.
Dipl.-Ing. Dipl.-Ing.
Dr. Dr.
Fr. Fr.
Frl. Frl.
Hr. Hr.
Hrn. Hrn.
Frl. Frl.
Prof. Prof.
St. St.
Hrgs. Hrgs.
Hg. Hg.
a.Z. a.Z.
a.D. a.D.
h.c. h.c.
Jr. Jr.
jr. jr.
jun. jun.
sen. sen.
rer. rer.
Ing. Ing.
M.A. M.A.
Mr. Mr.
M.Sc. M.Sc.
nat. nat.
phil. phil.
# companies
Co. Co.
co. co.
Cie. Cie.
A.G. A.G.
G.m.b.H. G.m.b.H.
i.G. i.G.
e.V. e.V.
# popular german abbreviations
Abb. Abb.
Abk. Abk.
Abs. Abs.
Abt. Abt.
abzgl. abzgl.
allg. allg.
a.M. a.M.
Bd. Bd.
betr. betr.
Betr. Betr.
Biol. Biol.
biol. biol.
Bf. Bf.
Bhf. Bhf.
Bsp. Bsp.
bspw. bspw.
bzgl. bzgl.
bzw. bzw.
d.h. d.h.
dgl. dgl.
ebd. ebd.
ehem. ehem.
eigtl. eigtl.
entspr. entspr.
erm. erm.
ev. ev.
evtl. evtl.
Fa. Fa.
Fam. Fam.
geb. geb.
Gebr. Gebr.
gem. gem.
ggf. ggf.
ggü. ggü.
ggfs. ggfs.
gegr. gegr.
Hbf. Hbf.
Hrsg. Hrsg.
hrsg. hrsg.
i.A. i.A.
i.d.R. i.d.R.
inkl. inkl.
insb. insb.
i.O. i.O.
i.Tr. i.Tr.
i.V. i.V.
jur. jur.
kath. kath.
K.O. K.O.
lt. lt.
max. max.
m.E. m.E.
m.M. m.M.
mtl. mtl.
min. min.
mind. mind.
MwSt. MwSt.
Nr. Nr.
o.a. o.a.
o.ä. o.ä.
o.Ä. o.Ä.
o.g. o.g.
o.k. o.k.
O.K. O.K.
Orig. Orig.
orig. orig.
pers. pers.
Pkt. Pkt.
Red. Red.
röm. röm.
s.o. s.o.
sog. sog.
std. std.
stellv. stellv.
Str. Str.
tägl. tägl.
Tel. Tel.
u.a. u.a.
usf. usf.
u.s.w. u.s.w.
usw. usw.
u.U. u.U.
u.v.m. u.v.m.
uvm. uvm.
v.a. v.a.
vgl. vgl.
vllt. vllt.
v.l.n.r. v.l.n.r.
vlt. vlt.
Vol. Vol.
wiss. wiss.
Univ. Univ.
z.B. z.B.
z.b. z.b.
z.Bsp. z.Bsp.
z.T. z.T.
z.Z. z.Z.
zzgl. zzgl.
z.Zt. z.Zt.
# popular latin abbreviations
vs. vs.
adv. adv.
Chr. Chr.
A.C. A.C.
A.D. A.D.
e.g. e.g.
i.e. i.e.
al. al.
p.a. p.a.
P.S. P.S.
q.e.d. q.e.d.
R.I.P. R.I.P.
etc. etc.
incl. incl.
ca. ca.
n.Chr. n.Chr.
p.s. p.s.
v.Chr. v.Chr.
# popular english abbreviations
D.C. D.C.
N.Y. N.Y.
N.Y.C. N.Y.C.
U.S. U.S.
U.S.A. U.S.A.
L.A. L.A.
U.S.S. U.S.S.
# dates & time
Jan. Jan.
Feb. Feb.
Mrz. Mrz.
Mär. Mär.
Apr. Apr.
Jun. Jun.
Jul. Jul.
Aug. Aug.
Sep. Sep.
Sept. Sept.
Okt. Okt.
Nov. Nov.
Dez. Dez.
Mo. Mo.
Di. Di.
Mi. Mi.
Do. Do.
Fr. Fr.
Sa. Sa.
So. So.
Std. Std.
Jh. Jh.
Jhd. Jhd.
# numbers
Tsd. Tsd.
Mio. Mio.
Mrd. Mrd.
# countries & languages
engl. engl.
frz. frz.
lat. lat.
österr. österr.
# smileys
:) :)
<3 <3
;) ;)
(: (:
:( :(
-_- -_-
=) =)
:/ :/
:> :>
;-) ;-)
:Y :Y
:P :P
:-P :-P
:3 :3
=3 =3
xD xD
^_^ ^_^
=] =]
=D =D
<333 <333
:)) :))
:0 :0
-__- -__-
xDD xDD
o_o o_o
o_O o_O
V_V V_V
=[[ =[[
<33 <33
;p ;p
;D ;D
;-p ;-p
;( ;(
:p :p
:] :]
:O :O
:-/ :-/
:-) :-)
:((( :(((
:(( :((
:') :')
(^_^) (^_^)
(= (=
o.O o.O
# single letters
a. a.
b. b.
c. c.
d. d.
e. e.
f. f.
g. g.
h. h.
i. i.
j. j.
k. k.
l. l.
m. m.
n. n.
o. o.
p. p.
q. q.
r. r.
s. s.
t. t.
u. u.
v. v.
w. w.
x. x.
y. y.
z. z.
ä. ä.
ö. ö.
ü. ü.

View File

@ -1,194 +0,0 @@
{
"Reddit": [
"PRODUCT",
{},
[
[{"lower": "reddit"}]
]
],
"SeptemberElevenAttacks": [
"EVENT",
{},
[
[
{"orth": "9/11"}
],
[
{"lower": "september"},
{"orth": "11"}
]
]
],
"Linux": [
"PRODUCT",
{},
[
[{"lower": "linux"}]
]
],
"Haskell": [
"PRODUCT",
{},
[
[{"lower": "haskell"}]
]
],
"HaskellCurry": [
"PERSON",
{},
[
[
{"lower": "haskell"},
{"lower": "curry"}
]
]
],
"Javascript": [
"PRODUCT",
{},
[
[{"lower": "javascript"}]
]
],
"CSS": [
"PRODUCT",
{},
[
[{"lower": "css"}],
[{"lower": "css3"}]
]
],
"displaCy": [
"PRODUCT",
{},
[
[{"lower": "displacy"}]
]
],
"spaCy": [
"PRODUCT",
{},
[
[{"orth": "spaCy"}]
]
],
"HTML": [
"PRODUCT",
{},
[
[{"lower": "html"}],
[{"lower": "html5"}]
]
],
"Python": [
"PRODUCT",
{},
[
[{"orth": "Python"}]
]
],
"Ruby": [
"PRODUCT",
{},
[
[{"orth": "Ruby"}]
]
],
"Digg": [
"PRODUCT",
{},
[
[{"lower": "digg"}]
]
],
"FoxNews": [
"ORG",
{},
[
[{"orth": "Fox"}],
[{"orth": "News"}]
]
],
"Google": [
"ORG",
{},
[
[{"lower": "google"}]
]
],
"Mac": [
"PRODUCT",
{},
[
[{"lower": "mac"}]
]
],
"Wikipedia": [
"PRODUCT",
{},
[
[{"lower": "wikipedia"}]
]
],
"Windows": [
"PRODUCT",
{},
[
[{"orth": "Windows"}]
]
],
"Dell": [
"ORG",
{},
[
[{"lower": "dell"}]
]
],
"Facebook": [
"ORG",
{},
[
[{"lower": "facebook"}]
]
],
"Blizzard": [
"ORG",
{},
[
[{"orth": "Blizzard"}]
]
],
"Ubuntu": [
"ORG",
{},
[
[{"orth": "Ubuntu"}]
]
],
"Youtube": [
"PRODUCT",
{},
[
[{"lower": "youtube"}]
]
],
"false_positives": [
null,
{},
[
[{"orth": "Shit"}],
[{"orth": "Weed"}],
[{"orth": "Cool"}],
[{"orth": "Btw"}],
[{"orth": "Bah"}],
[{"orth": "Bullshit"}],
[{"orth": "Lol"}],
[{"orth": "Yo"}, {"lower": "dawg"}],
[{"orth": "Yay"}],
[{"orth": "Ahh"}],
[{"orth": "Yea"}],
[{"orth": "Bah"}]
]
]
}

View File

@ -1,334 +0,0 @@
# coding=utf8
import json
import io
import itertools
contractions = {}
# contains the lemmas, parts of speech, number, and tenspect of
# potential tokens generated after splitting contractions off
token_properties = {}
# contains starting tokens with their potential contractions
# each potential contraction has a list of exceptions
# lower - don't generate the lowercase version
# upper - don't generate the uppercase version
# contrLower - don't generate the lowercase version with apostrophe (') removed
# contrUpper - dont' generate the uppercase version with apostrophe (') removed
# for example, we don't want to create the word "hell" or "Hell" from "he" + "'ll" so
# we add "contrLower" and "contrUpper" to the exceptions list
starting_tokens = {}
# other specials that don't really have contractions
# so they are hardcoded
hardcoded_specials = {
"''": [{"F": "''"}],
"\")": [{"F": "\")"}],
"\n": [{"F": "\n", "pos": "SP"}],
"\t": [{"F": "\t", "pos": "SP"}],
" ": [{"F": " ", "pos": "SP"}],
# example: Wie geht's?
"'s": [{"F": "'s", "L": "es"}],
"'S": [{"F": "'S", "L": "es"}],
# example: Haste mal 'nen Euro?
"'n": [{"F": "'n", "L": "ein"}],
"'ne": [{"F": "'ne", "L": "eine"}],
"'nen": [{"F": "'nen", "L": "einen"}],
# example: Kommen S nur herein!
"s'": [{"F": "s'", "L": "sie"}],
"S'": [{"F": "S'", "L": "sie"}],
# example: Da haben wir's!
"ich's": [{"F": "ich"}, {"F": "'s", "L": "es"}],
"du's": [{"F": "du"}, {"F": "'s", "L": "es"}],
"er's": [{"F": "er"}, {"F": "'s", "L": "es"}],
"sie's": [{"F": "sie"}, {"F": "'s", "L": "es"}],
"wir's": [{"F": "wir"}, {"F": "'s", "L": "es"}],
"ihr's": [{"F": "ihr"}, {"F": "'s", "L": "es"}],
# example: Die katze auf'm dach.
"auf'm": [{"F": "auf"}, {"F": "'m", "L": "dem"}],
"unter'm": [{"F": "unter"}, {"F": "'m", "L": "dem"}],
"über'm": [{"F": "über"}, {"F": "'m", "L": "dem"}],
"vor'm": [{"F": "vor"}, {"F": "'m", "L": "dem"}],
"hinter'm": [{"F": "hinter"}, {"F": "'m", "L": "dem"}],
# persons
"Fr.": [{"F": "Fr."}],
"Hr.": [{"F": "Hr."}],
"Frl.": [{"F": "Frl."}],
"Prof.": [{"F": "Prof."}],
"Dr.": [{"F": "Dr."}],
"St.": [{"F": "St."}],
"Hrgs.": [{"F": "Hrgs."}],
"Hg.": [{"F": "Hg."}],
"a.Z.": [{"F": "a.Z."}],
"a.D.": [{"F": "a.D."}],
"A.D.": [{"F": "A.D."}],
"h.c.": [{"F": "h.c."}],
"jun.": [{"F": "jun."}],
"sen.": [{"F": "sen."}],
"rer.": [{"F": "rer."}],
"Dipl.": [{"F": "Dipl."}],
"Ing.": [{"F": "Ing."}],
"Dipl.-Ing.": [{"F": "Dipl.-Ing."}],
# companies
"Co.": [{"F": "Co."}],
"co.": [{"F": "co."}],
"Cie.": [{"F": "Cie."}],
"A.G.": [{"F": "A.G."}],
"G.m.b.H.": [{"F": "G.m.b.H."}],
"i.G.": [{"F": "i.G."}],
"e.V.": [{"F": "e.V."}],
# popular german abbreviations
"ggü.": [{"F": "ggü."}],
"ggf.": [{"F": "ggf."}],
"ggfs.": [{"F": "ggfs."}],
"Gebr.": [{"F": "Gebr."}],
"geb.": [{"F": "geb."}],
"gegr.": [{"F": "gegr."}],
"erm.": [{"F": "erm."}],
"engl.": [{"F": "engl."}],
"ehem.": [{"F": "ehem."}],
"Biol.": [{"F": "Biol."}],
"biol.": [{"F": "biol."}],
"Abk.": [{"F": "Abk."}],
"Abb.": [{"F": "Abb."}],
"abzgl.": [{"F": "abzgl."}],
"Hbf.": [{"F": "Hbf."}],
"Bhf.": [{"F": "Bhf."}],
"Bf.": [{"F": "Bf."}],
"i.V.": [{"F": "i.V."}],
"inkl.": [{"F": "inkl."}],
"insb.": [{"F": "insb."}],
"z.B.": [{"F": "z.B."}],
"i.Tr.": [{"F": "i.Tr."}],
"Jhd.": [{"F": "Jhd."}],
"jur.": [{"F": "jur."}],
"lt.": [{"F": "lt."}],
"nat.": [{"F": "nat."}],
"u.a.": [{"F": "u.a."}],
"u.s.w.": [{"F": "u.s.w."}],
"Nr.": [{"F": "Nr."}],
"Univ.": [{"F": "Univ."}],
"vgl.": [{"F": "vgl."}],
"zzgl.": [{"F": "zzgl."}],
"z.Z.": [{"F": "z.Z."}],
"betr.": [{"F": "betr."}],
"ehem.": [{"F": "ehem."}],
# popular latin abbreviations
"vs.": [{"F": "vs."}],
"adv.": [{"F": "adv."}],
"Chr.": [{"F": "Chr."}],
"A.C.": [{"F": "A.C."}],
"A.D.": [{"F": "A.D."}],
"e.g.": [{"F": "e.g."}],
"i.e.": [{"F": "i.e."}],
"al.": [{"F": "al."}],
"p.a.": [{"F": "p.a."}],
"P.S.": [{"F": "P.S."}],
"q.e.d.": [{"F": "q.e.d."}],
"R.I.P.": [{"F": "R.I.P."}],
"etc.": [{"F": "etc."}],
"incl.": [{"F": "incl."}],
# popular english abbreviations
"D.C.": [{"F": "D.C."}],
"N.Y.": [{"F": "N.Y."}],
"N.Y.C.": [{"F": "N.Y.C."}],
# dates
"Jan.": [{"F": "Jan."}],
"Feb.": [{"F": "Feb."}],
"Mrz.": [{"F": "Mrz."}],
"Mär.": [{"F": "Mär."}],
"Apr.": [{"F": "Apr."}],
"Jun.": [{"F": "Jun."}],
"Jul.": [{"F": "Jul."}],
"Aug.": [{"F": "Aug."}],
"Sep.": [{"F": "Sep."}],
"Sept.": [{"F": "Sept."}],
"Okt.": [{"F": "Okt."}],
"Nov.": [{"F": "Nov."}],
"Dez.": [{"F": "Dez."}],
"Mo.": [{"F": "Mo."}],
"Di.": [{"F": "Di."}],
"Mi.": [{"F": "Mi."}],
"Do.": [{"F": "Do."}],
"Fr.": [{"F": "Fr."}],
"Sa.": [{"F": "Sa."}],
"So.": [{"F": "So."}],
# smileys
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"r.": [{"F": "r."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
}
def get_double_contractions(ending):
endings = []
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
while ends_with_contraction:
for contraction in contractions:
if ending.endswith(contraction):
endings.append(contraction)
ending = ending.rstrip(contraction)
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
endings.reverse() # reverse because the last ending is put in the list first
return endings
def get_token_properties(token, capitalize=False, remove_contractions=False):
props = dict(token_properties.get(token)) # ensure we copy the dict so we can add the "F" prop
if capitalize:
token = token.capitalize()
if remove_contractions:
token = token.replace("'", "")
props["F"] = token
return props
def create_entry(token, endings, capitalize=False, remove_contractions=False):
properties = []
properties.append(get_token_properties(token, capitalize=capitalize, remove_contractions=remove_contractions))
for e in endings:
properties.append(get_token_properties(e, remove_contractions=remove_contractions))
return properties
FIELDNAMES = ['F','L','pos']
def read_hardcoded(stream):
hc_specials = {}
for line in stream:
line = line.strip()
if line.startswith('#') or not line:
continue
key,_,rest = line.partition('\t')
values = []
for annotation in zip(*[ e.split('|') for e in rest.split('\t') ]):
values.append({ k:v for k,v in itertools.izip_longest(FIELDNAMES,annotation) if v })
hc_specials[key] = values
return hc_specials
def generate_specials():
specials = {}
for token in starting_tokens:
possible_endings = starting_tokens[token]
for ending in possible_endings:
endings = []
if ending.count("'") > 1:
endings.extend(get_double_contractions(ending))
else:
endings.append(ending)
exceptions = possible_endings[ending]
if "lower" not in exceptions:
special = token + ending
specials[special] = create_entry(token, endings)
if "upper" not in exceptions:
special = token.capitalize() + ending
specials[special] = create_entry(token, endings, capitalize=True)
if "contrLower" not in exceptions:
special = token + ending.replace("'", "")
specials[special] = create_entry(token, endings, remove_contractions=True)
if "contrUpper" not in exceptions:
special = token.capitalize() + ending.replace("'", "")
specials[special] = create_entry(token, endings, capitalize=True, remove_contractions=True)
# add in hardcoded specials
# changed it so it generates them from a file
with io.open('abbrev.de.tab','r',encoding='utf8') as abbrev_:
hc_specials = read_hardcoded(abbrev_)
specials = dict(specials, **hc_specials)
return specials
if __name__ == "__main__":
specials = generate_specials()
with open("specials.json", "w") as f:
json.dump(specials, f, sort_keys=True, indent=4, separators=(',', ': '))

View File

@ -1,6 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zöäüßA-ZÖÄÜ"]):(?=[a-zöäüßA-ZÖÄÜ])
(?<=[a-zöäüßA-ZÖÄÜ"])>(?=[a-zöäüßA-ZÖÄÜ])
(?<=[a-zöäüßA-ZÖÄÜ"])<(?=[a-zöäüßA-ZÖÄÜ])
(?<=[a-zöäüßA-ZÖÄÜ"])=(?=[a-zöäüßA-ZÖÄÜ])

View File

@ -1 +0,0 @@
{}

View File

@ -1,71 +0,0 @@
{
"PRP": {
"ich": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 1},
"meiner": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 2},
"mir": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 3},
"mich": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 4},
"du": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
"deiner": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"dir": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
"dich": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
"er": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 1},
"seiner": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 2},
"ihm": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 3},
"ihn": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 4},
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 1},
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 2},
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 3},
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 4},
"es": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 1},
"seiner": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 2},
"ihm": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 3},
"es": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 4},
"wir": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
"unser": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
"uns": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
"uns": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
"ihr": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
"euer": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"euch": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
"euch": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 1},
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 2},
"ihnen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 3},
"sie": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 4}
},
"PRP$": {
"mein": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
"meines": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
"meinem": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
"meinen": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
"dein": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
"deines": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"deinem": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
"deinen": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
"sein": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 1},
"seines": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 2},
"seinem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 3},
"seinen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 1, "case": 4},
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 1},
"ihrer": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 2},
"ihrem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 3},
"ihren": {"L": "-PRON-", "person": 3, "number": 0, "gender": 2, "case": 4},
"sein": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 1},
"seines": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 2},
"seinem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 3},
"seinen": {"L": "-PRON-", "person": 3, "number": 0, "gender": 3, "case": 4},
"unser": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 1},
"unseres": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 2},
"unserem": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 3},
"unseren": {"L": "-PRON-", "person": 1, "number": 0, "gender": 0, "case": 4},
"euer": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 1},
"eures": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"eurem": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 3},
"euren": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
"ihr": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 1},
"ihres": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 2},
"ihrem": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 3},
"ihren": {"L": "-PRON-", "person": 3, "number": 0, "gender": 0, "case": 4}
}
}

View File

@ -1,27 +0,0 @@
,
"
(
[
{
*
<
>
$
£
'
``
`
#
US$
C$
A$
a-
....
...
»
_
§

View File

@ -1,3 +0,0 @@
Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.
Mit Biografie: Ein Spiel wandte sich Frisch von der Parabelform seiner Erfolgsstücke Biedermann und die Brandstifter und Andorra ab und postulierte eine „Dramaturgie der Permutation“. Darin sollte nicht, wie im klassischen Theater, Sinn und Schicksal im Mittelpunkt stehen, sondern die Zufälligkeit von Ereignissen und die Möglichkeit ihrer Variation. Dennoch handelt Biografie: Ein Spiel gerade von der Unmöglichkeit seines Protagonisten, seinen Lebenslauf grundlegend zu verändern. Frisch empfand die Wirkung des Stücks im Nachhinein als zu fatalistisch und die Umsetzung seiner theoretischen Absichten als nicht geglückt. Obwohl das Stück 1968 als unpolitisch und nicht zeitgemäß kritisiert wurde und auch später eine geteilte Rezeption erfuhr, gehört es an deutschsprachigen Bühnen zu den häufiger aufgeführten Stücken Frischs.

File diff suppressed because it is too large Load Diff

View File

@ -1,73 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
«
_
''
's
'S
s
S
°
\.\.
\.\.\.
\.\.\.\.
(?<=[a-zäöüßÖÄÜ)\]"'´«‘’%\)²“”])\.
\-\-
´
(?<=[0-9])km²
(?<=[0-9])m²
(?<=[0-9])cm²
(?<=[0-9])mm²
(?<=[0-9])km³
(?<=[0-9])m³
(?<=[0-9])cm³
(?<=[0-9])mm³
(?<=[0-9])ha
(?<=[0-9])km
(?<=[0-9])m
(?<=[0-9])cm
(?<=[0-9])mm
(?<=[0-9])µm
(?<=[0-9])nm
(?<=[0-9])yd
(?<=[0-9])in
(?<=[0-9])ft
(?<=[0-9])kg
(?<=[0-9])g
(?<=[0-9])mg
(?<=[0-9])µg
(?<=[0-9])t
(?<=[0-9])lb
(?<=[0-9])oz
(?<=[0-9])m/s
(?<=[0-9])km/h
(?<=[0-9])mph
(?<=[0-9])°C
(?<=[0-9])°K
(?<=[0-9])°F
(?<=[0-9])hPa
(?<=[0-9])Pa
(?<=[0-9])mbar
(?<=[0-9])mb
(?<=[0-9])T
(?<=[0-9])G
(?<=[0-9])M
(?<=[0-9])K
(?<=[0-9])kb

View File

@ -1,59 +0,0 @@
{
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
"ADJA": {"pos": "ADJ"},
"ADJD": {"pos": "ADJ", "Variant": "Short"},
"ADV": {"pos": "ADV"},
"APPO": {"pos": "ADP", "AdpType": "Post"},
"APPR": {"pos": "ADP", "AdpType": "Prep"},
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
"APZR": {"pos": "ADP", "AdpType": "Circ"},
"ART": {"pos": "DET", "PronType": "Art"},
"CARD": {"pos": "NUM", "NumType": "Card"},
"FM": {"pos": "X", "Foreign": "Yes"},
"ITJ": {"pos": "INTJ"},
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
"KON": {"pos": "CONJ"},
"KOUI": {"pos": "SCONJ"},
"KOUS": {"pos": "SCONJ"},
"NE": {"pos": "PROPN"},
"NNE": {"pos": "PROPN"},
"NN": {"pos": "NOUN"},
"PAV": {"pos": "ADV", "PronType": "Dem"},
"PROAV": {"pos": "ADV", "PronType": "Dem"},
"PDAT": {"pos": "DET", "PronType": "Dem"},
"PDS": {"pos": "PRON", "PronType": "Dem"},
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
"PPER": {"pos": "PRON", "PronType": "Prs"},
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
"PRELAT": {"pos": "DET", "PronType": "Rel"},
"PRELS": {"pos": "PRON", "PronType": "Rel"},
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
"PTKA": {"pos": "PART"},
"PTKANT": {"pos": "PART", "PartType": "Res"},
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
"PTKZU": {"pos": "PART", "PartType": "Inf"},
"PWAT": {"pos": "DET", "PronType": "Int"},
"PWAV": {"pos": "ADV", "PronType": "Int"},
"PWS": {"pos": "PRON", "PronType": "Int"},
"TRUNC": {"pos": "X", "Hyph": "Yes"},
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
"XY": {"pos": "X"},
"SP": {"pos": "SPACE"}
}

View File

@ -1,20 +0,0 @@
WordNet Release 3.0 This software and database is being provided to you, the
LICENSEE, by Princeton University under the following license. By obtaining,
using and/or copying this software and database, you agree that you have read,
understood, and will comply with these terms and conditions.: Permission to
use, copy, modify and distribute this software and database and its
documentation for any purpose and without fee or royalty is hereby granted,
provided that you agree to comply with the following copyright notice and
statements, including the disclaimer, and that the same appear on ALL copies of
the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton
University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS"
AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO
REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY
PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR
DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS
OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used
in advertising or publicity pertaining to distribution of the software and/or
database. Title to copyright in this software, database and any associated
documentation shall at all times remain with Princeton University and LICENSEE
agrees to preserve same.

View File

@ -1,194 +0,0 @@
{
"Reddit": [
"PRODUCT",
{},
[
[{"lower": "reddit"}]
]
],
"SeptemberElevenAttacks": [
"EVENT",
{},
[
[
{"orth": "9/11"}
],
[
{"lower": "september"},
{"orth": "11"}
]
]
],
"Linux": [
"PRODUCT",
{},
[
[{"lower": "linux"}]
]
],
"Haskell": [
"PRODUCT",
{},
[
[{"lower": "haskell"}]
]
],
"HaskellCurry": [
"PERSON",
{},
[
[
{"lower": "haskell"},
{"lower": "curry"}
]
]
],
"Javascript": [
"PRODUCT",
{},
[
[{"lower": "javascript"}]
]
],
"CSS": [
"PRODUCT",
{},
[
[{"lower": "css"}],
[{"lower": "css3"}]
]
],
"displaCy": [
"PRODUCT",
{},
[
[{"lower": "displacy"}]
]
],
"spaCy": [
"PRODUCT",
{},
[
[{"orth": "spaCy"}]
]
],
"HTML": [
"PRODUCT",
{},
[
[{"lower": "html"}],
[{"lower": "html5"}]
]
],
"Python": [
"PRODUCT",
{},
[
[{"orth": "Python"}]
]
],
"Ruby": [
"PRODUCT",
{},
[
[{"orth": "Ruby"}]
]
],
"Digg": [
"PRODUCT",
{},
[
[{"lower": "digg"}]
]
],
"FoxNews": [
"ORG",
{},
[
[{"orth": "Fox"}],
[{"orth": "News"}]
]
],
"Google": [
"ORG",
{},
[
[{"lower": "google"}]
]
],
"Mac": [
"PRODUCT",
{},
[
[{"lower": "mac"}]
]
],
"Wikipedia": [
"PRODUCT",
{},
[
[{"lower": "wikipedia"}]
]
],
"Windows": [
"PRODUCT",
{},
[
[{"orth": "Windows"}]
]
],
"Dell": [
"ORG",
{},
[
[{"lower": "dell"}]
]
],
"Facebook": [
"ORG",
{},
[
[{"lower": "facebook"}]
]
],
"Blizzard": [
"ORG",
{},
[
[{"orth": "Blizzard"}]
]
],
"Ubuntu": [
"ORG",
{},
[
[{"orth": "Ubuntu"}]
]
],
"Youtube": [
"PRODUCT",
{},
[
[{"lower": "youtube"}]
]
],
"false_positives": [
null,
{},
[
[{"orth": "Shit"}],
[{"orth": "Weed"}],
[{"orth": "Cool"}],
[{"orth": "Btw"}],
[{"orth": "Bah"}],
[{"orth": "Bullshit"}],
[{"orth": "Lol"}],
[{"orth": "Yo"}, {"lower": "dawg"}],
[{"orth": "Yay"}],
[{"orth": "Ahh"}],
[{"orth": "Yea"}],
[{"orth": "Bah"}]
]
]
}

View File

@ -1,422 +0,0 @@
# -#- coding: utf-8 -*-
import json
contractions = {"n't", "'nt", "not", "'ve", "'d", "'ll", "'s", "'m", "'ma", "'re"}
# contains the lemmas, parts of speech, number, and tenspect of
# potential tokens generated after splitting contractions off
token_properties = {
"ai": {"L": "be", "pos": "VBP", "number": 2},
"are": {"L": "be", "pos": "VBP", "number": 2},
"ca": {"L": "can", "pos": "MD"},
"can": {"L": "can", "pos": "MD"},
"could": {"pos": "MD", "L": "could"},
"'d": {"L": "would", "pos": "MD"},
"did": {"L": "do", "pos": "VBD"},
"do": {"L": "do"},
"does": {"L": "do", "pos": "VBZ"},
"had": {"L": "have", "pos": "VBD"},
"has": {"L": "have", "pos": "VBZ"},
"have": {"pos": "VB"},
"he": {"L": "-PRON-", "pos": "PRP"},
"how": {},
"i": {"L": "-PRON-", "pos": "PRP"},
"is": {"L": "be", "pos": "VBZ"},
"it": {"L": "-PRON-", "pos": "PRP"},
"'ll": {"L": "will", "pos": "MD"},
"'m": {"L": "be", "pos": "VBP", "number": 1, "tenspect": 1},
"'ma": {},
"might": {},
"must": {},
"need": {},
"not": {"L": "not", "pos": "RB"},
"'nt": {"L": "not", "pos": "RB"},
"n't": {"L": "not", "pos": "RB"},
"'re": {"L": "be", "pos": "VBZ"},
"'s": {}, # no POS or lemma for s?
"sha": {"L": "shall", "pos": "MD"},
"she": {"L": "-PRON-", "pos": "PRP"},
"should": {},
"that": {},
"there": {},
"they": {"L": "-PRON-", "pos": "PRP"},
"was": {},
"we": {"L": "-PRON-", "pos": "PRP"},
"were": {},
"what": {},
"when": {},
"where": {},
"who": {},
"why": {},
"wo": {},
"would": {},
"you": {"L": "-PRON-", "pos": "PRP"},
"'ve": {"L": "have", "pos": "VB"}
}
# contains starting tokens with their potential contractions
# each potential contraction has a list of exceptions
# lower - don't generate the lowercase version
# upper - don't generate the uppercase version
# contrLower - don't generate the lowercase version with apostrophe (') removed
# contrUpper - dont' generate the uppercase version with apostrophe (') removed
# for example, we don't want to create the word "hell" or "Hell" from "he" + "'ll" so
# we add "contrLower" and "contrUpper" to the exceptions list
starting_tokens = {
"ai": {"n't": []},
"are": {"n't": []},
"ca": {"n't": []},
"can": {"not": []},
"could": {"'ve": [], "n't": [], "n't've": []},
"did": {"n't": []},
"does": {"n't": []},
"do": {"n't": []},
"had": {"n't": [], "n't've": []},
"has": {"n't": []},
"have": {"n't": []},
"he": {"'d": [], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'s": []},
"how": {"'d": [], "'ll": [], "'s": []},
"i": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'m": [], "'ma": [], "'ve": []},
"is": {"n't": []},
"it": {"'d": [], "'d've": [], "'ll": [], "'s": ["contrLower", "contrUpper"]},
"might": {"n't": [], "n't've": [], "'ve": []},
"must": {"n't": [], "'ve": []},
"need": {"n't": []},
"not": {"'ve": []},
"sha": {"n't": []},
"she": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'s": []},
"should": {"'ve": [], "n't": [], "n't've": []},
"that": {"'s": []},
"there": {"'d": [], "'d've": [], "'s": ["contrLower", "contrUpper"], "'ll": []},
"they": {"'d": [], "'d've": [], "'ll": [], "'re": [], "'ve": []},
"was": {"n't": []},
"we": {"'d": ["contrLower", "contrUpper"], "'d've": [], "'ll": ["contrLower", "contrUpper"], "'re": ["contrLower", "contrUpper"], "'ve": []},
"were": {"n't": []},
"what": {"'ll": [], "'re": [], "'s": [], "'ve": []},
"when": {"'s": []},
"where": {"'d": [], "'s": [], "'ve": []},
"who": {"'d": [], "'ll": [], "'re": ["contrLower", "contrUpper"], "'s": [], "'ve": []},
"why": {"'ll": [], "'re": [], "'s": []},
"wo": {"n't": []},
"would": {"'ve": [], "n't": [], "n't've": []},
"you": {"'d": [], "'d've": [], "'ll": [], "'re": [], "'ve": []}
}
# other specials that don't really have contractions
# so they are hardcoded
hardcoded_specials = {
"let's": [{"F": "let"}, {"F": "'s", "L": "us"}],
"Let's": [{"F": "Let"}, {"F": "'s", "L": "us"}],
"'s": [{"F": "'s", "L": "'s"}],
"'S": [{"F": "'S", "L": "'s"}],
u"\u2018s": [{"F": u"\u2018s", "L": "'s"}],
u"\u2018S": [{"F": u"\u2018S", "L": "'s"}],
"'em": [{"F": "'em"}],
"'ol": [{"F": "'ol"}],
"vs.": [{"F": "vs."}],
"Ms.": [{"F": "Ms."}],
"Mr.": [{"F": "Mr."}],
"Dr.": [{"F": "Dr."}],
"Mrs.": [{"F": "Mrs."}],
"Messrs.": [{"F": "Messrs."}],
"Gov.": [{"F": "Gov."}],
"Gen.": [{"F": "Gen."}],
"Mt.": [{"F": "Mt.", "L": "Mount"}],
"''": [{"F": "''"}],
"": [{"F": "", "L": "--", "pos": ":"}],
"Corp.": [{"F": "Corp."}],
"Inc.": [{"F": "Inc."}],
"Co.": [{"F": "Co."}],
"co.": [{"F": "co."}],
"Ltd.": [{"F": "Ltd."}],
"Bros.": [{"F": "Bros."}],
"Rep.": [{"F": "Rep."}],
"Sen.": [{"F": "Sen."}],
"Jr.": [{"F": "Jr."}],
"Rev.": [{"F": "Rev."}],
"Adm.": [{"F": "Adm."}],
"St.": [{"F": "St."}],
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Jan.": [{"F": "Jan."}],
"Feb.": [{"F": "Feb."}],
"Mar.": [{"F": "Mar."}],
"Apr.": [{"F": "Apr."}],
"May.": [{"F": "May."}],
"Jun.": [{"F": "Jun."}],
"Jul.": [{"F": "Jul."}],
"Aug.": [{"F": "Aug."}],
"Sep.": [{"F": "Sep."}],
"Sept.": [{"F": "Sept."}],
"Oct.": [{"F": "Oct."}],
"Nov.": [{"F": "Nov."}],
"Dec.": [{"F": "Dec."}],
"Ala.": [{"F": "Ala."}],
"Ariz.": [{"F": "Ariz."}],
"Ark.": [{"F": "Ark."}],
"Calif.": [{"F": "Calif."}],
"Colo.": [{"F": "Colo."}],
"Conn.": [{"F": "Conn."}],
"Del.": [{"F": "Del."}],
"D.C.": [{"F": "D.C."}],
"Fla.": [{"F": "Fla."}],
"Ga.": [{"F": "Ga."}],
"Ill.": [{"F": "Ill."}],
"Ind.": [{"F": "Ind."}],
"Kans.": [{"F": "Kans."}],
"Kan.": [{"F": "Kan."}],
"Ky.": [{"F": "Ky."}],
"La.": [{"F": "La."}],
"Md.": [{"F": "Md."}],
"Mass.": [{"F": "Mass."}],
"Mich.": [{"F": "Mich."}],
"Minn.": [{"F": "Minn."}],
"Miss.": [{"F": "Miss."}],
"Mo.": [{"F": "Mo."}],
"Mont.": [{"F": "Mont."}],
"Nebr.": [{"F": "Nebr."}],
"Neb.": [{"F": "Neb."}],
"Nev.": [{"F": "Nev."}],
"N.H.": [{"F": "N.H."}],
"N.J.": [{"F": "N.J."}],
"N.M.": [{"F": "N.M."}],
"N.Y.": [{"F": "N.Y."}],
"N.C.": [{"F": "N.C."}],
"N.D.": [{"F": "N.D."}],
"Okla.": [{"F": "Okla."}],
"Ore.": [{"F": "Ore."}],
"Pa.": [{"F": "Pa."}],
"Tenn.": [{"F": "Tenn."}],
"Va.": [{"F": "Va."}],
"Wash.": [{"F": "Wash."}],
"Wis.": [{"F": "Wis."}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"r.": [{"F": "r."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
"i.e.": [{"F": "i.e."}],
"I.e.": [{"F": "I.e."}],
"I.E.": [{"F": "I.E."}],
"e.g.": [{"F": "e.g."}],
"E.g.": [{"F": "E.g."}],
"E.G.": [{"F": "E.G."}],
"\n": [{"F": "\n", "pos": "SP"}],
"\t": [{"F": "\t", "pos": "SP"}],
" ": [{"F": " ", "pos": "SP"}],
u"\u00a0": [{"F": u"\u00a0", "pos": "SP", "L": " "}]
}
def get_double_contractions(ending):
endings = []
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
while ends_with_contraction:
for contraction in contractions:
if ending.endswith(contraction):
endings.append(contraction)
ending = ending.rstrip(contraction)
ends_with_contraction = any([ending.endswith(contraction) for contraction in contractions])
endings.reverse() # reverse because the last ending is put in the list first
return endings
def get_token_properties(token, capitalize=False, remove_contractions=False):
props = dict(token_properties.get(token)) # ensure we copy the dict so we can add the "F" prop
if capitalize:
token = token.capitalize()
if remove_contractions:
token = token.replace("'", "")
props["F"] = token
return props
def create_entry(token, endings, capitalize=False, remove_contractions=False):
properties = []
properties.append(get_token_properties(token, capitalize=capitalize, remove_contractions=remove_contractions))
for e in endings:
properties.append(get_token_properties(e, remove_contractions=remove_contractions))
return properties
def generate_specials():
specials = {}
for token in starting_tokens:
possible_endings = starting_tokens[token]
for ending in possible_endings:
endings = []
if ending.count("'") > 1:
endings.extend(get_double_contractions(ending))
else:
endings.append(ending)
exceptions = possible_endings[ending]
if "lower" not in exceptions:
special = token + ending
specials[special] = create_entry(token, endings)
if "upper" not in exceptions:
special = token.capitalize() + ending
specials[special] = create_entry(token, endings, capitalize=True)
if "contrLower" not in exceptions:
special = token + ending.replace("'", "")
specials[special] = create_entry(token, endings, remove_contractions=True)
if "contrUpper" not in exceptions:
special = token.capitalize() + ending.replace("'", "")
specials[special] = create_entry(token, endings, capitalize=True, remove_contractions=True)
# add in hardcoded specials
specials = dict(specials, **hardcoded_specials)
return specials
if __name__ == "__main__":
specials = generate_specials()
with open("specials.json", "w") as file_:
file_.write(json.dumps(specials, indent=2))

View File

@ -1,6 +0,0 @@
\.\.\.+
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])
(?<=[a-zA-Z])--(?=[a-zA-z])
(?<=[0-9])-(?=[0-9])
(?<=[A-Za-z]),(?=[A-Za-z])

View File

@ -1,59 +0,0 @@
{
"PRP": {
"I": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"me": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"you": {"L": "-PRON-", "PronType": "Prs", "Person": "Two"},
"he": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"him": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"she": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"her": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"it": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"we": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"us": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"they": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"them": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"mine": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"yours": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Poss": "Yes", "Reflex": "Yes"},
"his": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hers": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"its": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
"ours": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"yours": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"theirs": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"myself": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"yourself": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"himself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Masc", "Reflex": "Yes"},
"herself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Fem", "Reflex": "Yes"},
"itself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Neut", "Reflex": "Yes"},
"themself": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"ourselves": {"L": "-PRON-", "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"},
"yourselves": {"L": "-PRON-", "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"themselves": {"L": "-PRON-", "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"}
},
"PRP$": {
"my": {"L": "-PRON-", "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
"your": {"L": "-PRON-", "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
"his": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
"her": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
"its": {"L": "-PRON-", "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
"our": {"L": "-PRON-", "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
"their": {"L": "-PRON-", "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
},
"VBZ": {
"am": {"L": "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"are": {"L": "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"is": {"L": "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
},
"VBP": {
"are": {"L": "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
},
"VBD": {
"was": {"L": "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"were": {"L": "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
}
}

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

File diff suppressed because it is too large Load Diff

View File

@ -1,26 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1,60 +0,0 @@
{
".": {"pos": "punct", "puncttype": "peri"},
",": {"pos": "punct", "puncttype": "comm"},
"-LRB-": {"pos": "punct", "puncttype": "brck", "punctside": "ini"},
"-RRB-": {"pos": "punct", "puncttype": "brck", "punctside": "fin"},
"``": {"pos": "punct", "puncttype": "quot", "punctside": "ini"},
"\"\"": {"pos": "punct", "puncttype": "quot", "punctside": "fin"},
"''": {"pos": "punct", "puncttype": "quot", "punctside": "fin"},
":": {"pos": "punct"},
"$": {"pos": "sym", "other": {"symtype": "currency"}},
"#": {"pos": "sym", "other": {"symtype": "numbersign"}},
"AFX": {"pos": "adj", "hyph": "hyph"},
"CC": {"pos": "conj", "conjtype": "coor"},
"CD": {"pos": "num", "numtype": "card"},
"DT": {"pos": "det"},
"EX": {"pos": "adv", "advtype": "ex"},
"FW": {"pos": "x", "foreign": "foreign"},
"HYPH": {"pos": "punct", "puncttype": "dash"},
"IN": {"pos": "adp"},
"JJ": {"pos": "adj", "degree": "pos"},
"JJR": {"pos": "adj", "degree": "comp"},
"JJS": {"pos": "adj", "degree": "sup"},
"LS": {"pos": "punct", "numtype": "ord"},
"MD": {"pos": "verb", "verbtype": "mod"},
"NIL": {"pos": ""},
"NN": {"pos": "noun", "number": "sing"},
"NNP": {"pos": "propn", "nountype": "prop", "number": "sing"},
"NNPS": {"pos": "propn", "nountype": "prop", "number": "plur"},
"NNS": {"pos": "noun", "number": "plur"},
"PDT": {"pos": "adj", "adjtype": "pdt", "prontype": "prn"},
"POS": {"pos": "part", "poss": "poss"},
"PRP": {"pos": "pron", "prontype": "prs"},
"PRP$": {"pos": "adj", "prontype": "prs", "poss": "poss"},
"RB": {"pos": "adv", "degree": "pos"},
"RBR": {"pos": "adv", "degree": "comp"},
"RBS": {"pos": "adv", "degree": "sup"},
"RP": {"pos": "part"},
"SYM": {"pos": "sym"},
"TO": {"pos": "part", "parttype": "inf", "verbform": "inf"},
"UH": {"pos": "intJ"},
"VB": {"pos": "verb", "verbform": "inf"},
"VBD": {"pos": "verb", "verbform": "fin", "tense": "past"},
"VBG": {"pos": "verb", "verbform": "part", "tense": "pres", "aspect": "prog"},
"VBN": {"pos": "verb", "verbform": "part", "tense": "past", "aspect": "perf"},
"VBP": {"pos": "verb", "verbform": "fin", "tense": "pres"},
"VBZ": {"pos": "verb", "verbform": "fin", "tense": "pres", "number": "sing", "person": 3},
"WDT": {"pos": "adj", "prontype": "int|rel"},
"WP": {"pos": "noun", "prontype": "int|rel"},
"WP$": {"pos": "adj", "poss": "poss", "prontype": "int|rel"},
"WRB": {"pos": "adv", "prontype": "int|rel"},
"SP": {"pos": "space"},
"ADD": {"pos": "x"},
"NFP": {"pos": "punct"},
"GW": {"pos": "x"},
"AFX": {"pos": "x"},
"HYPH": {"pos": "punct"},
"XX": {"pos": "x"},
"BES": {"pos": "verb"},
"HVS": {"pos": "verb"}
}

View File

@ -1,3 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])

View File

@ -1 +0,0 @@
{}

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

View File

@ -1,3 +0,0 @@
Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.
Mit Biografie: Ein Spiel wandte sich Frisch von der Parabelform seiner Erfolgsstücke Biedermann und die Brandstifter und Andorra ab und postulierte eine „Dramaturgie der Permutation“. Darin sollte nicht, wie im klassischen Theater, Sinn und Schicksal im Mittelpunkt stehen, sondern die Zufälligkeit von Ereignissen und die Möglichkeit ihrer Variation. Dennoch handelt Biografie: Ein Spiel gerade von der Unmöglichkeit seines Protagonisten, seinen Lebenslauf grundlegend zu verändern. Frisch empfand die Wirkung des Stücks im Nachhinein als zu fatalistisch und die Umsetzung seiner theoretischen Absichten als nicht geglückt. Obwohl das Stück 1968 als unpolitisch und nicht zeitgemäß kritisiert wurde und auch später eine geteilte Rezeption erfuhr, gehört es an deutschsprachigen Bühnen zu den häufiger aufgeführten Stücken Frischs.

View File

@ -1,149 +0,0 @@
{
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Jan.": [{"F": "Jan.", "L": "Januar"}],
"Feb.": [{"F": "Feb.", "L": "Februar"}],
"Mär.": [{"F": "Mär.", "L": "März"}],
"Apr.": [{"F": "Apr.", "L": "April"}],
"Mai.": [{"F": "Mai.", "L": "Mai"}],
"Jun.": [{"F": "Jun.", "L": "Juni"}],
"Jul.": [{"F": "Jul.", "L": "Juli"}],
"Aug.": [{"F": "Aug.", "L": "August"}],
"Sep.": [{"F": "Sep.", "L": "September"}],
"Sept.": [{"F": "Sept.", "L": "September"}],
"Okt.": [{"F": "Okt.", "L": "Oktober"}],
"Nov.": [{"F": "Nov.", "L": "November"}],
"Dez.": [{"F": "Dez.", "L": "Dezember"}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
"z.b.": [{"F": "z.b."}],
"e.h.": [{"F": "I.e."}],
"o.ä.": [{"F": "I.E."}],
"bzw.": [{"F": "bzw."}],
"usw.": [{"F": "usw."}],
"\n": [{"F": "\n", "pos": "SP"}],
"\t": [{"F": "\t", "pos": "SP"}],
" ": [{"F": " ", "pos": "SP"}]
}

View File

@ -1,26 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1,19 +0,0 @@
{
"NOUN": {"pos": "NOUN"},
"VERB": {"pos": "VERB"},
"PUNCT": {"pos": "PUNCT"},
"ADV": {"pos": "ADV"},
"ADJ": {"pos": "ADJ"},
"PRON": {"pos": "PRON"},
"PROPN": {"pos": "PROPN"},
"CONJ": {"pos": "CONJ"},
"NUM": {"pos": "NUM"},
"AUX": {"pos": "AUX"},
"SCONJ": {"pos": "SCONJ"},
"ADP": {"pos": "ADP"},
"SYM": {"pos": "SYM"},
"X": {"pos": "X"},
"INTJ": {"pos": "INTJ"},
"DET": {"pos": "DET"},
"PART": {"pos": "PART"}
}

View File

@ -1,3 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

View File

@ -1,149 +0,0 @@
{
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Jan.": [{"F": "Jan.", "L": "Januar"}],
"Feb.": [{"F": "Feb.", "L": "Februar"}],
"Mär.": [{"F": "Mär.", "L": "März"}],
"Apr.": [{"F": "Apr.", "L": "April"}],
"Mai.": [{"F": "Mai.", "L": "Mai"}],
"Jun.": [{"F": "Jun.", "L": "Juni"}],
"Jul.": [{"F": "Jul.", "L": "Juli"}],
"Aug.": [{"F": "Aug.", "L": "August"}],
"Sep.": [{"F": "Sep.", "L": "September"}],
"Sept.": [{"F": "Sept.", "L": "September"}],
"Okt.": [{"F": "Okt.", "L": "Oktober"}],
"Nov.": [{"F": "Nov.", "L": "November"}],
"Dez.": [{"F": "Dez.", "L": "Dezember"}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
"z.b.": [{"F": "z.b."}],
"e.h.": [{"F": "I.e."}],
"o.ä.": [{"F": "I.E."}],
"bzw.": [{"F": "bzw."}],
"usw.": [{"F": "usw."}],
"\n": [{"F": "\n", "pos": "SP"}],
"\t": [{"F": "\t", "pos": "SP"}],
" ": [{"F": " ", "pos": "SP"}]
}

View File

@ -1,26 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1,44 +0,0 @@
{
"S": {"pos": "NOUN"},
"E": {"pos": "ADP"},
"RD": {"pos": "DET"},
"V": {"pos": "VERB"},
"_": {"pos": "NO_TAG"},
"A": {"pos": "ADJ"},
"SP": {"pos": "PROPN"},
"FF": {"pos": "PUNCT"},
"FS": {"pos": "PUNCT"},
"B": {"pos": "ADV"},
"CC": {"pos": "CONJ"},
"FB": {"pos": "PUNCT"},
"VA": {"pos": "AUX"},
"PC": {"pos": "PRON"},
"N": {"pos": "NUM"},
"RI": {"pos": "DET"},
"PR": {"pos": "PRON"},
"CS": {"pos": "SCONJ"},
"BN": {"pos": "ADV"},
"AP": {"pos": "DET"},
"VM": {"pos": "AUX"},
"DI": {"pos": "DET"},
"FC": {"pos": "PUNCT"},
"PI": {"pos": "PRON"},
"DD": {"pos": "DET"},
"DQ": {"pos": "DET"},
"PQ": {"pos": "PRON"},
"PD": {"pos": "PRON"},
"NO": {"pos": "ADJ"},
"PE": {"pos": "PRON"},
"T": {"pos": "DET"},
"X": {"pos": "SYM"},
"SW": {"pos": "X"},
"NO": {"pos": "PRON"},
"I": {"pos": "INTJ"},
"X": {"pos": "X"},
"DR": {"pos": "DET"},
"EA": {"pos": "ADP"},
"PP": {"pos": "PRON"},
"X": {"pos": "NUM"},
"DE": {"pos": "DET"},
"X": {"pos": "PART"}
}

View File

@ -1,194 +0,0 @@
{
"Reddit": [
"PRODUCT",
{},
[
[{"lower": "reddit"}]
]
],
"SeptemberElevenAttacks": [
"EVENT",
{},
[
[
{"orth": "9/11"}
],
[
{"lower": "september"},
{"orth": "11"}
]
]
],
"Linux": [
"PRODUCT",
{},
[
[{"lower": "linux"}]
]
],
"Haskell": [
"PRODUCT",
{},
[
[{"lower": "haskell"}]
]
],
"HaskellCurry": [
"PERSON",
{},
[
[
{"lower": "haskell"},
{"lower": "curry"}
]
]
],
"Javascript": [
"PRODUCT",
{},
[
[{"lower": "javascript"}]
]
],
"CSS": [
"PRODUCT",
{},
[
[{"lower": "css"}],
[{"lower": "css3"}]
]
],
"displaCy": [
"PRODUCT",
{},
[
[{"lower": "displacy"}]
]
],
"spaCy": [
"PRODUCT",
{},
[
[{"orth": "spaCy"}]
]
],
"HTML": [
"PRODUCT",
{},
[
[{"lower": "html"}],
[{"lower": "html5"}]
]
],
"Python": [
"PRODUCT",
{},
[
[{"orth": "Python"}]
]
],
"Ruby": [
"PRODUCT",
{},
[
[{"orth": "Ruby"}]
]
],
"Digg": [
"PRODUCT",
{},
[
[{"lower": "digg"}]
]
],
"FoxNews": [
"ORG",
{},
[
[{"orth": "Fox"}],
[{"orth": "News"}]
]
],
"Google": [
"ORG",
{},
[
[{"lower": "google"}]
]
],
"Mac": [
"PRODUCT",
{},
[
[{"lower": "mac"}]
]
],
"Wikipedia": [
"PRODUCT",
{},
[
[{"lower": "wikipedia"}]
]
],
"Windows": [
"PRODUCT",
{},
[
[{"orth": "Windows"}]
]
],
"Dell": [
"ORG",
{},
[
[{"lower": "dell"}]
]
],
"Facebook": [
"ORG",
{},
[
[{"lower": "facebook"}]
]
],
"Blizzard": [
"ORG",
{},
[
[{"orth": "Blizzard"}]
]
],
"Ubuntu": [
"ORG",
{},
[
[{"orth": "Ubuntu"}]
]
],
"Youtube": [
"PRODUCT",
{},
[
[{"lower": "youtube"}]
]
],
"false_positives": [
null,
{},
[
[{"orth": "Shit"}],
[{"orth": "Weed"}],
[{"orth": "Cool"}],
[{"orth": "Btw"}],
[{"orth": "Bah"}],
[{"orth": "Bullshit"}],
[{"orth": "Lol"}],
[{"orth": "Yo"}, {"lower": "dawg"}],
[{"orth": "Yay"}],
[{"orth": "Ahh"}],
[{"orth": "Yea"}],
[{"orth": "Bah"}]
]
]
}

View File

@ -1,6 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])
(?<=[a-zA-Z])--(?=[a-zA-z])
(?<=[0-9])-(?=[0-9])
(?<=[A-Za-z]),(?=[A-Za-z])

View File

@ -1 +0,0 @@
{}

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

View File

@ -1 +0,0 @@
{}

View File

@ -1,26 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1,43 +0,0 @@
{
"NR": {"pos": "PROPN"},
"AD": {"pos": "ADV"},
"NN": {"pos": "NOUN"},
"CD": {"pos": "NUM"},
"DEG": {"pos": "PART"},
"PN": {"pos": "PRON"},
"M": {"pos": "PART"},
"JJ": {"pos": "ADJ"},
"DEC": {"pos": "PART"},
"NT": {"pos": "NOUN"},
"DT": {"pos": "DET"},
"LC": {"pos": "PART"},
"CC": {"pos": "CONJ"},
"AS": {"pos": "PART"},
"SP": {"pos": "PART"},
"IJ": {"pos": "INTJ"},
"OD": {"pos": "NUM"},
"MSP": {"pos": "PART"},
"CS": {"pos": "SCONJ"},
"ETC": {"pos": "PART"},
"DEV": {"pos": "PART"},
"BA": {"pos": "AUX"},
"SB": {"pos": "AUX"},
"DER": {"pos": "PART"},
"LB": {"pos": "AUX"},
"P": {"pos": "ADP"},
"URL": {"pos": "SYM"},
"FRAG": {"pos": "X"},
"X": {"pos": "X"},
"ON": {"pos": "X"},
"FW": {"pos": "X"},
"VC": {"pos": "VERB"},
"VV": {"pos": "VERB"},
"VA": {"pos": "VERB"},
"VE": {"pos": "VERB"},
"PU": {"pos": "PUNCT"},
"SP": {"pos": "SPACE"},
"NP": {"pos": "X"},
"_": {"pos": "X"},
"VP": {"pos": "X"},
"CHAR": {"pos": "X"}
}

View File

@ -28,6 +28,9 @@ PACKAGES = [
'spacy.fr',
'spacy.it',
'spacy.pt',
'spacy.nl',
'spacy.sv',
'spacy.language_data',
'spacy.serialize',
'spacy.syntax',
'spacy.munge',
@ -77,6 +80,7 @@ MOD_NAMES = [
'spacy.syntax.ner',
'spacy.symbols',
'spacy.syntax.iterators']
# TODO: This is missing a lot of modules. Does it matter?
COMPILE_OPTIONS = {
@ -92,7 +96,7 @@ LINK_OPTIONS = {
'other' : []
}
# I don't understand this very well yet. See Issue #267
# Fingers crossed!
#if os.environ.get('USE_OPENMP') == '1':

View File

@ -11,6 +11,8 @@ from . import es
from . import it
from . import fr
from . import pt
from . import nl
from . import sv
try:
@ -27,23 +29,14 @@ set_lang_class(fr.French.lang, fr.French)
set_lang_class(it.Italian.lang, it.Italian)
set_lang_class(hu.Hungarian.lang, hu.Hungarian)
set_lang_class(zh.Chinese.lang, zh.Chinese)
set_lang_class(nl.Dutch.lang, nl.Dutch)
set_lang_class(sv.Swedish.lang, sv.Swedish)
def load(name, **overrides):
target_name, target_version = util.split_data_name(name)
data_path = overrides.get('path', util.get_data_path())
if target_name == 'en' and 'add_vectors' not in overrides:
if 'vectors' in overrides:
vec_path = util.match_best_version(overrides['vectors'], None, data_path)
if vec_path is None:
raise IOError(
'Could not load data pack %s from %s' % (overrides['vectors'], data_path))
else:
vec_path = util.match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
if vec_path is not None:
vec_path = vec_path / 'vocab' / 'vec.bin'
overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
path = util.match_best_version(target_name, target_version, data_path)
cls = get_lang_class(target_name)
return cls(path=path, **overrides)
overrides['path'] = path
return cls(**overrides)

View File

@ -4,7 +4,7 @@
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
__title__ = 'spacy'
__version__ = '1.3.0'
__version__ = '1.4.0'
__summary__ = 'Industrial-strength NLP'
__uri__ = 'https://spacy.io'
__author__ = 'Matthew Honnibal'

View File

@ -87,5 +87,3 @@ cpdef enum attr_id_t:
PROB
LANG

View File

@ -120,8 +120,19 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
stringy_attrs.pop('number')
if 'tenspect' in stringy_attrs:
stringy_attrs.pop('tenspect')
# for name, value in morphs.items():
# stringy_attrs[name] = value
morph_keys = [
'PunctType', 'PunctSide', 'Other', 'Degree', 'AdvType', 'Number',
'VerbForm', 'PronType', 'Aspect', 'Tense', 'PartType', 'Poss',
'Hyph', 'ConjType', 'NumType', 'Foreign', 'VerbType', 'NounType',
'Number', 'PronType', 'AdjType', 'Person', 'Variant', 'AdpType',
'Reflex', 'Negative', 'Mood', 'Aspect', 'Case']
for key in morph_keys:
if key in stringy_attrs:
stringy_attrs.pop(key)
elif key.lower() in stringy_attrs:
stringy_attrs.pop(key.lower())
elif key.upper() in stringy_attrs:
stringy_attrs.pop(key.upper())
for name, value in stringy_attrs.items():
if isinstance(name, int):
int_key = name

View File

@ -1,27 +1,22 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
from . import language_data
from .language_data import *
class German(Language):
lang = 'de'
class Defaults(Language.Defaults):
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'de'
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
infixes = tuple(language_data.TOKENIZER_INFIXES)
tag_map = dict(language_data.TAG_MAP)
stop_words = set(language_data.STOP_WORDS)
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
tag_map = TAG_MAP
stop_words = STOP_WORDS

View File

@ -1,3 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])

View File

@ -1 +0,0 @@
{}

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

View File

@ -1 +0,0 @@
{}

View File

@ -1,27 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
^\d+\.$
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1 +0,0 @@
{}

View File

@ -1,31 +0,0 @@
{
"noun": [
["s", ""],
["ses", "s"],
["ves", "f"],
["xes", "x"],
["zes", "z"],
["ches", "ch"],
["shes", "sh"],
["men", "man"],
["ies", "y"]
],
"verb": [
["s", ""],
["ies", "y"],
["es", "e"],
["es", ""],
["ed", "e"],
["ed", ""],
["ing", "e"],
["ing", ""]
],
"adj": [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
}

Binary file not shown.

View File

@ -1 +0,0 @@
-20.000000

File diff suppressed because it is too large Load Diff

View File

@ -1,57 +0,0 @@
{
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
"ADJA": {"pos": "ADJ"},
"ADJD": {"pos": "ADJ", "Variant": "Short"},
"ADV": {"pos": "ADV"},
"APPO": {"pos": "ADP", "AdpType": "Post"},
"APPR": {"pos": "ADP", "AdpType": "Prep"},
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
"APZR": {"pos": "ADP", "AdpType": "Circ"},
"ART": {"pos": "DET", "PronType": "Art"},
"CARD": {"pos": "NUM", "NumType": "Card"},
"FM": {"pos": "X", "Foreign": "Yes"},
"ITJ": {"pos": "INTJ"},
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
"KON": {"pos": "CONJ"},
"KOUI": {"pos": "SCONJ"},
"KOUS": {"pos": "SCONJ"},
"NE": {"pos": "PROPN"},
"NN": {"pos": "NOUN"},
"PAV": {"pos": "ADV", "PronType": "Dem"},
"PDAT": {"pos": "DET", "PronType": "Dem"},
"PDS": {"pos": "PRON", "PronType": "Dem"},
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
"PPER": {"pos": "PRON", "PronType": "Prs"},
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
"PRELAT": {"pos": "DET", "PronType": "Rel"},
"PRELS": {"pos": "PRON", "PronType": "Rel"},
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
"PTKA": {"pos": "PART"},
"PTKANT": {"pos": "PART", "PartType": "Res"},
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
"PTKZU": {"pos": "PART", "PartType": "Inf"},
"PWAT": {"pos": "DET", "PronType": "Int"},
"PWAV": {"pos": "ADV", "PronType": "Int"},
"PWS": {"pos": "PRON", "PronType": "Int"},
"TRUNC": {"pos": "X", "Hyph": "Yes"},
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
"XY": {"pos": "X"},
"SP": {"pos": "SPACE"}
}

File diff suppressed because it is too large Load Diff

81
spacy/de/stop_words.py Normal file
View File

@ -0,0 +1,81 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
á a ab aber ach acht achte achten achter achtes ag alle allein allem allen
aller allerdings alles allgemeinen als also am an andere anderen andern anders
auch auf aus ausser außer ausserdem außerdem
bald bei beide beiden beim beispiel bekannt bereits besonders besser besten bin
bis bisher bist
da dabei dadurch dafür dagegen daher dahin dahinter damals damit danach daneben
dank dann daran darauf daraus darf darfst darin darüber darum darunter das
dasein daselbst dass daß dasselbe davon davor dazu dazwischen dein deine deinem
deiner dem dementsprechend demgegenüber demgemäss demgemäß demselben demzufolge
den denen denn denselben der deren derjenige derjenigen dermassen dermaßen
derselbe derselben des deshalb desselben dessen deswegen dich die diejenige
diejenigen dies diese dieselbe dieselben diesem diesen dieser dieses dir doch
dort drei drin dritte dritten dritter drittes du durch durchaus dürfen dürft
durfte durften
eben ebenso ehrlich eigen eigene eigenen eigener eigenes ein einander eine
einem einen einer eines einigeeinigen einiger einiges einmal einmaleins elf en
ende endlich entweder er erst erste ersten erster erstes es etwa etwas euch
früher fünf fünfte fünften fünfter fünftes für
gab ganz ganze ganzen ganzer ganzes gar gedurft gegen gegenüber gehabt gehen
geht gekannt gekonnt gemacht gemocht gemusst genug gerade gern gesagt geschweige
gewesen gewollt geworden gibt ging gleich gott gross groß grosse große grossen
großen grosser großer grosses großes gut gute guter gutes
habe haben habt hast hat hatte hätte hatten hätten heisst heißt her heute hier
hin hinter hoch
ich ihm ihn ihnen ihr ihre ihrem ihrer ihres im immer in indem infolgedessen
ins irgend ist
ja jahr jahre jahren je jede jedem jeden jeder jedermann jedermanns jedoch
jemand jemandem jemanden jene jenem jenen jener jenes jetzt
kam kann kannst kaum kein keine keinem keinen keiner kleine kleinen kleiner
kleines kommen kommt können könnt konnte könnte konnten kurz
lang lange leicht leider lieber los
machen macht machte mag magst man manche manchem manchen mancher manches mehr
mein meine meinem meinen meiner meines mensch menschen mich mir mit mittel
mochte möchte mochten mögen möglich mögt morgen muss muß müssen musst müsst
musste mussten
na nach nachdem nahm natürlich neben nein neue neuen neun neunte neunten neunter
neuntes nicht nichts nie niemand niemandem niemanden noch nun nur
ob oben oder offen oft ohne
recht rechte rechten rechter rechtes richtig rund
sagt sagte sah satt schlecht schon sechs sechste sechsten sechster sechstes
sehr sei seid seien sein seine seinem seinen seiner seines seit seitdem selbst
selbst sich sie sieben siebente siebenten siebenter siebentes siebte siebten
siebter siebtes sind so solang solche solchem solchen solcher solches soll
sollen sollte sollten sondern sonst sowie später statt
tag tage tagen tat teil tel trotzdem tun
über überhaupt übrigens uhr um und uns unser unsere unserer unter
vergangene vergangenen viel viele vielem vielen vielleicht vier vierte vierten
vierter viertes vom von vor
wahr während währenddem währenddessen wann war wäre waren wart warum was wegen
weil weit weiter weitere weiteren weiteres welche welchem welchen welcher
welches wem wen wenig wenige weniger weniges wenigstens wenn wer werde werden
werdet wessen wie wieder will willst wir wird wirklich wirst wo wohl wollen
wollt wollte wollten worden wurde würde wurden würden
zehn zehnte zehnten zehnter zehntes zeit zu zuerst zugleich zum zunächst zur
zurück zusammen zwanzig zwar zwei zweite zweiten zweiter zweites zwischen
""".split())

65
spacy/de/tag_map.py Normal file
View File

@ -0,0 +1,65 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
TAG_MAP = {
"$(": {POS: PUNCT, "PunctType": "brck"},
"$,": {POS: PUNCT, "PunctType": "comm"},
"$.": {POS: PUNCT, "PunctType": "peri"},
"ADJA": {POS: ADJ},
"ADJD": {POS: ADJ, "Variant": "short"},
"ADV": {POS: ADV},
"APPO": {POS: ADP, "AdpType": "post"},
"APPR": {POS: ADP, "AdpType": "prep"},
"APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"},
"APZR": {POS: ADP, "AdpType": "circ"},
"ART": {POS: DET, "PronType": "art"},
"CARD": {POS: NUM, "NumType": "card"},
"FM": {POS: X, "Foreign": "yes"},
"ITJ": {POS: INTJ},
"KOKOM": {POS: CONJ, "ConjType": "comp"},
"KON": {POS: CONJ},
"KOUI": {POS: SCONJ},
"KOUS": {POS: SCONJ},
"NE": {POS: PROPN},
"NNE": {POS: PROPN},
"NN": {POS: NOUN},
"PAV": {POS: ADV, "PronType": "dem"},
"PROAV": {POS: ADV, "PronType": "dem"},
"PDAT": {POS: DET, "PronType": "dem"},
"PDS": {POS: PRON, "PronType": "dem"},
"PIAT": {POS: DET, "PronType": "ind|neg|tot"},
"PIDAT": {POS: DET, "AdjType": "pdt", "PronType": "ind|neg|tot"},
"PIS": {POS: PRON, "PronType": "ind|neg|tot"},
"PPER": {POS: PRON, "PronType": "prs"},
"PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"},
"PPOSS": {POS: PRON, "Poss": "yes", "PronType": "prs"},
"PRELAT": {POS: DET, "PronType": "rel"},
"PRELS": {POS: PRON, "PronType": "rel"},
"PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"},
"PTKA": {POS: PART},
"PTKANT": {POS: PART, "PartType": "res"},
"PTKNEG": {POS: PART, "Negative": "yes"},
"PTKVZ": {POS: PART, "PartType": "vbp"},
"PTKZU": {POS: PART, "PartType": "inf"},
"PWAT": {POS: DET, "PronType": "int"},
"PWAV": {POS: ADV, "PronType": "int"},
"PWS": {POS: PRON, "PronType": "int"},
"TRUNC": {POS: X, "Hyph": "yes"},
"VAFIN": {POS: AUX, "Mood": "ind", "VerbForm": "fin"},
"VAIMP": {POS: AUX, "Mood": "imp", "VerbForm": "fin"},
"VAINF": {POS: AUX, "VerbForm": "inf"},
"VAPP": {POS: AUX, "Aspect": "perf", "VerbForm": "part"},
"VMFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin", "VerbType": "mod"},
"VMINF": {POS: VERB, "VerbForm": "inf", "VerbType": "mod"},
"VMPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part", "VerbType": "mod"},
"VVFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin"},
"VVIMP": {POS: VERB, "Mood": "imp", "VerbForm": "fin"},
"VVINF": {POS: VERB, "VerbForm": "inf"},
"VVIZU": {POS: VERB, "VerbForm": "inf"},
"VVPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part"},
"XY": {POS: X},
"SP": {POS: SPACE}
}

View File

@ -0,0 +1,629 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
TOKENIZER_EXCEPTIONS = {
"\\n": [
{ORTH: "\\n", LEMMA: "<nl>", TAG: "SP"}
],
"\\t": [
{ORTH: "\\t", LEMMA: "<tab>", TAG: "SP"}
],
"'S": [
{ORTH: "'S", LEMMA: PRON_LEMMA}
],
"'n": [
{ORTH: "'n", LEMMA: "ein"}
],
"'ne": [
{ORTH: "'ne", LEMMA: "eine"}
],
"'nen": [
{ORTH: "'nen", LEMMA: "einen"}
],
"'s": [
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"Abb.": [
{ORTH: "Abb.", LEMMA: "Abbildung"}
],
"Abk.": [
{ORTH: "Abk.", LEMMA: "Abkürzung"}
],
"Abt.": [
{ORTH: "Abt.", LEMMA: "Abteilung"}
],
"Apr.": [
{ORTH: "Apr.", LEMMA: "April"}
],
"Aug.": [
{ORTH: "Aug.", LEMMA: "August"}
],
"Bd.": [
{ORTH: "Bd.", LEMMA: "Band"}
],
"Betr.": [
{ORTH: "Betr.", LEMMA: "Betreff"}
],
"Bf.": [
{ORTH: "Bf.", LEMMA: "Bahnhof"}
],
"Bhf.": [
{ORTH: "Bhf.", LEMMA: "Bahnhof"}
],
"Bsp.": [
{ORTH: "Bsp.", LEMMA: "Beispiel"}
],
"Dez.": [
{ORTH: "Dez.", LEMMA: "Dezember"}
],
"Di.": [
{ORTH: "Di.", LEMMA: "Dienstag"}
],
"Do.": [
{ORTH: "Do.", LEMMA: "Donnerstag"}
],
"Fa.": [
{ORTH: "Fa.", LEMMA: "Firma"}
],
"Fam.": [
{ORTH: "Fam.", LEMMA: "Familie"}
],
"Feb.": [
{ORTH: "Feb.", LEMMA: "Februar"}
],
"Fr.": [
{ORTH: "Fr.", LEMMA: "Frau"}
],
"Frl.": [
{ORTH: "Frl.", LEMMA: "Fräulein"}
],
"Hbf.": [
{ORTH: "Hbf.", LEMMA: "Hauptbahnhof"}
],
"Hr.": [
{ORTH: "Hr.", LEMMA: "Herr"}
],
"Hrn.": [
{ORTH: "Hrn.", LEMMA: "Herr"}
],
"Jan.": [
{ORTH: "Jan.", LEMMA: "Januar"}
],
"Jh.": [
{ORTH: "Jh.", LEMMA: "Jahrhundert"}
],
"Jhd.": [
{ORTH: "Jhd.", LEMMA: "Jahrhundert"}
],
"Jul.": [
{ORTH: "Jul.", LEMMA: "Juli"}
],
"Jun.": [
{ORTH: "Jun.", LEMMA: "Juni"}
],
"Mi.": [
{ORTH: "Mi.", LEMMA: "Mittwoch"}
],
"Mio.": [
{ORTH: "Mio.", LEMMA: "Million"}
],
"Mo.": [
{ORTH: "Mo.", LEMMA: "Montag"}
],
"Mrd.": [
{ORTH: "Mrd.", LEMMA: "Milliarde"}
],
"Mrz.": [
{ORTH: "Mrz.", LEMMA: "März"}
],
"MwSt.": [
{ORTH: "MwSt.", LEMMA: "Mehrwertsteuer"}
],
"Mär.": [
{ORTH: "Mär.", LEMMA: "März"}
],
"Nov.": [
{ORTH: "Nov.", LEMMA: "November"}
],
"Nr.": [
{ORTH: "Nr.", LEMMA: "Nummer"}
],
"Okt.": [
{ORTH: "Okt.", LEMMA: "Oktober"}
],
"Orig.": [
{ORTH: "Orig.", LEMMA: "Original"}
],
"Pkt.": [
{ORTH: "Pkt.", LEMMA: "Punkt"}
],
"Prof.": [
{ORTH: "Prof.", LEMMA: "Professor"}
],
"Red.": [
{ORTH: "Red.", LEMMA: "Redaktion"}
],
"S'": [
{ORTH: "S'", LEMMA: PRON_LEMMA}
],
"Sa.": [
{ORTH: "Sa.", LEMMA: "Samstag"}
],
"Sep.": [
{ORTH: "Sep.", LEMMA: "September"}
],
"Sept.": [
{ORTH: "Sept.", LEMMA: "September"}
],
"So.": [
{ORTH: "So.", LEMMA: "Sonntag"}
],
"Std.": [
{ORTH: "Std.", LEMMA: "Stunde"}
],
"Str.": [
{ORTH: "Str.", LEMMA: "Straße"}
],
"Tel.": [
{ORTH: "Tel.", LEMMA: "Telefon"}
],
"Tsd.": [
{ORTH: "Tsd.", LEMMA: "Tausend"}
],
"Univ.": [
{ORTH: "Univ.", LEMMA: "Universität"}
],
"abzgl.": [
{ORTH: "abzgl.", LEMMA: "abzüglich"}
],
"allg.": [
{ORTH: "allg.", LEMMA: "allgemein"}
],
"auf'm": [
{ORTH: "auf", LEMMA: "auf"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
],
"bspw.": [
{ORTH: "bspw.", LEMMA: "beispielsweise"}
],
"bzgl.": [
{ORTH: "bzgl.", LEMMA: "bezüglich"}
],
"bzw.": [
{ORTH: "bzw.", LEMMA: "beziehungsweise"}
],
"d.h.": [
{ORTH: "d.h.", LEMMA: "das heißt"}
],
"dgl.": [
{ORTH: "dgl.", LEMMA: "dergleichen"}
],
"du's": [
{ORTH: "du", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"ebd.": [
{ORTH: "ebd.", LEMMA: "ebenda"}
],
"eigtl.": [
{ORTH: "eigtl.", LEMMA: "eigentlich"}
],
"engl.": [
{ORTH: "engl.", LEMMA: "englisch"}
],
"er's": [
{ORTH: "er", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"evtl.": [
{ORTH: "evtl.", LEMMA: "eventuell"}
],
"frz.": [
{ORTH: "frz.", LEMMA: "französisch"}
],
"gegr.": [
{ORTH: "gegr.", LEMMA: "gegründet"}
],
"ggf.": [
{ORTH: "ggf.", LEMMA: "gegebenenfalls"}
],
"ggfs.": [
{ORTH: "ggfs.", LEMMA: "gegebenenfalls"}
],
"ggü.": [
{ORTH: "ggü.", LEMMA: "gegenüber"}
],
"hinter'm": [
{ORTH: "hinter", LEMMA: "hinter"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
],
"i.O.": [
{ORTH: "i.O.", LEMMA: "in Ordnung"}
],
"i.d.R.": [
{ORTH: "i.d.R.", LEMMA: "in der Regel"}
],
"ich's": [
{ORTH: "ich", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"ihr's": [
{ORTH: "ihr", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"incl.": [
{ORTH: "incl.", LEMMA: "inklusive"}
],
"inkl.": [
{ORTH: "inkl.", LEMMA: "inklusive"}
],
"insb.": [
{ORTH: "insb.", LEMMA: "insbesondere"}
],
"kath.": [
{ORTH: "kath.", LEMMA: "katholisch"}
],
"lt.": [
{ORTH: "lt.", LEMMA: "laut"}
],
"max.": [
{ORTH: "max.", LEMMA: "maximal"}
],
"min.": [
{ORTH: "min.", LEMMA: "minimal"}
],
"mind.": [
{ORTH: "mind.", LEMMA: "mindestens"}
],
"mtl.": [
{ORTH: "mtl.", LEMMA: "monatlich"}
],
"n.Chr.": [
{ORTH: "n.Chr.", LEMMA: "nach Christus"}
],
"orig.": [
{ORTH: "orig.", LEMMA: "original"}
],
"röm.": [
{ORTH: "röm.", LEMMA: "römisch"}
],
"s'": [
{ORTH: "s'", LEMMA: PRON_LEMMA}
],
"s.o.": [
{ORTH: "s.o.", LEMMA: "siehe oben"}
],
"sie's": [
{ORTH: "sie", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"sog.": [
{ORTH: "sog.", LEMMA: "so genannt"}
],
"stellv.": [
{ORTH: "stellv.", LEMMA: "stellvertretend"}
],
"tägl.": [
{ORTH: "tägl.", LEMMA: "täglich"}
],
"u.U.": [
{ORTH: "u.U.", LEMMA: "unter Umständen"}
],
"u.s.w.": [
{ORTH: "u.s.w.", LEMMA: "und so weiter"}
],
"u.v.m.": [
{ORTH: "u.v.m.", LEMMA: "und vieles mehr"}
],
"unter'm": [
{ORTH: "unter", LEMMA: "unter"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
],
"usf.": [
{ORTH: "usf.", LEMMA: "und so fort"}
],
"usw.": [
{ORTH: "usw.", LEMMA: "und so weiter"}
],
"uvm.": [
{ORTH: "uvm.", LEMMA: "und vieles mehr"}
],
"v.Chr.": [
{ORTH: "v.Chr.", LEMMA: "vor Christus"}
],
"v.a.": [
{ORTH: "v.a.", LEMMA: "vor allem"}
],
"v.l.n.r.": [
{ORTH: "v.l.n.r.", LEMMA: "von links nach rechts"}
],
"vgl.": [
{ORTH: "vgl.", LEMMA: "vergleiche"}
],
"vllt.": [
{ORTH: "vllt.", LEMMA: "vielleicht"}
],
"vlt.": [
{ORTH: "vlt.", LEMMA: "vielleicht"}
],
"vor'm": [
{ORTH: "vor", LEMMA: "vor"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
],
"wir's": [
{ORTH: "wir", LEMMA: PRON_LEMMA},
{ORTH: "'s", LEMMA: PRON_LEMMA}
],
"z.B.": [
{ORTH: "z.B.", LEMMA: "zum Beispiel"}
],
"z.Bsp.": [
{ORTH: "z.Bsp.", LEMMA: "zum Beispiel"}
],
"z.T.": [
{ORTH: "z.T.", LEMMA: "zum Teil"}
],
"z.Z.": [
{ORTH: "z.Z.", LEMMA: "zur Zeit"}
],
"z.Zt.": [
{ORTH: "z.Zt.", LEMMA: "zur Zeit"}
],
"z.b.": [
{ORTH: "z.b.", LEMMA: "zum Beispiel"}
],
"zzgl.": [
{ORTH: "zzgl.", LEMMA: "zuzüglich"}
],
"österr.": [
{ORTH: "österr.", LEMMA: "österreichisch"}
],
"über'm": [
{ORTH: "über", LEMMA: "über"},
{ORTH: "'m", LEMMA: PRON_LEMMA}
]
}
ORTH_ONLY = [
"'",
"\\\")",
"<space>",
"a.",
"ä.",
"A.C.",
"a.D.",
"A.D.",
"A.G.",
"a.M.",
"a.Z.",
"Abs.",
"adv.",
"al.",
"b.",
"B.A.",
"B.Sc.",
"betr.",
"biol.",
"Biol.",
"c.",
"ca.",
"Chr.",
"Cie.",
"co.",
"Co.",
"d.",
"D.C.",
"Dipl.-Ing.",
"Dipl.",
"Dr.",
"e.",
"e.g.",
"e.V.",
"ehem.",
"entspr.",
"erm.",
"etc.",
"ev.",
"f.",
"g.",
"G.m.b.H.",
"geb.",
"Gebr.",
"gem.",
"h.",
"h.c.",
"Hg.",
"hrsg.",
"Hrsg.",
"i.",
"i.A.",
"i.e.",
"i.G.",
"i.Tr.",
"i.V.",
"Ing.",
"j.",
"jr.",
"Jr.",
"jun.",
"jur.",
"k.",
"K.O.",
"l.",
"L.A.",
"lat.",
"m.",
"M.A.",
"m.E.",
"m.M.",
"M.Sc.",
"Mr.",
"n.",
"N.Y.",
"N.Y.C.",
"nat.",
"ö."
"o.",
"o.a.",
"o.ä.",
"o.g.",
"o.k.",
"O.K.",
"p.",
"p.a.",
"p.s.",
"P.S.",
"pers.",
"phil.",
"q.",
"q.e.d.",
"r.",
"R.I.P.",
"rer.",
"s.",
"sen.",
"St.",
"std.",
"t.",
"u.",
"ü.",
"u.a.",
"U.S.",
"U.S.A.",
"U.S.S.",
"v.",
"Vol.",
"vs.",
"w.",
"wiss.",
"x.",
"y.",
"z.",
]

View File

@ -1,15 +1,18 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..util import match_best_version
from ..util import get_data_path
from ..language import Language
from . import language_data
from .. import util
from ..lemmatizer import Lemmatizer
from ..vocab import Vocab
from ..tokenizer import Tokenizer
from ..attrs import LANG
from .language_data import *
class English(Language):
lang = 'en'
@ -18,14 +21,40 @@ class English(Language):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'en'
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
tag_map = TAG_MAP
stop_words = STOP_WORDS
lemma_rules = LEMMA_RULES
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
infixes = tuple(language_data.TOKENIZER_INFIXES)
def __init__(self, **overrides):
# Make a special-case hack for loading the GloVe vectors, to support
# deprecated <1.0 stuff. Phase this out once the data is fixed.
overrides = _fix_deprecated_glove_vectors_loading(overrides)
Language.__init__(self, **overrides)
tag_map = dict(language_data.TAG_MAP)
stop_words = set(language_data.STOP_WORDS)
def _fix_deprecated_glove_vectors_loading(overrides):
if 'data_dir' in overrides and 'path' not in overrides:
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
if overrides.get('path') is False:
return overrides
if overrides.get('path') in (None, True):
data_path = get_data_path()
else:
path = overrides['path']
data_path = path.parent
vec_path = None
if 'add_vectors' not in overrides:
if 'vectors' in overrides:
vec_path = match_best_version(overrides['vectors'], None, data_path)
if vec_path is None:
raise IOError(
'Could not load data pack %s from %s' % (overrides['vectors'], data_path))
else:
vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
if vec_path is not None:
vec_path = vec_path / 'vocab' / 'vec.bin'
if vec_path is not None:
overrides['add_vectors'] = lambda vocab: vocab.load_vectors_from_bin_loc(vec_path)
return overrides

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,8 @@
{
# encoding: utf8
from __future__ import unicode_literals
LEMMA_RULES = {
"noun": [
["s", ""],
["ses", "s"],

67
spacy/en/morph_rules.py Normal file
View File

@ -0,0 +1,67 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
MORPH_RULES = {
"PRP": {
"I": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
"me": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
"you": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two"},
"he": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
"him": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
"she": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Nom"},
"her": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Case": "Acc"},
"it": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
"we": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
"us": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
"they": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
"them": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
"mine": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Reflex": "Yes"},
"yours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Poss": "Yes", "Reflex": "Yes"},
"his": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Poss": "Yes", "Reflex": "Yes"},
"hers": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem", "Poss": "Yes", "Reflex": "Yes"},
"its": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut", "Poss": "Yes", "Reflex": "Yes"},
"ours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"yours": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"theirs": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Poss": "Yes", "Reflex": "Yes"},
"myself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"yourself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"himself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Masc", "Reflex": "Yes"},
"herself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Fem", "Reflex": "Yes"},
"itself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Gender": "Neut", "Reflex": "Yes"},
"themself": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Case": "Acc", "Reflex": "Yes"},
"ourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"},
"yourselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Case": "Acc", "Reflex": "Yes"},
"themselves": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc", "Reflex": "Yes"}
},
"PRP$": {
"my": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Sing", "PronType": "Prs", "Poss": "Yes"},
"your": {LEMMA: PRON_LEMMA, "Person": "Two", "PronType": "Prs", "Poss": "Yes"},
"his": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Masc", "PronType": "Prs", "Poss": "Yes"},
"her": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Fem", "PronType": "Prs", "Poss": "Yes"},
"its": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Sing", "Gender": "Neut", "PronType": "Prs", "Poss": "Yes"},
"our": {LEMMA: PRON_LEMMA, "Person": "One", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"},
"their": {LEMMA: PRON_LEMMA, "Person": "Three", "Number": "Plur", "PronType": "Prs", "Poss": "Yes"}
},
"VBZ": {
"am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
"are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
"is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
},
"VBP": {
"are": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
},
"VBD": {
"was": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
"were": {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
}
}

View File

@ -1,47 +0,0 @@
import re
_mw_prepositions = [
'close to',
'down by',
'on the way to',
'on my way to',
'on my way',
'on his way to',
'on his way',
'on her way to',
'on her way',
'on your way to',
'on your way',
'on our way to',
'on our way',
'on their way to',
'on their way',
'along the route from'
]
MW_PREPOSITIONS_RE = re.compile('|'.join(_mw_prepositions), flags=re.IGNORECASE)
TIME_RE = re.compile(
'{colon_digits}|{colon_digits} ?{am_pm}?|{one_two_digits} ?({am_pm})'.format(
colon_digits=r'[0-2]?[0-9]:[0-5][0-9](?::[0-5][0-9])?',
one_two_digits=r'[0-2]?[0-9]',
am_pm=r'[ap]\.?m\.?'))
DATE_RE = re.compile(
'(?:this|last|next|the) (?:week|weekend|{days})'.format(
days='Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday'
))
MONEY_RE = re.compile('\$\d+(?:\.\d+)?|\d+ dollars(?: \d+ cents)?')
DAYS_RE = re.compile('Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday')
REGEXES = [('IN', 'O', MW_PREPOSITIONS_RE), ('CD', 'TIME', TIME_RE),
('NNP', 'DATE', DATE_RE),
('NNP', 'DATE', DAYS_RE), ('CD', 'MONEY', MONEY_RE)]

67
spacy/en/stop_words.py Normal file
View File

@ -0,0 +1,67 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
a about above across after afterwards again against all almost alone along
already also although always am among amongst amount an and another any anyhow
anyone anything anyway anywhere are around as at
back be became because become becomes becoming been before beforehand behind
being below beside besides between beyond both bottom but by
call can cannot ca could
did do does doing done down due during
each eight either eleven else elsewhere empty enough etc even ever every
everyone everything everywhere except
few fifteen fifty first five for former formerly forty four from front full
further
get give go
had has have he hence her here hereafter hereby herein hereupon hers herself
him himself his how however hundred
i if in inc indeed into is it its itself
keep
last latter latterly least less
just
made make many may me meanwhile might mine more moreover most mostly move much
must my myself
name namely neither never nevertheless next nine no nobody none noone nor not
nothing now nowhere
of off often on once one only onto or other others otherwise our ours ourselves
out over own
part per perhaps please put
quite
rather re really regarding
same say see seem seemed seeming seems serious several she should show side
since six sixty so some somehow someone something sometime sometimes somewhere
still such
take ten than that the their them themselves then thence there thereafter
thereby therefore therein thereupon these they third this those though three
through throughout thru thus to together too top toward towards twelve twenty
two
under until up unless upon us used using
various very very via was we well were what whatever when whence whenever where
whereafter whereas whereby wherein whereupon wherever whether which while
whither who whoever whole whom whose why will with within without would
yet you your yours yourself yourselves
""".split())

64
spacy/en/tag_map.py Normal file
View File

@ -0,0 +1,64 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
TAG_MAP = {
".": {POS: PUNCT, "PunctType": "peri"},
",": {POS: PUNCT, "PunctType": "comm"},
"-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
"-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
"``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
"\"\"": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
"''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
":": {POS: PUNCT},
"$": {POS: SYM, "Other": {"SymType": "currency"}},
"#": {POS: SYM, "Other": {"SymType": "numbersign"}},
"AFX": {POS: ADJ, "Hyph": "yes"},
"CC": {POS: CONJ, "ConjType": "coor"},
"CD": {POS: NUM, "NumType": "card"},
"DT": {POS: DET},
"EX": {POS: ADV, "AdvType": "ex"},
"FW": {POS: X, "Foreign": "yes"},
"HYPH": {POS: PUNCT, "PunctType": "dash"},
"IN": {POS: ADP},
"JJ": {POS: ADJ, "Degree": "pos"},
"JJR": {POS: ADJ, "Degree": "comp"},
"JJS": {POS: ADJ, "Degree": "sup"},
"LS": {POS: PUNCT, "NumType": "ord"},
"MD": {POS: VERB, "VerbType": "mod"},
"NIL": {POS: ""},
"NN": {POS: NOUN, "Number": "sing"},
"NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
"NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
"NNS": {POS: NOUN, "Number": "plur"},
"PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
"POS": {POS: PART, "Poss": "yes"},
"PRP": {POS: PRON, "PronType": "prs"},
"PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
"RB": {POS: ADV, "Degree": "pos"},
"RBR": {POS: ADV, "Degree": "comp"},
"RBS": {POS: ADV, "Degree": "sup"},
"RP": {POS: PART},
"SYM": {POS: SYM},
"TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
"UH": {POS: INTJ},
"VB": {POS: VERB, "VerbForm": "inf"},
"VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
"VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
"VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
"VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
"VBZ": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
"WDT": {POS: ADJ, "PronType": "int|rel"},
"WP": {POS: NOUN, "PronType": "int|rel"},
"WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
"WRB": {POS: ADV, "PronType": "int|rel"},
"SP": {POS: SPACE},
"ADD": {POS: X},
"NFP": {POS: PUNCT},
"GW": {POS: X},
"XX": {POS: X},
"BES": {POS: VERB},
"HVS": {POS: VERB}
}

File diff suppressed because it is too large Load Diff

View File

@ -1,246 +0,0 @@
import os
import time
import io
import math
import re
try:
from urllib.parse import urlparse
from urllib.request import urlopen, Request
from urllib.error import HTTPError
except ImportError:
from urllib2 import urlopen, urlparse, Request, HTTPError
class UnknownContentLengthException(Exception): pass
class InvalidChecksumException(Exception): pass
class UnsupportedHTTPCodeException(Exception): pass
class InvalidOffsetException(Exception): pass
class MissingChecksumHeader(Exception): pass
CHUNK_SIZE = 16 * 1024
class RateSampler(object):
def __init__(self, period=1):
self.rate = None
self.reset = True
self.period = period
def __enter__(self):
if self.reset:
self.reset = False
self.start = time.time()
self.counter = 0
def __exit__(self, type, value, traceback):
elapsed = time.time() - self.start
if elapsed >= self.period:
self.reset = True
self.rate = float(self.counter) / elapsed
def update(self, value):
self.counter += value
def format(self, unit="MB"):
if self.rate is None:
return None
divisor = {'MB': 1048576, 'kB': 1024}
return "%0.2f%s/s" % (self.rate / divisor[unit], unit)
class TimeEstimator(object):
def __init__(self, cooldown=1):
self.cooldown = cooldown
self.start = time.time()
self.time_left = None
def update(self, bytes_read, total_size):
elapsed = time.time() - self.start
if elapsed > self.cooldown:
self.time_left = math.ceil(elapsed * total_size /
bytes_read - elapsed)
def format(self):
if self.time_left is None:
return None
res = "eta "
if self.time_left / 60 >= 1:
res += "%dm " % (self.time_left / 60)
return res + "%ds" % (self.time_left % 60)
def format_bytes_read(bytes_read, unit="MB"):
divisor = {'MB': 1048576, 'kB': 1024}
return "%0.2f%s" % (float(bytes_read) / divisor[unit], unit)
def format_percent(bytes_read, total_size):
percent = round(bytes_read * 100.0 / total_size, 2)
return "%0.2f%%" % percent
def get_content_range(response):
content_range = response.headers.get('Content-Range', "").strip()
if content_range:
m = re.match(r"bytes (\d+)-(\d+)/(\d+)", content_range)
if m:
return [int(v) for v in m.groups()]
def get_content_length(response):
if 'Content-Length' not in response.headers:
raise UnknownContentLengthException
return int(response.headers.get('Content-Length').strip())
def get_url_meta(url, checksum_header=None):
class HeadRequest(Request):
def get_method(self):
return "HEAD"
r = urlopen(HeadRequest(url))
res = {'size': get_content_length(r)}
if checksum_header:
value = r.headers.get(checksum_header)
if value:
res['checksum'] = value
r.close()
return res
def progress(console, bytes_read, total_size, transfer_rate, eta):
fields = [
format_bytes_read(bytes_read),
format_percent(bytes_read, total_size),
transfer_rate.format(),
eta.format(),
" " * 10,
]
console.write("Downloaded %s\r" % " ".join(filter(None, fields)))
console.flush()
def read_request(request, offset=0, console=None,
progress_func=None, write_func=None):
# support partial downloads
if offset > 0:
request.add_header('Range', "bytes=%s-" % offset)
try:
response = urlopen(request)
except HTTPError as e:
if e.code == 416: # Requested Range Not Satisfiable
raise InvalidOffsetException
# TODO add http error handling here
raise UnsupportedHTTPCodeException(e.code)
total_size = get_content_length(response) + offset
bytes_read = offset
# sanity checks
if response.code == 200: # OK
assert offset == 0
elif response.code == 206: # Partial content
range_start, range_end, range_total = get_content_range(response)
assert range_start == offset
assert range_total == total_size
assert range_end + 1 - range_start == total_size - bytes_read
else:
raise UnsupportedHTTPCodeException(response.code)
eta = TimeEstimator()
transfer_rate = RateSampler()
if console:
if offset > 0:
console.write("Continue downloading...\n")
else:
console.write("Downloading...\n")
while True:
with transfer_rate:
chunk = response.read(CHUNK_SIZE)
if not chunk:
if progress_func and console:
console.write('\n')
break
bytes_read += len(chunk)
transfer_rate.update(len(chunk))
eta.update(bytes_read - offset, total_size - offset)
if progress_func and console:
progress_func(console, bytes_read, total_size, transfer_rate, eta)
if write_func:
write_func(chunk)
response.close()
assert bytes_read == total_size
return response
def download(url, path=".",
checksum=None, checksum_header=None,
headers=None, console=None):
if os.path.isdir(path):
path = os.path.join(path, url.rsplit('/', 1)[1])
path = os.path.abspath(path)
with io.open(path, "a+b") as f:
size = f.tell()
# update checksum of partially downloaded file
if checksum:
f.seek(0, os.SEEK_SET)
for chunk in iter(lambda: f.read(CHUNK_SIZE), b""):
checksum.update(chunk)
def write(chunk):
if checksum:
checksum.update(chunk)
f.write(chunk)
request = Request(url)
# request headers
if headers:
for key, value in headers.items():
request.add_header(key, value)
try:
response = read_request(request,
offset=size,
console=console,
progress_func=progress,
write_func=write)
except InvalidOffsetException:
response = None
if checksum:
if response:
origin_checksum = response.headers.get(checksum_header)
else:
# check whether file is already complete
meta = get_url_meta(url, checksum_header)
origin_checksum = meta.get('checksum')
if origin_checksum is None:
raise MissingChecksumHeader
if checksum.hexdigest() != origin_checksum:
raise InvalidChecksumException
if console:
console.write("checksum/sha256 OK\n")
return path

View File

@ -1,26 +1,20 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
from . import language_data
from .language_data import *
class Spanish(Language):
lang = 'es'
class Defaults(Language.Defaults):
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'es'
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
infixes = tuple(language_data.TOKENIZER_INFIXES)
tag_map = dict(language_data.TAG_MAP)
stop_words = set(language_data.STOP_WORDS)
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS

View File

@ -1,356 +1,19 @@
# encoding: utf8
from __future__ import unicode_literals
import re
from .. import language_data as base
from ..language_data import update_exc, strings_to_exc
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
STOP_WORDS = set()
TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
STOP_WORDS = set(STOP_WORDS)
TOKENIZER_PREFIXES = map(re.escape, r'''
,
"
(
[
{
*
<
>
$
£
'
``
`
#
US$
C$
A$
a-
....
...
»
_
§
'''.strip().split('\n'))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
TOKENIZER_SUFFIXES = r'''
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
«
_
''
's
'S
s
S
°
\.\.
\.\.\.
\.\.\.\.
(?<=[a-zäöüßÖÄÜ)\]"'´«‘’%\)²“”])\.
\-\-
´
(?<=[0-9])km²
(?<=[0-9])
(?<=[0-9])cm²
(?<=[0-9])mm²
(?<=[0-9])km³
(?<=[0-9])
(?<=[0-9])cm³
(?<=[0-9])mm³
(?<=[0-9])ha
(?<=[0-9])km
(?<=[0-9])m
(?<=[0-9])cm
(?<=[0-9])mm
(?<=[0-9])µm
(?<=[0-9])nm
(?<=[0-9])yd
(?<=[0-9])in
(?<=[0-9])ft
(?<=[0-9])kg
(?<=[0-9])g
(?<=[0-9])mg
(?<=[0-9])µg
(?<=[0-9])t
(?<=[0-9])lb
(?<=[0-9])oz
(?<=[0-9])m/s
(?<=[0-9])km/h
(?<=[0-9])mph
(?<=[0-9])°C
(?<=[0-9])°K
(?<=[0-9])°F
(?<=[0-9])hPa
(?<=[0-9])Pa
(?<=[0-9])mbar
(?<=[0-9])mb
(?<=[0-9])T
(?<=[0-9])G
(?<=[0-9])M
(?<=[0-9])K
(?<=[0-9])kb
'''.strip().split('\n')
TOKENIZER_INFIXES = (r'''\.\.\.+ (?<=[a-z])\.(?=[A-Z]) (?<=[a-zA-Z])-(?=[a-zA-z]) '''
r'''(?<=[a-zA-Z])--(?=[a-zA-z]) (?<=[0-9])-(?=[0-9]) '''
r'''(?<=[A-Za-z]),(?=[A-Za-z])''').split()
TOKENIZER_EXCEPTIONS = {
"vs.": [{"F": "vs."}],
"''": [{"F": "''"}],
"": [{"F": "", "L": "--", "pos": "$,"}],
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Ala.": [{"F": "Ala."}],
"Ariz.": [{"F": "Ariz."}],
"Ark.": [{"F": "Ark."}],
"Calif.": [{"F": "Calif."}],
"Colo.": [{"F": "Colo."}],
"Conn.": [{"F": "Conn."}],
"Del.": [{"F": "Del."}],
"D.C.": [{"F": "D.C."}],
"Fla.": [{"F": "Fla."}],
"Ga.": [{"F": "Ga."}],
"Ill.": [{"F": "Ill."}],
"Ind.": [{"F": "Ind."}],
"Kans.": [{"F": "Kans."}],
"Kan.": [{"F": "Kan."}],
"Ky.": [{"F": "Ky."}],
"La.": [{"F": "La."}],
"Md.": [{"F": "Md."}],
"Mass.": [{"F": "Mass."}],
"Mich.": [{"F": "Mich."}],
"Minn.": [{"F": "Minn."}],
"Miss.": [{"F": "Miss."}],
"Mo.": [{"F": "Mo."}],
"Mont.": [{"F": "Mont."}],
"Nebr.": [{"F": "Nebr."}],
"Neb.": [{"F": "Neb."}],
"Nev.": [{"F": "Nev."}],
"N.H.": [{"F": "N.H."}],
"N.J.": [{"F": "N.J."}],
"N.M.": [{"F": "N.M."}],
"N.Y.": [{"F": "N.Y."}],
"N.C.": [{"F": "N.C."}],
"N.D.": [{"F": "N.D."}],
"Okla.": [{"F": "Okla."}],
"Ore.": [{"F": "Ore."}],
"Pa.": [{"F": "Pa."}],
"Tenn.": [{"F": "Tenn."}],
"Va.": [{"F": "Va."}],
"Wash.": [{"F": "Wash."}],
"Wis.": [{"F": "Wis."}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"r.": [{"F": "r."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
}
TAG_MAP = {
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
"ADJA": {"pos": "ADJ"},
"ADJD": {"pos": "ADJ", "Variant": "Short"},
"ADV": {"pos": "ADV"},
"APPO": {"pos": "ADP", "AdpType": "Post"},
"APPR": {"pos": "ADP", "AdpType": "Prep"},
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
"APZR": {"pos": "ADP", "AdpType": "Circ"},
"ART": {"pos": "DET", "PronType": "Art"},
"CARD": {"pos": "NUM", "NumType": "Card"},
"FM": {"pos": "X", "Foreign": "Yes"},
"ITJ": {"pos": "INTJ"},
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
"KON": {"pos": "CONJ"},
"KOUI": {"pos": "SCONJ"},
"KOUS": {"pos": "SCONJ"},
"NE": {"pos": "PROPN"},
"NNE": {"pos": "PROPN"},
"NN": {"pos": "NOUN"},
"PAV": {"pos": "ADV", "PronType": "Dem"},
"PROAV": {"pos": "ADV", "PronType": "Dem"},
"PDAT": {"pos": "DET", "PronType": "Dem"},
"PDS": {"pos": "PRON", "PronType": "Dem"},
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
"PPER": {"pos": "PRON", "PronType": "Prs"},
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
"PRELAT": {"pos": "DET", "PronType": "Rel"},
"PRELS": {"pos": "PRON", "PronType": "Rel"},
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
"PTKA": {"pos": "PART"},
"PTKANT": {"pos": "PART", "PartType": "Res"},
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
"PTKZU": {"pos": "PART", "PartType": "Inf"},
"PWAT": {"pos": "DET", "PronType": "Int"},
"PWAV": {"pos": "ADV", "PronType": "Int"},
"PWS": {"pos": "PRON", "PronType": "Int"},
"TRUNC": {"pos": "X", "Hyph": "Yes"},
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
"XY": {"pos": "X"},
"SP": {"pos": "SPACE"}
}
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]

84
spacy/es/stop_words.py Normal file
View File

@ -0,0 +1,84 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas
aquello aquellos aqui aquél aquélla aquéllas aquéllos aquí arriba arribaabajo
aseguró asi así atras aun aunque ayer añadió aún
bajo bastante bien breve buen buena buenas bueno buenos
cada casi cerca cierta ciertas cierto ciertos cinco claro comentó como con
conmigo conocer conseguimos conseguir considera consideró consigo consigue
consiguen consigues contigo contra cosas creo cual cuales cualquier cuando
cuanta cuantas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas
cuánto cuántos cómo
da dado dan dar de debajo debe deben debido decir dejó del delante demasiado
demás dentro deprisa desde despacio despues después detras detrás dia dias dice
dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante día
días dónde
ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas
empleo en encima encuentra enfrente enseguida entonces entre era eramos eran
eras eres es esa esas ese eso esos esta estaba estaban estado estados estais
estamos estan estar estará estas este esto estos estoy estuvo está están ex
excepto existe existen explicó expresó él ésa ésas ése ésos ésta éstas éste
éstos
fin final fue fuera fueron fui fuimos
general gran grandes gueno
ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
hizo horas hoy hubo
igual incluso indicó informo informó intenta intentais intentamos intentan
intentar intentas intento ir
junto
la lado largo las le lejos les llegó lleva llevar lo los luego lugar
mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi
mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha
muchas mucho muchos muy más mía mías mío míos
nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros
nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca
ocho os otra otras otro otros
pais para parece parte partir pasada pasado paìs peor pero pesar poca pocas
poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
podrán podría podrían poner por porque posible primer primera primero primeros
principalmente pronto propia propias propio propios proximo próximo próximos
pudo pueda puede pueden puedo pues
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién quiénes qué
raras realizado realizar realizó repente respecto
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
según seis ser sera será serán sería señaló si sido siempre siendo siete sigue
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy
soyos su supuesto sus suya suyas suyo sólo
tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis
tenemos tener tenga tengo tenido tenía tercera ti tiempo tiene tienen toda
todas todavia todavía todo todos total trabaja trabajais trabajamos trabajan
trabajar trabajas trabajo tras trata través tres tu tus tuvo tuya tuyas tuyo
tuyos
ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
última últimas último últimos
va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero
vez vosotras vosotros voy vuestra vuestras vuestro vuestros
ya yo
""".split())

View File

@ -0,0 +1,318 @@
# encoding: utf8
from __future__ import unicode_literals
from ..symbols import *
from ..language_data import PRON_LEMMA
TOKENIZER_EXCEPTIONS = {
"accidentarse": [
{ORTH: "accidentar", LEMMA: "accidentar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"aceptarlo": [
{ORTH: "aceptar", LEMMA: "aceptar", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"acompañarla": [
{ORTH: "acompañar", LEMMA: "acompañar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"advertirle": [
{ORTH: "advertir", LEMMA: "advertir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"al": [
{ORTH: "a", LEMMA: "a", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
],
"anunciarnos": [
{ORTH: "anunciar", LEMMA: "anunciar", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
],
"asegurándole": [
{ORTH: "asegurando", LEMMA: "asegurar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"considerarle": [
{ORTH: "considerar", LEMMA: "considerar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirle": [
{ORTH: "decir", LEMMA: "decir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirles": [
{ORTH: "decir", LEMMA: "decir", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"decirte": [
{ORTH: "Decir", LEMMA: "decir", POS: AUX},
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejarla": [
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejarnos": [
{ORTH: "dejar", LEMMA: "dejar", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
],
"dejándole": [
{ORTH: "dejando", LEMMA: "dejar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"del": [
{ORTH: "de", LEMMA: "de", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
],
"demostrarles": [
{ORTH: "demostrar", LEMMA: "demostrar", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"diciéndole": [
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"diciéndoles": [
{ORTH: "diciendo", LEMMA: "decir", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"diferenciarse": [
{ORTH: "diferenciar", LEMMA: "diferenciar", POS: AUX},
{ORTH: "se", LEMMA: "él", POS: PRON}
],
"divirtiéndome": [
{ORTH: "divirtiendo", LEMMA: "divertir", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"ensanchándose": [
{ORTH: "ensanchando", LEMMA: "ensanchar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"explicarles": [
{ORTH: "explicar", LEMMA: "explicar", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberla": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlas": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlo": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberlos": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberme": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"haberse": [
{ORTH: "haber", LEMMA: "haber", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"hacerle": [
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"hacerles": [
{ORTH: "hacer", LEMMA: "hacer", POS: AUX},
{ORTH: "les", LEMMA: PRON_LEMMA, POS: PRON}
],
"hallarse": [
{ORTH: "hallar", LEMMA: "hallar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"imaginaros": [
{ORTH: "imaginar", LEMMA: "imaginar", POS: AUX},
{ORTH: "os", LEMMA: PRON_LEMMA, POS: PRON}
],
"insinuarle": [
{ORTH: "insinuar", LEMMA: "insinuar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"justificarla": [
{ORTH: "justificar", LEMMA: "justificar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerlas": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "las", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerlos": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"mantenerme": [
{ORTH: "mantener", LEMMA: "mantener", POS: AUX},
{ORTH: "me", LEMMA: PRON_LEMMA, POS: PRON}
],
"pasarte": [
{ORTH: "pasar", LEMMA: "pasar", POS: AUX},
{ORTH: "te", LEMMA: PRON_LEMMA, POS: PRON}
],
"pedirle": [
{ORTH: "pedir", LEMMA: "pedir", POS: AUX},
{ORTH: "le", LEMMA: "él", POS: PRON}
],
"pel": [
{ORTH: "per", LEMMA: "per", POS: ADP},
{ORTH: "el", LEMMA: "el", POS: DET}
],
"pidiéndonos": [
{ORTH: "pidiendo", LEMMA: "pedir", POS: AUX},
{ORTH: "nos", LEMMA: PRON_LEMMA, POS: PRON}
],
"poderle": [
{ORTH: "poder", LEMMA: "poder", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"preguntarse": [
{ORTH: "preguntar", LEMMA: "preguntar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"preguntándose": [
{ORTH: "preguntando", LEMMA: "preguntar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"presentarla": [
{ORTH: "presentar", LEMMA: "presentar", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"pudiéndolo": [
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"pudiéndose": [
{ORTH: "pudiendo", LEMMA: "poder", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"quererle": [
{ORTH: "querer", LEMMA: "querer", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"rasgarse": [
{ORTH: "Rasgar", LEMMA: "rasgar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"repetirlo": [
{ORTH: "repetir", LEMMA: "repetir", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"robarle": [
{ORTH: "robar", LEMMA: "robar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"seguirlos": [
{ORTH: "seguir", LEMMA: "seguir", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"serle": [
{ORTH: "ser", LEMMA: "ser", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"serlo": [
{ORTH: "ser", LEMMA: "ser", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
],
"señalándole": [
{ORTH: "señalando", LEMMA: "señalar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"suplicarle": [
{ORTH: "suplicar", LEMMA: "suplicar", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"tenerlos": [
{ORTH: "tener", LEMMA: "tener", POS: AUX},
{ORTH: "los", LEMMA: PRON_LEMMA, POS: PRON}
],
"vengarse": [
{ORTH: "vengar", LEMMA: "vengar", POS: AUX},
{ORTH: "se", LEMMA: PRON_LEMMA, POS: PRON}
],
"verla": [
{ORTH: "ver", LEMMA: "ver", POS: AUX},
{ORTH: "la", LEMMA: PRON_LEMMA, POS: PRON}
],
"verle": [
{ORTH: "ver", LEMMA: "ver", POS: AUX},
{ORTH: "le", LEMMA: PRON_LEMMA, POS: PRON}
],
"volverlo": [
{ORTH: "volver", LEMMA: "volver", POS: AUX},
{ORTH: "lo", LEMMA: PRON_LEMMA, POS: PRON}
]
}
ORTH_ONLY = [
]

View File

@ -1,9 +0,0 @@
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
class Finnish(Language):
pass

View File

@ -1,27 +1,20 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
from . import language_data
from .language_data import *
class French(Language):
lang = 'fr'
class Defaults(Language.Defaults):
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'fr'
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
infixes = tuple(language_data.TOKENIZER_INFIXES)
tag_map = dict(language_data.TAG_MAP)
stop_words = set(language_data.STOP_WORDS)
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS

View File

@ -1,356 +1,14 @@
# encoding: utf8
from __future__ import unicode_literals
import re
from .. import language_data as base
from ..language_data import strings_to_exc
from .stop_words import STOP_WORDS
STOP_WORDS = set()
TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
STOP_WORDS = set(STOP_WORDS)
TOKENIZER_PREFIXES = map(re.escape, r'''
,
"
(
[
{
*
<
>
$
£
'
``
`
#
US$
C$
A$
a-
....
...
»
_
§
'''.strip().split('\n'))
TOKENIZER_SUFFIXES = r'''
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
«
_
''
's
'S
s
S
°
\.\.
\.\.\.
\.\.\.\.
(?<=[a-zäöüßÖÄÜ)\]"'´«‘’%\)²“”])\.
\-\-
´
(?<=[0-9])km²
(?<=[0-9])
(?<=[0-9])cm²
(?<=[0-9])mm²
(?<=[0-9])km³
(?<=[0-9])
(?<=[0-9])cm³
(?<=[0-9])mm³
(?<=[0-9])ha
(?<=[0-9])km
(?<=[0-9])m
(?<=[0-9])cm
(?<=[0-9])mm
(?<=[0-9])µm
(?<=[0-9])nm
(?<=[0-9])yd
(?<=[0-9])in
(?<=[0-9])ft
(?<=[0-9])kg
(?<=[0-9])g
(?<=[0-9])mg
(?<=[0-9])µg
(?<=[0-9])t
(?<=[0-9])lb
(?<=[0-9])oz
(?<=[0-9])m/s
(?<=[0-9])km/h
(?<=[0-9])mph
(?<=[0-9])°C
(?<=[0-9])°K
(?<=[0-9])°F
(?<=[0-9])hPa
(?<=[0-9])Pa
(?<=[0-9])mbar
(?<=[0-9])mb
(?<=[0-9])T
(?<=[0-9])G
(?<=[0-9])M
(?<=[0-9])K
(?<=[0-9])kb
'''.strip().split('\n')
TOKENIZER_INFIXES = (r'''\.\.\.+ (?<=[a-z])\.(?=[A-Z]) (?<=[a-zA-Z])-(?=[a-zA-z]) '''
r'''(?<=[a-zA-Z])--(?=[a-zA-z]) (?<=[0-9])-(?=[0-9]) '''
r'''(?<=[A-Za-z]),(?=[A-Za-z])''').split()
TOKENIZER_EXCEPTIONS = {
"vs.": [{"F": "vs."}],
"''": [{"F": "''"}],
"": [{"F": "", "L": "--", "pos": "$,"}],
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Ala.": [{"F": "Ala."}],
"Ariz.": [{"F": "Ariz."}],
"Ark.": [{"F": "Ark."}],
"Calif.": [{"F": "Calif."}],
"Colo.": [{"F": "Colo."}],
"Conn.": [{"F": "Conn."}],
"Del.": [{"F": "Del."}],
"D.C.": [{"F": "D.C."}],
"Fla.": [{"F": "Fla."}],
"Ga.": [{"F": "Ga."}],
"Ill.": [{"F": "Ill."}],
"Ind.": [{"F": "Ind."}],
"Kans.": [{"F": "Kans."}],
"Kan.": [{"F": "Kan."}],
"Ky.": [{"F": "Ky."}],
"La.": [{"F": "La."}],
"Md.": [{"F": "Md."}],
"Mass.": [{"F": "Mass."}],
"Mich.": [{"F": "Mich."}],
"Minn.": [{"F": "Minn."}],
"Miss.": [{"F": "Miss."}],
"Mo.": [{"F": "Mo."}],
"Mont.": [{"F": "Mont."}],
"Nebr.": [{"F": "Nebr."}],
"Neb.": [{"F": "Neb."}],
"Nev.": [{"F": "Nev."}],
"N.H.": [{"F": "N.H."}],
"N.J.": [{"F": "N.J."}],
"N.M.": [{"F": "N.M."}],
"N.Y.": [{"F": "N.Y."}],
"N.C.": [{"F": "N.C."}],
"N.D.": [{"F": "N.D."}],
"Okla.": [{"F": "Okla."}],
"Ore.": [{"F": "Ore."}],
"Pa.": [{"F": "Pa."}],
"Tenn.": [{"F": "Tenn."}],
"Va.": [{"F": "Va."}],
"Wash.": [{"F": "Wash."}],
"Wis.": [{"F": "Wis."}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"r.": [{"F": "r."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
}
TAG_MAP = {
"$(": {"pos": "PUNCT", "PunctType": "Brck"},
"$,": {"pos": "PUNCT", "PunctType": "Comm"},
"$.": {"pos": "PUNCT", "PunctType": "Peri"},
"ADJA": {"pos": "ADJ"},
"ADJD": {"pos": "ADJ", "Variant": "Short"},
"ADV": {"pos": "ADV"},
"APPO": {"pos": "ADP", "AdpType": "Post"},
"APPR": {"pos": "ADP", "AdpType": "Prep"},
"APPRART": {"pos": "ADP", "AdpType": "Prep", "PronType": "Art"},
"APZR": {"pos": "ADP", "AdpType": "Circ"},
"ART": {"pos": "DET", "PronType": "Art"},
"CARD": {"pos": "NUM", "NumType": "Card"},
"FM": {"pos": "X", "Foreign": "Yes"},
"ITJ": {"pos": "INTJ"},
"KOKOM": {"pos": "CONJ", "ConjType": "Comp"},
"KON": {"pos": "CONJ"},
"KOUI": {"pos": "SCONJ"},
"KOUS": {"pos": "SCONJ"},
"NE": {"pos": "PROPN"},
"NNE": {"pos": "PROPN"},
"NN": {"pos": "NOUN"},
"PAV": {"pos": "ADV", "PronType": "Dem"},
"PROAV": {"pos": "ADV", "PronType": "Dem"},
"PDAT": {"pos": "DET", "PronType": "Dem"},
"PDS": {"pos": "PRON", "PronType": "Dem"},
"PIAT": {"pos": "DET", "PronType": "Ind,Neg,Tot"},
"PIDAT": {"pos": "DET", "AdjType": "Pdt", "PronType": "Ind,Neg,Tot"},
"PIS": {"pos": "PRON", "PronType": "Ind,Neg,Tot"},
"PPER": {"pos": "PRON", "PronType": "Prs"},
"PPOSAT": {"pos": "DET", "Poss": "Yes", "PronType": "Prs"},
"PPOSS": {"pos": "PRON", "Poss": "Yes", "PronType": "Prs"},
"PRELAT": {"pos": "DET", "PronType": "Rel"},
"PRELS": {"pos": "PRON", "PronType": "Rel"},
"PRF": {"pos": "PRON", "PronType": "Prs", "Reflex": "Yes"},
"PTKA": {"pos": "PART"},
"PTKANT": {"pos": "PART", "PartType": "Res"},
"PTKNEG": {"pos": "PART", "Negative": "Neg"},
"PTKVZ": {"pos": "PART", "PartType": "Vbp"},
"PTKZU": {"pos": "PART", "PartType": "Inf"},
"PWAT": {"pos": "DET", "PronType": "Int"},
"PWAV": {"pos": "ADV", "PronType": "Int"},
"PWS": {"pos": "PRON", "PronType": "Int"},
"TRUNC": {"pos": "X", "Hyph": "Yes"},
"VAFIN": {"pos": "AUX", "Mood": "Ind", "VerbForm": "Fin"},
"VAIMP": {"pos": "AUX", "Mood": "Imp", "VerbForm": "Fin"},
"VAINF": {"pos": "AUX", "VerbForm": "Inf"},
"VAPP": {"pos": "AUX", "Aspect": "Perf", "VerbForm": "Part"},
"VMFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin", "VerbType": "Mod"},
"VMINF": {"pos": "VERB", "VerbForm": "Inf", "VerbType": "Mod"},
"VMPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part", "VerbType": "Mod"},
"VVFIN": {"pos": "VERB", "Mood": "Ind", "VerbForm": "Fin"},
"VVIMP": {"pos": "VERB", "Mood": "Imp", "VerbForm": "Fin"},
"VVINF": {"pos": "VERB", "VerbForm": "Inf"},
"VVIZU": {"pos": "VERB", "VerbForm": "Inf"},
"VVPP": {"pos": "VERB", "Aspect": "Perf", "VerbForm": "Part"},
"XY": {"pos": "X"},
"SP": {"pos": "SPACE"}
}
__all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS"]

88
spacy/fr/stop_words.py Normal file
View File

@ -0,0 +1,88 @@
# encoding: utf8
from __future__ import unicode_literals
STOP_WORDS = set("""
a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons
allô alors anterieur anterieure anterieures apres après as assez attendu au
aucun aucune aujourd aujourd'hui aupres auquel aura auraient aurait auront
aussi autre autrefois autrement autres autrui aux auxquelles auxquels avaient
avais avait avant avec avoir avons ayant
bah bas basee bat beau beaucoup bien bigre boum bravo brrr
ça car ce ceci cela celle celle-ci celle- celles celles-ci celles- celui
celui-ci celui- cent cependant certain certaine certaines certains certes ces
cet cette ceux ceux-ci ceux- chacun chacune chaque cher chers chez chiche
chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac
clic combien comme comment comparable comparables compris concernant contre
couic crac
da dans de debout dedans dehors deja delà depuis dernier derniere derriere
derrière des desormais desquelles desquels dessous dessus deux deuxième
deuxièmement devant devers devra different differentes differents différent
différente différentes différents dire directe directement dit dite dits divers
diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont
douze douzième dring du duquel durant dès désormais
effet egale egalement egales eh elle elle-même elles elles-mêmes en encore
enfin entre envers environ es ès est et etaient étaient etais étais etait était
etant étant etc été etre être eu euh eux eux-mêmes exactement excepté extenso
exterieur
fais faisaient faisant fait façon feront fi flac floc font
gens
ha hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum
hurrah hélas i il ils importe
je jusqu jusque juste
la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps
lors lorsque lui lui-meme lui-même lès
ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien
mienne miennes miens mille mince minimale moi moi-meme moi-même moindres moins
mon moyennant multiple multiples même mêmes
na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf
neuvième ni nombreuses nombreux non nos notamment notre nous nous-mêmes nouveau
nul néanmoins nôtre nôtres
o ô oh ohé ollé olé on ont onze onzième ore ou ouf ouias oust ouste outre
ouvert ouverte ouverts
paf pan par parce parfois parle parlent parler parmi parseme partant
particulier particulière particulièrement pas passé pendant pense permet
personne peu peut peuvent peux pff pfft pfut pif pire plein plouf plus
plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi
pourrais pourrait pouvait prealable precisement premier première premièrement
pres probable probante procedant proche près psitt pu puis puisque pur pure
qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt
quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque
quelques quels qui quiconque quinze quoi quoique
rare rarement rares relative relativement remarquable rend rendre restant reste
restent restrictif retour revoici revoilà rien
sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient
semble semblent sent sept septième sera seraient serait seront ses seul seule
seulement si sien sienne siennes siens sinon six sixième soi soi-même soit
soixante son sont sous souvent specifique specifiques speculatif stop
strictement subtiles suffisant suffisante suffit suis suit suivant suivante
suivantes suivants suivre superpose sur surtout
ta tac tant tardive te tel telle tellement telles tels tenant tend tenir tente
tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous
tout toute toutefois toutes treize trente tres trois troisième troisièmement
trop très tsoin tsouin tu
un une unes uniformement unique uniques uns
va vais vas vers via vif vifs vingt vivat vive vives vlan voici voilà vont vos
votre vous vous-mêmes vu vôtre vôtres
zut
""".split())

View File

@ -1,27 +1,20 @@
# encoding: utf8
from __future__ import unicode_literals, print_function
from os import path
from ..language import Language
from ..attrs import LANG
from . import language_data
from .language_data import *
class Italian(Language):
lang = 'it'
class Defaults(Language.Defaults):
tokenizer_exceptions = dict(language_data.TOKENIZER_EXCEPTIONS)
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters[LANG] = lambda text: 'it'
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
infixes = tuple(language_data.TOKENIZER_INFIXES)
tag_map = dict(language_data.TAG_MAP)
stop_words = set(language_data.STOP_WORDS)
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
stop_words = STOP_WORDS

View File

@ -1,3 +0,0 @@
\.\.\.
(?<=[a-z])\.(?=[A-Z])
(?<=[a-zA-Z])-(?=[a-zA-z])

View File

@ -1,55 +0,0 @@
{
"PRP": {
"I": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 1},
"me": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 3},
"mine": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 2},
"myself": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 4},
"you": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 0},
"yours": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"yourself": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 4},
"he": {"L": "-PRON-", "person": 3, "number": 1, "gender": 1, "case": 1},
"him": {"L": "-PRON-", "person": 3, "number": 1, "gender": 1, "case": 3},
"his": {"L": "-PRON-", "person": 3, "number": 1, "gender": 1, "case": 2},
"himself": {"L": "-PRON-", "person": 3, "number": 1, "gender": 1, "case": 4},
"she": {"L": "-PRON-", "person": 3, "number": 1, "gender": 2, "case": 1},
"her": {"L": "-PRON-", "person": 3, "number": 1, "gender": 2, "case": 3},
"hers": {"L": "-PRON-", "person": 3, "number": 1, "gender": 2, "case": 2},
"herself": {"L": "-PRON-", "person": 3, "number": 1, "gender": 2, "case": 4},
"it": {"L": "-PRON-", "person": 3, "number": 1, "gender": 3, "case": 0},
"its": {"L": "-PRON-", "person": 3, "number": 1, "gender": 3, "case": 2},
"itself": {"L": "-PRON-", "person": 3, "number": 1, "gender": 3, "case": 4},
"themself": {"L": "-PRON-", "person": 3, "number": 1, "gender": 0, "case": 4},
"we": {"L": "-PRON-", "person": 1, "number": 2, "gender": 0, "case": 1},
"us": {"L": "-PRON-", "person": 1, "number": 2, "gender": 0, "case": 3},
"ours": {"L": "-PRON-", "person": 1, "number": 2, "gender": 0, "case": 3},
"ourselves": {"L": "-PRON-", "person": 1, "number": 2, "gender": 0, "case": 4},
"yourselves": {"L": "-PRON-", "person": 2, "number": 2, "gender": 0, "case": 4},
"they": {"L": "-PRON-", "person": 3, "number": 2, "gender": 0, "case": 1},
"them": {"L": "-PRON-", "person": 3, "number": 2, "gender": 0, "case": 3},
"their": {"L": "-PRON-", "person": 3, "number": 2, "gender": 0, "case": 2},
"themselves": {"L": "-PRON-", "person": 3, "number": 2, "gender": 0, "case": 4}
},
"PRP$": {
"my": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 2},
"your": {"L": "-PRON-", "person": 2, "number": 0, "gender": 0, "case": 2},
"his": {"L": "-PRON-", "person": 3, "number": 1, "gender": 1, "case": 2},
"her": {"L": "-PRON-", "person": 3, "number": 1, "gender": 2, "case": 2},
"its": {"L": "-PRON-", "person": 3, "number": 1, "gender": 3, "case": 2},
"our": {"L": "-PRON-", "person": 1, "number": 1, "gender": 0, "case": 2},
"their": {"L": "-PRON-", "person": 3, "number": 2, "gender": 0, "case": 2}
},
"JJR": {
"better": {"L": "good", "misc": 1}
},
"JJS": {
"best": {"L": "good", "misc": 2}
},
"RBR": {
"better": {"L": "good", "misc": 1}
},
"RBS": {
"best": {"L": "good", "misc": 2}
}
}

View File

@ -1,21 +0,0 @@
,
"
(
[
{
*
<
$
£
'
``
`
#
US$
C$
A$
a-
....
...

View File

@ -1,647 +0,0 @@
{
"'s": [{"F": "'s", "L": "'s"}],
"'S": [{"F": "'S", "L": "'s"}],
"ain't": [{"F": "ai", "L": "be", "pos": "VBP", "number": 2},
{"F": "n't", "L": "not", "pos": "RB"}],
"aint": [{"F": "ai", "L": "be", "pos": "VBP", "number": 2},
{"F": "nt", "L": "not", "pos": "RB"}],
"Ain't": [{"F": "Ai", "L": "be", "pos": "VBP", "number": 2},
{"F": "n't", "L": "not", "pos": "RB"}],
"aren't": [{"F": "are", "L": "be", "pos": "VBP", "number": 2},
{"F": "n't", "L": "not"}],
"arent": [{"F": "are", "L": "be", "pos": "VBP", "number": 2},
{"F": "nt", "L": "not"}],
"Aren't": [{"F": "Are", "L": "be", "pos": "VBP", "number": 2},
{"F": "n't", "L": "not"}],
"can't": [{"F": "ca", "L": "can", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"cant": [{"F": "ca", "L": "can", "pos": "MD"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Can't": [{"F": "Ca", "L": "can", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"cannot": [{"F": "can", "pos": "MD"},
{"F": "not", "L": "not", "pos": "RB"}],
"Cannot": [{"F": "Can", "pos": "MD"},
{"F": "not", "L": "not", "pos": "RB"}],
"could've": [{"F": "could", "pos": "MD"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"couldve": [{"F": "could", "pos": "MD"},
{"F": "ve", "L": "have", "pos": "VB"}],
"Could've": [{"F": "Could", "pos": "MD"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"couldn't": [{"F": "could", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"couldnt": [{"F": "could", "pos": "MD"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Couldn't": [{"F": "Could", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"couldn't've": [{"F": "could", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve", "pos": "VB"}],
"couldntve": [{"F": "could", "pos": "MD"},
{"F": "nt", "L": "not", "pos": "RB"},
{"F": "ve", "pos": "VB"}],
"Couldn't've": [{"F": "Could", "pos": "MD"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve", "pos": "VB"}],
"didn't": [{"F": "did", "pos": "VBD", "L": "do"},
{"F": "n't", "L": "not", "pos": "RB"}],
"didnt": [{"F": "did", "pos": "VBD", "L": "do"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Didn't": [{"F": "Did", "pos": "VBD", "L": "do"},
{"F": "n't", "L": "not", "pos": "RB"}],
"doesn't": [{"F": "does", "L": "do", "pos": "VBZ"},
{"F": "n't", "L": "not", "pos": "RB"}],
"doesnt": [{"F": "does", "L": "do", "pos": "VBZ"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Doesn't": [{"F": "Does", "L": "do", "pos": "VBZ"},
{"F": "n't", "L": "not", "pos": "RB"}],
"don't": [{"F": "do", "L": "do"},
{"F": "n't", "L": "not", "pos": "RB"}],
"dont": [{"F": "do", "L": "do"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Don't": [{"F": "Do", "L": "do"},
{"F": "n't", "L": "not", "pos": "RB"}],
"hadn't": [{"F": "had", "L": "have", "pos": "VBD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"hadnt": [{"F": "had", "L": "have", "pos": "VBD"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Hadn't": [{"F": "Had", "L": "have", "pos": "VBD"},
{"F": "n't", "L": "not", "pos": "RB"}],
"hadn't've": [{"F": "had", "L": "have", "pos": "VBD"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"hasn't": [{"F": "has"},
{"F": "n't", "L": "not", "pos": "RB"}],
"hasnt": [{"F": "has"},
{"F": "nt", "L": "not", "pos": "RB"}],
"haven't": [{"F": "have", "pos": "VB"},
{"F": "n't", "L": "not", "pos": "RB"}],
"havent": [{"F": "have", "pos": "VB"},
{"F": "nt", "L": "not", "pos": "RB"}],
"he'd": [{"F": "he", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"hed": [{"F": "he", "L": "-PRON-"},
{"F": "d", "L": "would", "pos": "MD"}],
"he'd've": [{"F": "he", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"hedve": [{"F": "he", "L": "-PRON-"},
{"F": "d", "L": "would", "pos": "MD"},
{"F": "ve", "pos": "VB"}],
"he'll": [{"F": "he", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"he's": [{"F": "he", "L": "-PRON-"},
{"F": "'s"}],
"hes": [{"F": "he", "L": "-PRON-"},
{"F": "s"}],
"how'd": [{"F": "how"},
{"F": "'d", "L": "would", "pos": "MD"}],
"howd": [{"F": "how"},
{"F": "d", "L": "would", "pos": "MD"}],
"how'll": [{"F": "how"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"howll": [{"F": "how"},
{"F": "ll", "L": "will", "pos": "MD"}],
"how's": [{"F": "how"},
{"F": "'s"}],
"hows": [{"F": "how"},
{"F": "s"}],
"I'd": [{"F": "I", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"I'd've": [{"F": "I", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"I'll": [{"F": "I", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"i'll": [{"F": "i", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"I'm": [{"F": "I", "L": "-PRON-"},
{"F": "'m", "L": "be", "pos": "VBP", "number": 1, "tenspect": 1}],
"i'm": [{"F": "i", "L": "-PRON-"},
{"F": "'m", "L": "be", "pos": "VBP", "number": 1, "tenspect": 1}],
"Im": [{"F": "I", "L": "-PRON-"},
{"F": "m", "L": "be", "pos": "VBP", "number": 1, "tenspect": 1}],
"im": [{"F": "i", "L": "-PRON-"},
{"F": "m", "L": "be", "pos": "VBP", "number": 1, "tenspect": 1}],
"I'ma": [{"F": "I", "L": "-PRON-"},
{"F": "'ma"}],
"i'ma": [{"F": "i", "L": "-PRON-"},
{"F": "'ma"}],
"I've": [{"F": "I", "L": "-PRON-"},
{"F": "'ve", "pos": "VB", "L": "have", "pos": "MD"}],
"i've": [{"F": "i", "L": "-PRON-"},
{"F": "'ve", "pos": "VB", "L": "have", "pos": "MD"}],
"isn't": [{"F": "is", "L": "be", "pos": "VBZ"},
{"F": "n't", "L": "not", "pos": "RB"}],
"isnt": [{"F": "is", "L": "be", "pos": "VBZ"},
{"F": "nt", "L": "not", "pos": "RB"}],
"Isn't": [{"F": "Is", "L": "be", "pos": "VBZ"},
{"F": "n't", "L": "not", "pos": "RB"}],
"It'd": [{"F": "It", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"it'd": [{"F": "it", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"it'd've": [{"F": "it", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve"}],
"it'll": [{"F": "it", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"itll": [{"F": "it", "L": "-PRON-"},
{"F": "ll", "L": "will", "pos": "MD"}],
"it's": [{"F": "it", "L": "-PRON-"},
{"F": "'s"}],
"let's": [{"F": "let"},
{"F": "'s"}],
"lets": [{"F": "let"},
{"F": "s", "L": "'s"}],
"mightn't": [{"F": "might"},
{"F": "n't", "L": "not", "pos": "RB"}],
"mightn't've": [{"F": "might"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve", "pos": "VB"}],
"might've": [{"F": "might"},
{"F": "'ve", "pos": "VB"}],
"mustn't": [{"F": "must"},
{"F": "n't", "L": "not", "pos": "RB"}],
"must've": [{"F": "must"},
{"F": "'ve", "pos": "VB"}],
"needn't": [{"F": "need"},
{"F": "n't", "L": "not", "pos": "RB"}],
"not've": [{"F": "not"},
{"F": "'ve", "pos": "VB"}],
"shan't": [{"F": "sha"},
{"F": "n't", "L": "not", "pos": "RB"}],
"she'd": [{"F": "she", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"she'd've": [{"F": "she", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"she'll": [{"F": "she", "L": "-PRON-"},
{"F": "'ll", "L": "will"}],
"she's": [{"F": "she", "L": "-PRON-"},
{"F": "'s"}],
"should've": [{"F": "should"},
{"F": "'ve", "pos": "VB"}],
"shouldn't": [{"F": "should"},
{"F": "n't", "L": "not", "pos": "RB"}],
"shouldn't've": [{"F": "should"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve"}],
"that's": [{"F": "that"},
{"F": "'s"}],
"thats": [{"F": "that"},
{"F": "s", "L": "'s"}],
"there'd": [{"F": "there"},
{"F": "'d", "L": "would", "pos": "MD"}],
"there'd've": [{"F": "there"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"there's": [{"F": "there"},
{"F": "'s"}],
"they'd": [{"F": "they", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD", "pos": "VB"}],
"They'd": [{"F": "They", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD", "pos": "VB"}],
"they'd've": [{"F": "they", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"They'd've": [{"F": "They", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"they'll": [{"F": "they", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"They'll": [{"F": "They", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"they're": [{"F": "they", "L": "-PRON-"},
{"F": "'re"}],
"They're": [{"F": "They", "L": "-PRON-"},
{"F": "'re"}],
"they've": [{"F": "they", "L": "-PRON-"},
{"F": "'ve", "pos": "VB"}],
"They've": [{"F": "They", "L": "-PRON-"},
{"F": "'ve", "pos": "VB"}],
"wasn't": [{"F": "was"},
{"F": "n't", "L": "not", "pos": "RB"}],
"we'd": [{"F": "we"},
{"F": "'d", "L": "would", "pos": "MD"}],
"We'd": [{"F": "We"},
{"F": "'d", "L": "would", "pos": "MD"}],
"we'd've": [{"F": "we"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "pos": "VB"}],
"we'll": [{"F": "we"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"We'll": [{"F": "We", "L": "we"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"we're": [{"F": "we"},
{"F": "'re"}],
"We're": [{"F": "We"},
{"F": "'re"}],
"we've": [{"F": "we"},
{"F": "'ve", "pos": "VB"}],
"We've": [{"F": "We"},
{"F": "'ve", "pos": "VB"}],
"weren't": [{"F": "were"},
{"F": "n't", "L": "not", "pos": "RB"}],
"what'll": [{"F": "what"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"what're": [{"F": "what"},
{"F": "'re"}],
"what's": [{"F": "what"},
{"F": "'s"}],
"what've": [{"F": "what"},
{"F": "'ve", "pos": "VB"}],
"when's": [{"F": "when"},
{"F": "'s"}],
"where'd": [{"F": "where"},
{"F": "'d", "L": "would", "pos": "MD"}],
"where's": [{"F": "where"},
{"F": "'s"}],
"where've": [{"F": "where"},
{"F": "'ve", "pos": "VB"}],
"who'd": [{"F": "who"},
{"F": "'d", "L": "would", "pos": "MD"}],
"who'll": [{"F": "who"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"who're": [{"F": "who"},
{"F": "'re"}],
"who's": [{"F": "who"},
{"F": "'s"}],
"who've": [{"F": "who"},
{"F": "'ve", "pos": "VB"}],
"why'll": [{"F": "why"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"why're": [{"F": "why"},
{"F": "'re"}],
"why's": [{"F": "why"},
{"F": "'s"}],
"won't": [{"F": "wo"},
{"F": "n't", "L": "not", "pos": "RB"}],
"wont": [{"F": "wo"},
{"F": "nt", "L": "not", "pos": "RB"}],
"would've": [{"F": "would"},
{"F": "'ve", "pos": "VB"}],
"wouldn't": [{"F": "would"},
{"F": "n't", "L": "not", "pos": "RB"}],
"wouldn't've": [{"F": "would"},
{"F": "n't", "L": "not", "pos": "RB"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"you'd": [{"F": "you", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"}],
"you'd've": [{"F": "you", "L": "-PRON-"},
{"F": "'d", "L": "would", "pos": "MD"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"you'll": [{"F": "you", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"You'll": [{"F": "You", "L": "-PRON-"},
{"F": "'ll", "L": "will", "pos": "MD"}],
"you're": [{"F": "you", "L": "-PRON-"},
{"F": "'re"}],
"You're": [{"F": "You", "L": "-PRON-"},
{"F": "'re"}],
"you've": [{"F": "you", "L": "-PRON-"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"You've": [{"F": "You", "L": "-PRON-"},
{"F": "'ve", "L": "have", "pos": "VB"}],
"'em": [{"F": "'em"}],
"'ol": [{"F": "'ol"}],
"vs.": [{"F": "vs."}],
"Ms.": [{"F": "Ms."}],
"Mr.": [{"F": "Mr."}],
"Dr.": [{"F": "Dr."}],
"Mrs.": [{"F": "Mrs."}],
"Messrs.": [{"F": "Messrs."}],
"Gov.": [{"F": "Gov."}],
"Gen.": [{"F": "Gen."}],
"Mt.": [{"F": "Mt.", "L": "Mount"}],
"''": [{"F": "''"}],
"Corp.": [{"F": "Corp."}],
"Inc.": [{"F": "Inc."}],
"Co.": [{"F": "Co."}],
"co.": [{"F": "co."}],
"Ltd.": [{"F": "Ltd."}],
"Bros.": [{"F": "Bros."}],
"Rep.": [{"F": "Rep."}],
"Sen.": [{"F": "Sen."}],
"Jr.": [{"F": "Jr."}],
"Rev.": [{"F": "Rev."}],
"Adm.": [{"F": "Adm."}],
"St.": [{"F": "St."}],
"a.m.": [{"F": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1a.m.": [{"F": "1"}, {"F": "a.m."}],
"2a.m.": [{"F": "2"}, {"F": "a.m."}],
"3a.m.": [{"F": "3"}, {"F": "a.m."}],
"4a.m.": [{"F": "4"}, {"F": "a.m."}],
"5a.m.": [{"F": "5"}, {"F": "a.m."}],
"6a.m.": [{"F": "6"}, {"F": "a.m."}],
"7a.m.": [{"F": "7"}, {"F": "a.m."}],
"8a.m.": [{"F": "8"}, {"F": "a.m."}],
"9a.m.": [{"F": "9"}, {"F": "a.m."}],
"10a.m.": [{"F": "10"}, {"F": "a.m."}],
"11a.m.": [{"F": "11"}, {"F": "a.m."}],
"12a.m.": [{"F": "12"}, {"F": "a.m."}],
"1am": [{"F": "1"}, {"F": "am", "L": "a.m."}],
"2am": [{"F": "2"}, {"F": "am", "L": "a.m."}],
"3am": [{"F": "3"}, {"F": "am", "L": "a.m."}],
"4am": [{"F": "4"}, {"F": "am", "L": "a.m."}],
"5am": [{"F": "5"}, {"F": "am", "L": "a.m."}],
"6am": [{"F": "6"}, {"F": "am", "L": "a.m."}],
"7am": [{"F": "7"}, {"F": "am", "L": "a.m."}],
"8am": [{"F": "8"}, {"F": "am", "L": "a.m."}],
"9am": [{"F": "9"}, {"F": "am", "L": "a.m."}],
"10am": [{"F": "10"}, {"F": "am", "L": "a.m."}],
"11am": [{"F": "11"}, {"F": "am", "L": "a.m."}],
"12am": [{"F": "12"}, {"F": "am", "L": "a.m."}],
"p.m.": [{"F": "p.m."}],
"1p.m.": [{"F": "1"}, {"F": "p.m."}],
"2p.m.": [{"F": "2"}, {"F": "p.m."}],
"3p.m.": [{"F": "3"}, {"F": "p.m."}],
"4p.m.": [{"F": "4"}, {"F": "p.m."}],
"5p.m.": [{"F": "5"}, {"F": "p.m."}],
"6p.m.": [{"F": "6"}, {"F": "p.m."}],
"7p.m.": [{"F": "7"}, {"F": "p.m."}],
"8p.m.": [{"F": "8"}, {"F": "p.m."}],
"9p.m.": [{"F": "9"}, {"F": "p.m."}],
"10p.m.": [{"F": "10"}, {"F": "p.m."}],
"11p.m.": [{"F": "11"}, {"F": "p.m."}],
"12p.m.": [{"F": "12"}, {"F": "p.m."}],
"1pm": [{"F": "1"}, {"F": "pm", "L": "p.m."}],
"2pm": [{"F": "2"}, {"F": "pm", "L": "p.m."}],
"3pm": [{"F": "3"}, {"F": "pm", "L": "p.m."}],
"4pm": [{"F": "4"}, {"F": "pm", "L": "p.m."}],
"5pm": [{"F": "5"}, {"F": "pm", "L": "p.m."}],
"6pm": [{"F": "6"}, {"F": "pm", "L": "p.m."}],
"7pm": [{"F": "7"}, {"F": "pm", "L": "p.m."}],
"8pm": [{"F": "8"}, {"F": "pm", "L": "p.m."}],
"9pm": [{"F": "9"}, {"F": "pm", "L": "p.m."}],
"10pm": [{"F": "10"}, {"F": "pm", "L": "p.m."}],
"11pm": [{"F": "11"}, {"F": "pm", "L": "p.m."}],
"12pm": [{"F": "12"}, {"F": "pm", "L": "p.m."}],
"Jan.": [{"F": "Jan."}],
"Feb.": [{"F": "Feb."}],
"Mar.": [{"F": "Mar."}],
"Apr.": [{"F": "Apr."}],
"May.": [{"F": "May."}],
"Jun.": [{"F": "Jun."}],
"Jul.": [{"F": "Jul."}],
"Aug.": [{"F": "Aug."}],
"Sep.": [{"F": "Sep."}],
"Sept.": [{"F": "Sept."}],
"Oct.": [{"F": "Oct."}],
"Nov.": [{"F": "Nov."}],
"Dec.": [{"F": "Dec."}],
"Ala.": [{"F": "Ala."}],
"Ariz.": [{"F": "Ariz."}],
"Ark.": [{"F": "Ark."}],
"Calif.": [{"F": "Calif."}],
"Colo.": [{"F": "Colo."}],
"Conn.": [{"F": "Conn."}],
"Del.": [{"F": "Del."}],
"D.C.": [{"F": "D.C."}],
"Fla.": [{"F": "Fla."}],
"Ga.": [{"F": "Ga."}],
"Ill.": [{"F": "Ill."}],
"Ind.": [{"F": "Ind."}],
"Kans.": [{"F": "Kans."}],
"Kan.": [{"F": "Kan."}],
"Ky.": [{"F": "Ky."}],
"La.": [{"F": "La."}],
"Md.": [{"F": "Md."}],
"Mass.": [{"F": "Mass."}],
"Mich.": [{"F": "Mich."}],
"Minn.": [{"F": "Minn."}],
"Miss.": [{"F": "Miss."}],
"Mo.": [{"F": "Mo."}],
"Mont.": [{"F": "Mont."}],
"Nebr.": [{"F": "Nebr."}],
"Neb.": [{"F": "Neb."}],
"Nev.": [{"F": "Nev."}],
"N.H.": [{"F": "N.H."}],
"N.J.": [{"F": "N.J."}],
"N.M.": [{"F": "N.M."}],
"N.Y.": [{"F": "N.Y."}],
"N.C.": [{"F": "N.C."}],
"N.D.": [{"F": "N.D."}],
"Okla.": [{"F": "Okla."}],
"Ore.": [{"F": "Ore."}],
"Pa.": [{"F": "Pa."}],
"Tenn.": [{"F": "Tenn."}],
"Va.": [{"F": "Va."}],
"Wash.": [{"F": "Wash."}],
"Wis.": [{"F": "Wis."}],
":)": [{"F": ":)"}],
"<3": [{"F": "<3"}],
";)": [{"F": ";)"}],
"(:": [{"F": "(:"}],
":(": [{"F": ":("}],
"-_-": [{"F": "-_-"}],
"=)": [{"F": "=)"}],
":/": [{"F": ":/"}],
":>": [{"F": ":>"}],
";-)": [{"F": ";-)"}],
":Y": [{"F": ":Y"}],
":P": [{"F": ":P"}],
":-P": [{"F": ":-P"}],
":3": [{"F": ":3"}],
"=3": [{"F": "=3"}],
"xD": [{"F": "xD"}],
"^_^": [{"F": "^_^"}],
"=]": [{"F": "=]"}],
"=D": [{"F": "=D"}],
"<333": [{"F": "<333"}],
":))": [{"F": ":))"}],
":0": [{"F": ":0"}],
"-__-": [{"F": "-__-"}],
"xDD": [{"F": "xDD"}],
"o_o": [{"F": "o_o"}],
"o_O": [{"F": "o_O"}],
"V_V": [{"F": "V_V"}],
"=[[": [{"F": "=[["}],
"<33": [{"F": "<33"}],
";p": [{"F": ";p"}],
";D": [{"F": ";D"}],
";-p": [{"F": ";-p"}],
";(": [{"F": ";("}],
":p": [{"F": ":p"}],
":]": [{"F": ":]"}],
":O": [{"F": ":O"}],
":-/": [{"F": ":-/"}],
":-)": [{"F": ":-)"}],
":(((": [{"F": ":((("}],
":((": [{"F": ":(("}],
":')": [{"F": ":')"}],
"(^_^)": [{"F": "(^_^)"}],
"(=": [{"F": "(="}],
"o.O": [{"F": "o.O"}],
"\")": [{"F": "\")"}],
"a.": [{"F": "a."}],
"b.": [{"F": "b."}],
"c.": [{"F": "c."}],
"d.": [{"F": "d."}],
"e.": [{"F": "e."}],
"f.": [{"F": "f."}],
"g.": [{"F": "g."}],
"h.": [{"F": "h."}],
"i.": [{"F": "i."}],
"j.": [{"F": "j."}],
"k.": [{"F": "k."}],
"l.": [{"F": "l."}],
"m.": [{"F": "m."}],
"n.": [{"F": "n."}],
"o.": [{"F": "o."}],
"p.": [{"F": "p."}],
"q.": [{"F": "q."}],
"s.": [{"F": "s."}],
"t.": [{"F": "t."}],
"u.": [{"F": "u."}],
"v.": [{"F": "v."}],
"w.": [{"F": "w."}],
"x.": [{"F": "x."}],
"y.": [{"F": "y."}],
"z.": [{"F": "z."}],
"i.e.": [{"F": "i.e."}],
"I.e.": [{"F": "I.e."}],
"I.E.": [{"F": "I.E."}],
"e.g.": [{"F": "e.g."}],
"E.g.": [{"F": "E.g."}],
"E.G.": [{"F": "E.G."}],
"\n": [{"F": "\n", "pos": "SP"}],
"\t": [{"F": "\t", "pos": "SP"}],
" ": [{"F": " ", "pos": "SP"}]
}

View File

@ -1,26 +0,0 @@
,
\"
\)
\]
\}
\*
\!
\?
%
\$
>
:
;
'
''
's
'S
s
S
\.\.
\.\.\.
\.\.\.\.
(?<=[a-z0-9)\]"'%\)])\.
(?<=[0-9])km

View File

@ -1,198 +0,0 @@
{
"Reddit": [
"PRODUCT",
{},
[
[{"lower": "reddit"}]
]
],
"SeptemberElevenAttacks": [
"EVENT",
{},
[
[
{"orth": "9/11"}
],
[
{"lower": "Septmber"},
{"lower": "Eleven"}
],
[
{"lower": "september"},
{"orth": "11"}
]
]
],
"Linux": [
"PRODUCT",
{},
[
[{"lower": "linux"}]
]
],
"Haskell": [
"PRODUCT",
{},
[
[{"lower": "haskell"}]
]
],
"HaskellCurry": [
"PERSON",
{},
[
[
{"lower": "haskell"},
{"lower": "curry"}
]
]
],
"Javascript": [
"PRODUCT",
{},
[
[{"lower": "javascript"}]
]
],
"CSS": [
"PRODUCT",
{},
[
[{"lower": "css"}],
[{"lower": "css3"}]
]
],
"displaCy": [
"PRODUCT",
{},
[
[{"lower": "displacy"}]
]
],
"spaCy": [
"PRODUCT",
{},
[
[{"orth": "spaCy"}]
]
],
"HTML": [
"PRODUCT",
{},
[
[{"lower": "html"}],
[{"lower": "html5"}]
]
],
"Python": [
"PRODUCT",
{},
[
[{"orth": "Python"}]
]
],
"Ruby": [
"PRODUCT",
{},
[
[{"orth": "Ruby"}]
]
],
"Digg": [
"PRODUCT",
{},
[
[{"lower": "digg"}]
]
],
"FoxNews": [
"ORG",
{},
[
[{"orth": "Fox"}],
[{"orth": "News"}]
]
],
"Google": [
"ORG",
{},
[
[{"lower": "google"}]
]
],
"Mac": [
"PRODUCT",
{},
[
[{"lower": "mac"}]
]
],
"Wikipedia": [
"PRODUCT",
{},
[
[{"lower": "wikipedia"}]
]
],
"Windows": [
"PRODUCT",
{},
[
[{"orth": "Windows"}]
]
],
"Dell": [
"ORG",
{},
[
[{"lower": "dell"}]
]
],
"Facebook": [
"ORG",
{},
[
[{"lower": "facebook"}]
]
],
"Blizzard": [
"ORG",
{},
[
[{"orth": "Facebook"}]
]
],
"Ubuntu": [
"ORG",
{},
[
[{"orth": "Ubuntu"}]
]
],
"Youtube": [
"PRODUCT",
{},
[
[{"lower": "youtube"}]
]
],
"false_positives": [
null,
{},
[
[{"orth": "Shit"}],
[{"orth": "Weed"}],
[{"orth": "Cool"}],
[{"orth": "Btw"}],
[{"orth": "Bah"}],
[{"orth": "Bullshit"}],
[{"orth": "Lol"}],
[{"orth": "Yo"}, {"lower": "dawg"}],
[{"orth": "Yay"}],
[{"orth": "Ahh"}],
[{"orth": "Yea"}],
[{"orth": "Bah"}]
]
]
}

View File

@ -1,31 +0,0 @@
{
"noun": [
["s", ""],
["ses", "s"],
["ves", "f"],
["xes", "x"],
["zes", "z"],
["ches", "ch"],
["shes", "sh"],
["men", "man"],
["ies", "y"]
],
"verb": [
["s", ""],
["ies", "y"],
["es", "e"],
["es", ""],
["ed", "e"],
["ed", ""],
["ing", "e"],
["ing", ""]
],
"adj": [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
}

Binary file not shown.

View File

@ -1 +0,0 @@
-20.000000

Some files were not shown because too many files have changed in this diff Show More