mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-06 21:03:07 +03:00
Merge remote-tracking branch 'upstream/master'
This commit is contained in:
commit
daaa42dd25
|
@ -87,7 +87,16 @@ Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/). Re
|
||||||
|
|
||||||
### Python conventions
|
### Python conventions
|
||||||
|
|
||||||
All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. We could use some extra utilities for this. Please pay particular attention to code that serialises json objects.
|
All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or platform compatibility should only live in [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin functions, replacement functions are suffixed with an undersocre, for example `unicode_`. If you need to access the user's version or platform information, for example to show more specific error messages, you can use the `is_config()` helper function.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from .compat import unicode_, json_dumps, is_config
|
||||||
|
|
||||||
|
compatible_unicode = unicode_('hello world')
|
||||||
|
compatible_json = json_dumps({'key': 'value'})
|
||||||
|
if is_config(windows=True, python2=True):
|
||||||
|
print("You are using Python 2 on Windows.")
|
||||||
|
```
|
||||||
|
|
||||||
Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects.
|
Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects.
|
||||||
|
|
||||||
|
@ -95,6 +104,8 @@ At the time of writing (v1.7), spaCy's serialization and deserialization functio
|
||||||
|
|
||||||
Although spaCy uses a lot of classes, inheritance is viewed with some suspicion — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing.
|
Although spaCy uses a lot of classes, inheritance is viewed with some suspicion — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing.
|
||||||
|
|
||||||
|
We have a number of conventions around variable naming that are still being documented, and aren't 100% strict. A general policy is that instances of the class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`, `Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are of other types these names. For instance, don't name a text string `doc` — you should usually call this `text`. Two general code style preferences further help with naming. First, lean away from introducing temporary variables, as these clutter your namespace. This is one reason why comprehension expressions are often preferred. Second, keep your functions shortish, so that can work in a smaller scope. Of course, this is a question of trade-offs.
|
||||||
|
|
||||||
### Cython conventions
|
### Cython conventions
|
||||||
|
|
||||||
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef` classes. Memory is managed through the `cymem.cymem.Pool` class, which allows you to allocate memory which will be freed when the `Pool` object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the `Pool`. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.
|
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef` classes. Memory is managed through the `cymem.cymem.Pool` class, which allows you to allocate memory which will be freed when the `Pool` object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the `Pool`. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.
|
||||||
|
@ -126,7 +137,7 @@ cdef int c_total(const int* int_array, int length) nogil:
|
||||||
return total
|
return total
|
||||||
```
|
```
|
||||||
|
|
||||||
If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above -- the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
|
If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above — the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
|
||||||
|
|
||||||
Your functions cannot be declared `nogil` if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get `nogil` functions. However, if your function isn't `nogil`, you should compile your module with `cython -a --cplus my_module.pyx` and open the resulting `my_module.html` file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.
|
Your functions cannot be declared `nogil` if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get `nogil` functions. However, if your function isn't `nogil`, you should compile your module with `cython -a --cplus my_module.pyx` and open the resulting `my_module.html` file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.
|
||||||
|
|
||||||
|
|
|
@ -10,7 +10,7 @@ open-source software, released under the MIT license.
|
||||||
|
|
||||||
📊 **Help us improve the library!** `Take the spaCy user survey <https://survey.spacy.io>`_.
|
📊 **Help us improve the library!** `Take the spaCy user survey <https://survey.spacy.io>`_.
|
||||||
|
|
||||||
💫 **Version 1.7 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
💫 **Version 1.8 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
|
||||||
|
|
||||||
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
|
.. image:: https://img.shields.io/travis/explosion/spaCy/master.svg?style=flat-square
|
||||||
:target: https://travis-ci.org/explosion/spaCy
|
:target: https://travis-ci.org/explosion/spaCy
|
||||||
|
@ -320,6 +320,7 @@ and ``--model`` are optional and enable additional tests:
|
||||||
=========== ============== ===========
|
=========== ============== ===========
|
||||||
Version Date Description
|
Version Date Description
|
||||||
=========== ============== ===========
|
=========== ============== ===========
|
||||||
|
`v1.8.0`_ ``2017-04-16`` Better NER training, saving and loading
|
||||||
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
|
`v1.7.5`_ ``2017-04-07`` Bug fixes and new CLI commands
|
||||||
`v1.7.3`_ ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
|
`v1.7.3`_ ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
|
||||||
`v1.7.2`_ ``2017-03-20`` Small fixes to beam parser and model linking
|
`v1.7.2`_ ``2017-03-20`` Small fixes to beam parser and model linking
|
||||||
|
@ -350,6 +351,7 @@ Version Date Description
|
||||||
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
|
`v0.93`_ ``2015-09-22`` Bug fixes to word vectors
|
||||||
=========== ============== ===========
|
=========== ============== ===========
|
||||||
|
|
||||||
|
.. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
|
||||||
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
|
.. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
|
||||||
.. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
|
.. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
|
||||||
.. _v1.7.2: https://github.com/explosion/spaCy/releases/tag/v1.7.2
|
.. _v1.7.2: https://github.com/explosion/spaCy/releases/tag/v1.7.2
|
||||||
|
|
|
@ -1,7 +1,8 @@
|
||||||
'''Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
|
"""
|
||||||
|
Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
|
||||||
text, with each "sentence" on a newline, and spaces between tokens. Supports
|
text, with each "sentence" on a newline, and spaces between tokens. Supports
|
||||||
multi-processing.
|
multi-processing.
|
||||||
'''
|
"""
|
||||||
from __future__ import print_function, unicode_literals, division
|
from __future__ import print_function, unicode_literals, division
|
||||||
import io
|
import io
|
||||||
import bz2
|
import bz2
|
||||||
|
@ -22,14 +23,14 @@ def parallelize(func, iterator, n_jobs, extra):
|
||||||
|
|
||||||
|
|
||||||
def iter_texts_from_json_bz2(loc):
|
def iter_texts_from_json_bz2(loc):
|
||||||
'''
|
"""
|
||||||
Iterator of unicode strings, one per document (here, a comment).
|
Iterator of unicode strings, one per document (here, a comment).
|
||||||
|
|
||||||
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
|
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
|
||||||
document text should be in a string field titled 'body'.
|
document text should be in a string field titled 'body'.
|
||||||
|
|
||||||
This is the data format of the Reddit comments corpus.
|
This is the data format of the Reddit comments corpus.
|
||||||
'''
|
"""
|
||||||
with bz2.BZ2File(loc) as file_:
|
with bz2.BZ2File(loc) as file_:
|
||||||
for i, line in enumerate(file_):
|
for i, line in enumerate(file_):
|
||||||
yield ujson.loads(line)['body']
|
yield ujson.loads(line)['body']
|
||||||
|
@ -80,7 +81,7 @@ def is_sent_begin(word):
|
||||||
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
|
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
|
||||||
if not path.exists(out_dir):
|
if not path.exists(out_dir):
|
||||||
path.join(out_dir)
|
path.join(out_dir)
|
||||||
texts = partition(batch_size, iter_texts(in_loc))
|
texts = partition(batch_size, iter_texts_from_json_bz2(in_loc))
|
||||||
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
|
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
'''Example of training a named entity recognition system from scratch using spaCy
|
'''Example of training a named entity recognition system from scratch using spaCy
|
||||||
|
|
||||||
This example is written to be self-contained and reasonably transparent.
|
This example is written to be self-contained and reasonably transparent.
|
||||||
|
@ -81,7 +82,7 @@ def load_vocab(path):
|
||||||
def init_ner_model(vocab, features=None):
|
def init_ner_model(vocab, features=None):
|
||||||
if features is None:
|
if features is None:
|
||||||
features = tuple(EntityRecognizer.feature_templates)
|
features = tuple(EntityRecognizer.feature_templates)
|
||||||
return BeamEntityRecognizer(vocab, features=features)
|
return EntityRecognizer(vocab, features=features)
|
||||||
|
|
||||||
|
|
||||||
def save_ner_model(model, path):
|
def save_ner_model(model, path):
|
||||||
|
@ -99,7 +100,7 @@ def save_ner_model(model, path):
|
||||||
|
|
||||||
|
|
||||||
def load_ner_model(vocab, path):
|
def load_ner_model(vocab, path):
|
||||||
return BeamEntityRecognizer.load(path, vocab)
|
return EntityRecognizer.load(path, vocab)
|
||||||
|
|
||||||
|
|
||||||
class Pipeline(object):
|
class Pipeline(object):
|
||||||
|
@ -110,18 +111,21 @@ class Pipeline(object):
|
||||||
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
|
raise IOError("Cannot load pipeline from %s\nDoes not exist" % path)
|
||||||
if not path.is_dir():
|
if not path.is_dir():
|
||||||
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
|
raise IOError("Cannot load pipeline from %s\nNot a directory" % path)
|
||||||
vocab = load_vocab(path / 'vocab')
|
vocab = load_vocab(path)
|
||||||
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
||||||
ner_model = load_ner_model(vocab, path / 'ner')
|
ner_model = load_ner_model(vocab, path / 'ner')
|
||||||
return cls(vocab, tokenizer, ner_model)
|
return cls(vocab, tokenizer, ner_model)
|
||||||
|
|
||||||
def __init__(self, vocab=None, tokenizer=None, ner_model=None):
|
def __init__(self, vocab=None, tokenizer=None, entity=None):
|
||||||
if vocab is None:
|
if vocab is None:
|
||||||
self.vocab = init_vocab()
|
vocab = init_vocab()
|
||||||
if tokenizer is None:
|
if tokenizer is None:
|
||||||
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
tokenizer = Tokenizer(vocab, {}, None, None, None)
|
||||||
if ner_model is None:
|
if entity is None:
|
||||||
self.entity = init_ner_model(self.vocab)
|
entity = init_ner_model(self.vocab)
|
||||||
|
self.vocab = vocab
|
||||||
|
self.tokenizer = tokenizer
|
||||||
|
self.entity = entity
|
||||||
self.pipeline = [self.entity]
|
self.pipeline = [self.entity]
|
||||||
|
|
||||||
def __call__(self, input_):
|
def __call__(self, input_):
|
||||||
|
@ -173,7 +177,7 @@ class Pipeline(object):
|
||||||
save_ner_model(self.entity, path / 'ner')
|
save_ner_model(self.entity, path / 'ner')
|
||||||
|
|
||||||
|
|
||||||
def train(nlp, train_examples, dev_examples, nr_epoch=5):
|
def train(nlp, train_examples, dev_examples, ctx, nr_epoch=5):
|
||||||
next_epoch = train_examples
|
next_epoch = train_examples
|
||||||
print("Iter", "Loss", "P", "R", "F")
|
print("Iter", "Loss", "P", "R", "F")
|
||||||
for i in range(nr_epoch):
|
for i in range(nr_epoch):
|
||||||
|
@ -186,14 +190,17 @@ def train(nlp, train_examples, dev_examples, nr_epoch=5):
|
||||||
next_epoch.append((input_, annot))
|
next_epoch.append((input_, annot))
|
||||||
random.shuffle(next_epoch)
|
random.shuffle(next_epoch)
|
||||||
scores = nlp.evaluate(dev_examples)
|
scores = nlp.evaluate(dev_examples)
|
||||||
|
report_scores(i, loss, scores)
|
||||||
|
nlp.average_weights()
|
||||||
|
scores = nlp.evaluate(dev_examples)
|
||||||
|
report_scores(channels, i+1, loss, scores)
|
||||||
|
|
||||||
|
|
||||||
|
def report_scores(i, loss, scores):
|
||||||
precision = '%.2f' % scores['ents_p']
|
precision = '%.2f' % scores['ents_p']
|
||||||
recall = '%.2f' % scores['ents_r']
|
recall = '%.2f' % scores['ents_r']
|
||||||
f_measure = '%.2f' % scores['ents_f']
|
f_measure = '%.2f' % scores['ents_f']
|
||||||
print(i, int(loss), precision, recall, f_measure)
|
print('%d %s %s %s' % (int(loss), precision, recall, f_measure))
|
||||||
nlp.average_weights()
|
|
||||||
scores = nlp.evaluate(dev_examples)
|
|
||||||
print("After averaging")
|
|
||||||
print(scores['ents_p'], scores['ents_r'], scores['ents_f'])
|
|
||||||
|
|
||||||
|
|
||||||
def read_examples(path):
|
def read_examples(path):
|
||||||
|
@ -221,15 +228,17 @@ def read_examples(path):
|
||||||
train_loc=("Path to your training data", "positional", None, Path),
|
train_loc=("Path to your training data", "positional", None, Path),
|
||||||
dev_loc=("Path to your development data", "positional", None, Path),
|
dev_loc=("Path to your development data", "positional", None, Path),
|
||||||
)
|
)
|
||||||
def main(model_dir, train_loc, dev_loc, nr_epoch=10):
|
def main(model_dir=Path('/home/matt/repos/spaCy/spacy/data/de-1.0.0'),
|
||||||
|
train_loc=None, dev_loc=None, nr_epoch=30):
|
||||||
|
|
||||||
train_examples = read_examples(train_loc)
|
train_examples = read_examples(train_loc)
|
||||||
dev_examples = read_examples(dev_loc)
|
dev_examples = read_examples(dev_loc)
|
||||||
nlp = Pipeline()
|
nlp = Pipeline.load(model_dir)
|
||||||
|
|
||||||
train(nlp, train_examples, list(dev_examples), nr_epoch)
|
train(nlp, train_examples, list(dev_examples), ctx, nr_epoch)
|
||||||
|
|
||||||
nlp.save(model_dir)
|
nlp.save(model_dir)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
plac.call(main)
|
main()
|
103
examples/training/train_new_entity_type.py
Normal file
103
examples/training/train_new_entity_type.py
Normal file
|
@ -0,0 +1,103 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
"""
|
||||||
|
Example of training an additional entity type
|
||||||
|
|
||||||
|
This script shows how to add a new entity type to an existing pre-trained NER
|
||||||
|
model. To keep the example short and simple, only four sentences are provided
|
||||||
|
as examples. In practice, you'll need many more — a few hundred would be a
|
||||||
|
good start. You will also likely need to mix in examples of other entity
|
||||||
|
types, which might be obtained by running the entity recognizer over unlabelled
|
||||||
|
sentences, and adding their annotations to the training set.
|
||||||
|
|
||||||
|
The actual training is performed by looping over the examples, and calling
|
||||||
|
`nlp.entity.update()`. The `update()` method steps through the words of the
|
||||||
|
input. At each word, it makes a prediction. It then consults the annotations
|
||||||
|
provided on the GoldParse instance, to see whether it was right. If it was
|
||||||
|
wrong, it adjusts its weights so that the correct action will score higher
|
||||||
|
next time.
|
||||||
|
|
||||||
|
After training your model, you can save it to a directory. We recommend
|
||||||
|
wrapping models as Python packages, for ease of deployment.
|
||||||
|
|
||||||
|
For more details, see the documentation:
|
||||||
|
* Training the Named Entity Recognizer: https://spacy.io/docs/usage/train-ner
|
||||||
|
* Saving and loading models: https://spacy.io/docs/usage/saving-loading
|
||||||
|
|
||||||
|
Developed for: spaCy 1.7.6
|
||||||
|
Last tested for: spaCy 1.7.6
|
||||||
|
"""
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
|
import random
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import spacy
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.tagger import Tagger
|
||||||
|
|
||||||
|
|
||||||
|
def train_ner(nlp, train_data, output_dir):
|
||||||
|
# Add new words to vocab
|
||||||
|
for raw_text, _ in train_data:
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
for word in doc:
|
||||||
|
_ = nlp.vocab[word.orth]
|
||||||
|
|
||||||
|
for itn in range(20):
|
||||||
|
random.shuffle(train_data)
|
||||||
|
for raw_text, entity_offsets in train_data:
|
||||||
|
gold = GoldParse(doc, entities=entity_offsets)
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
nlp.tagger(doc)
|
||||||
|
loss = nlp.entity.update(doc, gold)
|
||||||
|
nlp.end_training()
|
||||||
|
if output_dir:
|
||||||
|
if not output_dir.exists():
|
||||||
|
output_dir.mkdir()
|
||||||
|
nlp.save_to_directory(output_dir)
|
||||||
|
|
||||||
|
|
||||||
|
def main(model_name, output_directory=None):
|
||||||
|
print("Loading initial model", model_name)
|
||||||
|
nlp = spacy.load(model_name)
|
||||||
|
if output_directory is not None:
|
||||||
|
output_directory = Path(output_directory)
|
||||||
|
|
||||||
|
train_data = [
|
||||||
|
(
|
||||||
|
"Horses are too tall and they pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')],
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"horses are too tall and they pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')]
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"horses pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')]
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"they pretend to care about your feelings, those horses",
|
||||||
|
[(48, 54, 'ANIMAL')]
|
||||||
|
)
|
||||||
|
]
|
||||||
|
nlp.entity.add_label('ANIMAL')
|
||||||
|
train_ner(nlp, train_data, output_directory)
|
||||||
|
|
||||||
|
# Test that the entity is recognized
|
||||||
|
doc = nlp('Do you like horses?')
|
||||||
|
for ent in doc.ents:
|
||||||
|
print(ent.label_, ent.text)
|
||||||
|
if output_directory:
|
||||||
|
print("Loading from", output_directory)
|
||||||
|
nlp2 = spacy.load('en', path=output_directory)
|
||||||
|
nlp2.entity.add_label('ANIMAL')
|
||||||
|
doc2 = nlp2('Do you like horses?')
|
||||||
|
for ent in doc2.ents:
|
||||||
|
print(ent.label_, ent.text)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import plac
|
||||||
|
plac.call(main)
|
2
fabfile.py
vendored
2
fabfile.py
vendored
|
@ -14,7 +14,7 @@ VENV_DIR = path.join(PWD, ENV)
|
||||||
def env(lang='python2.7'):
|
def env(lang='python2.7'):
|
||||||
if path.exists(VENV_DIR):
|
if path.exists(VENV_DIR):
|
||||||
local('rm -rf {env}'.format(env=VENV_DIR))
|
local('rm -rf {env}'.format(env=VENV_DIR))
|
||||||
local('virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
|
local('python -m virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))
|
||||||
|
|
||||||
|
|
||||||
def install():
|
def install():
|
||||||
|
|
|
@ -11,3 +11,4 @@ ujson>=1.35
|
||||||
dill>=0.2,<0.3
|
dill>=0.2,<0.3
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
regex==2017.4.5
|
regex==2017.4.5
|
||||||
|
pytest>=3.0.6,<4.0.0
|
||||||
|
|
|
@ -1,53 +1,40 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import json
|
from . import util
|
||||||
from pathlib import Path
|
|
||||||
from .util import set_lang_class, get_lang_class, parse_package_meta
|
|
||||||
from .deprecated import resolve_model_name
|
from .deprecated import resolve_model_name
|
||||||
from .cli import info
|
from .cli import info
|
||||||
|
|
||||||
from . import en
|
from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he
|
||||||
from . import de
|
|
||||||
from . import zh
|
|
||||||
from . import es
|
|
||||||
from . import it
|
|
||||||
from . import hu
|
|
||||||
from . import fr
|
|
||||||
from . import pt
|
|
||||||
from . import nl
|
|
||||||
from . import sv
|
|
||||||
from . import fi
|
|
||||||
from . import bn
|
|
||||||
from . import he
|
|
||||||
|
|
||||||
from .about import *
|
|
||||||
|
|
||||||
|
|
||||||
set_lang_class(en.English.lang, en.English)
|
_languages = (en.English, de.German, es.Spanish, pt.Portuguese, fr.French,
|
||||||
set_lang_class(de.German.lang, de.German)
|
it.Italian, hu.Hungarian, zh.Chinese, nl.Dutch, sv.Swedish,
|
||||||
set_lang_class(es.Spanish.lang, es.Spanish)
|
fi.Finnish, bn.Bengali, he.Hebrew)
|
||||||
set_lang_class(pt.Portuguese.lang, pt.Portuguese)
|
|
||||||
set_lang_class(fr.French.lang, fr.French)
|
|
||||||
set_lang_class(it.Italian.lang, it.Italian)
|
for _lang in _languages:
|
||||||
set_lang_class(hu.Hungarian.lang, hu.Hungarian)
|
util.set_lang_class(_lang.lang, _lang)
|
||||||
set_lang_class(zh.Chinese.lang, zh.Chinese)
|
|
||||||
set_lang_class(nl.Dutch.lang, nl.Dutch)
|
|
||||||
set_lang_class(sv.Swedish.lang, sv.Swedish)
|
|
||||||
set_lang_class(fi.Finnish.lang, fi.Finnish)
|
|
||||||
set_lang_class(bn.Bengali.lang, bn.Bengali)
|
|
||||||
set_lang_class(he.Hebrew.lang, he.Hebrew)
|
|
||||||
|
|
||||||
|
|
||||||
def load(name, **overrides):
|
def load(name, **overrides):
|
||||||
data_path = overrides.get('path', util.get_data_path())
|
if overrides.get('path') in (None, False, True):
|
||||||
|
data_path = util.get_data_path()
|
||||||
model_name = resolve_model_name(name)
|
model_name = resolve_model_name(name)
|
||||||
meta = parse_package_meta(data_path, model_name, require=False)
|
model_path = data_path / model_name
|
||||||
|
if not model_path.exists():
|
||||||
|
lang_name = util.get_lang_class(name).lang
|
||||||
|
model_path = None
|
||||||
|
util.print_msg(
|
||||||
|
"Only loading the '{}' tokenizer.".format(lang_name),
|
||||||
|
title="Warning: no model found for '{}'".format(name))
|
||||||
|
else:
|
||||||
|
model_path = util.ensure_path(overrides['path'])
|
||||||
|
data_path = model_path.parent
|
||||||
|
model_name = ''
|
||||||
|
meta = util.parse_package_meta(data_path, model_name, require=False)
|
||||||
lang = meta['lang'] if meta and 'lang' in meta else name
|
lang = meta['lang'] if meta and 'lang' in meta else name
|
||||||
cls = get_lang_class(lang)
|
cls = util.get_lang_class(lang)
|
||||||
overrides['meta'] = meta
|
overrides['meta'] = meta
|
||||||
model_path = Path(data_path / model_name)
|
|
||||||
if model_path.exists():
|
|
||||||
overrides['path'] = model_path
|
overrides['path'] = model_path
|
||||||
|
|
||||||
return cls(**overrides)
|
return cls(**overrides)
|
||||||
|
|
|
@ -14,8 +14,9 @@ from spacy.cli import convert as cli_convert
|
||||||
|
|
||||||
|
|
||||||
class CLI(object):
|
class CLI(object):
|
||||||
"""Command-line interface for spaCy"""
|
"""
|
||||||
|
Command-line interface for spaCy
|
||||||
|
"""
|
||||||
commands = ('download', 'link', 'info', 'package', 'train', 'model', 'convert')
|
commands = ('download', 'link', 'info', 'package', 'train', 'model', 'convert')
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
@ -29,7 +30,6 @@ class CLI(object):
|
||||||
can be shortcut, model name or, if --direct flag is set, full model name
|
can be shortcut, model name or, if --direct flag is set, full model name
|
||||||
with version.
|
with version.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_download(model, direct)
|
cli_download(model, direct)
|
||||||
|
|
||||||
|
|
||||||
|
@ -44,7 +44,6 @@ class CLI(object):
|
||||||
either the name of a pip package, or the local path to the model data
|
either the name of a pip package, or the local path to the model data
|
||||||
directory. Linking models allows loading them via spacy.load(link_name).
|
directory. Linking models allows loading them via spacy.load(link_name).
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_link(origin, link_name, force)
|
cli_link(origin, link_name, force)
|
||||||
|
|
||||||
|
|
||||||
|
@ -58,23 +57,22 @@ class CLI(object):
|
||||||
speficied as an argument, print model information. Flag --markdown
|
speficied as an argument, print model information. Flag --markdown
|
||||||
prints details in Markdown for easy copy-pasting to GitHub issues.
|
prints details in Markdown for easy copy-pasting to GitHub issues.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_info(model, markdown)
|
cli_info(model, markdown)
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
input_dir=("directory with model data", "positional", None, str),
|
input_dir=("directory with model data", "positional", None, str),
|
||||||
output_dir=("output parent directory", "positional", None, str),
|
output_dir=("output parent directory", "positional", None, str),
|
||||||
|
meta=("path to meta.json", "option", "m", str),
|
||||||
force=("force overwriting of existing folder in output directory", "flag", "f", bool)
|
force=("force overwriting of existing folder in output directory", "flag", "f", bool)
|
||||||
)
|
)
|
||||||
def package(self, input_dir, output_dir, force=False):
|
def package(self, input_dir, output_dir, meta=None, force=False):
|
||||||
"""
|
"""
|
||||||
Generate Python package for model data, including meta and required
|
Generate Python package for model data, including meta and required
|
||||||
installation files. A new directory will be created in the specified
|
installation files. A new directory will be created in the specified
|
||||||
output directory, and model data will be copied over.
|
output directory, and model data will be copied over.
|
||||||
"""
|
"""
|
||||||
|
cli_package(input_dir, output_dir, meta, force)
|
||||||
cli_package(input_dir, output_dir, force)
|
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
@ -93,7 +91,6 @@ class CLI(object):
|
||||||
"""
|
"""
|
||||||
Train a model. Expects data in spaCy's JSON format.
|
Train a model. Expects data in spaCy's JSON format.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_train(lang, output_dir, train_data, dev_data, n_iter, not no_tagger,
|
cli_train(lang, output_dir, train_data, dev_data, n_iter, not no_tagger,
|
||||||
not no_parser, not no_ner, parser_L1)
|
not no_parser, not no_ner, parser_L1)
|
||||||
|
|
||||||
|
@ -108,7 +105,6 @@ class CLI(object):
|
||||||
"""
|
"""
|
||||||
Initialize a new model and its data directory.
|
Initialize a new model and its data directory.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_model(lang, model_dir, freqs_data, clusters_data, vectors_data)
|
cli_model(lang, model_dir, freqs_data, clusters_data, vectors_data)
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
@ -122,7 +118,6 @@ class CLI(object):
|
||||||
Convert files into JSON format for use with train command and other
|
Convert files into JSON format for use with train command and other
|
||||||
experiment management functions.
|
experiment management functions.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
cli_convert(input_file, output_dir, n_sents, morphology)
|
cli_convert(input_file, output_dir, n_sents, morphology)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
# https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
|
||||||
|
|
||||||
__title__ = 'spacy'
|
__title__ = 'spacy'
|
||||||
__version__ = '1.7.5'
|
__version__ = '1.8.0'
|
||||||
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
__summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
|
||||||
__uri__ = 'https://spacy.io'
|
__uri__ = 'https://spacy.io'
|
||||||
__author__ = 'Matthew Honnibal'
|
__author__ = 'Matthew Honnibal'
|
||||||
|
|
|
@ -1,3 +1,7 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
IDS = {
|
IDS = {
|
||||||
"": NULL_ATTR,
|
"": NULL_ATTR,
|
||||||
"IS_ALPHA": IS_ALPHA,
|
"IS_ALPHA": IS_ALPHA,
|
||||||
|
@ -92,7 +96,8 @@ NAMES = [key for key, value in sorted(IDS.items(), key=lambda item: item[1])]
|
||||||
|
|
||||||
|
|
||||||
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
'''Normalize a dictionary of attributes, converting them to ints.
|
"""
|
||||||
|
Normalize a dictionary of attributes, converting them to ints.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
stringy_attrs (dict):
|
stringy_attrs (dict):
|
||||||
|
@ -105,7 +110,7 @@ def intify_attrs(stringy_attrs, strings_map=None, _do_deprecated=False):
|
||||||
inty_attrs (dict):
|
inty_attrs (dict):
|
||||||
Attributes dictionary with keys and optionally values converted to
|
Attributes dictionary with keys and optionally values converted to
|
||||||
ints.
|
ints.
|
||||||
'''
|
"""
|
||||||
inty_attrs = {}
|
inty_attrs = {}
|
||||||
if _do_deprecated:
|
if _do_deprecated:
|
||||||
if 'F' in stringy_attrs:
|
if 'F' in stringy_attrs:
|
||||||
|
|
|
@ -1,3 +1,6 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from libc.stdio cimport fopen, fclose, fread, fwrite
|
from libc.stdio cimport fopen, fclose, fread, fwrite
|
||||||
from libc.string cimport memcpy
|
from libc.string cimport memcpy
|
||||||
|
|
||||||
|
|
|
@ -1,8 +1,7 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, division, print_function
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import io
|
from pathlib import Path
|
||||||
from pathlib import Path, PurePosixPath
|
|
||||||
|
|
||||||
from .converters import conllu2json
|
from .converters import conllu2json
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
|
@ -1,13 +1,14 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, division, print_function
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import json
|
import json
|
||||||
from ...gold import read_json_file, merge_sents
|
from ...compat import json_dumps
|
||||||
from ... import util
|
from ... import util
|
||||||
|
|
||||||
|
|
||||||
def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||||
"""Convert conllu files into JSON format for use with train cli.
|
"""
|
||||||
|
Convert conllu files into JSON format for use with train cli.
|
||||||
use_morphology parameter enables appending morphology to tags, which is
|
use_morphology parameter enables appending morphology to tags, which is
|
||||||
useful for languages such as Spanish, where UD tags are not so rich.
|
useful for languages such as Spanish, where UD tags are not so rich.
|
||||||
"""
|
"""
|
||||||
|
@ -29,7 +30,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
|
||||||
|
|
||||||
output_filename = input_path.parts[-1].replace(".conllu", ".json")
|
output_filename = input_path.parts[-1].replace(".conllu", ".json")
|
||||||
output_file = output_path / output_filename
|
output_file = output_path / output_filename
|
||||||
json.dump(docs, output_file.open('w', encoding='utf-8'), indent=2)
|
with output_file.open('w', encoding='utf-8') as f:
|
||||||
|
f.write(json_dumps(docs))
|
||||||
util.print_msg("Created {} documents".format(len(docs)),
|
util.print_msg("Created {} documents".format(len(docs)),
|
||||||
title="Generated output file {}".format(output_file))
|
title="Generated output file {}".format(output_file))
|
||||||
|
|
||||||
|
|
|
@ -1,7 +1,6 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pip
|
|
||||||
import requests
|
import requests
|
||||||
import os
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
||||||
import platform
|
import platform
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from ..compat import unicode_
|
||||||
from .. import about
|
from .. import about
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
@ -13,12 +14,11 @@ def info(model=None, markdown=False):
|
||||||
data = util.parse_package_meta(util.get_data_path(), model, require=True)
|
data = util.parse_package_meta(util.get_data_path(), model, require=True)
|
||||||
model_path = Path(__file__).parent / util.get_data_path() / model
|
model_path = Path(__file__).parent / util.get_data_path() / model
|
||||||
if model_path.resolve() != model_path:
|
if model_path.resolve() != model_path:
|
||||||
data['link'] = str(model_path)
|
data['link'] = unicode_(model_path)
|
||||||
data['source'] = str(model_path.resolve())
|
data['source'] = unicode_(model_path.resolve())
|
||||||
else:
|
else:
|
||||||
data['source'] = str(model_path)
|
data['source'] = unicode_(model_path)
|
||||||
print_info(data, "model " + model, markdown)
|
print_info(data, "model " + model, markdown)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
data = get_spacy_data()
|
data = get_spacy_data()
|
||||||
print_info(data, "spaCy", markdown)
|
print_info(data, "spaCy", markdown)
|
||||||
|
@ -26,10 +26,8 @@ def info(model=None, markdown=False):
|
||||||
|
|
||||||
def print_info(data, title, markdown):
|
def print_info(data, title, markdown):
|
||||||
title = "Info about {title}".format(title=title)
|
title = "Info about {title}".format(title=title)
|
||||||
|
|
||||||
if markdown:
|
if markdown:
|
||||||
util.print_markdown(data, title=title)
|
util.print_markdown(data, title=title)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
util.print_table(data, title=title)
|
util.print_table(data, title=title)
|
||||||
|
|
||||||
|
@ -37,7 +35,7 @@ def print_info(data, title, markdown):
|
||||||
def get_spacy_data():
|
def get_spacy_data():
|
||||||
return {
|
return {
|
||||||
'spaCy version': about.__version__,
|
'spaCy version': about.__version__,
|
||||||
'Location': str(Path(__file__).parent.parent),
|
'Location': unicode_(Path(__file__).parent.parent),
|
||||||
'Platform': platform.platform(),
|
'Platform': platform.platform(),
|
||||||
'Python version': platform.python_version(),
|
'Python version': platform.python_version(),
|
||||||
'Installed models': ', '.join(list_models())
|
'Installed models': ', '.join(list_models())
|
||||||
|
@ -49,5 +47,6 @@ def list_models():
|
||||||
# won't show up in list, but it seems worth it
|
# won't show up in list, but it seems worth it
|
||||||
exclude = ['cache', 'pycache', '__pycache__']
|
exclude = ['cache', 'pycache', '__pycache__']
|
||||||
data_path = util.get_data_path()
|
data_path = util.get_data_path()
|
||||||
|
if data_path:
|
||||||
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()]
|
||||||
return [m for m in models if m not in exclude]
|
return [m for m in models if m not in exclude]
|
||||||
|
|
|
@ -4,6 +4,7 @@ from __future__ import unicode_literals
|
||||||
import pip
|
import pip
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import importlib
|
import importlib
|
||||||
|
from ..compat import unicode_, symlink_to
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
|
@ -20,7 +21,6 @@ def link_package(package_name, link_name, force=False):
|
||||||
# Python's installation and import rules are very complicated.
|
# Python's installation and import rules are very complicated.
|
||||||
pkg = importlib.import_module(package_name)
|
pkg = importlib.import_module(package_name)
|
||||||
package_path = Path(pkg.__file__).parent.parent
|
package_path = Path(pkg.__file__).parent.parent
|
||||||
|
|
||||||
meta = get_meta(package_path, package_name)
|
meta = get_meta(package_path, package_name)
|
||||||
model_name = package_name + '-' + meta['version']
|
model_name = package_name + '-' + meta['version']
|
||||||
model_path = package_path / package_name / model_name
|
model_path = package_path / package_name / model_name
|
||||||
|
@ -29,7 +29,7 @@ def link_package(package_name, link_name, force=False):
|
||||||
|
|
||||||
def symlink(model_path, link_name, force):
|
def symlink(model_path, link_name, force):
|
||||||
model_path = Path(model_path)
|
model_path = Path(model_path)
|
||||||
if not Path(model_path).exists():
|
if not model_path.exists():
|
||||||
util.sys_exit(
|
util.sys_exit(
|
||||||
"The data should be located in {p}".format(p=model_path),
|
"The data should be located in {p}".format(p=model_path),
|
||||||
title="Can't locate model data")
|
title="Can't locate model data")
|
||||||
|
@ -43,13 +43,21 @@ def symlink(model_path, link_name, force):
|
||||||
elif link_path.exists():
|
elif link_path.exists():
|
||||||
link_path.unlink()
|
link_path.unlink()
|
||||||
|
|
||||||
# Add workaround for Python 2 on Windows (see issue #909)
|
try:
|
||||||
if util.is_python2() and util.is_windows():
|
symlink_to(link_path, model_path)
|
||||||
import subprocess
|
except:
|
||||||
command = ['mklink', '/d', link_path, model_path]
|
# This is quite dirty, but just making sure other errors are caught so
|
||||||
subprocess.call(command, shell=True)
|
# users at least see a proper message.
|
||||||
else:
|
util.print_msg(
|
||||||
link_path.symlink_to(model_path)
|
"Creating a symlink in spacy/data failed. Make sure you have the "
|
||||||
|
"required permissions and try re-running the command as admin, or "
|
||||||
|
"use a virtualenv to install spaCy in a user directory, instead of "
|
||||||
|
"doing a system installation.",
|
||||||
|
"You can still import the model as a Python package and call its "
|
||||||
|
"load() method, or create the symlink manually:",
|
||||||
|
"{a} --> {b}".format(a=unicode_(model_path), b=unicode_(link_path)),
|
||||||
|
title="Error: Couldn't link model to '{l}'".format(l=link_name))
|
||||||
|
raise
|
||||||
|
|
||||||
util.print_msg(
|
util.print_msg(
|
||||||
"{a} --> {b}".format(a=model_path.as_posix(), b=link_path.as_posix()),
|
"{a} --> {b}".format(a=model_path.as_posix(), b=link_path.as_posix()),
|
||||||
|
|
|
@ -95,7 +95,7 @@ def read_clusters(clusters_path):
|
||||||
return clusters
|
return clusters
|
||||||
|
|
||||||
|
|
||||||
def populate_vocab(vocab, clusters, probs, oov_probs):
|
def populate_vocab(vocab, clusters, probs, oov_prob):
|
||||||
# Ensure probs has entries for all words seen during clustering.
|
# Ensure probs has entries for all words seen during clustering.
|
||||||
for word in clusters:
|
for word in clusters:
|
||||||
if word not in probs:
|
if word not in probs:
|
||||||
|
|
|
@ -1,57 +1,67 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import json
|
|
||||||
import shutil
|
import shutil
|
||||||
import requests
|
import requests
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from .. import about
|
from ..compat import unicode_, json_dumps
|
||||||
from .. import util
|
from .. import util
|
||||||
|
|
||||||
|
|
||||||
def package(input_dir, output_dir, force):
|
def package(input_dir, output_dir, meta_path, force):
|
||||||
input_path = Path(input_dir)
|
input_path = Path(input_dir)
|
||||||
output_path = Path(output_dir)
|
output_path = Path(output_dir)
|
||||||
check_dirs(input_path, output_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
|
check_dirs(input_path, output_path, meta_path)
|
||||||
|
|
||||||
template_setup = get_template('setup.py')
|
template_setup = get_template('setup.py')
|
||||||
template_manifest = get_template('MANIFEST.in')
|
template_manifest = get_template('MANIFEST.in')
|
||||||
template_init = get_template('en_model_name/__init__.py')
|
template_init = get_template('en_model_name/__init__.py')
|
||||||
|
|
||||||
|
meta_path = meta_path or input_path / 'meta.json'
|
||||||
|
if meta_path.is_file():
|
||||||
|
util.print_msg(unicode_(meta_path), title="Reading meta.json from file")
|
||||||
|
meta = util.read_json(meta_path)
|
||||||
|
else:
|
||||||
meta = generate_meta()
|
meta = generate_meta()
|
||||||
|
|
||||||
|
validate_meta(meta, ['lang', 'name', 'version'])
|
||||||
model_name = meta['lang'] + '_' + meta['name']
|
model_name = meta['lang'] + '_' + meta['name']
|
||||||
model_name_v = model_name + '-' + meta['version']
|
model_name_v = model_name + '-' + meta['version']
|
||||||
main_path = output_path / model_name_v
|
main_path = output_path / model_name_v
|
||||||
package_path = main_path / model_name
|
package_path = main_path / model_name
|
||||||
|
|
||||||
create_dirs(package_path, force)
|
create_dirs(package_path, force)
|
||||||
shutil.copytree(input_path.as_posix(), (package_path / model_name_v).as_posix())
|
shutil.copytree(unicode_(input_path), unicode_(package_path / model_name_v))
|
||||||
create_file(main_path / 'meta.json', json.dumps(meta, indent=2))
|
create_file(main_path / 'meta.json', json_dumps(meta))
|
||||||
create_file(main_path / 'setup.py', template_setup)
|
create_file(main_path / 'setup.py', template_setup)
|
||||||
create_file(main_path / 'MANIFEST.in', template_manifest)
|
create_file(main_path / 'MANIFEST.in', template_manifest)
|
||||||
create_file(package_path / '__init__.py', template_init)
|
create_file(package_path / '__init__.py', template_init)
|
||||||
|
|
||||||
util.print_msg(
|
util.print_msg(
|
||||||
main_path.as_posix(),
|
unicode_(main_path),
|
||||||
"To build the package, run `python setup.py sdist` in that directory.",
|
"To build the package, run `python setup.py sdist` in that directory.",
|
||||||
title="Successfully created package {p}".format(p=model_name_v))
|
title="Successfully created package {p}".format(p=model_name_v))
|
||||||
|
|
||||||
|
|
||||||
def check_dirs(input_path, output_path):
|
def check_dirs(input_path, output_path, meta_path):
|
||||||
if not input_path.exists():
|
if not input_path.exists():
|
||||||
util.sys_exit(input_path.as_poisx(), title="Model directory not found")
|
util.sys_exit(unicode_(input_path.as_poisx), title="Model directory not found")
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
util.sys_exit(output_path.as_posix(), title="Output directory not found")
|
util.sys_exit(unicode_(output_path), title="Output directory not found")
|
||||||
|
if meta_path and not meta_path.exists():
|
||||||
|
util.sys_exit(unicode_(meta_path), title="meta.json not found")
|
||||||
|
|
||||||
|
|
||||||
def create_dirs(package_path, force):
|
def create_dirs(package_path, force):
|
||||||
if package_path.exists():
|
if package_path.exists():
|
||||||
if force:
|
if force:
|
||||||
shutil.rmtree(package_path.as_posix())
|
shutil.rmtree(unicode_(package_path))
|
||||||
else:
|
else:
|
||||||
util.sys_exit(package_path.as_posix(),
|
util.sys_exit(unicode_(package_path),
|
||||||
"Please delete the directory and try again.",
|
"Please delete the directory and try again, or use the --force "
|
||||||
|
"flag to overwrite existing directories.",
|
||||||
title="Package directory already exists")
|
title="Package directory already exists")
|
||||||
Path.mkdir(package_path, parents=True)
|
Path.mkdir(package_path, parents=True)
|
||||||
|
|
||||||
|
@ -81,6 +91,14 @@ def generate_meta():
|
||||||
return meta
|
return meta
|
||||||
|
|
||||||
|
|
||||||
|
def validate_meta(meta, keys):
|
||||||
|
for key in keys:
|
||||||
|
if key not in meta or meta[key] == '':
|
||||||
|
util.sys_exit(
|
||||||
|
"This setting is required to build your package.",
|
||||||
|
title='No "{k}" setting found in meta.json'.format(k=key))
|
||||||
|
|
||||||
|
|
||||||
def get_template(filepath):
|
def get_template(filepath):
|
||||||
url = 'https://raw.githubusercontent.com/explosion/spacy-dev-resources/master/templates/model/'
|
url = 'https://raw.githubusercontent.com/explosion/spacy-dev-resources/master/templates/model/'
|
||||||
r = requests.get(url + filepath)
|
r = requests.get(url + filepath)
|
||||||
|
|
|
@ -2,11 +2,9 @@
|
||||||
from __future__ import unicode_literals, division, print_function
|
from __future__ import unicode_literals, division, print_function
|
||||||
|
|
||||||
import json
|
import json
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
from ..util import ensure_path
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..tagger import Tagger
|
|
||||||
from ..syntax.parser import Parser
|
|
||||||
from ..gold import GoldParse, merge_sents
|
from ..gold import GoldParse, merge_sents
|
||||||
from ..gold import read_json_file as read_gold_json
|
from ..gold import read_json_file as read_gold_json
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -14,9 +12,9 @@ from .. import util
|
||||||
|
|
||||||
def train(language, output_dir, train_data, dev_data, n_iter, tagger, parser, ner,
|
def train(language, output_dir, train_data, dev_data, n_iter, tagger, parser, ner,
|
||||||
parser_L1):
|
parser_L1):
|
||||||
output_path = Path(output_dir)
|
output_path = ensure_path(output_dir)
|
||||||
train_path = Path(train_data)
|
train_path = ensure_path(train_data)
|
||||||
dev_path = Path(dev_data)
|
dev_path = ensure_path(dev_data)
|
||||||
check_dirs(output_path, train_path, dev_path)
|
check_dirs(output_path, train_path, dev_path)
|
||||||
|
|
||||||
lang = util.get_lang_class(language)
|
lang = util.get_lang_class(language)
|
||||||
|
@ -45,7 +43,7 @@ def train(language, output_dir, train_data, dev_data, n_iter, tagger, parser, ne
|
||||||
|
|
||||||
|
|
||||||
def train_config(config):
|
def train_config(config):
|
||||||
config_path = Path(config)
|
config_path = ensure_path(config)
|
||||||
if not config_path.is_file():
|
if not config_path.is_file():
|
||||||
util.sys_exit(config_path.as_posix(), title="Config file not found")
|
util.sys_exit(config_path.as_posix(), title="Config file not found")
|
||||||
config = json.load(config_path)
|
config = json.load(config_path)
|
||||||
|
@ -59,8 +57,8 @@ def train_model(Language, train_data, dev_data, output_path, tagger_cfg, parser_
|
||||||
entity_cfg, n_iter):
|
entity_cfg, n_iter):
|
||||||
print("Itn.\tN weight\tN feats\tUAS\tNER F.\tTag %\tToken %")
|
print("Itn.\tN weight\tN feats\tUAS\tNER F.\tTag %\tToken %")
|
||||||
|
|
||||||
with Language.train(output_path, train_data, tagger_cfg, parser_cfg, entity_cfg) as trainer:
|
with Language.train(output_path, train_data,
|
||||||
loss = 0
|
pos=tagger_cfg, deps=parser_cfg, ner=entity_cfg) as trainer:
|
||||||
for itn, epoch in enumerate(trainer.epochs(n_iter, augment_data=None)):
|
for itn, epoch in enumerate(trainer.epochs(n_iter, augment_data=None)):
|
||||||
for doc, gold in epoch:
|
for doc, gold in epoch:
|
||||||
trainer.update(doc, gold)
|
trainer.update(doc, gold)
|
||||||
|
|
54
spacy/compat.py
Normal file
54
spacy/compat.py
Normal file
|
@ -0,0 +1,54 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import six
|
||||||
|
import sys
|
||||||
|
import ujson
|
||||||
|
|
||||||
|
try:
|
||||||
|
import cPickle as pickle
|
||||||
|
except ImportError:
|
||||||
|
import pickle
|
||||||
|
|
||||||
|
try:
|
||||||
|
import copy_reg
|
||||||
|
except ImportError:
|
||||||
|
import copyreg as copy_reg
|
||||||
|
|
||||||
|
|
||||||
|
is_python2 = six.PY2
|
||||||
|
is_python3 = six.PY3
|
||||||
|
is_windows = sys.platform.startswith('win')
|
||||||
|
is_linux = sys.platform.startswith('linux')
|
||||||
|
is_osx = sys.platform == 'darwin'
|
||||||
|
|
||||||
|
|
||||||
|
if is_python2:
|
||||||
|
bytes_ = str
|
||||||
|
unicode_ = unicode
|
||||||
|
basestring_ = basestring
|
||||||
|
input_ = raw_input
|
||||||
|
json_dumps = lambda data: ujson.dumps(data, indent=2).decode('utf8')
|
||||||
|
|
||||||
|
elif is_python3:
|
||||||
|
bytes_ = bytes
|
||||||
|
unicode_ = str
|
||||||
|
basestring_ = str
|
||||||
|
input_ = input
|
||||||
|
json_dumps = lambda data: ujson.dumps(data, indent=2)
|
||||||
|
|
||||||
|
|
||||||
|
def symlink_to(orig, dest):
|
||||||
|
if is_python2 and is_windows:
|
||||||
|
import subprocess
|
||||||
|
subprocess.call(['mklink', '/d', unicode(orig), unicode(dest)], shell=True)
|
||||||
|
else:
|
||||||
|
orig.symlink_to(dest)
|
||||||
|
|
||||||
|
|
||||||
|
def is_config(python2=None, python3=None, windows=None, linux=None, osx=None):
|
||||||
|
return ((python2 == None or python2 == is_python2) and
|
||||||
|
(python3 == None or python3 == is_python3) and
|
||||||
|
(windows == None or windows == is_windows) and
|
||||||
|
(linux == None or linux == is_linux) and
|
||||||
|
(osx == None or osx == is_osx))
|
|
@ -1,16 +1,14 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from . import about
|
from . import about
|
||||||
from . import util
|
from . import util
|
||||||
from .cli import download
|
from .cli import download
|
||||||
from .cli import link
|
from .cli import link
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
basestring
|
|
||||||
except NameError:
|
|
||||||
basestring = str
|
|
||||||
|
|
||||||
|
|
||||||
def read_lang_data(package):
|
def read_lang_data(package):
|
||||||
tokenization = package.load_json(('tokenizer', 'specials.json'))
|
tokenization = package.load_json(('tokenizer', 'specials.json'))
|
||||||
with package.open(('tokenizer', 'prefix.txt'), default=None) as file_:
|
with package.open(('tokenizer', 'prefix.txt'), default=None) as file_:
|
||||||
|
@ -36,7 +34,8 @@ def align_tokens(ref, indices): # Deprecated, surely?
|
||||||
|
|
||||||
|
|
||||||
def detokenize(token_rules, words): # Deprecated?
|
def detokenize(token_rules, words): # Deprecated?
|
||||||
"""To align with treebanks, return a list of "chunks", where a chunk is a
|
"""
|
||||||
|
To align with treebanks, return a list of "chunks", where a chunk is a
|
||||||
sequence of tokens that are separated by whitespace in actual strings. Each
|
sequence of tokens that are separated by whitespace in actual strings. Each
|
||||||
chunk should be a tuple of token indices, e.g.
|
chunk should be a tuple of token indices, e.g.
|
||||||
|
|
||||||
|
@ -57,10 +56,30 @@ def detokenize(token_rules, words): # Deprecated?
|
||||||
return positions
|
return positions
|
||||||
|
|
||||||
|
|
||||||
def fix_glove_vectors_loading(overrides):
|
def match_best_version(target_name, target_version, path):
|
||||||
"""Special-case hack for loading the GloVe vectors, to support deprecated
|
path = util.ensure_path(path)
|
||||||
<1.0 stuff. Phase this out once the data is fixed."""
|
if path is None or not path.exists():
|
||||||
|
return None
|
||||||
|
matches = []
|
||||||
|
for data_name in path.iterdir():
|
||||||
|
name, version = split_data_name(data_name.parts[-1])
|
||||||
|
if name == target_name:
|
||||||
|
matches.append((tuple(float(v) for v in version.split('.')), data_name))
|
||||||
|
if matches:
|
||||||
|
return Path(max(matches)[1])
|
||||||
|
else:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def split_data_name(name):
|
||||||
|
return name.split('-', 1) if '-' in name else (name, '')
|
||||||
|
|
||||||
|
|
||||||
|
def fix_glove_vectors_loading(overrides):
|
||||||
|
"""
|
||||||
|
Special-case hack for loading the GloVe vectors, to support deprecated
|
||||||
|
<1.0 stuff. Phase this out once the data is fixed.
|
||||||
|
"""
|
||||||
if 'data_dir' in overrides and 'path' not in overrides:
|
if 'data_dir' in overrides and 'path' not in overrides:
|
||||||
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
||||||
if overrides.get('path') is False:
|
if overrides.get('path') is False:
|
||||||
|
@ -68,18 +87,16 @@ def fix_glove_vectors_loading(overrides):
|
||||||
if overrides.get('path') in (None, True):
|
if overrides.get('path') in (None, True):
|
||||||
data_path = util.get_data_path()
|
data_path = util.get_data_path()
|
||||||
else:
|
else:
|
||||||
path = overrides['path']
|
path = util.ensure_path(overrides['path'])
|
||||||
if isinstance(path, basestring):
|
|
||||||
path = Path(path)
|
|
||||||
data_path = path.parent
|
data_path = path.parent
|
||||||
vec_path = None
|
vec_path = None
|
||||||
if 'add_vectors' not in overrides:
|
if 'add_vectors' not in overrides:
|
||||||
if 'vectors' in overrides:
|
if 'vectors' in overrides:
|
||||||
vec_path = util.match_best_version(overrides['vectors'], None, data_path)
|
vec_path = match_best_version(overrides['vectors'], None, data_path)
|
||||||
if vec_path is None:
|
if vec_path is None:
|
||||||
return overrides
|
return overrides
|
||||||
else:
|
else:
|
||||||
vec_path = util.match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
|
vec_path = match_best_version('en_glove_cc_300_1m_vectors', None, data_path)
|
||||||
if vec_path is not None:
|
if vec_path is not None:
|
||||||
vec_path = vec_path / 'vocab' / 'vec.bin'
|
vec_path = vec_path / 'vocab' / 'vec.bin'
|
||||||
if vec_path is not None:
|
if vec_path is not None:
|
||||||
|
@ -88,13 +105,13 @@ def fix_glove_vectors_loading(overrides):
|
||||||
|
|
||||||
|
|
||||||
def resolve_model_name(name):
|
def resolve_model_name(name):
|
||||||
"""If spaCy is loaded with 'de', check if symlink already exists. If
|
"""
|
||||||
not, user have upgraded from older version and have old models installed.
|
If spaCy is loaded with 'de', check if symlink already exists. If
|
||||||
|
not, user may have upgraded from older version and have old models installed.
|
||||||
Check if old model directory exists and if so, return that instead and create
|
Check if old model directory exists and if so, return that instead and create
|
||||||
shortcut link. If English model is found and no shortcut exists, raise error
|
shortcut link. If English model is found and no shortcut exists, raise error
|
||||||
and tell user to install new model.
|
and tell user to install new model.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
if name == 'en' or name == 'de':
|
if name == 'en' or name == 'de':
|
||||||
versions = ['1.0.0', '1.1.0']
|
versions = ['1.0.0', '1.1.0']
|
||||||
data_path = Path(util.get_data_path())
|
data_path = Path(util.get_data_path())
|
||||||
|
@ -117,9 +134,11 @@ def resolve_model_name(name):
|
||||||
|
|
||||||
|
|
||||||
class ModelDownload():
|
class ModelDownload():
|
||||||
"""Replace download modules within en and de with deprecation warning and
|
"""
|
||||||
|
Replace download modules within en and de with deprecation warning and
|
||||||
download default language model (using shortcut). Use classmethods to allow
|
download default language model (using shortcut). Use classmethods to allow
|
||||||
importing ModelDownload as download and calling download.en() etc."""
|
importing ModelDownload as download and calling download.en() etc.
|
||||||
|
"""
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(self, lang):
|
def load(self, lang):
|
||||||
|
|
|
@ -11,12 +11,6 @@ from ..deprecated import fix_glove_vectors_loading
|
||||||
from .language_data import *
|
from .language_data import *
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
basestring
|
|
||||||
except NameError:
|
|
||||||
basestring = str
|
|
||||||
|
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
lang = 'en'
|
lang = 'en'
|
||||||
|
|
||||||
|
|
|
@ -1,15 +1,13 @@
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
import io
|
import io
|
||||||
import json
|
|
||||||
import re
|
import re
|
||||||
import os
|
import ujson
|
||||||
from os import path
|
|
||||||
|
|
||||||
import ujson as json
|
|
||||||
|
|
||||||
from .syntax import nonproj
|
from .syntax import nonproj
|
||||||
|
from .util import ensure_path
|
||||||
|
|
||||||
|
|
||||||
def tags_to_entities(tags):
|
def tags_to_entities(tags):
|
||||||
|
@ -141,12 +139,13 @@ def _min_edit_path(cand_words, gold_words):
|
||||||
|
|
||||||
|
|
||||||
def read_json_file(loc, docs_filter=None):
|
def read_json_file(loc, docs_filter=None):
|
||||||
if path.isdir(loc):
|
loc = ensure_path(loc)
|
||||||
for filename in os.listdir(loc):
|
if loc.is_dir():
|
||||||
yield from read_json_file(path.join(loc, filename))
|
for filename in loc.iterdir():
|
||||||
|
yield from read_json_file(loc / filename)
|
||||||
else:
|
else:
|
||||||
with io.open(loc, 'r', encoding='utf8') as file_:
|
with loc.open('r', encoding='utf8') as file_:
|
||||||
docs = json.load(file_)
|
docs = ujson.load(file_)
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
if docs_filter is not None and not docs_filter(doc):
|
if docs_filter is not None and not docs_filter(doc):
|
||||||
continue
|
continue
|
||||||
|
@ -220,7 +219,8 @@ cdef class GoldParse:
|
||||||
|
|
||||||
def __init__(self, doc, annot_tuples=None, words=None, tags=None, heads=None,
|
def __init__(self, doc, annot_tuples=None, words=None, tags=None, heads=None,
|
||||||
deps=None, entities=None, make_projective=False):
|
deps=None, entities=None, make_projective=False):
|
||||||
"""Create a GoldParse.
|
"""
|
||||||
|
Create a GoldParse.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc):
|
doc (Doc):
|
||||||
|
@ -302,7 +302,8 @@ cdef class GoldParse:
|
||||||
self.heads = proj_heads
|
self.heads = proj_heads
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""Get the number of gold-standard tokens.
|
"""
|
||||||
|
Get the number of gold-standard tokens.
|
||||||
|
|
||||||
Returns (int): The number of gold-standard tokens.
|
Returns (int): The number of gold-standard tokens.
|
||||||
"""
|
"""
|
||||||
|
@ -310,13 +311,16 @@ cdef class GoldParse:
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def is_projective(self):
|
def is_projective(self):
|
||||||
"""Whether the provided syntactic annotations form a projective dependency
|
"""
|
||||||
tree."""
|
Whether the provided syntactic annotations form a projective dependency
|
||||||
|
tree.
|
||||||
|
"""
|
||||||
return not nonproj.is_nonproj_tree(self.heads)
|
return not nonproj.is_nonproj_tree(self.heads)
|
||||||
|
|
||||||
|
|
||||||
def biluo_tags_from_offsets(doc, entities):
|
def biluo_tags_from_offsets(doc, entities):
|
||||||
'''Encode labelled spans into per-token tags, using the Begin/In/Last/Unit/Out
|
"""
|
||||||
|
Encode labelled spans into per-token tags, using the Begin/In/Last/Unit/Out
|
||||||
scheme (biluo).
|
scheme (biluo).
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -347,7 +351,7 @@ def biluo_tags_from_offsets(doc, entities):
|
||||||
tags = biluo_tags_from_offsets(doc, entities)
|
tags = biluo_tags_from_offsets(doc, entities)
|
||||||
|
|
||||||
assert tags == ['O', 'O', 'U-LOC', 'O']
|
assert tags == ['O', 'O', 'U-LOC', 'O']
|
||||||
'''
|
"""
|
||||||
starts = {token.idx: token.i for token in doc}
|
starts = {token.idx: token.i for token in doc}
|
||||||
ends = {token.idx+len(token): token.i for token in doc}
|
ends = {token.idx+len(token): token.i for token in doc}
|
||||||
biluo = ['-' for _ in doc]
|
biluo = ['-' for _ in doc]
|
||||||
|
|
|
@ -1,39 +1,25 @@
|
||||||
from __future__ import absolute_import
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import absolute_import, unicode_literals
|
||||||
import pathlib
|
|
||||||
from contextlib import contextmanager
|
from contextlib import contextmanager
|
||||||
import shutil
|
import shutil
|
||||||
|
|
||||||
import ujson
|
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
basestring
|
|
||||||
except NameError:
|
|
||||||
basestring = str
|
|
||||||
|
|
||||||
try:
|
|
||||||
unicode
|
|
||||||
except NameError:
|
|
||||||
unicode = str
|
|
||||||
|
|
||||||
from .tokenizer import Tokenizer
|
from .tokenizer import Tokenizer
|
||||||
from .vocab import Vocab
|
from .vocab import Vocab
|
||||||
from .tagger import Tagger
|
from .tagger import Tagger
|
||||||
from .matcher import Matcher
|
from .matcher import Matcher
|
||||||
from . import attrs
|
|
||||||
from . import orth
|
|
||||||
from . import util
|
|
||||||
from . import language_data
|
|
||||||
from .lemmatizer import Lemmatizer
|
from .lemmatizer import Lemmatizer
|
||||||
from .train import Trainer
|
from .train import Trainer
|
||||||
|
|
||||||
from .attrs import TAG, DEP, ENT_IOB, ENT_TYPE, HEAD, PROB, LANG, IS_STOP
|
|
||||||
from .syntax.parser import get_templates
|
from .syntax.parser import get_templates
|
||||||
from .syntax.nonproj import PseudoProjectivity
|
from .syntax.nonproj import PseudoProjectivity
|
||||||
from .pipeline import DependencyParser, EntityRecognizer
|
from .pipeline import DependencyParser, EntityRecognizer
|
||||||
from .syntax.arc_eager import ArcEager
|
from .syntax.arc_eager import ArcEager
|
||||||
from .syntax.ner import BiluoPushDown
|
from .syntax.ner import BiluoPushDown
|
||||||
|
from .compat import json_dumps
|
||||||
|
from .attrs import IS_STOP
|
||||||
|
from . import attrs
|
||||||
|
from . import orth
|
||||||
|
from . import util
|
||||||
|
from . import language_data
|
||||||
|
|
||||||
|
|
||||||
class BaseDefaults(object):
|
class BaseDefaults(object):
|
||||||
|
@ -150,25 +136,15 @@ class BaseDefaults(object):
|
||||||
return pipeline
|
return pipeline
|
||||||
|
|
||||||
token_match = language_data.TOKEN_MATCH
|
token_match = language_data.TOKEN_MATCH
|
||||||
|
|
||||||
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
|
prefixes = tuple(language_data.TOKENIZER_PREFIXES)
|
||||||
|
|
||||||
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
|
suffixes = tuple(language_data.TOKENIZER_SUFFIXES)
|
||||||
|
|
||||||
infixes = tuple(language_data.TOKENIZER_INFIXES)
|
infixes = tuple(language_data.TOKENIZER_INFIXES)
|
||||||
|
|
||||||
tag_map = dict(language_data.TAG_MAP)
|
tag_map = dict(language_data.TAG_MAP)
|
||||||
|
|
||||||
tokenizer_exceptions = {}
|
tokenizer_exceptions = {}
|
||||||
|
|
||||||
parser_features = get_templates('parser')
|
parser_features = get_templates('parser')
|
||||||
|
|
||||||
entity_features = get_templates('ner')
|
entity_features = get_templates('ner')
|
||||||
|
|
||||||
tagger_features = Tagger.feature_templates # TODO -- fix this
|
tagger_features = Tagger.feature_templates # TODO -- fix this
|
||||||
|
|
||||||
stop_words = set()
|
stop_words = set()
|
||||||
|
|
||||||
lemma_rules = {}
|
lemma_rules = {}
|
||||||
lemma_exc = {}
|
lemma_exc = {}
|
||||||
lemma_index = {}
|
lemma_index = {}
|
||||||
|
@ -202,53 +178,46 @@ class BaseDefaults(object):
|
||||||
|
|
||||||
|
|
||||||
class Language(object):
|
class Language(object):
|
||||||
'''A text-processing pipeline. Usually you'll load this once per process, and
|
"""
|
||||||
|
A text-processing pipeline. Usually you'll load this once per process, and
|
||||||
pass the instance around your program.
|
pass the instance around your program.
|
||||||
'''
|
"""
|
||||||
Defaults = BaseDefaults
|
Defaults = BaseDefaults
|
||||||
lang = None
|
lang = None
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
@contextmanager
|
def setup_directory(cls, path, **configs):
|
||||||
def train(cls, path, gold_tuples, *configs):
|
"""
|
||||||
if isinstance(path, basestring):
|
Initialise a model directory.
|
||||||
path = pathlib.Path(path)
|
"""
|
||||||
tagger_cfg, parser_cfg, entity_cfg = configs
|
for name, config in configs.items():
|
||||||
dep_model_dir = path / 'deps'
|
directory = path / name
|
||||||
ner_model_dir = path / 'ner'
|
if directory.exists():
|
||||||
pos_model_dir = path / 'pos'
|
shutil.rmtree(str(directory))
|
||||||
if dep_model_dir.exists():
|
directory.mkdir()
|
||||||
shutil.rmtree(str(dep_model_dir))
|
with (directory / 'config.json').open('wb') as file_:
|
||||||
if ner_model_dir.exists():
|
data = json_dumps(config)
|
||||||
shutil.rmtree(str(ner_model_dir))
|
file_.write(data)
|
||||||
if pos_model_dir.exists():
|
if not (path / 'vocab').exists():
|
||||||
shutil.rmtree(str(pos_model_dir))
|
(path / 'vocab').mkdir()
|
||||||
dep_model_dir.mkdir()
|
|
||||||
ner_model_dir.mkdir()
|
|
||||||
pos_model_dir.mkdir()
|
|
||||||
|
|
||||||
if parser_cfg['pseudoprojective']:
|
@classmethod
|
||||||
|
@contextmanager
|
||||||
|
def train(cls, path, gold_tuples, **configs):
|
||||||
|
parser_cfg = configs.get('deps', {})
|
||||||
|
if parser_cfg.get('pseudoprojective'):
|
||||||
# preprocess training data here before ArcEager.get_labels() is called
|
# preprocess training data here before ArcEager.get_labels() is called
|
||||||
gold_tuples = PseudoProjectivity.preprocess_training_data(gold_tuples)
|
gold_tuples = PseudoProjectivity.preprocess_training_data(gold_tuples)
|
||||||
|
|
||||||
parser_cfg['actions'] = ArcEager.get_actions(gold_parses=gold_tuples)
|
for subdir in ('deps', 'ner', 'pos'):
|
||||||
entity_cfg['actions'] = BiluoPushDown.get_actions(gold_parses=gold_tuples)
|
if subdir not in configs:
|
||||||
|
configs[subdir] = {}
|
||||||
|
if parser_cfg:
|
||||||
|
configs['deps']['actions'] = ArcEager.get_actions(gold_parses=gold_tuples)
|
||||||
|
if 'ner' in configs:
|
||||||
|
configs['ner']['actions'] = BiluoPushDown.get_actions(gold_parses=gold_tuples)
|
||||||
|
|
||||||
with (dep_model_dir / 'config.json').open('wb') as file_:
|
cls.setup_directory(path, **configs)
|
||||||
data = ujson.dumps(parser_cfg)
|
|
||||||
if isinstance(data, unicode):
|
|
||||||
data = data.encode('utf8')
|
|
||||||
file_.write(data)
|
|
||||||
with (ner_model_dir / 'config.json').open('wb') as file_:
|
|
||||||
data = ujson.dumps(entity_cfg)
|
|
||||||
if isinstance(data, unicode):
|
|
||||||
data = data.encode('utf8')
|
|
||||||
file_.write(data)
|
|
||||||
with (pos_model_dir / 'config.json').open('wb') as file_:
|
|
||||||
data = ujson.dumps(tagger_cfg)
|
|
||||||
if isinstance(data, unicode):
|
|
||||||
data = data.encode('utf8')
|
|
||||||
file_.write(data)
|
|
||||||
|
|
||||||
self = cls(
|
self = cls(
|
||||||
path=path,
|
path=path,
|
||||||
|
@ -269,14 +238,22 @@ class Language(object):
|
||||||
self.entity = self.Defaults.create_entity(self)
|
self.entity = self.Defaults.create_entity(self)
|
||||||
self.pipeline = self.Defaults.create_pipeline(self)
|
self.pipeline = self.Defaults.create_pipeline(self)
|
||||||
yield Trainer(self, gold_tuples)
|
yield Trainer(self, gold_tuples)
|
||||||
self.end_training(path=path)
|
self.end_training()
|
||||||
|
self.save_to_directory(path)
|
||||||
|
|
||||||
def __init__(self, **overrides):
|
def __init__(self, **overrides):
|
||||||
|
"""
|
||||||
|
Create or load the pipeline.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
**overrides: Keyword arguments indicating which defaults to override.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Language: The newly constructed object.
|
||||||
|
"""
|
||||||
if 'data_dir' in overrides and 'path' not in overrides:
|
if 'data_dir' in overrides and 'path' not in overrides:
|
||||||
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
raise ValueError("The argument 'data_dir' has been renamed to 'path'")
|
||||||
path = overrides.get('path', True)
|
path = util.ensure_path(overrides.get('path', True))
|
||||||
if isinstance(path, basestring):
|
|
||||||
path = pathlib.Path(path)
|
|
||||||
if path is True:
|
if path is True:
|
||||||
path = util.get_data_path() / self.lang
|
path = util.get_data_path() / self.lang
|
||||||
if not path.exists() and 'path' not in overrides:
|
if not path.exists() and 'path' not in overrides:
|
||||||
|
@ -322,11 +299,12 @@ class Language(object):
|
||||||
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
|
self.pipeline = [self.tagger, self.parser, self.matcher, self.entity]
|
||||||
|
|
||||||
def __call__(self, text, tag=True, parse=True, entity=True):
|
def __call__(self, text, tag=True, parse=True, entity=True):
|
||||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
"""
|
||||||
|
Apply the pipeline to some text. The text can span multiple sentences,
|
||||||
and can contain arbtrary whitespace. Alignment into the original string
|
and can contain arbtrary whitespace. Alignment into the original string
|
||||||
is preserved.
|
is preserved.
|
||||||
|
|
||||||
Args:
|
Argsuments:
|
||||||
text (unicode): The text to be processed.
|
text (unicode): The text to be processed.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
|
@ -352,7 +330,8 @@ class Language(object):
|
||||||
return doc
|
return doc
|
||||||
|
|
||||||
def pipe(self, texts, tag=True, parse=True, entity=True, n_threads=2, batch_size=1000):
|
def pipe(self, texts, tag=True, parse=True, entity=True, n_threads=2, batch_size=1000):
|
||||||
'''Process texts as a stream, and yield Doc objects in order.
|
"""
|
||||||
|
Process texts as a stream, and yield Doc objects in order.
|
||||||
|
|
||||||
Supports GIL-free multi-threading.
|
Supports GIL-free multi-threading.
|
||||||
|
|
||||||
|
@ -361,7 +340,7 @@ class Language(object):
|
||||||
tag (bool)
|
tag (bool)
|
||||||
parse (bool)
|
parse (bool)
|
||||||
entity (bool)
|
entity (bool)
|
||||||
'''
|
"""
|
||||||
skip = {self.tagger: not tag, self.parser: not parse, self.entity: not entity}
|
skip = {self.tagger: not tag, self.parser: not parse, self.entity: not entity}
|
||||||
stream = (self.make_doc(text) for text in texts)
|
stream = (self.make_doc(text) for text in texts)
|
||||||
for proc in self.pipeline:
|
for proc in self.pipeline:
|
||||||
|
@ -373,51 +352,42 @@ class Language(object):
|
||||||
for doc in stream:
|
for doc in stream:
|
||||||
yield doc
|
yield doc
|
||||||
|
|
||||||
def end_training(self, path=None):
|
def save_to_directory(self, path):
|
||||||
if path is None:
|
"""
|
||||||
path = self.path
|
Save the Vocab, StringStore and pipeline to a directory.
|
||||||
elif isinstance(path, basestring):
|
|
||||||
path = pathlib.Path(path)
|
|
||||||
|
|
||||||
if self.tagger:
|
Arguments:
|
||||||
self.tagger.model.end_training()
|
path (string or pathlib path): Path to save the model.
|
||||||
self.tagger.model.dump(str(path / 'pos' / 'model'))
|
"""
|
||||||
if self.parser:
|
configs = {
|
||||||
self.parser.model.end_training()
|
'pos': self.tagger.cfg if self.tagger else {},
|
||||||
self.parser.model.dump(str(path / 'deps' / 'model'))
|
'deps': self.parser.cfg if self.parser else {},
|
||||||
if self.entity:
|
'ner': self.entity.cfg if self.entity else {},
|
||||||
self.entity.model.end_training()
|
}
|
||||||
self.entity.model.dump(str(path / 'ner' / 'model'))
|
|
||||||
|
path = util.ensure_path(path)
|
||||||
|
self.setup_directory(path, **configs)
|
||||||
|
|
||||||
strings_loc = path / 'vocab' / 'strings.json'
|
strings_loc = path / 'vocab' / 'strings.json'
|
||||||
with strings_loc.open('w', encoding='utf8') as file_:
|
with strings_loc.open('w', encoding='utf8') as file_:
|
||||||
self.vocab.strings.dump(file_)
|
self.vocab.strings.dump(file_)
|
||||||
self.vocab.dump(path / 'vocab' / 'lexemes.bin')
|
self.vocab.dump(path / 'vocab' / 'lexemes.bin')
|
||||||
|
# TODO: Word vectors?
|
||||||
if self.tagger:
|
if self.tagger:
|
||||||
tagger_freqs = list(self.tagger.freqs[TAG].items())
|
self.tagger.model.dump(str(path / 'pos' / 'model'))
|
||||||
else:
|
|
||||||
tagger_freqs = []
|
|
||||||
if self.parser:
|
if self.parser:
|
||||||
dep_freqs = list(self.parser.moves.freqs[DEP].items())
|
self.parser.model.dump(str(path / 'deps' / 'model'))
|
||||||
head_freqs = list(self.parser.moves.freqs[HEAD].items())
|
|
||||||
else:
|
|
||||||
dep_freqs = []
|
|
||||||
head_freqs = []
|
|
||||||
if self.entity:
|
if self.entity:
|
||||||
entity_iob_freqs = list(self.entity.moves.freqs[ENT_IOB].items())
|
self.entity.model.dump(str(path / 'ner' / 'model'))
|
||||||
entity_type_freqs = list(self.entity.moves.freqs[ENT_TYPE].items())
|
|
||||||
else:
|
def end_training(self, path=None):
|
||||||
entity_iob_freqs = []
|
if self.tagger:
|
||||||
entity_type_freqs = []
|
self.tagger.model.end_training()
|
||||||
with (path / 'vocab' / 'serializer.json').open('wb') as file_:
|
if self.parser:
|
||||||
data = ujson.dumps([
|
self.parser.model.end_training()
|
||||||
(TAG, tagger_freqs),
|
if self.entity:
|
||||||
(DEP, dep_freqs),
|
self.entity.model.end_training()
|
||||||
(ENT_IOB, entity_iob_freqs),
|
# NB: This is slightly different from before --- we no longer default
|
||||||
(ENT_TYPE, entity_type_freqs),
|
# to taking nlp.path
|
||||||
(HEAD, head_freqs)
|
if path is not None:
|
||||||
])
|
self.save_to_directory(path)
|
||||||
if isinstance(data, unicode):
|
|
||||||
data = data.encode('utf8')
|
|
||||||
file_.write(data)
|
|
||||||
|
|
|
@ -1,13 +1,8 @@
|
||||||
from __future__ import unicode_literals, print_function
|
# coding: utf8
|
||||||
import codecs
|
from __future__ import unicode_literals
|
||||||
import pathlib
|
|
||||||
|
|
||||||
import ujson as json
|
|
||||||
|
|
||||||
from .symbols import POS, NOUN, VERB, ADJ, PUNCT
|
from .symbols import POS, NOUN, VERB, ADJ, PUNCT
|
||||||
from .symbols import VerbForm_inf, VerbForm_none
|
from .symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos
|
||||||
from .symbols import Number_sing
|
|
||||||
from .symbols import Degree_pos
|
|
||||||
|
|
||||||
|
|
||||||
class Lemmatizer(object):
|
class Lemmatizer(object):
|
||||||
|
@ -38,8 +33,10 @@ class Lemmatizer(object):
|
||||||
return lemmas
|
return lemmas
|
||||||
|
|
||||||
def is_base_form(self, univ_pos, morphology=None):
|
def is_base_form(self, univ_pos, morphology=None):
|
||||||
'''Check whether we're dealing with an uninflected paradigm, so we can
|
"""
|
||||||
avoid lemmatization entirely.'''
|
Check whether we're dealing with an uninflected paradigm, so we can
|
||||||
|
avoid lemmatization entirely.
|
||||||
|
"""
|
||||||
morphology = {} if morphology is None else morphology
|
morphology = {} if morphology is None else morphology
|
||||||
others = [key for key in morphology if key not in (POS, 'number', 'pos', 'verbform')]
|
others = [key for key in morphology if key not in (POS, 'number', 'pos', 'verbform')]
|
||||||
true_morph_key = morphology.get('morph', 0)
|
true_morph_key = morphology.get('morph', 0)
|
||||||
|
|
|
@ -1,4 +1,7 @@
|
||||||
# cython: embedsignature=True
|
# cython: embedsignature=True
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals, print_function
|
||||||
|
|
||||||
from libc.math cimport sqrt
|
from libc.math cimport sqrt
|
||||||
from cpython.ref cimport Py_INCREF
|
from cpython.ref cimport Py_INCREF
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
@ -9,14 +12,11 @@ from cython.view cimport array as cvarray
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
np.import_array()
|
np.import_array()
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
|
import numpy
|
||||||
|
|
||||||
from .orth cimport word_shape
|
from .orth cimport word_shape
|
||||||
from .typedefs cimport attr_t, flags_t
|
from .typedefs cimport attr_t, flags_t
|
||||||
import numpy
|
|
||||||
|
|
||||||
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||||
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
from .attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||||
from .attrs cimport IS_BRACKET
|
from .attrs cimport IS_BRACKET
|
||||||
|
@ -30,13 +30,15 @@ memset(&EMPTY_LEXEME, 0, sizeof(LexemeC))
|
||||||
|
|
||||||
|
|
||||||
cdef class Lexeme:
|
cdef class Lexeme:
|
||||||
"""An entry in the vocabulary. A Lexeme has no string context --- it's a
|
"""
|
||||||
|
An entry in the vocabulary. A Lexeme has no string context --- it's a
|
||||||
word-type, as opposed to a word token. It therefore has no part-of-speech
|
word-type, as opposed to a word token. It therefore has no part-of-speech
|
||||||
tag, dependency parse, or lemma (lemmatization depends on the part-of-speech
|
tag, dependency parse, or lemma (lemmatization depends on the part-of-speech
|
||||||
tag).
|
tag).
|
||||||
"""
|
"""
|
||||||
def __init__(self, Vocab vocab, int orth):
|
def __init__(self, Vocab vocab, int orth):
|
||||||
"""Create a Lexeme object.
|
"""
|
||||||
|
Create a Lexeme object.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
vocab (Vocab): The parent vocabulary
|
vocab (Vocab): The parent vocabulary
|
||||||
|
@ -80,7 +82,8 @@ cdef class Lexeme:
|
||||||
return self.c.orth
|
return self.c.orth
|
||||||
|
|
||||||
def set_flag(self, attr_id_t flag_id, bint value):
|
def set_flag(self, attr_id_t flag_id, bint value):
|
||||||
"""Change the value of a boolean flag.
|
"""
|
||||||
|
Change the value of a boolean flag.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
flag_id (int): The attribute ID of the flag to set.
|
flag_id (int): The attribute ID of the flag to set.
|
||||||
|
@ -89,7 +92,8 @@ cdef class Lexeme:
|
||||||
Lexeme.c_set_flag(self.c, flag_id, value)
|
Lexeme.c_set_flag(self.c, flag_id, value)
|
||||||
|
|
||||||
def check_flag(self, attr_id_t flag_id):
|
def check_flag(self, attr_id_t flag_id):
|
||||||
"""Check the value of a boolean flag.
|
"""
|
||||||
|
Check the value of a boolean flag.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
flag_id (int): The attribute ID of the flag to query.
|
flag_id (int): The attribute ID of the flag to query.
|
||||||
|
@ -98,7 +102,8 @@ cdef class Lexeme:
|
||||||
return True if Lexeme.c_check_flag(self.c, flag_id) else False
|
return True if Lexeme.c_check_flag(self.c, flag_id) else False
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
'''Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
"""
|
||||||
|
Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
other:
|
other:
|
||||||
|
@ -106,7 +111,7 @@ cdef class Lexeme:
|
||||||
Token and Lexeme objects.
|
Token and Lexeme objects.
|
||||||
Returns:
|
Returns:
|
||||||
score (float): A scalar similarity score. Higher is more similar.
|
score (float): A scalar similarity score. Higher is more similar.
|
||||||
'''
|
"""
|
||||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||||
return 0.0
|
return 0.0
|
||||||
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||||
|
|
|
@ -1,7 +1,10 @@
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import ujson
|
||||||
|
|
||||||
from .typedefs cimport attr_t
|
from .typedefs cimport attr_t
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
from .attrs cimport attr_id_t
|
from .attrs cimport attr_id_t
|
||||||
|
@ -52,12 +55,6 @@ from .attrs import FLAG36 as L9_ENT
|
||||||
from .attrs import FLAG35 as L10_ENT
|
from .attrs import FLAG35 as L10_ENT
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
import ujson as json
|
|
||||||
except ImportError:
|
|
||||||
import json
|
|
||||||
|
|
||||||
|
|
||||||
cpdef enum quantifier_t:
|
cpdef enum quantifier_t:
|
||||||
_META
|
_META
|
||||||
ONE
|
ONE
|
||||||
|
@ -180,7 +177,8 @@ cdef class Matcher:
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, vocab):
|
def load(cls, path, vocab):
|
||||||
'''Load the matcher and patterns from a file path.
|
"""
|
||||||
|
Load the matcher and patterns from a file path.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
path (Path):
|
path (Path):
|
||||||
|
@ -189,16 +187,17 @@ cdef class Matcher:
|
||||||
The vocabulary that the documents to match over will refer to.
|
The vocabulary that the documents to match over will refer to.
|
||||||
Returns:
|
Returns:
|
||||||
Matcher: The newly constructed object.
|
Matcher: The newly constructed object.
|
||||||
'''
|
"""
|
||||||
if (path / 'gazetteer.json').exists():
|
if (path / 'gazetteer.json').exists():
|
||||||
with (path / 'gazetteer.json').open('r', encoding='utf8') as file_:
|
with (path / 'gazetteer.json').open('r', encoding='utf8') as file_:
|
||||||
patterns = json.load(file_)
|
patterns = ujson.load(file_)
|
||||||
else:
|
else:
|
||||||
patterns = {}
|
patterns = {}
|
||||||
return cls(vocab, patterns)
|
return cls(vocab, patterns)
|
||||||
|
|
||||||
def __init__(self, vocab, patterns={}):
|
def __init__(self, vocab, patterns={}):
|
||||||
"""Create the Matcher.
|
"""
|
||||||
|
Create the Matcher.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
vocab (Vocab):
|
vocab (Vocab):
|
||||||
|
@ -227,7 +226,8 @@ cdef class Matcher:
|
||||||
|
|
||||||
def add_entity(self, entity_key, attrs=None, if_exists='raise',
|
def add_entity(self, entity_key, attrs=None, if_exists='raise',
|
||||||
acceptor=None, on_match=None):
|
acceptor=None, on_match=None):
|
||||||
"""Add an entity to the matcher.
|
"""
|
||||||
|
Add an entity to the matcher.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
entity_key (unicode or int):
|
entity_key (unicode or int):
|
||||||
|
@ -264,7 +264,8 @@ cdef class Matcher:
|
||||||
self._callbacks[entity_key] = on_match
|
self._callbacks[entity_key] = on_match
|
||||||
|
|
||||||
def add_pattern(self, entity_key, token_specs, label=""):
|
def add_pattern(self, entity_key, token_specs, label=""):
|
||||||
"""Add a pattern to the matcher.
|
"""
|
||||||
|
Add a pattern to the matcher.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
entity_key (unicode or int):
|
entity_key (unicode or int):
|
||||||
|
@ -307,7 +308,8 @@ cdef class Matcher:
|
||||||
return entity_key
|
return entity_key
|
||||||
|
|
||||||
def has_entity(self, entity_key):
|
def has_entity(self, entity_key):
|
||||||
"""Check whether the matcher has an entity.
|
"""
|
||||||
|
Check whether the matcher has an entity.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
entity_key (string or int): The entity key to check.
|
entity_key (string or int): The entity key to check.
|
||||||
|
@ -318,7 +320,8 @@ cdef class Matcher:
|
||||||
return entity_key in self._entities
|
return entity_key in self._entities
|
||||||
|
|
||||||
def get_entity(self, entity_key):
|
def get_entity(self, entity_key):
|
||||||
"""Retrieve the attributes stored for an entity.
|
"""
|
||||||
|
Retrieve the attributes stored for an entity.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
entity_key (unicode or int): The entity to retrieve.
|
entity_key (unicode or int): The entity to retrieve.
|
||||||
|
@ -332,7 +335,8 @@ cdef class Matcher:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def __call__(self, Doc doc, acceptor=None):
|
def __call__(self, Doc doc, acceptor=None):
|
||||||
"""Find all token sequences matching the supplied patterns on the Doc.
|
"""
|
||||||
|
Find all token sequences matching the supplied patterns on the Doc.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc):
|
doc (Doc):
|
||||||
|
@ -445,7 +449,8 @@ cdef class Matcher:
|
||||||
return matches
|
return matches
|
||||||
|
|
||||||
def pipe(self, docs, batch_size=1000, n_threads=2):
|
def pipe(self, docs, batch_size=1000, n_threads=2):
|
||||||
"""Match a stream of documents, yielding them in turn.
|
"""
|
||||||
|
Match a stream of documents, yielding them in turn.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
docs: A stream of documents.
|
docs: A stream of documents.
|
||||||
|
|
|
@ -1,13 +1,9 @@
|
||||||
# cython: infer_types
|
# cython: infer_types
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
|
|
||||||
try:
|
|
||||||
import ujson as json
|
|
||||||
except ImportError:
|
|
||||||
import json
|
|
||||||
|
|
||||||
from .parts_of_speech cimport ADJ, VERB, NOUN, PUNCT
|
from .parts_of_speech cimport ADJ, VERB, NOUN, PUNCT
|
||||||
from .attrs cimport POS, IS_SPACE
|
from .attrs cimport POS, IS_SPACE
|
||||||
from .parts_of_speech import IDS as POS_IDS
|
from .parts_of_speech import IDS as POS_IDS
|
||||||
|
@ -16,7 +12,9 @@ from .attrs import LEMMA, intify_attrs
|
||||||
|
|
||||||
|
|
||||||
def _normalize_props(props):
|
def _normalize_props(props):
|
||||||
'''Transform deprecated string keys to correct names.'''
|
"""
|
||||||
|
Transform deprecated string keys to correct names.
|
||||||
|
"""
|
||||||
out = {}
|
out = {}
|
||||||
for key, value in props.items():
|
for key, value in props.items():
|
||||||
if key == POS:
|
if key == POS:
|
||||||
|
@ -98,13 +96,14 @@ cdef class Morphology:
|
||||||
flags[0] &= ~(one << flag_id)
|
flags[0] &= ~(one << flag_id)
|
||||||
|
|
||||||
def add_special_case(self, unicode tag_str, unicode orth_str, attrs, force=False):
|
def add_special_case(self, unicode tag_str, unicode orth_str, attrs, force=False):
|
||||||
'''Add a special-case rule to the morphological analyser. Tokens whose
|
"""
|
||||||
|
Add a special-case rule to the morphological analyser. Tokens whose
|
||||||
tag and orth match the rule will receive the specified properties.
|
tag and orth match the rule will receive the specified properties.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
tag (unicode): The part-of-speech tag to key the exception.
|
tag (unicode): The part-of-speech tag to key the exception.
|
||||||
orth (unicode): The word-form to key the exception.
|
orth (unicode): The word-form to key the exception.
|
||||||
'''
|
"""
|
||||||
tag = self.strings[tag_str]
|
tag = self.strings[tag_str]
|
||||||
tag_id = self.reverse_index[tag]
|
tag_id = self.reverse_index[tag]
|
||||||
orth = self.strings[orth_str]
|
orth = self.strings[orth_str]
|
||||||
|
|
|
@ -1,8 +0,0 @@
|
||||||
class RegexMerger(object):
|
|
||||||
def __init__(self, regexes):
|
|
||||||
self.regexes = regexes
|
|
||||||
|
|
||||||
def __call__(self, tokens):
|
|
||||||
for tag, entity_type, regex in self.regexes:
|
|
||||||
for m in regex.finditer(tokens.string):
|
|
||||||
tokens.merge(m.start(), m.end(), tag, m.group(), entity_type)
|
|
|
@ -1,6 +1,7 @@
|
||||||
# coding: utf8
|
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import unicodedata
|
import unicodedata
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,3 +1,6 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .syntax.parser cimport Parser
|
from .syntax.parser cimport Parser
|
||||||
from .syntax.beam_parser cimport BeamParser
|
from .syntax.beam_parser cimport BeamParser
|
||||||
from .syntax.ner cimport BiluoPushDown
|
from .syntax.ner cimport BiluoPushDown
|
||||||
|
@ -11,44 +14,40 @@ from .attrs import DEP, ENT_TYPE
|
||||||
|
|
||||||
|
|
||||||
cdef class EntityRecognizer(Parser):
|
cdef class EntityRecognizer(Parser):
|
||||||
"""Annotate named entities on Doc objects."""
|
"""
|
||||||
|
Annotate named entities on Doc objects.
|
||||||
|
"""
|
||||||
TransitionSystem = BiluoPushDown
|
TransitionSystem = BiluoPushDown
|
||||||
|
|
||||||
feature_templates = get_feature_templates('ner')
|
feature_templates = get_feature_templates('ner')
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
for action in self.moves.action_types:
|
Parser.add_label(self, label)
|
||||||
self.moves.add_action(action, label)
|
|
||||||
if 'actions' in self.cfg:
|
|
||||||
self.cfg['actions'].setdefault(action,
|
|
||||||
{}).setdefault(label, True)
|
|
||||||
if isinstance(label, basestring):
|
if isinstance(label, basestring):
|
||||||
label = self.vocab.strings[label]
|
label = self.vocab.strings[label]
|
||||||
|
# Set label into serializer. Super hacky :(
|
||||||
for attr, freqs in self.vocab.serializer_freqs:
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
if attr == ENT_TYPE and label not in freqs:
|
if attr == ENT_TYPE and label not in freqs:
|
||||||
freqs.append([label, 1])
|
freqs.append([label, 1])
|
||||||
# Super hacky :(
|
|
||||||
self.vocab._serializer = None
|
self.vocab._serializer = None
|
||||||
|
|
||||||
|
|
||||||
cdef class BeamEntityRecognizer(BeamParser):
|
cdef class BeamEntityRecognizer(BeamParser):
|
||||||
"""Annotate named entities on Doc objects."""
|
"""
|
||||||
|
Annotate named entities on Doc objects.
|
||||||
|
"""
|
||||||
TransitionSystem = BiluoPushDown
|
TransitionSystem = BiluoPushDown
|
||||||
|
|
||||||
feature_templates = get_feature_templates('ner')
|
feature_templates = get_feature_templates('ner')
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
for action in self.moves.action_types:
|
Parser.add_label(self, label)
|
||||||
self.moves.add_action(action, label)
|
|
||||||
if 'actions' in self.cfg:
|
|
||||||
self.cfg['actions'].setdefault(action,
|
|
||||||
{}).setdefault(label, True)
|
|
||||||
if isinstance(label, basestring):
|
if isinstance(label, basestring):
|
||||||
label = self.vocab.strings[label]
|
label = self.vocab.strings[label]
|
||||||
|
# Set label into serializer. Super hacky :(
|
||||||
for attr, freqs in self.vocab.serializer_freqs:
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
if attr == ENT_TYPE and label not in freqs:
|
if attr == ENT_TYPE and label not in freqs:
|
||||||
freqs.append([label, 1])
|
freqs.append([label, 1])
|
||||||
# Super hacky :(
|
|
||||||
self.vocab._serializer = None
|
self.vocab._serializer = None
|
||||||
|
|
||||||
|
|
||||||
|
@ -58,11 +57,7 @@ cdef class DependencyParser(Parser):
|
||||||
feature_templates = get_feature_templates('basic')
|
feature_templates = get_feature_templates('basic')
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
for action in self.moves.action_types:
|
Parser.add_label(self, label)
|
||||||
self.moves.add_action(action, label)
|
|
||||||
if 'actions' in self.cfg:
|
|
||||||
self.cfg['actions'].setdefault(action,
|
|
||||||
{}).setdefault(label, True)
|
|
||||||
if isinstance(label, basestring):
|
if isinstance(label, basestring):
|
||||||
label = self.vocab.strings[label]
|
label = self.vocab.strings[label]
|
||||||
for attr, freqs in self.vocab.serializer_freqs:
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
|
@ -78,11 +73,7 @@ cdef class BeamDependencyParser(BeamParser):
|
||||||
feature_templates = get_feature_templates('basic')
|
feature_templates = get_feature_templates('basic')
|
||||||
|
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
for action in self.moves.action_types:
|
Parser.add_label(self, label)
|
||||||
self.moves.add_action(action, label)
|
|
||||||
if 'actions' in self.cfg:
|
|
||||||
self.cfg['actions'].setdefault(action,
|
|
||||||
{}).setdefault(label, True)
|
|
||||||
if isinstance(label, basestring):
|
if isinstance(label, basestring):
|
||||||
label = self.vocab.strings[label]
|
label = self.vocab.strings[label]
|
||||||
for attr, freqs in self.vocab.serializer_freqs:
|
for attr, freqs in self.vocab.serializer_freqs:
|
||||||
|
|
|
@ -1,12 +1,13 @@
|
||||||
from __future__ import division
|
# coding: utf8
|
||||||
from __future__ import print_function
|
from __future__ import division, print_function, unicode_literals
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from .gold import tags_to_entities
|
from .gold import tags_to_entities
|
||||||
|
|
||||||
|
|
||||||
class PRFScore(object):
|
class PRFScore(object):
|
||||||
"""A precision / recall / F score"""
|
"""
|
||||||
|
A precision / recall / F score
|
||||||
|
"""
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.tp = 0
|
self.tp = 0
|
||||||
self.fp = 0
|
self.fp = 0
|
||||||
|
|
|
@ -1,12 +1,11 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, absolute_import
|
from __future__ import unicode_literals, absolute_import
|
||||||
|
|
||||||
cimport cython
|
cimport cython
|
||||||
from libc.string cimport memcpy
|
from libc.string cimport memcpy
|
||||||
from libc.stdint cimport uint64_t, uint32_t
|
from libc.stdint cimport uint64_t, uint32_t
|
||||||
|
|
||||||
from murmurhash.mrmr cimport hash64, hash32
|
from murmurhash.mrmr cimport hash64, hash32
|
||||||
|
|
||||||
from preshed.maps cimport map_iter, key_t
|
from preshed.maps cimport map_iter, key_t
|
||||||
|
|
||||||
from .typedefs cimport hash_t
|
from .typedefs cimport hash_t
|
||||||
|
@ -73,13 +72,16 @@ cdef Utf8Str _allocate(Pool mem, const unsigned char* chars, uint32_t length) ex
|
||||||
|
|
||||||
|
|
||||||
cdef class StringStore:
|
cdef class StringStore:
|
||||||
'''Map strings to and from integer IDs.'''
|
"""
|
||||||
|
Map strings to and from integer IDs.
|
||||||
|
"""
|
||||||
def __init__(self, strings=None, freeze=False):
|
def __init__(self, strings=None, freeze=False):
|
||||||
'''Create the StringStore.
|
"""
|
||||||
|
Create the StringStore.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
strings: A sequence of unicode strings to add to the store.
|
strings: A sequence of unicode strings to add to the store.
|
||||||
'''
|
"""
|
||||||
self.mem = Pool()
|
self.mem = Pool()
|
||||||
self._map = PreshMap()
|
self._map = PreshMap()
|
||||||
self._oov = PreshMap()
|
self._oov = PreshMap()
|
||||||
|
@ -104,7 +106,8 @@ cdef class StringStore:
|
||||||
return (StringStore, (list(self),))
|
return (StringStore, (list(self),))
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""The number of strings in the store.
|
"""
|
||||||
|
The number of strings in the store.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
int The number of strings in the store.
|
int The number of strings in the store.
|
||||||
|
@ -112,7 +115,8 @@ cdef class StringStore:
|
||||||
return self.size-1
|
return self.size-1
|
||||||
|
|
||||||
def __getitem__(self, object string_or_id):
|
def __getitem__(self, object string_or_id):
|
||||||
"""Retrieve a string from a given integer ID, or vice versa.
|
"""
|
||||||
|
Retrieve a string from a given integer ID, or vice versa.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
string_or_id (bytes or unicode or int):
|
string_or_id (bytes or unicode or int):
|
||||||
|
@ -159,7 +163,8 @@ cdef class StringStore:
|
||||||
return utf8str - self.c
|
return utf8str - self.c
|
||||||
|
|
||||||
def __contains__(self, unicode string not None):
|
def __contains__(self, unicode string not None):
|
||||||
"""Check whether a string is in the store.
|
"""
|
||||||
|
Check whether a string is in the store.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
string (unicode): The string to check.
|
string (unicode): The string to check.
|
||||||
|
@ -172,7 +177,8 @@ cdef class StringStore:
|
||||||
return self._map.get(key) is not NULL
|
return self._map.get(key) is not NULL
|
||||||
|
|
||||||
def __iter__(self):
|
def __iter__(self):
|
||||||
"""Iterate over the strings in the store, in order.
|
"""
|
||||||
|
Iterate over the strings in the store, in order.
|
||||||
|
|
||||||
Yields: unicode A string in the store.
|
Yields: unicode A string in the store.
|
||||||
"""
|
"""
|
||||||
|
@ -230,7 +236,8 @@ cdef class StringStore:
|
||||||
return &self.c[self.size-1]
|
return &self.c[self.size-1]
|
||||||
|
|
||||||
def dump(self, file_):
|
def dump(self, file_):
|
||||||
"""Save the strings to a JSON file.
|
"""
|
||||||
|
Save the strings to a JSON file.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
file_ (buffer): The file to save the strings.
|
file_ (buffer): The file to save the strings.
|
||||||
|
@ -244,7 +251,8 @@ cdef class StringStore:
|
||||||
file_.write(string_data)
|
file_.write(string_data)
|
||||||
|
|
||||||
def load(self, file_):
|
def load(self, file_):
|
||||||
"""Load the strings from a JSON file.
|
"""
|
||||||
|
Load the strings from a JSON file.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
file_ (buffer): The file from which to load the strings.
|
file_ (buffer): The file from which to load the strings.
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
IDS = {
|
IDS = {
|
||||||
|
|
|
@ -7,17 +7,17 @@ out of "context") is in features/extractor.pyx
|
||||||
The atomic feature names are listed in a big enum, so that the feature tuples
|
The atomic feature names are listed in a big enum, so that the feature tuples
|
||||||
can refer to them.
|
can refer to them.
|
||||||
"""
|
"""
|
||||||
from libc.string cimport memset
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from libc.string cimport memset
|
||||||
from itertools import combinations
|
from itertools import combinations
|
||||||
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
from ..structs cimport TokenC
|
from ..structs cimport TokenC
|
||||||
|
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
|
||||||
|
|
||||||
|
|
||||||
cdef inline void fill_token(atom_t* context, const TokenC* token) nogil:
|
cdef inline void fill_token(atom_t* context, const TokenC* token) nogil:
|
||||||
if token is NULL:
|
if token is NULL:
|
||||||
|
|
|
@ -1,29 +1,26 @@
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
# cython: cdivision=True
|
# cython: cdivision=True
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
||||||
|
|
||||||
import ctypes
|
import ctypes
|
||||||
import os
|
from libc.stdint cimport uint32_t
|
||||||
|
from libc.string cimport memcpy
|
||||||
from ..structs cimport TokenC
|
from cymem.cymem cimport Pool
|
||||||
|
|
||||||
|
from .stateclass cimport StateClass
|
||||||
|
from ._state cimport StateC, is_space_token
|
||||||
|
from .nonproj import PseudoProjectivity
|
||||||
|
from .nonproj import is_nonproj_tree
|
||||||
from .transition_system cimport do_func_t, get_cost_func_t
|
from .transition_system cimport do_func_t, get_cost_func_t
|
||||||
from .transition_system cimport move_cost_func_t, label_cost_func_t
|
from .transition_system cimport move_cost_func_t, label_cost_func_t
|
||||||
from ..gold cimport GoldParse
|
from ..gold cimport GoldParse
|
||||||
from ..gold cimport GoldParseC
|
from ..gold cimport GoldParseC
|
||||||
from ..attrs cimport TAG, HEAD, DEP, ENT_IOB, ENT_TYPE, IS_SPACE
|
from ..attrs cimport TAG, HEAD, DEP, ENT_IOB, ENT_TYPE, IS_SPACE
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
|
from ..structs cimport TokenC
|
||||||
from libc.stdint cimport uint32_t
|
|
||||||
from libc.string cimport memcpy
|
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
|
||||||
from .stateclass cimport StateClass
|
|
||||||
from ._state cimport StateC, is_space_token
|
|
||||||
from .nonproj import PseudoProjectivity
|
|
||||||
from .nonproj import is_nonproj_tree
|
|
||||||
|
|
||||||
|
|
||||||
DEF NON_MONOTONIC = True
|
DEF NON_MONOTONIC = True
|
||||||
|
@ -317,17 +314,20 @@ cdef class ArcEager(TransitionSystem):
|
||||||
def get_actions(cls, **kwargs):
|
def get_actions(cls, **kwargs):
|
||||||
actions = kwargs.get('actions',
|
actions = kwargs.get('actions',
|
||||||
{
|
{
|
||||||
SHIFT: {'': True},
|
SHIFT: [''],
|
||||||
REDUCE: {'': True},
|
REDUCE: [''],
|
||||||
RIGHT: {},
|
RIGHT: [],
|
||||||
LEFT: {},
|
LEFT: [],
|
||||||
BREAK: {'ROOT': True}})
|
BREAK: ['ROOT']})
|
||||||
|
seen_actions = set()
|
||||||
for label in kwargs.get('left_labels', []):
|
for label in kwargs.get('left_labels', []):
|
||||||
if label.upper() != 'ROOT':
|
if label.upper() != 'ROOT':
|
||||||
actions[LEFT][label] = True
|
if (LEFT, label) not in seen_actions:
|
||||||
|
actions[LEFT].append(label)
|
||||||
for label in kwargs.get('right_labels', []):
|
for label in kwargs.get('right_labels', []):
|
||||||
if label.upper() != 'ROOT':
|
if label.upper() != 'ROOT':
|
||||||
actions[RIGHT][label] = True
|
if (RIGHT, label) not in seen_actions:
|
||||||
|
actions[RIGHT].append(label)
|
||||||
|
|
||||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||||
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
for (ids, words, tags, heads, labels, iob), ctnts in sents:
|
||||||
|
@ -336,9 +336,11 @@ cdef class ArcEager(TransitionSystem):
|
||||||
label = 'ROOT'
|
label = 'ROOT'
|
||||||
if label != 'ROOT':
|
if label != 'ROOT':
|
||||||
if head < child:
|
if head < child:
|
||||||
actions[RIGHT][label] = True
|
if (RIGHT, label) not in seen_actions:
|
||||||
|
actions[RIGHT].append(label)
|
||||||
elif head > child:
|
elif head > child:
|
||||||
actions[LEFT][label] = True
|
if (LEFT, label) not in seen_actions:
|
||||||
|
actions[LEFT].append(label)
|
||||||
return actions
|
return actions
|
||||||
|
|
||||||
property action_types:
|
property action_types:
|
||||||
|
|
|
@ -1,50 +1,34 @@
|
||||||
|
"""
|
||||||
|
MALT-style dependency parser
|
||||||
|
"""
|
||||||
# cython: profile=True
|
# cython: profile=True
|
||||||
# cython: experimental_cpp_class_def=True
|
# cython: experimental_cpp_class_def=True
|
||||||
# cython: cdivision=True
|
# cython: cdivision=True
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
"""
|
# coding: utf-8
|
||||||
MALT-style dependency parser
|
|
||||||
"""
|
from __future__ import unicode_literals, print_function
|
||||||
from __future__ import unicode_literals
|
|
||||||
cimport cython
|
cimport cython
|
||||||
|
|
||||||
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
||||||
|
|
||||||
from libc.stdint cimport uint32_t, uint64_t
|
from libc.stdint cimport uint32_t, uint64_t
|
||||||
from libc.string cimport memset, memcpy
|
from libc.string cimport memset, memcpy
|
||||||
from libc.stdlib cimport rand
|
from libc.stdlib cimport rand
|
||||||
from libc.math cimport log, exp, isnan, isinf
|
from libc.math cimport log, exp, isnan, isinf
|
||||||
import random
|
|
||||||
import os.path
|
|
||||||
from os import path
|
|
||||||
import shutil
|
|
||||||
import json
|
|
||||||
import math
|
|
||||||
|
|
||||||
from cymem.cymem cimport Pool, Address
|
from cymem.cymem cimport Pool, Address
|
||||||
from murmurhash.mrmr cimport real_hash64 as hash64
|
from murmurhash.mrmr cimport real_hash64 as hash64
|
||||||
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
|
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
|
||||||
|
|
||||||
|
|
||||||
from util import Config
|
|
||||||
|
|
||||||
from thinc.linear.features cimport ConjunctionExtracter
|
from thinc.linear.features cimport ConjunctionExtracter
|
||||||
from thinc.structs cimport FeatureC, ExampleC
|
from thinc.structs cimport FeatureC, ExampleC
|
||||||
|
from thinc.extra.search cimport Beam, MaxViolation
|
||||||
from thinc.extra.search cimport Beam
|
|
||||||
from thinc.extra.search cimport MaxViolation
|
|
||||||
from thinc.extra.eg cimport Example
|
from thinc.extra.eg cimport Example
|
||||||
from thinc.extra.mb cimport Minibatch
|
from thinc.extra.mb cimport Minibatch
|
||||||
|
|
||||||
from ..structs cimport TokenC
|
from ..structs cimport TokenC
|
||||||
|
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
from ..strings cimport StringStore
|
from ..strings cimport StringStore
|
||||||
|
|
||||||
from .transition_system cimport TransitionSystem, Transition
|
from .transition_system cimport TransitionSystem, Transition
|
||||||
|
|
||||||
from ..gold cimport GoldParse
|
from ..gold cimport GoldParse
|
||||||
|
|
||||||
from . import _parse_features
|
from . import _parse_features
|
||||||
from ._parse_features cimport CONTEXT_SIZE
|
from ._parse_features cimport CONTEXT_SIZE
|
||||||
from ._parse_features cimport fill_context
|
from ._parse_features cimport fill_context
|
||||||
|
@ -266,4 +250,3 @@ def is_gold(StateClass state, GoldParse gold, StringStore strings):
|
||||||
id_, word, tag, head, dep, ner = gold.orig_annot[gold.cand_to_gold[i]]
|
id_, word, tag, head, dep, ner = gold.orig_annot[gold.cand_to_gold[i]]
|
||||||
truth.add((id_, head, dep))
|
truth.add((id_, head, dep))
|
||||||
return truth == predicted
|
return truth == predicted
|
||||||
|
|
||||||
|
|
|
@ -1,9 +1,14 @@
|
||||||
from spacy.parts_of_speech cimport NOUN, PROPN, PRON
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from ..parts_of_speech cimport NOUN, PROPN, PRON
|
||||||
|
|
||||||
|
|
||||||
def english_noun_chunks(obj):
|
def english_noun_chunks(obj):
|
||||||
'''Detect base noun phrases from a dependency parse.
|
"""
|
||||||
Works on both Doc and Span.'''
|
Detect base noun phrases from a dependency parse.
|
||||||
|
Works on both Doc and Span.
|
||||||
|
"""
|
||||||
labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
|
labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
|
||||||
'attr', 'ROOT', 'root']
|
'attr', 'ROOT', 'root']
|
||||||
doc = obj.doc # Ensure works on both Doc and Span.
|
doc = obj.doc # Ensure works on both Doc and Span.
|
||||||
|
|
|
@ -1,17 +1,16 @@
|
||||||
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from .transition_system cimport Transition
|
|
||||||
from .transition_system cimport do_func_t
|
|
||||||
|
|
||||||
from ..structs cimport TokenC, Entity
|
|
||||||
|
|
||||||
from thinc.typedefs cimport weight_t
|
from thinc.typedefs cimport weight_t
|
||||||
from ..gold cimport GoldParseC
|
|
||||||
from ..gold cimport GoldParse
|
|
||||||
from ..attrs cimport ENT_TYPE, ENT_IOB
|
|
||||||
|
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
|
from .transition_system cimport Transition
|
||||||
|
from .transition_system cimport do_func_t
|
||||||
|
from ..structs cimport TokenC, Entity
|
||||||
|
from ..gold cimport GoldParseC
|
||||||
|
from ..gold cimport GoldParse
|
||||||
|
from ..attrs cimport ENT_TYPE, ENT_IOB
|
||||||
|
|
||||||
|
|
||||||
cdef enum:
|
cdef enum:
|
||||||
|
@ -21,6 +20,7 @@ cdef enum:
|
||||||
LAST
|
LAST
|
||||||
UNIT
|
UNIT
|
||||||
OUT
|
OUT
|
||||||
|
ISNT
|
||||||
N_MOVES
|
N_MOVES
|
||||||
|
|
||||||
|
|
||||||
|
@ -31,6 +31,7 @@ MOVE_NAMES[IN] = 'I'
|
||||||
MOVE_NAMES[LAST] = 'L'
|
MOVE_NAMES[LAST] = 'L'
|
||||||
MOVE_NAMES[UNIT] = 'U'
|
MOVE_NAMES[UNIT] = 'U'
|
||||||
MOVE_NAMES[OUT] = 'O'
|
MOVE_NAMES[OUT] = 'O'
|
||||||
|
MOVE_NAMES[ISNT] = 'x'
|
||||||
|
|
||||||
|
|
||||||
cdef do_func_t[N_MOVES] do_funcs
|
cdef do_func_t[N_MOVES] do_funcs
|
||||||
|
@ -54,16 +55,20 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
def get_actions(cls, **kwargs):
|
def get_actions(cls, **kwargs):
|
||||||
actions = kwargs.get('actions',
|
actions = kwargs.get('actions',
|
||||||
{
|
{
|
||||||
MISSING: {'': True},
|
MISSING: [''],
|
||||||
BEGIN: {},
|
BEGIN: [],
|
||||||
IN: {},
|
IN: [],
|
||||||
LAST: {},
|
LAST: [],
|
||||||
UNIT: {},
|
UNIT: [],
|
||||||
OUT: {'': True}
|
OUT: ['']
|
||||||
})
|
})
|
||||||
|
seen_entities = set()
|
||||||
for entity_type in kwargs.get('entity_types', []):
|
for entity_type in kwargs.get('entity_types', []):
|
||||||
|
if entity_type in seen_entities:
|
||||||
|
continue
|
||||||
|
seen_entities.add(entity_type)
|
||||||
for action in (BEGIN, IN, LAST, UNIT):
|
for action in (BEGIN, IN, LAST, UNIT):
|
||||||
actions[action][entity_type] = True
|
actions[action].append(entity_type)
|
||||||
moves = ('M', 'B', 'I', 'L', 'U')
|
moves = ('M', 'B', 'I', 'L', 'U')
|
||||||
for raw_text, sents in kwargs.get('gold_parses', []):
|
for raw_text, sents in kwargs.get('gold_parses', []):
|
||||||
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
for (ids, words, tags, heads, labels, biluo), _ in sents:
|
||||||
|
@ -72,8 +77,10 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
if ner_tag.count('-') != 1:
|
if ner_tag.count('-') != 1:
|
||||||
raise ValueError(ner_tag)
|
raise ValueError(ner_tag)
|
||||||
_, label = ner_tag.split('-')
|
_, label = ner_tag.split('-')
|
||||||
|
if label not in seen_entities:
|
||||||
|
seen_entities.add(label)
|
||||||
for move_str in ('B', 'I', 'L', 'U'):
|
for move_str in ('B', 'I', 'L', 'U'):
|
||||||
actions[moves.index(move_str)][label] = True
|
actions[moves.index(move_str)].append(label)
|
||||||
return actions
|
return actions
|
||||||
|
|
||||||
property action_types:
|
property action_types:
|
||||||
|
@ -111,11 +118,17 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
label = 0
|
label = 0
|
||||||
elif '-' in name:
|
elif '-' in name:
|
||||||
move_str, label_str = name.split('-', 1)
|
move_str, label_str = name.split('-', 1)
|
||||||
|
# Hacky way to denote 'not this entity'
|
||||||
|
if label_str.startswith('!'):
|
||||||
|
label_str = label_str[1:]
|
||||||
|
move_str = 'x'
|
||||||
label = self.strings[label_str]
|
label = self.strings[label_str]
|
||||||
else:
|
else:
|
||||||
move_str = name
|
move_str = name
|
||||||
label = 0
|
label = 0
|
||||||
move = MOVE_NAMES.index(move_str)
|
move = MOVE_NAMES.index(move_str)
|
||||||
|
if move == ISNT:
|
||||||
|
return Transition(clas=0, move=ISNT, label=label, score=0)
|
||||||
for i in range(self.n_moves):
|
for i in range(self.n_moves):
|
||||||
if self.c[i].move == move and self.c[i].label == label:
|
if self.c[i].move == move and self.c[i].label == label:
|
||||||
return self.c[i]
|
return self.c[i]
|
||||||
|
@ -225,6 +238,9 @@ cdef class Begin:
|
||||||
elif g_act == BEGIN:
|
elif g_act == BEGIN:
|
||||||
# B, Gold B --> Label match
|
# B, Gold B --> Label match
|
||||||
return label != g_tag
|
return label != g_tag
|
||||||
|
# Support partial supervision in the form of "not this label"
|
||||||
|
elif g_act == ISNT:
|
||||||
|
return label == g_tag
|
||||||
else:
|
else:
|
||||||
# B, Gold I --> False (P)
|
# B, Gold I --> False (P)
|
||||||
# B, Gold L --> False (P)
|
# B, Gold L --> False (P)
|
||||||
|
@ -359,6 +375,9 @@ cdef class Unit:
|
||||||
elif g_act == UNIT:
|
elif g_act == UNIT:
|
||||||
# U, Gold U --> True iff tag match
|
# U, Gold U --> True iff tag match
|
||||||
return label != g_tag
|
return label != g_tag
|
||||||
|
# Support partial supervision in the form of "not this label"
|
||||||
|
elif g_act == ISNT:
|
||||||
|
return label == g_tag
|
||||||
else:
|
else:
|
||||||
# U, Gold B --> False
|
# U, Gold B --> False
|
||||||
# U, Gold I --> False
|
# U, Gold I --> False
|
||||||
|
@ -388,7 +407,7 @@ cdef class Out:
|
||||||
cdef int g_act = gold.ner[s.B(0)].move
|
cdef int g_act = gold.ner[s.B(0)].move
|
||||||
cdef int g_tag = gold.ner[s.B(0)].label
|
cdef int g_tag = gold.ner[s.B(0)].label
|
||||||
|
|
||||||
if g_act == MISSING:
|
if g_act == MISSING or g_act == ISNT:
|
||||||
return 0
|
return 0
|
||||||
elif g_act == BEGIN:
|
elif g_act == BEGIN:
|
||||||
# O, Gold B --> False
|
# O, Gold B --> False
|
||||||
|
|
|
@ -1,8 +1,9 @@
|
||||||
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from copy import copy
|
from copy import copy
|
||||||
|
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
from spacy.attrs import DEP, HEAD
|
from ..attrs import DEP, HEAD
|
||||||
|
|
||||||
|
|
||||||
def ancestors(tokenid, heads):
|
def ancestors(tokenid, heads):
|
||||||
|
@ -201,5 +202,3 @@ class PseudoProjectivity:
|
||||||
filtered_sents.append(((ids,words,tags,heads,filtered_labels,iob), ctnts))
|
filtered_sents.append(((ids,words,tags,heads,filtered_labels,iob), ctnts))
|
||||||
filtered.append((raw_text, filtered_sents))
|
filtered.append((raw_text, filtered_sents))
|
||||||
return filtered
|
return filtered
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,56 +1,44 @@
|
||||||
# cython: infer_types=True
|
|
||||||
"""
|
"""
|
||||||
MALT-style dependency parser
|
MALT-style dependency parser
|
||||||
"""
|
"""
|
||||||
|
# coding: utf-8
|
||||||
|
# cython: infer_types=True
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from collections import Counter
|
||||||
|
import ujson
|
||||||
|
|
||||||
cimport cython
|
cimport cython
|
||||||
cimport cython.parallel
|
cimport cython.parallel
|
||||||
|
|
||||||
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
||||||
from cpython.exc cimport PyErr_CheckSignals
|
from cpython.exc cimport PyErr_CheckSignals
|
||||||
|
|
||||||
from libc.stdint cimport uint32_t, uint64_t
|
from libc.stdint cimport uint32_t, uint64_t
|
||||||
from libc.string cimport memset, memcpy
|
from libc.string cimport memset, memcpy
|
||||||
from libc.stdlib cimport malloc, calloc, free
|
from libc.stdlib cimport malloc, calloc, free
|
||||||
|
|
||||||
import os.path
|
|
||||||
from collections import Counter
|
|
||||||
from os import path
|
|
||||||
import shutil
|
|
||||||
import json
|
|
||||||
import sys
|
|
||||||
from .nonproj import PseudoProjectivity
|
|
||||||
|
|
||||||
from cymem.cymem cimport Pool, Address
|
|
||||||
from murmurhash.mrmr cimport hash64
|
|
||||||
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
|
from thinc.typedefs cimport weight_t, class_t, feat_t, atom_t, hash_t
|
||||||
from thinc.linear.avgtron cimport AveragedPerceptron
|
from thinc.linear.avgtron cimport AveragedPerceptron
|
||||||
from thinc.linalg cimport VecVec
|
from thinc.linalg cimport VecVec
|
||||||
from thinc.structs cimport SparseArrayC
|
from thinc.structs cimport SparseArrayC, FeatureC, ExampleC
|
||||||
|
from thinc.extra.eg cimport Example
|
||||||
|
from cymem.cymem cimport Pool, Address
|
||||||
|
from murmurhash.mrmr cimport hash64
|
||||||
from preshed.maps cimport MapStruct
|
from preshed.maps cimport MapStruct
|
||||||
from preshed.maps cimport map_get
|
from preshed.maps cimport map_get
|
||||||
|
|
||||||
from thinc.structs cimport FeatureC
|
|
||||||
from thinc.structs cimport ExampleC
|
|
||||||
from thinc.extra.eg cimport Example
|
|
||||||
|
|
||||||
from util import Config
|
|
||||||
|
|
||||||
from ..structs cimport TokenC
|
|
||||||
|
|
||||||
from ..tokens.doc cimport Doc
|
|
||||||
from ..strings cimport StringStore
|
|
||||||
|
|
||||||
from .transition_system import OracleError
|
|
||||||
from .transition_system cimport TransitionSystem, Transition
|
|
||||||
|
|
||||||
from ..gold cimport GoldParse
|
|
||||||
|
|
||||||
from . import _parse_features
|
from . import _parse_features
|
||||||
from ._parse_features cimport CONTEXT_SIZE
|
from ._parse_features cimport CONTEXT_SIZE
|
||||||
from ._parse_features cimport fill_context
|
from ._parse_features cimport fill_context
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ._state cimport StateC
|
from ._state cimport StateC
|
||||||
|
from .nonproj import PseudoProjectivity
|
||||||
|
from .transition_system import OracleError
|
||||||
|
from .transition_system cimport TransitionSystem, Transition
|
||||||
|
from ..structs cimport TokenC
|
||||||
|
from ..tokens.doc cimport Doc
|
||||||
|
from ..strings cimport StringStore
|
||||||
|
from ..gold cimport GoldParse
|
||||||
|
|
||||||
|
|
||||||
USE_FTRL = True
|
USE_FTRL = True
|
||||||
DEBUG = False
|
DEBUG = False
|
||||||
|
@ -80,7 +68,9 @@ cdef class ParserModel(AveragedPerceptron):
|
||||||
return nr_feat
|
return nr_feat
|
||||||
|
|
||||||
def update(self, Example eg, itn=0):
|
def update(self, Example eg, itn=0):
|
||||||
'''Does regression on negative cost. Sort of cute?'''
|
"""
|
||||||
|
Does regression on negative cost. Sort of cute?
|
||||||
|
"""
|
||||||
self.time += 1
|
self.time += 1
|
||||||
cdef int best = arg_max_if_gold(eg.c.scores, eg.c.costs, eg.c.nr_class)
|
cdef int best = arg_max_if_gold(eg.c.scores, eg.c.costs, eg.c.nr_class)
|
||||||
cdef int guess = eg.guess
|
cdef int guess = eg.guess
|
||||||
|
@ -132,10 +122,13 @@ cdef class ParserModel(AveragedPerceptron):
|
||||||
|
|
||||||
|
|
||||||
cdef class Parser:
|
cdef class Parser:
|
||||||
"""Base class of the DependencyParser and EntityRecognizer."""
|
"""
|
||||||
|
Base class of the DependencyParser and EntityRecognizer.
|
||||||
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, Vocab vocab, TransitionSystem=None, require=False, **cfg):
|
def load(cls, path, Vocab vocab, TransitionSystem=None, require=False, **cfg):
|
||||||
"""Load the statistical model from the supplied path.
|
"""
|
||||||
|
Load the statistical model from the supplied path.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
path (Path):
|
path (Path):
|
||||||
|
@ -148,10 +141,16 @@ cdef class Parser:
|
||||||
The newly constructed object.
|
The newly constructed object.
|
||||||
"""
|
"""
|
||||||
with (path / 'config.json').open() as file_:
|
with (path / 'config.json').open() as file_:
|
||||||
cfg = json.load(file_)
|
cfg = ujson.load(file_)
|
||||||
# TODO: remove this shim when we don't have to support older data
|
# TODO: remove this shim when we don't have to support older data
|
||||||
if 'labels' in cfg and 'actions' not in cfg:
|
if 'labels' in cfg and 'actions' not in cfg:
|
||||||
cfg['actions'] = cfg.pop('labels')
|
cfg['actions'] = cfg.pop('labels')
|
||||||
|
# TODO: remove this shim when we don't have to support older data
|
||||||
|
for action_name, labels in dict(cfg['actions']).items():
|
||||||
|
# We need this to be sorted
|
||||||
|
if isinstance(labels, dict):
|
||||||
|
labels = list(sorted(labels.keys()))
|
||||||
|
cfg['actions'][action_name] = labels
|
||||||
self = cls(vocab, TransitionSystem=TransitionSystem, model=None, **cfg)
|
self = cls(vocab, TransitionSystem=TransitionSystem, model=None, **cfg)
|
||||||
if (path / 'model').exists():
|
if (path / 'model').exists():
|
||||||
self.model.load(str(path / 'model'))
|
self.model.load(str(path / 'model'))
|
||||||
|
@ -161,7 +160,8 @@ cdef class Parser:
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, TransitionSystem=None, ParserModel model=None, **cfg):
|
def __init__(self, Vocab vocab, TransitionSystem=None, ParserModel model=None, **cfg):
|
||||||
"""Create a Parser.
|
"""
|
||||||
|
Create a Parser.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
vocab (Vocab):
|
vocab (Vocab):
|
||||||
|
@ -186,12 +186,18 @@ cdef class Parser:
|
||||||
self.model.learn_rate = cfg.get('learn_rate', 0.001)
|
self.model.learn_rate = cfg.get('learn_rate', 0.001)
|
||||||
|
|
||||||
self.cfg = cfg
|
self.cfg = cfg
|
||||||
|
# TODO: This is a pretty hacky fix to the problem of adding more
|
||||||
|
# labels. The issue is they come in out of order, if labels are
|
||||||
|
# added during training
|
||||||
|
for label in cfg.get('extra_labels', []):
|
||||||
|
self.add_label(label)
|
||||||
|
|
||||||
def __reduce__(self):
|
def __reduce__(self):
|
||||||
return (Parser, (self.vocab, self.moves, self.model), None, None)
|
return (Parser, (self.vocab, self.moves, self.model), None, None)
|
||||||
|
|
||||||
def __call__(self, Doc tokens):
|
def __call__(self, Doc tokens):
|
||||||
"""Apply the entity recognizer, setting the annotations onto the Doc object.
|
"""
|
||||||
|
Apply the entity recognizer, setting the annotations onto the Doc object.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc): The document to be processed.
|
doc (Doc): The document to be processed.
|
||||||
|
@ -208,7 +214,8 @@ cdef class Parser:
|
||||||
self.moves.finalize_doc(tokens)
|
self.moves.finalize_doc(tokens)
|
||||||
|
|
||||||
def pipe(self, stream, int batch_size=1000, int n_threads=2):
|
def pipe(self, stream, int batch_size=1000, int n_threads=2):
|
||||||
"""Process a stream of documents.
|
"""
|
||||||
|
Process a stream of documents.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
stream: The sequence of documents to process.
|
stream: The sequence of documents to process.
|
||||||
|
@ -296,7 +303,8 @@ cdef class Parser:
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
def update(self, Doc tokens, GoldParse gold, itn=0):
|
def update(self, Doc tokens, GoldParse gold, itn=0):
|
||||||
"""Update the statistical model.
|
"""
|
||||||
|
Update the statistical model.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc):
|
doc (Doc):
|
||||||
|
@ -334,15 +342,17 @@ cdef class Parser:
|
||||||
self.moves.finalize_state(stcls.c)
|
self.moves.finalize_state(stcls.c)
|
||||||
return loss
|
return loss
|
||||||
|
|
||||||
def step_through(self, Doc doc):
|
def step_through(self, Doc doc, GoldParse gold=None):
|
||||||
"""Set up a stepwise state, to introspect and control the transition sequence.
|
"""
|
||||||
|
Set up a stepwise state, to introspect and control the transition sequence.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc): The document to step through.
|
doc (Doc): The document to step through.
|
||||||
|
gold (GoldParse): Optional gold parse
|
||||||
Returns (StepwiseState):
|
Returns (StepwiseState):
|
||||||
A state object, to step through the annotation process.
|
A state object, to step through the annotation process.
|
||||||
"""
|
"""
|
||||||
return StepwiseState(self, doc)
|
return StepwiseState(self, doc, gold=gold)
|
||||||
|
|
||||||
def from_transition_sequence(self, Doc doc, sequence):
|
def from_transition_sequence(self, Doc doc, sequence):
|
||||||
"""Control the annotations on a document by specifying a transition sequence
|
"""Control the annotations on a document by specifying a transition sequence
|
||||||
|
@ -360,18 +370,28 @@ cdef class Parser:
|
||||||
def add_label(self, label):
|
def add_label(self, label):
|
||||||
# Doesn't set label into serializer -- subclasses override it to do that.
|
# Doesn't set label into serializer -- subclasses override it to do that.
|
||||||
for action in self.moves.action_types:
|
for action in self.moves.action_types:
|
||||||
self.moves.add_action(action, label)
|
added = self.moves.add_action(action, label)
|
||||||
|
if added:
|
||||||
|
# Important that the labels be stored as a list! We need the
|
||||||
|
# order, or the model goes out of synch
|
||||||
|
self.cfg.setdefault('extra_labels', []).append(label)
|
||||||
|
|
||||||
|
|
||||||
cdef class StepwiseState:
|
cdef class StepwiseState:
|
||||||
cdef readonly StateClass stcls
|
cdef readonly StateClass stcls
|
||||||
cdef readonly Example eg
|
cdef readonly Example eg
|
||||||
cdef readonly Doc doc
|
cdef readonly Doc doc
|
||||||
|
cdef readonly GoldParse gold
|
||||||
cdef readonly Parser parser
|
cdef readonly Parser parser
|
||||||
|
|
||||||
def __init__(self, Parser parser, Doc doc):
|
def __init__(self, Parser parser, Doc doc, GoldParse gold=None):
|
||||||
self.parser = parser
|
self.parser = parser
|
||||||
self.doc = doc
|
self.doc = doc
|
||||||
|
if gold is not None:
|
||||||
|
self.gold = gold
|
||||||
|
self.parser.moves.preprocess_gold(self.gold)
|
||||||
|
else:
|
||||||
|
self.gold = GoldParse(doc)
|
||||||
self.stcls = StateClass.init(doc.c, doc.length)
|
self.stcls = StateClass.init(doc.c, doc.length)
|
||||||
self.parser.moves.initialize_state(self.stcls.c)
|
self.parser.moves.initialize_state(self.stcls.c)
|
||||||
self.eg = Example(
|
self.eg = Example(
|
||||||
|
@ -406,6 +426,24 @@ cdef class StepwiseState:
|
||||||
return [self.doc.vocab.strings[self.stcls.c._sent[i].dep]
|
return [self.doc.vocab.strings[self.stcls.c._sent[i].dep]
|
||||||
for i in range(self.stcls.c.length)]
|
for i in range(self.stcls.c.length)]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def costs(self):
|
||||||
|
"""
|
||||||
|
Find the action-costs for the current state.
|
||||||
|
"""
|
||||||
|
if not self.gold:
|
||||||
|
raise ValueError("Can't set costs: No GoldParse provided")
|
||||||
|
self.parser.moves.set_costs(self.eg.c.is_valid, self.eg.c.costs,
|
||||||
|
self.stcls, self.gold)
|
||||||
|
costs = {}
|
||||||
|
for i in range(self.parser.moves.n_moves):
|
||||||
|
if not self.eg.c.is_valid[i]:
|
||||||
|
continue
|
||||||
|
transition = self.parser.moves.c[i]
|
||||||
|
name = self.parser.moves.move_name(transition.move, transition.label)
|
||||||
|
costs[name] = self.eg.c.costs[i]
|
||||||
|
return costs
|
||||||
|
|
||||||
def predict(self):
|
def predict(self):
|
||||||
self.eg.reset()
|
self.eg.reset()
|
||||||
self.eg.c.nr_feat = self.parser.model.set_featuresC(self.eg.c.atoms, self.eg.c.features,
|
self.eg.c.nr_feat = self.parser.model.set_featuresC(self.eg.c.atoms, self.eg.c.features,
|
||||||
|
|
|
@ -1,5 +1,9 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from libc.string cimport memcpy, memset
|
from libc.string cimport memcpy, memset
|
||||||
from libc.stdint cimport uint32_t
|
from libc.stdint cimport uint32_t
|
||||||
|
|
||||||
from ..vocab cimport EMPTY_LEXEME
|
from ..vocab cimport EMPTY_LEXEME
|
||||||
from ..structs cimport Entity
|
from ..structs cimport Entity
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
|
|
|
@ -1,4 +1,8 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from thinc.typedefs cimport weight_t
|
from thinc.typedefs cimport weight_t
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
@ -6,7 +10,6 @@ from collections import defaultdict
|
||||||
from ..structs cimport TokenC
|
from ..structs cimport TokenC
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
from ..attrs cimport TAG, HEAD, DEP, ENT_TYPE, ENT_IOB
|
from ..attrs cimport TAG, HEAD, DEP, ENT_TYPE, ENT_IOB
|
||||||
from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
|
|
||||||
|
|
||||||
|
|
||||||
cdef weight_t MIN_SCORE = -90000
|
cdef weight_t MIN_SCORE = -90000
|
||||||
|
@ -32,7 +35,7 @@ cdef class TransitionSystem:
|
||||||
self.c = <Transition*>self.mem.alloc(self._size, sizeof(Transition))
|
self.c = <Transition*>self.mem.alloc(self._size, sizeof(Transition))
|
||||||
|
|
||||||
for action, label_strs in sorted(labels_by_action.items()):
|
for action, label_strs in sorted(labels_by_action.items()):
|
||||||
for label_str in sorted(label_strs):
|
for label_str in label_strs:
|
||||||
self.add_action(int(action), label_str)
|
self.add_action(int(action), label_str)
|
||||||
self.root_label = self.strings['ROOT']
|
self.root_label = self.strings['ROOT']
|
||||||
self.freqs = {} if _freqs is None else _freqs
|
self.freqs = {} if _freqs is None else _freqs
|
||||||
|
|
|
@ -1,18 +0,0 @@
|
||||||
from os import path
|
|
||||||
import json
|
|
||||||
|
|
||||||
class Config(object):
|
|
||||||
def __init__(self, **kwargs):
|
|
||||||
for key, value in kwargs.items():
|
|
||||||
setattr(self, key, value)
|
|
||||||
|
|
||||||
def get(self, attr, default=None):
|
|
||||||
return self.__dict__.get(attr, default)
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def write(cls, model_dir, name, **kwargs):
|
|
||||||
open(path.join(model_dir, '%s.json' % name), 'w').write(json.dumps(kwargs))
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def read(cls, model_dir, name):
|
|
||||||
return cls(**json.load(open(path.join(model_dir, '%s.json' % name))))
|
|
|
@ -1,5 +1,7 @@
|
||||||
import json
|
# coding: utf8
|
||||||
import pathlib
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import ujson
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
@ -12,8 +14,8 @@ from thinc.linalg cimport VecVec
|
||||||
from .tokens.doc cimport Doc
|
from .tokens.doc cimport Doc
|
||||||
from .attrs cimport TAG
|
from .attrs cimport TAG
|
||||||
from .gold cimport GoldParse
|
from .gold cimport GoldParse
|
||||||
|
|
||||||
from .attrs cimport *
|
from .attrs cimport *
|
||||||
|
from . import util
|
||||||
|
|
||||||
|
|
||||||
cpdef enum:
|
cpdef enum:
|
||||||
|
@ -106,10 +108,13 @@ cdef inline void _fill_from_token(atom_t* context, const TokenC* t) nogil:
|
||||||
|
|
||||||
|
|
||||||
cdef class Tagger:
|
cdef class Tagger:
|
||||||
"""Annotate part-of-speech tags on Doc objects."""
|
"""
|
||||||
|
Annotate part-of-speech tags on Doc objects.
|
||||||
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, vocab, require=False):
|
def load(cls, path, vocab, require=False):
|
||||||
"""Load the statistical model from the supplied path.
|
"""
|
||||||
|
Load the statistical model from the supplied path.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
path (Path):
|
path (Path):
|
||||||
|
@ -123,10 +128,10 @@ cdef class Tagger:
|
||||||
"""
|
"""
|
||||||
# TODO: Change this to expect config.json when we don't have to
|
# TODO: Change this to expect config.json when we don't have to
|
||||||
# support old data.
|
# support old data.
|
||||||
path = path if not isinstance(path, basestring) else pathlib.Path(path)
|
path = util.ensure_path(path)
|
||||||
if (path / 'templates.json').exists():
|
if (path / 'templates.json').exists():
|
||||||
with (path / 'templates.json').open('r', encoding='utf8') as file_:
|
with (path / 'templates.json').open('r', encoding='utf8') as file_:
|
||||||
templates = json.load(file_)
|
templates = ujson.load(file_)
|
||||||
elif require:
|
elif require:
|
||||||
raise IOError(
|
raise IOError(
|
||||||
"Required file %s/templates.json not found when loading Tagger" % str(path))
|
"Required file %s/templates.json not found when loading Tagger" % str(path))
|
||||||
|
@ -142,7 +147,8 @@ cdef class Tagger:
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, TaggerModel model=None, **cfg):
|
def __init__(self, Vocab vocab, TaggerModel model=None, **cfg):
|
||||||
"""Create a Tagger.
|
"""
|
||||||
|
Create a Tagger.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
vocab (Vocab):
|
vocab (Vocab):
|
||||||
|
@ -180,7 +186,8 @@ cdef class Tagger:
|
||||||
tokens._py_tokens = [None] * tokens.length
|
tokens._py_tokens = [None] * tokens.length
|
||||||
|
|
||||||
def __call__(self, Doc tokens):
|
def __call__(self, Doc tokens):
|
||||||
"""Apply the tagger, setting the POS tags onto the Doc object.
|
"""
|
||||||
|
Apply the tagger, setting the POS tags onto the Doc object.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc): The tokens to be tagged.
|
doc (Doc): The tokens to be tagged.
|
||||||
|
@ -208,7 +215,8 @@ cdef class Tagger:
|
||||||
tokens._py_tokens = [None] * tokens.length
|
tokens._py_tokens = [None] * tokens.length
|
||||||
|
|
||||||
def pipe(self, stream, batch_size=1000, n_threads=2):
|
def pipe(self, stream, batch_size=1000, n_threads=2):
|
||||||
"""Tag a stream of documents.
|
"""
|
||||||
|
Tag a stream of documents.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
stream: The sequence of documents to tag.
|
stream: The sequence of documents to tag.
|
||||||
|
@ -225,7 +233,8 @@ cdef class Tagger:
|
||||||
yield doc
|
yield doc
|
||||||
|
|
||||||
def update(self, Doc tokens, GoldParse gold, itn=0):
|
def update(self, Doc tokens, GoldParse gold, itn=0):
|
||||||
"""Update the statistical model, with tags supplied for the given document.
|
"""
|
||||||
|
Update the statistical model, with tags supplied for the given document.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc):
|
doc (Doc):
|
||||||
|
|
|
@ -3,15 +3,21 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
ABBREVIATION_TESTS = [
|
|
||||||
('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])
|
|
||||||
]
|
|
||||||
|
|
||||||
TESTCASES = ABBREVIATION_TESTS
|
@pytest.mark.parametrize('text,expected_tokens',
|
||||||
|
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
|
||||||
|
def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
|
||||||
def test_tokenizer_handles_testcases(he_tokenizer, text, expected_tokens):
|
|
||||||
tokens = he_tokenizer(text)
|
tokens = he_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', [
|
||||||
|
pytest.mark.xfail(('עקבת אחריו בכל רחבי המדינה.', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '.'])),
|
||||||
|
('עקבת אחריו בכל רחבי המדינה?', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']),
|
||||||
|
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
|
||||||
|
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
|
||||||
|
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
|
||||||
|
def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
|
||||||
|
tokens = he_tokenizer(text)
|
||||||
|
assert expected_tokens == [token.text for token in tokens]
|
||||||
|
|
|
@ -16,6 +16,7 @@ def test_tagger_lemmatizer_noun_lemmas(lemmatizer, text, lemmas):
|
||||||
assert lemmatizer.noun(text) == set(lemmas)
|
assert lemmatizer.noun(text) == set(lemmas)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
def test_tagger_lemmatizer_base_forms(lemmatizer):
|
def test_tagger_lemmatizer_base_forms(lemmatizer):
|
||||||
if lemmatizer is None:
|
if lemmatizer is None:
|
||||||
|
|
|
@ -3,9 +3,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...vocab import Vocab
|
from ...vocab import Vocab
|
||||||
from ...tokenizer import Tokenizer
|
from ...tokenizer import Tokenizer
|
||||||
from ...util import utf8open
|
from ... import util
|
||||||
|
|
||||||
from os import path
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@ -75,8 +74,8 @@ Phasellus tincidunt, augue quis porta finibus, massa sapien consectetur augue, n
|
||||||
|
|
||||||
@pytest.mark.parametrize('file_name', ["sun.txt"])
|
@pytest.mark.parametrize('file_name', ["sun.txt"])
|
||||||
def test_tokenizer_handle_text_from_file(tokenizer, file_name):
|
def test_tokenizer_handle_text_from_file(tokenizer, file_name):
|
||||||
loc = path.join(path.dirname(__file__), file_name)
|
loc = util.ensure_path(__file__).parent / file_name
|
||||||
text = utf8open(loc).read()
|
text = loc.open('r', encoding='utf8').read()
|
||||||
assert len(text) != 0
|
assert len(text) != 0
|
||||||
tokens = tokenizer(text)
|
tokens = tokenizer(text)
|
||||||
assert len(tokens) > 100
|
assert len(tokens) > 100
|
||||||
|
|
|
@ -1,17 +1,11 @@
|
||||||
# cython: embedsignature=True
|
# cython: embedsignature=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pathlib
|
import ujson
|
||||||
|
|
||||||
from cython.operator cimport dereference as deref
|
from cython.operator cimport dereference as deref
|
||||||
from cython.operator cimport preincrement as preinc
|
from cython.operator cimport preincrement as preinc
|
||||||
|
|
||||||
try:
|
|
||||||
import ujson as json
|
|
||||||
except ImportError:
|
|
||||||
import json
|
|
||||||
|
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from preshed.maps cimport PreshMap
|
from preshed.maps cimport PreshMap
|
||||||
|
|
||||||
|
@ -23,11 +17,14 @@ from .tokens.doc cimport Doc
|
||||||
|
|
||||||
|
|
||||||
cdef class Tokenizer:
|
cdef class Tokenizer:
|
||||||
"""Segment text, and create Doc objects with the discovered segment boundaries."""
|
"""
|
||||||
|
Segment text, and create Doc objects with the discovered segment boundaries.
|
||||||
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, Vocab vocab, rules=None, prefix_search=None, suffix_search=None,
|
def load(cls, path, Vocab vocab, rules=None, prefix_search=None, suffix_search=None,
|
||||||
infix_finditer=None, token_match=None):
|
infix_finditer=None, token_match=None):
|
||||||
'''Load a Tokenizer, reading unsupplied components from the path.
|
"""
|
||||||
|
Load a Tokenizer, reading unsupplied components from the path.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
path (Path):
|
path (Path):
|
||||||
|
@ -45,13 +42,11 @@ cdef class Tokenizer:
|
||||||
infix_finditer:
|
infix_finditer:
|
||||||
Signature of re.compile(string).finditer
|
Signature of re.compile(string).finditer
|
||||||
Returns Tokenizer
|
Returns Tokenizer
|
||||||
'''
|
"""
|
||||||
if isinstance(path, basestring):
|
path = util.ensure_path(path)
|
||||||
path = pathlib.Path(path)
|
|
||||||
|
|
||||||
if rules is None:
|
if rules is None:
|
||||||
with (path / 'tokenizer' / 'specials.json').open('r', encoding='utf8') as file_:
|
with (path / 'tokenizer' / 'specials.json').open('r', encoding='utf8') as file_:
|
||||||
rules = json.load(file_)
|
rules = ujson.load(file_)
|
||||||
if prefix_search in (None, True):
|
if prefix_search in (None, True):
|
||||||
with (path / 'tokenizer' / 'prefix.txt').open() as file_:
|
with (path / 'tokenizer' / 'prefix.txt').open() as file_:
|
||||||
entries = file_.read().split('\n')
|
entries = file_.read().split('\n')
|
||||||
|
@ -67,7 +62,8 @@ cdef class Tokenizer:
|
||||||
return cls(vocab, rules, prefix_search, suffix_search, infix_finditer, token_match)
|
return cls(vocab, rules, prefix_search, suffix_search, infix_finditer, token_match)
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, rules, prefix_search, suffix_search, infix_finditer, token_match=None):
|
def __init__(self, Vocab vocab, rules, prefix_search, suffix_search, infix_finditer, token_match=None):
|
||||||
'''Create a Tokenizer, to create Doc objects given unicode text.
|
"""
|
||||||
|
Create a Tokenizer, to create Doc objects given unicode text.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
vocab (Vocab):
|
vocab (Vocab):
|
||||||
|
@ -85,7 +81,7 @@ cdef class Tokenizer:
|
||||||
to find infixes.
|
to find infixes.
|
||||||
token_match:
|
token_match:
|
||||||
A boolean function matching strings that becomes tokens.
|
A boolean function matching strings that becomes tokens.
|
||||||
'''
|
"""
|
||||||
self.mem = Pool()
|
self.mem = Pool()
|
||||||
self._cache = PreshMap()
|
self._cache = PreshMap()
|
||||||
self._specials = PreshMap()
|
self._specials = PreshMap()
|
||||||
|
@ -117,7 +113,8 @@ cdef class Tokenizer:
|
||||||
|
|
||||||
@cython.boundscheck(False)
|
@cython.boundscheck(False)
|
||||||
def __call__(self, unicode string):
|
def __call__(self, unicode string):
|
||||||
"""Tokenize a string.
|
"""
|
||||||
|
Tokenize a string.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
string (unicode): The string to tokenize.
|
string (unicode): The string to tokenize.
|
||||||
|
@ -170,7 +167,8 @@ cdef class Tokenizer:
|
||||||
return tokens
|
return tokens
|
||||||
|
|
||||||
def pipe(self, texts, batch_size=1000, n_threads=2):
|
def pipe(self, texts, batch_size=1000, n_threads=2):
|
||||||
"""Tokenize a stream of texts.
|
"""
|
||||||
|
Tokenize a stream of texts.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
texts: A sequence of unicode texts.
|
texts: A sequence of unicode texts.
|
||||||
|
@ -324,7 +322,8 @@ cdef class Tokenizer:
|
||||||
self._cache.set(key, cached)
|
self._cache.set(key, cached)
|
||||||
|
|
||||||
def find_infix(self, unicode string):
|
def find_infix(self, unicode string):
|
||||||
"""Find internal split points of the string, such as hyphens.
|
"""
|
||||||
|
Find internal split points of the string, such as hyphens.
|
||||||
|
|
||||||
string (unicode): The string to segment.
|
string (unicode): The string to segment.
|
||||||
|
|
||||||
|
@ -337,7 +336,8 @@ cdef class Tokenizer:
|
||||||
return list(self.infix_finditer(string))
|
return list(self.infix_finditer(string))
|
||||||
|
|
||||||
def find_prefix(self, unicode string):
|
def find_prefix(self, unicode string):
|
||||||
"""Find the length of a prefix that should be segmented from the string,
|
"""
|
||||||
|
Find the length of a prefix that should be segmented from the string,
|
||||||
or None if no prefix rules match.
|
or None if no prefix rules match.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -350,7 +350,8 @@ cdef class Tokenizer:
|
||||||
return (match.end() - match.start()) if match is not None else 0
|
return (match.end() - match.start()) if match is not None else 0
|
||||||
|
|
||||||
def find_suffix(self, unicode string):
|
def find_suffix(self, unicode string):
|
||||||
"""Find the length of a suffix that should be segmented from the string,
|
"""
|
||||||
|
Find the length of a suffix that should be segmented from the string,
|
||||||
or None if no suffix rules match.
|
or None if no suffix rules match.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -363,13 +364,15 @@ cdef class Tokenizer:
|
||||||
return (match.end() - match.start()) if match is not None else 0
|
return (match.end() - match.start()) if match is not None else 0
|
||||||
|
|
||||||
def _load_special_tokenization(self, special_cases):
|
def _load_special_tokenization(self, special_cases):
|
||||||
'''Add special-case tokenization rules.
|
"""
|
||||||
'''
|
Add special-case tokenization rules.
|
||||||
|
"""
|
||||||
for chunk, substrings in sorted(special_cases.items()):
|
for chunk, substrings in sorted(special_cases.items()):
|
||||||
self.add_special_case(chunk, substrings)
|
self.add_special_case(chunk, substrings)
|
||||||
|
|
||||||
def add_special_case(self, unicode string, substrings):
|
def add_special_case(self, unicode string, substrings):
|
||||||
'''Add a special-case tokenization rule.
|
"""
|
||||||
|
Add a special-case tokenization rule.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
string (unicode): The string to specially tokenize.
|
string (unicode): The string to specially tokenize.
|
||||||
|
@ -378,7 +381,7 @@ cdef class Tokenizer:
|
||||||
attributes. The ORTH fields of the attributes must exactly match
|
attributes. The ORTH fields of the attributes must exactly match
|
||||||
the string when they are concatenated.
|
the string when they are concatenated.
|
||||||
Returns None
|
Returns None
|
||||||
'''
|
"""
|
||||||
substrings = list(substrings)
|
substrings = list(substrings)
|
||||||
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached))
|
||||||
cached.length = len(substrings)
|
cached.length = len(substrings)
|
||||||
|
|
|
@ -1,15 +1,18 @@
|
||||||
|
# coding: utf8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
cimport cython
|
cimport cython
|
||||||
|
cimport numpy as np
|
||||||
|
import numpy
|
||||||
|
import numpy.linalg
|
||||||
|
import struct
|
||||||
|
|
||||||
from libc.string cimport memcpy, memset
|
from libc.string cimport memcpy, memset
|
||||||
from libc.stdint cimport uint32_t
|
from libc.stdint cimport uint32_t
|
||||||
from libc.math cimport sqrt
|
from libc.math cimport sqrt
|
||||||
|
|
||||||
import numpy
|
from .span cimport Span
|
||||||
import numpy.linalg
|
from .token cimport Token
|
||||||
import struct
|
|
||||||
cimport numpy as np
|
|
||||||
import six
|
|
||||||
import warnings
|
|
||||||
|
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
from ..lexeme cimport EMPTY_LEXEME
|
from ..lexeme cimport EMPTY_LEXEME
|
||||||
from ..typedefs cimport attr_t, flags_t
|
from ..typedefs cimport attr_t, flags_t
|
||||||
|
@ -19,11 +22,10 @@ from ..attrs cimport POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB, ENT_TYPE
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN
|
from ..parts_of_speech cimport CCONJ, PUNCT, NOUN
|
||||||
from ..parts_of_speech cimport univ_pos_t
|
from ..parts_of_speech cimport univ_pos_t
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
from .span cimport Span
|
|
||||||
from .token cimport Token
|
|
||||||
from ..serialize.bits cimport BitArray
|
from ..serialize.bits cimport BitArray
|
||||||
from ..util import normalize_slice
|
from ..util import normalize_slice
|
||||||
from ..syntax.iterators import CHUNKERS
|
from ..syntax.iterators import CHUNKERS
|
||||||
|
from ..compat import is_config
|
||||||
|
|
||||||
|
|
||||||
DEF PADDING = 5
|
DEF PADDING = 5
|
||||||
|
@ -76,7 +78,7 @@ cdef class Doc:
|
||||||
|
|
||||||
"""
|
"""
|
||||||
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
|
def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
|
||||||
'''
|
"""
|
||||||
Create a Doc object.
|
Create a Doc object.
|
||||||
|
|
||||||
Aside: Implementation
|
Aside: Implementation
|
||||||
|
@ -97,7 +99,7 @@ cdef class Doc:
|
||||||
A list of boolean values, of the same length as words. True
|
A list of boolean values, of the same length as words. True
|
||||||
means that the word is followed by a space, False means it is not.
|
means that the word is followed by a space, False means it is not.
|
||||||
If None, defaults to [True]*len(words)
|
If None, defaults to [True]*len(words)
|
||||||
'''
|
"""
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
size = 20
|
size = 20
|
||||||
self.mem = Pool()
|
self.mem = Pool()
|
||||||
|
@ -158,7 +160,7 @@ cdef class Doc:
|
||||||
self.is_parsed = True
|
self.is_parsed = True
|
||||||
|
|
||||||
def __getitem__(self, object i):
|
def __getitem__(self, object i):
|
||||||
'''
|
"""
|
||||||
doc[i]
|
doc[i]
|
||||||
Get the Token object at position i, where i is an integer.
|
Get the Token object at position i, where i is an integer.
|
||||||
Negative indexing is supported, and follows the usual Python
|
Negative indexing is supported, and follows the usual Python
|
||||||
|
@ -172,7 +174,7 @@ cdef class Doc:
|
||||||
are not supported, as `Span` objects must be contiguous (cannot have gaps).
|
are not supported, as `Span` objects must be contiguous (cannot have gaps).
|
||||||
You can use negative indices and open-ended ranges, which have their
|
You can use negative indices and open-ended ranges, which have their
|
||||||
normal Python semantics.
|
normal Python semantics.
|
||||||
'''
|
"""
|
||||||
if isinstance(i, slice):
|
if isinstance(i, slice):
|
||||||
start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
|
start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
|
||||||
return Span(self, start, stop, label=0)
|
return Span(self, start, stop, label=0)
|
||||||
|
@ -186,7 +188,7 @@ cdef class Doc:
|
||||||
return Token.cinit(self.vocab, &self.c[i], i, self)
|
return Token.cinit(self.vocab, &self.c[i], i, self)
|
||||||
|
|
||||||
def __iter__(self):
|
def __iter__(self):
|
||||||
'''
|
"""
|
||||||
for token in doc
|
for token in doc
|
||||||
Iterate over `Token` objects, from which the annotations can
|
Iterate over `Token` objects, from which the annotations can
|
||||||
be easily accessed. This is the main way of accessing Token
|
be easily accessed. This is the main way of accessing Token
|
||||||
|
@ -194,7 +196,7 @@ cdef class Doc:
|
||||||
Python. If faster-than-Python speeds are required, you can
|
Python. If faster-than-Python speeds are required, you can
|
||||||
instead access the annotations as a numpy array, or access the
|
instead access the annotations as a numpy array, or access the
|
||||||
underlying C data directly from Cython.
|
underlying C data directly from Cython.
|
||||||
'''
|
"""
|
||||||
cdef int i
|
cdef int i
|
||||||
for i in range(self.length):
|
for i in range(self.length):
|
||||||
if self._py_tokens[i] is not None:
|
if self._py_tokens[i] is not None:
|
||||||
|
@ -203,10 +205,10 @@ cdef class Doc:
|
||||||
yield Token.cinit(self.vocab, &self.c[i], i, self)
|
yield Token.cinit(self.vocab, &self.c[i], i, self)
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
'''
|
"""
|
||||||
len(doc)
|
len(doc)
|
||||||
The number of tokens in the document.
|
The number of tokens in the document.
|
||||||
'''
|
"""
|
||||||
return self.length
|
return self.length
|
||||||
|
|
||||||
def __unicode__(self):
|
def __unicode__(self):
|
||||||
|
@ -216,7 +218,7 @@ cdef class Doc:
|
||||||
return u''.join([t.text_with_ws for t in self]).encode('utf-8')
|
return u''.join([t.text_with_ws for t in self]).encode('utf-8')
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self):
|
||||||
if six.PY3:
|
if is_config(python3=True):
|
||||||
return self.__unicode__()
|
return self.__unicode__()
|
||||||
return self.__bytes__()
|
return self.__bytes__()
|
||||||
|
|
||||||
|
@ -228,7 +230,8 @@ cdef class Doc:
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
'''Make a semantic similarity estimate. The default estimate is cosine
|
"""
|
||||||
|
Make a semantic similarity estimate. The default estimate is cosine
|
||||||
similarity using an average of word vectors.
|
similarity using an average of word vectors.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -237,7 +240,7 @@ cdef class Doc:
|
||||||
|
|
||||||
Return:
|
Return:
|
||||||
score (float): A scalar similarity score. Higher is more similar.
|
score (float): A scalar similarity score. Higher is more similar.
|
||||||
'''
|
"""
|
||||||
if 'similarity' in self.user_hooks:
|
if 'similarity' in self.user_hooks:
|
||||||
return self.user_hooks['similarity'](self, other)
|
return self.user_hooks['similarity'](self, other)
|
||||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||||
|
@ -245,9 +248,9 @@ cdef class Doc:
|
||||||
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
return numpy.dot(self.vector, other.vector) / (self.vector_norm * other.vector_norm)
|
||||||
|
|
||||||
property has_vector:
|
property has_vector:
|
||||||
'''
|
"""
|
||||||
A boolean value indicating whether a word vector is associated with the object.
|
A boolean value indicating whether a word vector is associated with the object.
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if 'has_vector' in self.user_hooks:
|
if 'has_vector' in self.user_hooks:
|
||||||
return self.user_hooks['has_vector'](self)
|
return self.user_hooks['has_vector'](self)
|
||||||
|
@ -255,11 +258,11 @@ cdef class Doc:
|
||||||
return any(token.has_vector for token in self)
|
return any(token.has_vector for token in self)
|
||||||
|
|
||||||
property vector:
|
property vector:
|
||||||
'''
|
"""
|
||||||
A real-valued meaning representation. Defaults to an average of the token vectors.
|
A real-valued meaning representation. Defaults to an average of the token vectors.
|
||||||
|
|
||||||
Type: numpy.ndarray[ndim=1, dtype='float32']
|
Type: numpy.ndarray[ndim=1, dtype='float32']
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if 'vector' in self.user_hooks:
|
if 'vector' in self.user_hooks:
|
||||||
return self.user_hooks['vector'](self)
|
return self.user_hooks['vector'](self)
|
||||||
|
@ -294,17 +297,21 @@ cdef class Doc:
|
||||||
return self.text
|
return self.text
|
||||||
|
|
||||||
property text:
|
property text:
|
||||||
'''A unicode representation of the document text.'''
|
"""
|
||||||
|
A unicode representation of the document text.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return u''.join(t.text_with_ws for t in self)
|
return u''.join(t.text_with_ws for t in self)
|
||||||
|
|
||||||
property text_with_ws:
|
property text_with_ws:
|
||||||
'''An alias of Doc.text, provided for duck-type compatibility with Span and Token.'''
|
"""
|
||||||
|
An alias of Doc.text, provided for duck-type compatibility with Span and Token.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.text
|
return self.text
|
||||||
|
|
||||||
property ents:
|
property ents:
|
||||||
'''
|
"""
|
||||||
Yields named-entity `Span` objects, if the entity recognizer
|
Yields named-entity `Span` objects, if the entity recognizer
|
||||||
has been applied to the document. Iterate over the span to get
|
has been applied to the document. Iterate over the span to get
|
||||||
individual Token objects, or access the label:
|
individual Token objects, or access the label:
|
||||||
|
@ -318,7 +325,7 @@ cdef class Doc:
|
||||||
assert ents[0].label_ == 'PERSON'
|
assert ents[0].label_ == 'PERSON'
|
||||||
assert ents[0].orth_ == 'Best'
|
assert ents[0].orth_ == 'Best'
|
||||||
assert ents[0].text == 'Mr. Best'
|
assert ents[0].text == 'Mr. Best'
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
cdef int i
|
cdef int i
|
||||||
cdef const TokenC* token
|
cdef const TokenC* token
|
||||||
|
@ -382,13 +389,13 @@ cdef class Doc:
|
||||||
self.c[start].ent_iob = 3
|
self.c[start].ent_iob = 3
|
||||||
|
|
||||||
property noun_chunks:
|
property noun_chunks:
|
||||||
'''
|
"""
|
||||||
Yields base noun-phrase #[code Span] objects, if the document
|
Yields base noun-phrase #[code Span] objects, if the document
|
||||||
has been syntactically parsed. A base noun phrase, or
|
has been syntactically parsed. A base noun phrase, or
|
||||||
'NP chunk', is a noun phrase that does not permit other NPs to
|
'NP chunk', is a noun phrase that does not permit other NPs to
|
||||||
be nested within it – so no NP-level coordination, no prepositional
|
be nested within it – so no NP-level coordination, no prepositional
|
||||||
phrases, and no relative clauses. For example:
|
phrases, and no relative clauses.
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if not self.is_parsed:
|
if not self.is_parsed:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
|
@ -496,7 +503,8 @@ cdef class Doc:
|
||||||
return output
|
return output
|
||||||
|
|
||||||
def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
|
def count_by(self, attr_id_t attr_id, exclude=None, PreshCounter counts=None):
|
||||||
"""Produce a dict of {attribute (int): count (ints)} frequencies, keyed
|
"""
|
||||||
|
Produce a dict of {attribute (int): count (ints)} frequencies, keyed
|
||||||
by the values of the given attribute ID.
|
by the values of the given attribute ID.
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
|
@ -563,8 +571,9 @@ cdef class Doc:
|
||||||
self.c[i] = parsed[i]
|
self.c[i] = parsed[i]
|
||||||
|
|
||||||
def from_array(self, attrs, array):
|
def from_array(self, attrs, array):
|
||||||
'''Write to a `Doc` object, from an `(M, N)` array of attributes.
|
"""
|
||||||
'''
|
Write to a `Doc` object, from an `(M, N)` array of attributes.
|
||||||
|
"""
|
||||||
cdef int i, col
|
cdef int i, col
|
||||||
cdef attr_id_t attr_id
|
cdef attr_id_t attr_id
|
||||||
cdef TokenC* tokens = self.c
|
cdef TokenC* tokens = self.c
|
||||||
|
@ -603,19 +612,23 @@ cdef class Doc:
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def to_bytes(self):
|
def to_bytes(self):
|
||||||
'''Serialize, producing a byte string.'''
|
"""
|
||||||
|
Serialize, producing a byte string.
|
||||||
|
"""
|
||||||
byte_string = self.vocab.serializer.pack(self)
|
byte_string = self.vocab.serializer.pack(self)
|
||||||
cdef uint32_t length = len(byte_string)
|
cdef uint32_t length = len(byte_string)
|
||||||
return struct.pack('I', length) + byte_string
|
return struct.pack('I', length) + byte_string
|
||||||
|
|
||||||
def from_bytes(self, data):
|
def from_bytes(self, data):
|
||||||
'''Deserialize, loading from bytes.'''
|
"""
|
||||||
|
Deserialize, loading from bytes.
|
||||||
|
"""
|
||||||
self.vocab.serializer.unpack_into(data[4:], self)
|
self.vocab.serializer.unpack_into(data[4:], self)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def read_bytes(file_):
|
def read_bytes(file_):
|
||||||
'''
|
"""
|
||||||
A static method, used to read serialized #[code Doc] objects from
|
A static method, used to read serialized #[code Doc] objects from
|
||||||
a file. For example:
|
a file. For example:
|
||||||
|
|
||||||
|
@ -630,7 +643,7 @@ cdef class Doc:
|
||||||
for byte_string in Doc.read_bytes(file_):
|
for byte_string in Doc.read_bytes(file_):
|
||||||
docs.append(Doc(nlp.vocab).from_bytes(byte_string))
|
docs.append(Doc(nlp.vocab).from_bytes(byte_string))
|
||||||
assert len(docs) == 2
|
assert len(docs) == 2
|
||||||
'''
|
"""
|
||||||
keep_reading = True
|
keep_reading = True
|
||||||
while keep_reading:
|
while keep_reading:
|
||||||
try:
|
try:
|
||||||
|
@ -644,7 +657,8 @@ cdef class Doc:
|
||||||
yield n_bytes_str + data
|
yield n_bytes_str + data
|
||||||
|
|
||||||
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
def merge(self, int start_idx, int end_idx, *args, **attributes):
|
||||||
"""Retokenize the document, such that the span at doc.text[start_idx : end_idx]
|
"""
|
||||||
|
Retokenize the document, such that the span at doc.text[start_idx : end_idx]
|
||||||
is merged into a single token. If start_idx and end_idx do not mark start
|
is merged into a single token. If start_idx and end_idx do not mark start
|
||||||
and end token boundaries, the document remains unchanged.
|
and end token boundaries, the document remains unchanged.
|
||||||
|
|
||||||
|
@ -658,7 +672,6 @@ cdef class Doc:
|
||||||
token (Token):
|
token (Token):
|
||||||
The newly merged token, or None if the start and end indices did
|
The newly merged token, or None if the start and end indices did
|
||||||
not fall at token boundaries.
|
not fall at token boundaries.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
cdef unicode tag, lemma, ent_type
|
cdef unicode tag, lemma, ent_type
|
||||||
if len(args) == 3:
|
if len(args) == 3:
|
||||||
|
|
|
@ -1,26 +1,31 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
|
|
||||||
|
cimport numpy as np
|
||||||
import numpy
|
import numpy
|
||||||
import numpy.linalg
|
import numpy.linalg
|
||||||
cimport numpy as np
|
|
||||||
from libc.math cimport sqrt
|
from libc.math cimport sqrt
|
||||||
import six
|
|
||||||
|
|
||||||
|
from .doc cimport token_by_start, token_by_end
|
||||||
from ..structs cimport TokenC, LexemeC
|
from ..structs cimport TokenC, LexemeC
|
||||||
from ..typedefs cimport flags_t, attr_t, hash_t
|
from ..typedefs cimport flags_t, attr_t, hash_t
|
||||||
from ..attrs cimport attr_id_t
|
from ..attrs cimport attr_id_t
|
||||||
from ..parts_of_speech cimport univ_pos_t
|
from ..parts_of_speech cimport univ_pos_t
|
||||||
from ..util import normalize_slice
|
from ..util import normalize_slice
|
||||||
from .doc cimport token_by_start, token_by_end
|
|
||||||
from ..attrs cimport IS_PUNCT, IS_SPACE
|
from ..attrs cimport IS_PUNCT, IS_SPACE
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
|
from ..compat import is_config
|
||||||
|
|
||||||
|
|
||||||
cdef class Span:
|
cdef class Span:
|
||||||
"""A slice from a Doc object."""
|
"""
|
||||||
|
A slice from a Doc object.
|
||||||
|
"""
|
||||||
def __cinit__(self, Doc doc, int start, int end, int label=0, vector=None,
|
def __cinit__(self, Doc doc, int start, int end, int label=0, vector=None,
|
||||||
vector_norm=None):
|
vector_norm=None):
|
||||||
'''Create a Span object from the slice doc[start : end]
|
"""
|
||||||
|
Create a Span object from the slice doc[start : end]
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
doc (Doc): The parent document.
|
doc (Doc): The parent document.
|
||||||
|
@ -30,7 +35,7 @@ cdef class Span:
|
||||||
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span.
|
vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span.
|
||||||
Returns:
|
Returns:
|
||||||
Span The newly constructed object.
|
Span The newly constructed object.
|
||||||
'''
|
"""
|
||||||
if not (0 <= start <= end <= len(doc)):
|
if not (0 <= start <= end <= len(doc)):
|
||||||
raise IndexError
|
raise IndexError
|
||||||
|
|
||||||
|
@ -68,7 +73,7 @@ cdef class Span:
|
||||||
return self.end - self.start
|
return self.end - self.start
|
||||||
|
|
||||||
def __repr__(self):
|
def __repr__(self):
|
||||||
if six.PY3:
|
if is_config(python3=True):
|
||||||
return self.text
|
return self.text
|
||||||
return self.text.encode('utf-8')
|
return self.text.encode('utf-8')
|
||||||
|
|
||||||
|
@ -89,7 +94,8 @@ cdef class Span:
|
||||||
yield self.doc[i]
|
yield self.doc[i]
|
||||||
|
|
||||||
def merge(self, *args, **attributes):
|
def merge(self, *args, **attributes):
|
||||||
"""Retokenize the document, such that the span is merged into a single token.
|
"""
|
||||||
|
Retokenize the document, such that the span is merged into a single token.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
**attributes:
|
**attributes:
|
||||||
|
@ -102,7 +108,8 @@ cdef class Span:
|
||||||
return self.doc.merge(self.start_char, self.end_char, *args, **attributes)
|
return self.doc.merge(self.start_char, self.end_char, *args, **attributes)
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
'''Make a semantic similarity estimate. The default estimate is cosine
|
"""
|
||||||
|
Make a semantic similarity estimate. The default estimate is cosine
|
||||||
similarity using an average of word vectors.
|
similarity using an average of word vectors.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -111,7 +118,7 @@ cdef class Span:
|
||||||
|
|
||||||
Return:
|
Return:
|
||||||
score (float): A scalar similarity score. Higher is more similar.
|
score (float): A scalar similarity score. Higher is more similar.
|
||||||
'''
|
"""
|
||||||
if 'similarity' in self.doc.user_span_hooks:
|
if 'similarity' in self.doc.user_span_hooks:
|
||||||
self.doc.user_span_hooks['similarity'](self, other)
|
self.doc.user_span_hooks['similarity'](self, other)
|
||||||
if self.vector_norm == 0.0 or other.vector_norm == 0.0:
|
if self.vector_norm == 0.0 or other.vector_norm == 0.0:
|
||||||
|
@ -133,11 +140,12 @@ cdef class Span:
|
||||||
self.end = end + 1
|
self.end = end + 1
|
||||||
|
|
||||||
property sent:
|
property sent:
|
||||||
'''The sentence span that this span is a part of.
|
"""
|
||||||
|
The sentence span that this span is a part of.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Span The sentence this is part of.
|
Span The sentence this is part of.
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if 'sent' in self.doc.user_span_hooks:
|
if 'sent' in self.doc.user_span_hooks:
|
||||||
return self.doc.user_span_hooks['sent'](self)
|
return self.doc.user_span_hooks['sent'](self)
|
||||||
|
@ -198,13 +206,13 @@ cdef class Span:
|
||||||
return u''.join([t.text_with_ws for t in self])
|
return u''.join([t.text_with_ws for t in self])
|
||||||
|
|
||||||
property noun_chunks:
|
property noun_chunks:
|
||||||
'''
|
"""
|
||||||
Yields base noun-phrase #[code Span] objects, if the document
|
Yields base noun-phrase #[code Span] objects, if the document
|
||||||
has been syntactically parsed. A base noun phrase, or
|
has been syntactically parsed. A base noun phrase, or
|
||||||
'NP chunk', is a noun phrase that does not permit other NPs to
|
'NP chunk', is a noun phrase that does not permit other NPs to
|
||||||
be nested within it – so no NP-level coordination, no prepositional
|
be nested within it – so no NP-level coordination, no prepositional
|
||||||
phrases, and no relative clauses. For example:
|
phrases, and no relative clauses. For example:
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if not self.doc.is_parsed:
|
if not self.doc.is_parsed:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
|
@ -223,17 +231,16 @@ cdef class Span:
|
||||||
yield span
|
yield span
|
||||||
|
|
||||||
property root:
|
property root:
|
||||||
"""The token within the span that's highest in the parse tree. If there's a tie, the earlist is prefered.
|
"""
|
||||||
|
The token within the span that's highest in the parse tree. If there's a
|
||||||
|
tie, the earlist is prefered.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Token: The root token.
|
Token: The root token.
|
||||||
|
|
||||||
i.e. has the
|
i.e. has the shortest path to the root of the sentence (or is the root
|
||||||
shortest path to the root of the sentence (or is the root itself).
|
itself). If multiple words are equally high in the tree, the first word
|
||||||
|
is taken. For example:
|
||||||
If multiple words are equally high in the tree, the first word is taken.
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
>>> toks = nlp(u'I like New York in Autumn.')
|
>>> toks = nlp(u'I like New York in Autumn.')
|
||||||
|
|
||||||
|
@ -303,7 +310,8 @@ cdef class Span:
|
||||||
return self.doc[root]
|
return self.doc[root]
|
||||||
|
|
||||||
property lefts:
|
property lefts:
|
||||||
"""Tokens that are to the left of the span, whose head is within the Span.
|
"""
|
||||||
|
Tokens that are to the left of the span, whose head is within the Span.
|
||||||
|
|
||||||
Yields: Token A left-child of a token of the span.
|
Yields: Token A left-child of a token of the span.
|
||||||
"""
|
"""
|
||||||
|
@ -314,7 +322,8 @@ cdef class Span:
|
||||||
yield left
|
yield left
|
||||||
|
|
||||||
property rights:
|
property rights:
|
||||||
"""Tokens that are to the right of the Span, whose head is within the Span.
|
"""
|
||||||
|
Tokens that are to the right of the Span, whose head is within the Span.
|
||||||
|
|
||||||
Yields: Token A right-child of a token of the span.
|
Yields: Token A right-child of a token of the span.
|
||||||
"""
|
"""
|
||||||
|
@ -325,7 +334,8 @@ cdef class Span:
|
||||||
yield right
|
yield right
|
||||||
|
|
||||||
property subtree:
|
property subtree:
|
||||||
"""Tokens that descend from tokens in the span, but fall outside it.
|
"""
|
||||||
|
Tokens that descend from tokens in the span, but fall outside it.
|
||||||
|
|
||||||
Yields: Token A descendant of a token within the span.
|
Yields: Token A descendant of a token within the span.
|
||||||
"""
|
"""
|
||||||
|
@ -337,7 +347,9 @@ cdef class Span:
|
||||||
yield from word.subtree
|
yield from word.subtree
|
||||||
|
|
||||||
property ent_id:
|
property ent_id:
|
||||||
'''An (integer) entity ID. Usually assigned by patterns in the Matcher.'''
|
"""
|
||||||
|
An (integer) entity ID. Usually assigned by patterns in the Matcher.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.root.ent_id
|
return self.root.ent_id
|
||||||
|
|
||||||
|
@ -345,9 +357,11 @@ cdef class Span:
|
||||||
# TODO
|
# TODO
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(
|
||||||
"Can't yet set ent_id from Span. Vote for this feature on the issue "
|
"Can't yet set ent_id from Span. Vote for this feature on the issue "
|
||||||
"tracker: http://github.com/spacy-io/spaCy")
|
"tracker: http://github.com/explosion/spaCy/issues")
|
||||||
property ent_id_:
|
property ent_id_:
|
||||||
'''A (string) entity ID. Usually assigned by patterns in the Matcher.'''
|
"""
|
||||||
|
A (string) entity ID. Usually assigned by patterns in the Matcher.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.root.ent_id_
|
return self.root.ent_id_
|
||||||
|
|
||||||
|
@ -355,7 +369,7 @@ cdef class Span:
|
||||||
# TODO
|
# TODO
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(
|
||||||
"Can't yet set ent_id_ from Span. Vote for this feature on the issue "
|
"Can't yet set ent_id_ from Span. Vote for this feature on the issue "
|
||||||
"tracker: http://github.com/spacy-io/spaCy")
|
"tracker: http://github.com/explosion/spaCy/issues")
|
||||||
|
|
||||||
property orth_:
|
property orth_:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -397,5 +411,5 @@ cdef int _count_words_to_root(const TokenC* token, int sent_length) except -1:
|
||||||
raise RuntimeError(
|
raise RuntimeError(
|
||||||
"Array bounds exceeded while searching for root word. This likely "
|
"Array bounds exceeded while searching for root word. This likely "
|
||||||
"means the parse tree is in an invalid state. Please report this "
|
"means the parse tree is in an invalid state. Please report this "
|
||||||
"issue here: http://github.com/honnibal/spaCy/")
|
"issue here: http://github.com/explosion/spaCy/issues")
|
||||||
return n
|
return n
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# coding: utf8
|
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from libc.string cimport memcpy
|
from libc.string cimport memcpy
|
||||||
|
@ -8,20 +8,15 @@ from cpython.mem cimport PyMem_Malloc, PyMem_Free
|
||||||
from cython.view cimport array as cvarray
|
from cython.view cimport array as cvarray
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
np.import_array()
|
np.import_array()
|
||||||
|
|
||||||
import numpy
|
import numpy
|
||||||
import six
|
|
||||||
|
|
||||||
|
|
||||||
from ..typedefs cimport hash_t
|
from ..typedefs cimport hash_t
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
from .. import parts_of_speech
|
from .. import parts_of_speech
|
||||||
|
|
||||||
from ..attrs cimport LEMMA
|
from ..attrs cimport LEMMA
|
||||||
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER
|
from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, LENGTH, CLUSTER
|
||||||
from ..attrs cimport POS, LEMMA, TAG, DEP
|
from ..attrs cimport POS, LEMMA, TAG, DEP
|
||||||
from ..parts_of_speech cimport CCONJ, PUNCT
|
from ..parts_of_speech cimport CCONJ, PUNCT
|
||||||
|
|
||||||
from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE
|
||||||
from ..attrs cimport IS_BRACKET
|
from ..attrs cimport IS_BRACKET
|
||||||
from ..attrs cimport IS_QUOTE
|
from ..attrs cimport IS_QUOTE
|
||||||
|
@ -29,12 +24,13 @@ from ..attrs cimport IS_LEFT_PUNCT
|
||||||
from ..attrs cimport IS_RIGHT_PUNCT
|
from ..attrs cimport IS_RIGHT_PUNCT
|
||||||
from ..attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
from ..attrs cimport IS_TITLE, IS_UPPER, LIKE_URL, LIKE_NUM, LIKE_EMAIL, IS_STOP
|
||||||
from ..attrs cimport IS_OOV
|
from ..attrs cimport IS_OOV
|
||||||
|
|
||||||
from ..lexeme cimport Lexeme
|
from ..lexeme cimport Lexeme
|
||||||
|
from ..compat import is_config
|
||||||
|
|
||||||
|
|
||||||
cdef class Token:
|
cdef class Token:
|
||||||
"""An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
|
"""
|
||||||
|
An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
|
||||||
"""
|
"""
|
||||||
def __cinit__(self, Vocab vocab, Doc doc, int offset):
|
def __cinit__(self, Vocab vocab, Doc doc, int offset):
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
|
@ -46,7 +42,9 @@ cdef class Token:
|
||||||
return hash((self.doc, self.i))
|
return hash((self.doc, self.i))
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
'''Number of unicode characters in token.text'''
|
"""
|
||||||
|
Number of unicode characters in token.text.
|
||||||
|
"""
|
||||||
return self.c.lex.length
|
return self.c.lex.length
|
||||||
|
|
||||||
def __unicode__(self):
|
def __unicode__(self):
|
||||||
|
@ -56,7 +54,7 @@ cdef class Token:
|
||||||
return self.text.encode('utf8')
|
return self.text.encode('utf8')
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self):
|
||||||
if six.PY3:
|
if is_config(python3=True):
|
||||||
return self.__unicode__()
|
return self.__unicode__()
|
||||||
return self.__bytes__()
|
return self.__bytes__()
|
||||||
|
|
||||||
|
@ -83,27 +81,30 @@ cdef class Token:
|
||||||
raise ValueError(op)
|
raise ValueError(op)
|
||||||
|
|
||||||
cpdef bint check_flag(self, attr_id_t flag_id) except -1:
|
cpdef bint check_flag(self, attr_id_t flag_id) except -1:
|
||||||
'''Check the value of a boolean flag.
|
"""
|
||||||
|
Check the value of a boolean flag.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
flag_id (int): The ID of the flag attribute.
|
flag_id (int): The ID of the flag attribute.
|
||||||
Returns:
|
Returns:
|
||||||
is_set (bool): Whether the flag is set.
|
is_set (bool): Whether the flag is set.
|
||||||
'''
|
"""
|
||||||
return Lexeme.c_check_flag(self.c.lex, flag_id)
|
return Lexeme.c_check_flag(self.c.lex, flag_id)
|
||||||
|
|
||||||
def nbor(self, int i=1):
|
def nbor(self, int i=1):
|
||||||
'''Get a neighboring token.
|
"""
|
||||||
|
Get a neighboring token.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
i (int): The relative position of the token to get. Defaults to 1.
|
i (int): The relative position of the token to get. Defaults to 1.
|
||||||
Returns:
|
Returns:
|
||||||
neighbor (Token): The token at position self.doc[self.i+i]
|
neighbor (Token): The token at position self.doc[self.i+i]
|
||||||
'''
|
"""
|
||||||
return self.doc[self.i+i]
|
return self.doc[self.i+i]
|
||||||
|
|
||||||
def similarity(self, other):
|
def similarity(self, other):
|
||||||
'''Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
"""
|
||||||
|
Compute a semantic similarity estimate. Defaults to cosine over vectors.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
other:
|
other:
|
||||||
|
@ -111,7 +112,7 @@ cdef class Token:
|
||||||
Token and Lexeme objects.
|
Token and Lexeme objects.
|
||||||
Returns:
|
Returns:
|
||||||
score (float): A scalar similarity score. Higher is more similar.
|
score (float): A scalar similarity score. Higher is more similar.
|
||||||
'''
|
"""
|
||||||
if 'similarity' in self.doc.user_token_hooks:
|
if 'similarity' in self.doc.user_token_hooks:
|
||||||
return self.doc.user_token_hooks['similarity'](self)
|
return self.doc.user_token_hooks['similarity'](self)
|
||||||
if self.vector_norm == 0 or other.vector_norm == 0:
|
if self.vector_norm == 0 or other.vector_norm == 0:
|
||||||
|
@ -191,6 +192,8 @@ cdef class Token:
|
||||||
property lemma:
|
property lemma:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.c.lemma
|
return self.c.lemma
|
||||||
|
def __set__(self, int lemma):
|
||||||
|
self.c.lemma = lemma
|
||||||
|
|
||||||
property pos:
|
property pos:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
@ -209,9 +212,9 @@ cdef class Token:
|
||||||
self.c.dep = label
|
self.c.dep = label
|
||||||
|
|
||||||
property has_vector:
|
property has_vector:
|
||||||
'''
|
"""
|
||||||
A boolean value indicating whether a word vector is associated with the object.
|
A boolean value indicating whether a word vector is associated with the object.
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if 'has_vector' in self.doc.user_token_hooks:
|
if 'has_vector' in self.doc.user_token_hooks:
|
||||||
return self.doc.user_token_hooks['has_vector'](self)
|
return self.doc.user_token_hooks['has_vector'](self)
|
||||||
|
@ -223,11 +226,11 @@ cdef class Token:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
property vector:
|
property vector:
|
||||||
'''
|
"""
|
||||||
A real-valued meaning representation.
|
A real-valued meaning representation.
|
||||||
|
|
||||||
Type: numpy.ndarray[ndim=1, dtype='float32']
|
Type: numpy.ndarray[ndim=1, dtype='float32']
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
if 'vector' in self.doc.user_token_hooks:
|
if 'vector' in self.doc.user_token_hooks:
|
||||||
return self.doc.user_token_hooks['vector'](self)
|
return self.doc.user_token_hooks['vector'](self)
|
||||||
|
@ -245,6 +248,7 @@ cdef class Token:
|
||||||
property repvec:
|
property repvec:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
raise AttributeError("repvec was renamed to vector in v0.100")
|
raise AttributeError("repvec was renamed to vector in v0.100")
|
||||||
|
|
||||||
property has_repvec:
|
property has_repvec:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
raise AttributeError("has_repvec was renamed to has_vector in v0.100")
|
raise AttributeError("has_repvec was renamed to has_vector in v0.100")
|
||||||
|
@ -265,7 +269,8 @@ cdef class Token:
|
||||||
|
|
||||||
property lefts:
|
property lefts:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
"""The leftward immediate children of the word, in the syntactic
|
"""
|
||||||
|
The leftward immediate children of the word, in the syntactic
|
||||||
dependency parse.
|
dependency parse.
|
||||||
"""
|
"""
|
||||||
cdef int nr_iter = 0
|
cdef int nr_iter = 0
|
||||||
|
@ -282,8 +287,10 @@ cdef class Token:
|
||||||
|
|
||||||
property rights:
|
property rights:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
"""The rightward immediate children of the word, in the syntactic
|
"""
|
||||||
dependency parse."""
|
The rightward immediate children of the word, in the syntactic
|
||||||
|
dependency parse.
|
||||||
|
"""
|
||||||
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i)
|
||||||
tokens = []
|
tokens = []
|
||||||
cdef int nr_iter = 0
|
cdef int nr_iter = 0
|
||||||
|
@ -300,19 +307,21 @@ cdef class Token:
|
||||||
yield t
|
yield t
|
||||||
|
|
||||||
property children:
|
property children:
|
||||||
'''A sequence of the token's immediate syntactic children.
|
"""
|
||||||
|
A sequence of the token's immediate syntactic children.
|
||||||
|
|
||||||
Yields: Token A child token such that child.head==self
|
Yields: Token A child token such that child.head==self
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
yield from self.lefts
|
yield from self.lefts
|
||||||
yield from self.rights
|
yield from self.rights
|
||||||
|
|
||||||
property subtree:
|
property subtree:
|
||||||
'''A sequence of all the token's syntactic descendents.
|
"""
|
||||||
|
A sequence of all the token's syntactic descendents.
|
||||||
|
|
||||||
Yields: Token A descendent token such that self.is_ancestor(descendent)
|
Yields: Token A descendent token such that self.is_ancestor(descendent)
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
for word in self.lefts:
|
for word in self.lefts:
|
||||||
yield from word.subtree
|
yield from word.subtree
|
||||||
|
@ -321,26 +330,29 @@ cdef class Token:
|
||||||
yield from word.subtree
|
yield from word.subtree
|
||||||
|
|
||||||
property left_edge:
|
property left_edge:
|
||||||
'''The leftmost token of this token's syntactic descendents.
|
"""
|
||||||
|
The leftmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
Returns: Token The first token such that self.is_ancestor(token)
|
Returns: Token The first token such that self.is_ancestor(token)
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.doc[self.c.l_edge]
|
return self.doc[self.c.l_edge]
|
||||||
|
|
||||||
property right_edge:
|
property right_edge:
|
||||||
'''The rightmost token of this token's syntactic descendents.
|
"""
|
||||||
|
The rightmost token of this token's syntactic descendents.
|
||||||
|
|
||||||
Returns: Token The last token such that self.is_ancestor(token)
|
Returns: Token The last token such that self.is_ancestor(token)
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.doc[self.c.r_edge]
|
return self.doc[self.c.r_edge]
|
||||||
|
|
||||||
property ancestors:
|
property ancestors:
|
||||||
'''A sequence of this token's syntactic ancestors.
|
"""
|
||||||
|
A sequence of this token's syntactic ancestors.
|
||||||
|
|
||||||
Yields: Token A sequence of ancestor tokens such that ancestor.is_ancestor(self)
|
Yields: Token A sequence of ancestor tokens such that ancestor.is_ancestor(self)
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
cdef const TokenC* head_ptr = self.c
|
cdef const TokenC* head_ptr = self.c
|
||||||
# guard against infinite loop, no token can have
|
# guard against infinite loop, no token can have
|
||||||
|
@ -356,25 +368,29 @@ cdef class Token:
|
||||||
return self.is_ancestor(descendant)
|
return self.is_ancestor(descendant)
|
||||||
|
|
||||||
def is_ancestor(self, descendant):
|
def is_ancestor(self, descendant):
|
||||||
'''Check whether this token is a parent, grandparent, etc. of another
|
"""
|
||||||
|
Check whether this token is a parent, grandparent, etc. of another
|
||||||
in the dependency tree.
|
in the dependency tree.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
descendant (Token): Another token.
|
descendant (Token): Another token.
|
||||||
Returns:
|
Returns:
|
||||||
is_ancestor (bool): Whether this token is the ancestor of the descendant.
|
is_ancestor (bool): Whether this token is the ancestor of the descendant.
|
||||||
'''
|
"""
|
||||||
if self.doc is not descendant.doc:
|
if self.doc is not descendant.doc:
|
||||||
return False
|
return False
|
||||||
return any( ancestor.i == self.i for ancestor in descendant.ancestors )
|
return any( ancestor.i == self.i for ancestor in descendant.ancestors )
|
||||||
|
|
||||||
property head:
|
property head:
|
||||||
'''The syntactic parent, or "governor", of this token.
|
"""
|
||||||
|
The syntactic parent, or "governor", of this token.
|
||||||
|
|
||||||
Returns: Token
|
Returns: Token
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
"""The token predicted by the parser to be the head of the current token."""
|
"""
|
||||||
|
The token predicted by the parser to be the head of the current token.
|
||||||
|
"""
|
||||||
return self.doc[self.i + self.c.head]
|
return self.doc[self.i + self.c.head]
|
||||||
def __set__(self, Token new_head):
|
def __set__(self, Token new_head):
|
||||||
# this function sets the head of self to new_head
|
# this function sets the head of self to new_head
|
||||||
|
@ -467,10 +483,11 @@ cdef class Token:
|
||||||
self.c.head = rel_newhead_i
|
self.c.head = rel_newhead_i
|
||||||
|
|
||||||
property conjuncts:
|
property conjuncts:
|
||||||
'''A sequence of coordinated tokens, including the token itself.
|
"""
|
||||||
|
A sequence of coordinated tokens, including the token itself.
|
||||||
|
|
||||||
Yields: Token A coordinated token
|
Yields: Token A coordinated token
|
||||||
'''
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
"""Get a list of conjoined words."""
|
"""Get a list of conjoined words."""
|
||||||
cdef Token word
|
cdef Token word
|
||||||
|
@ -501,7 +518,9 @@ cdef class Token:
|
||||||
return iob_strings[self.c.ent_iob]
|
return iob_strings[self.c.ent_iob]
|
||||||
|
|
||||||
property ent_id:
|
property ent_id:
|
||||||
'''An (integer) entity ID. Usually assigned by patterns in the Matcher.'''
|
"""
|
||||||
|
An (integer) entity ID. Usually assigned by patterns in the Matcher.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.c.ent_id
|
return self.c.ent_id
|
||||||
|
|
||||||
|
@ -509,7 +528,9 @@ cdef class Token:
|
||||||
self.c.ent_id = key
|
self.c.ent_id = key
|
||||||
|
|
||||||
property ent_id_:
|
property ent_id_:
|
||||||
'''A (string) entity ID. Usually assigned by patterns in the Matcher.'''
|
"""
|
||||||
|
A (string) entity ID. Usually assigned by patterns in the Matcher.
|
||||||
|
"""
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.ent_id]
|
return self.vocab.strings[self.c.ent_id]
|
||||||
|
|
||||||
|
@ -551,6 +572,8 @@ cdef class Token:
|
||||||
property lemma_:
|
property lemma_:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
return self.vocab.strings[self.c.lemma]
|
return self.vocab.strings[self.c.lemma]
|
||||||
|
def __set__(self, unicode lemma_):
|
||||||
|
self.c.lemma = self.vocab.strings[lemma_]
|
||||||
|
|
||||||
property pos_:
|
property pos_:
|
||||||
def __get__(self):
|
def __get__(self):
|
||||||
|
|
|
@ -1,15 +1,16 @@
|
||||||
from __future__ import absolute_import
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import absolute_import, unicode_literals
|
||||||
|
|
||||||
import random
|
import random
|
||||||
import tqdm
|
import tqdm
|
||||||
from .gold import GoldParse
|
from .gold import GoldParse, merge_sents
|
||||||
from .scorer import Scorer
|
from .scorer import Scorer
|
||||||
from .gold import merge_sents
|
|
||||||
|
|
||||||
|
|
||||||
class Trainer(object):
|
class Trainer(object):
|
||||||
'''Manage training of an NLP pipeline.'''
|
"""
|
||||||
|
Manage training of an NLP pipeline.
|
||||||
|
"""
|
||||||
def __init__(self, nlp, gold_tuples):
|
def __init__(self, nlp, gold_tuples):
|
||||||
self.nlp = nlp
|
self.nlp = nlp
|
||||||
self.gold_tuples = gold_tuples
|
self.gold_tuples = gold_tuples
|
||||||
|
|
157
spacy/util.py
157
spacy/util.py
|
@ -1,29 +1,17 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals, print_function
|
from __future__ import unicode_literals, print_function
|
||||||
import os
|
|
||||||
import io
|
import ujson
|
||||||
import json
|
|
||||||
import re
|
import re
|
||||||
import os.path
|
from pathlib import Path
|
||||||
import pathlib
|
|
||||||
import sys
|
import sys
|
||||||
import textwrap
|
import textwrap
|
||||||
|
|
||||||
|
from .compat import basestring_, unicode_, input_
|
||||||
try:
|
|
||||||
basestring
|
|
||||||
except NameError:
|
|
||||||
basestring = str
|
|
||||||
|
|
||||||
|
|
||||||
try:
|
|
||||||
raw_input
|
|
||||||
except NameError: # Python 3
|
|
||||||
raw_input = input
|
|
||||||
|
|
||||||
|
|
||||||
LANGUAGES = {}
|
LANGUAGES = {}
|
||||||
_data_path = pathlib.Path(__file__).parent / 'data'
|
_data_path = Path(__file__).parent / 'data'
|
||||||
|
|
||||||
|
|
||||||
def set_lang_class(name, cls):
|
def set_lang_class(name, cls):
|
||||||
|
@ -32,9 +20,11 @@ def set_lang_class(name, cls):
|
||||||
|
|
||||||
|
|
||||||
def get_lang_class(name):
|
def get_lang_class(name):
|
||||||
|
if name in LANGUAGES:
|
||||||
|
return LANGUAGES[name]
|
||||||
lang = re.split('[^a-zA-Z0-9]', name, 1)[0]
|
lang = re.split('[^a-zA-Z0-9]', name, 1)[0]
|
||||||
if lang not in LANGUAGES:
|
if lang not in LANGUAGES:
|
||||||
raise RuntimeError('Language not supported: %s' % lang)
|
raise RuntimeError('Language not supported: %s' % name)
|
||||||
return LANGUAGES[lang]
|
return LANGUAGES[lang]
|
||||||
|
|
||||||
|
|
||||||
|
@ -47,55 +37,18 @@ def get_data_path(require_exists=True):
|
||||||
|
|
||||||
def set_data_path(path):
|
def set_data_path(path):
|
||||||
global _data_path
|
global _data_path
|
||||||
if isinstance(path, basestring):
|
_data_path = ensure_path(path)
|
||||||
path = pathlib.Path(path)
|
|
||||||
_data_path = path
|
|
||||||
|
|
||||||
|
|
||||||
def or_(val1, val2):
|
def ensure_path(path):
|
||||||
if val1 is not None:
|
if isinstance(path, basestring_):
|
||||||
return val1
|
return Path(path)
|
||||||
elif callable(val2):
|
|
||||||
return val2()
|
|
||||||
else:
|
else:
|
||||||
return val2
|
return path
|
||||||
|
|
||||||
|
|
||||||
def match_best_version(target_name, target_version, path):
|
|
||||||
path = path if not isinstance(path, basestring) else pathlib.Path(path)
|
|
||||||
if path is None or not path.exists():
|
|
||||||
return None
|
|
||||||
matches = []
|
|
||||||
for data_name in path.iterdir():
|
|
||||||
name, version = split_data_name(data_name.parts[-1])
|
|
||||||
if name == target_name and constraint_match(target_version, version):
|
|
||||||
matches.append((tuple(float(v) for v in version.split('.')), data_name))
|
|
||||||
if matches:
|
|
||||||
return pathlib.Path(max(matches)[1])
|
|
||||||
else:
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def split_data_name(name):
|
|
||||||
return name.split('-', 1) if '-' in name else (name, '')
|
|
||||||
|
|
||||||
|
|
||||||
def constraint_match(constraint_string, version):
|
|
||||||
# From http://github.com/spacy-io/sputnik
|
|
||||||
if not constraint_string:
|
|
||||||
return True
|
|
||||||
|
|
||||||
constraints = [c.strip() for c in constraint_string.split(',') if c.strip()]
|
|
||||||
|
|
||||||
for c in constraints:
|
|
||||||
if not re.match(r'[><=][=]?\d+(\.\d+)*', c):
|
|
||||||
raise ValueError('invalid constraint: %s' % c)
|
|
||||||
|
|
||||||
return all(semver.match(version, c) for c in constraints)
|
|
||||||
|
|
||||||
|
|
||||||
def read_regex(path):
|
def read_regex(path):
|
||||||
path = path if not isinstance(path, basestring) else pathlib.Path(path)
|
path = ensure_path(path)
|
||||||
with path.open() as file_:
|
with path.open() as file_:
|
||||||
entries = file_.read().split('\n')
|
entries = file_.read().split('\n')
|
||||||
expression = '|'.join(['^' + re.escape(piece) for piece in entries if piece.strip()])
|
expression = '|'.join(['^' + re.escape(piece) for piece in entries if piece.strip()])
|
||||||
|
@ -142,32 +95,28 @@ def normalize_slice(length, start, stop, step=None):
|
||||||
return start, stop
|
return start, stop
|
||||||
|
|
||||||
|
|
||||||
def utf8open(loc, mode='r'):
|
|
||||||
return io.open(loc, mode, encoding='utf8')
|
|
||||||
|
|
||||||
|
|
||||||
def check_renamed_kwargs(renamed, kwargs):
|
def check_renamed_kwargs(renamed, kwargs):
|
||||||
for old, new in renamed.items():
|
for old, new in renamed.items():
|
||||||
if old in kwargs:
|
if old in kwargs:
|
||||||
raise TypeError("Keyword argument %s now renamed to %s" % (old, new))
|
raise TypeError("Keyword argument %s now renamed to %s" % (old, new))
|
||||||
|
|
||||||
|
|
||||||
def is_windows():
|
def read_json(location):
|
||||||
"""Check if user is on Windows."""
|
with location.open('r', encoding='utf8') as f:
|
||||||
return sys.platform.startswith('win')
|
return ujson.load(f)
|
||||||
|
|
||||||
|
|
||||||
def is_python2():
|
|
||||||
"""Check if Python 2 is used."""
|
|
||||||
return sys.version.startswith('2.')
|
|
||||||
|
|
||||||
|
|
||||||
def parse_package_meta(package_path, package, require=True):
|
def parse_package_meta(package_path, package, require=True):
|
||||||
location = os.path.join(str(package_path), package, 'meta.json')
|
"""
|
||||||
if os.path.isfile(location):
|
Check if a meta.json exists in a package and return its contents as a
|
||||||
with io.open(location, encoding='utf8') as f:
|
dictionary. If require is set to True, raise an error if no meta.json found.
|
||||||
meta = json.load(f)
|
"""
|
||||||
return meta
|
# TODO: Allow passing in full model path and only require one argument
|
||||||
|
# instead of path and package name. This lets us avoid passing in an awkward
|
||||||
|
# empty string in spacy.load() if user supplies full model path.
|
||||||
|
location = package_path / package / 'meta.json'
|
||||||
|
if location.is_file():
|
||||||
|
return read_json(location)
|
||||||
elif require:
|
elif require:
|
||||||
raise IOError("Could not read meta.json from %s" % location)
|
raise IOError("Could not read meta.json from %s" % location)
|
||||||
else:
|
else:
|
||||||
|
@ -175,20 +124,22 @@ def parse_package_meta(package_path, package, require=True):
|
||||||
|
|
||||||
|
|
||||||
def get_raw_input(description, default=False):
|
def get_raw_input(description, default=False):
|
||||||
"""Get user input via raw_input / input and return input value. Takes a
|
"""
|
||||||
|
Get user input via raw_input / input and return input value. Takes a
|
||||||
description for the prompt, and an optional default value that's displayed
|
description for the prompt, and an optional default value that's displayed
|
||||||
with the prompt."""
|
with the prompt.
|
||||||
|
"""
|
||||||
additional = ' (default: {d})'.format(d=default) if default else ''
|
additional = ' (default: {d})'.format(d=default) if default else ''
|
||||||
prompt = ' {d}{a}: '.format(d=description, a=additional)
|
prompt = ' {d}{a}: '.format(d=description, a=additional)
|
||||||
user_input = raw_input(prompt)
|
user_input = input_(prompt)
|
||||||
return user_input
|
return user_input
|
||||||
|
|
||||||
|
|
||||||
def print_table(data, **kwargs):
|
def print_table(data, **kwargs):
|
||||||
"""Print data in table format. Can either take a list of tuples or a
|
"""
|
||||||
dictionary, which will be converted to a list of tuples."""
|
Print data in table format. Can either take a list of tuples or a
|
||||||
|
dictionary, which will be converted to a list of tuples.
|
||||||
|
"""
|
||||||
if type(data) == dict:
|
if type(data) == dict:
|
||||||
data = list(data.items())
|
data = list(data.items())
|
||||||
|
|
||||||
|
@ -204,15 +155,15 @@ def print_table(data, **kwargs):
|
||||||
|
|
||||||
|
|
||||||
def print_markdown(data, **kwargs):
|
def print_markdown(data, **kwargs):
|
||||||
"""Print listed data in GitHub-flavoured Markdown format so it can be
|
"""
|
||||||
|
Print listed data in GitHub-flavoured Markdown format so it can be
|
||||||
copy-pasted into issues. Can either take a list of tuples or a dictionary,
|
copy-pasted into issues. Can either take a list of tuples or a dictionary,
|
||||||
which will be converted to a list of tuples."""
|
which will be converted to a list of tuples.
|
||||||
|
"""
|
||||||
def excl_value(value):
|
def excl_value(value):
|
||||||
# don't print value if it contains absolute path of directory
|
# don't print value if it contains absolute path of directory (i.e.
|
||||||
# (i.e. personal info that shouldn't need to be shared)
|
# personal info). Other conditions can be included here if necessary.
|
||||||
# other conditions can be included here if necessary
|
if unicode_(Path(__file__).parent) in value:
|
||||||
if str(pathlib.Path(__file__).parent) in value:
|
|
||||||
return True
|
return True
|
||||||
|
|
||||||
if type(data) == dict:
|
if type(data) == dict:
|
||||||
|
@ -225,16 +176,16 @@ def print_markdown(data, **kwargs):
|
||||||
|
|
||||||
if 'title' in kwargs and kwargs['title']:
|
if 'title' in kwargs and kwargs['title']:
|
||||||
print(tpl_title.format(msg=kwargs['title']))
|
print(tpl_title.format(msg=kwargs['title']))
|
||||||
|
|
||||||
print(tpl_msg.format(msg=markdown))
|
print(tpl_msg.format(msg=markdown))
|
||||||
|
|
||||||
|
|
||||||
def print_msg(*text, **kwargs):
|
def print_msg(*text, **kwargs):
|
||||||
"""Print formatted message. Each positional argument is rendered as newline-
|
"""
|
||||||
|
Print formatted message. Each positional argument is rendered as newline-
|
||||||
separated paragraph. If kwarg 'title' exist, title is printed above the text
|
separated paragraph. If kwarg 'title' exist, title is printed above the text
|
||||||
and highlighted (using ANSI escape sequences manually to avoid unnecessary
|
and highlighted (using ANSI escape sequences manually to avoid unnecessary
|
||||||
dependency)."""
|
dependency).
|
||||||
|
"""
|
||||||
message = '\n\n'.join([_wrap_text(t) for t in text])
|
message = '\n\n'.join([_wrap_text(t) for t in text])
|
||||||
tpl_msg = '\n{msg}\n'
|
tpl_msg = '\n{msg}\n'
|
||||||
tpl_title = '\n\033[93m{msg}\033[0m'
|
tpl_title = '\n\033[93m{msg}\033[0m'
|
||||||
|
@ -246,9 +197,10 @@ def print_msg(*text, **kwargs):
|
||||||
|
|
||||||
|
|
||||||
def _wrap_text(text):
|
def _wrap_text(text):
|
||||||
"""Wrap text at given width using textwrap module. Indent should consist of
|
"""
|
||||||
spaces. Its length is deducted from wrap width to ensure exact wrapping."""
|
Wrap text at given width using textwrap module. Indent should consist of
|
||||||
|
spaces. Its length is deducted from wrap width to ensure exact wrapping.
|
||||||
|
"""
|
||||||
wrap_max = 80
|
wrap_max = 80
|
||||||
indent = ' '
|
indent = ' '
|
||||||
wrap_width = wrap_max - len(indent)
|
wrap_width = wrap_max - len(indent)
|
||||||
|
@ -258,10 +210,11 @@ def _wrap_text(text):
|
||||||
|
|
||||||
|
|
||||||
def sys_exit(*messages, **kwargs):
|
def sys_exit(*messages, **kwargs):
|
||||||
"""Performs SystemExit. For modules used from the command line, like
|
"""
|
||||||
|
Performs SystemExit. For modules used from the command line, like
|
||||||
download and link. To print message, use the same arguments as for
|
download and link. To print message, use the same arguments as for
|
||||||
print_msg()."""
|
print_msg().
|
||||||
|
"""
|
||||||
if messages:
|
if messages:
|
||||||
print_msg(*messages, **kwargs)
|
print_msg(*messages, **kwargs)
|
||||||
sys.exit(0)
|
sys.exit(0)
|
||||||
|
|
119
spacy/vocab.pyx
119
spacy/vocab.pyx
|
@ -1,41 +1,29 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import bz2
|
||||||
|
import ujson
|
||||||
|
import re
|
||||||
|
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t
|
||||||
from libc.math cimport sqrt
|
from libc.math cimport sqrt
|
||||||
|
from cymem.cymem cimport Address
|
||||||
from pathlib import Path
|
|
||||||
import bz2
|
|
||||||
import ujson as json
|
|
||||||
import re
|
|
||||||
|
|
||||||
try:
|
|
||||||
import cPickle as pickle
|
|
||||||
except ImportError:
|
|
||||||
import pickle
|
|
||||||
|
|
||||||
from .lexeme cimport EMPTY_LEXEME
|
from .lexeme cimport EMPTY_LEXEME
|
||||||
from .lexeme cimport Lexeme
|
from .lexeme cimport Lexeme
|
||||||
from .strings cimport hash_string
|
from .strings cimport hash_string
|
||||||
from .typedefs cimport attr_t
|
from .typedefs cimport attr_t
|
||||||
from .cfile cimport CFile, StringCFile
|
from .cfile cimport CFile, StringCFile
|
||||||
from .lemmatizer import Lemmatizer
|
|
||||||
from .attrs import intify_attrs
|
|
||||||
from .tokens.token cimport Token
|
from .tokens.token cimport Token
|
||||||
|
|
||||||
from . import attrs
|
|
||||||
from . import symbols
|
|
||||||
|
|
||||||
from cymem.cymem cimport Address
|
|
||||||
from .serialize.packer cimport Packer
|
from .serialize.packer cimport Packer
|
||||||
from .attrs cimport PROB, LANG
|
from .attrs cimport PROB, LANG
|
||||||
|
|
||||||
|
from .compat import copy_reg, pickle
|
||||||
|
from .lemmatizer import Lemmatizer
|
||||||
|
from .attrs import intify_attrs
|
||||||
from . import util
|
from . import util
|
||||||
|
from . import attrs
|
||||||
|
from . import symbols
|
||||||
try:
|
|
||||||
import copy_reg
|
|
||||||
except ImportError:
|
|
||||||
import copyreg as copy_reg
|
|
||||||
|
|
||||||
|
|
||||||
DEF MAX_VEC_SIZE = 100000
|
DEF MAX_VEC_SIZE = 100000
|
||||||
|
@ -48,8 +36,9 @@ EMPTY_LEXEME.vector = EMPTY_VEC
|
||||||
|
|
||||||
|
|
||||||
cdef class Vocab:
|
cdef class Vocab:
|
||||||
'''A map container for a language's LexemeC structs.
|
"""
|
||||||
'''
|
A map container for a language's LexemeC structs.
|
||||||
|
"""
|
||||||
@classmethod
|
@classmethod
|
||||||
def load(cls, path, lex_attr_getters=None, lemmatizer=True,
|
def load(cls, path, lex_attr_getters=None, lemmatizer=True,
|
||||||
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
|
tag_map=True, serializer_freqs=True, oov_prob=True, **deprecated_kwargs):
|
||||||
|
@ -72,8 +61,7 @@ cdef class Vocab:
|
||||||
Returns:
|
Returns:
|
||||||
Vocab: The newly constructed vocab object.
|
Vocab: The newly constructed vocab object.
|
||||||
"""
|
"""
|
||||||
if isinstance(path, basestring):
|
path = util.ensure_path(path)
|
||||||
path = Path(path)
|
|
||||||
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
|
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
|
||||||
if 'vectors' in deprecated_kwargs:
|
if 'vectors' in deprecated_kwargs:
|
||||||
raise AttributeError(
|
raise AttributeError(
|
||||||
|
@ -81,7 +69,7 @@ cdef class Vocab:
|
||||||
"Install vectors after loading.")
|
"Install vectors after loading.")
|
||||||
if tag_map is True and (path / 'vocab' / 'tag_map.json').exists():
|
if tag_map is True and (path / 'vocab' / 'tag_map.json').exists():
|
||||||
with (path / 'vocab' / 'tag_map.json').open('r', encoding='utf8') as file_:
|
with (path / 'vocab' / 'tag_map.json').open('r', encoding='utf8') as file_:
|
||||||
tag_map = json.load(file_)
|
tag_map = ujson.load(file_)
|
||||||
elif tag_map is True:
|
elif tag_map is True:
|
||||||
tag_map = None
|
tag_map = None
|
||||||
if lex_attr_getters is not None \
|
if lex_attr_getters is not None \
|
||||||
|
@ -94,12 +82,12 @@ cdef class Vocab:
|
||||||
lemmatizer = Lemmatizer.load(path)
|
lemmatizer = Lemmatizer.load(path)
|
||||||
if serializer_freqs is True and (path / 'vocab' / 'serializer.json').exists():
|
if serializer_freqs is True and (path / 'vocab' / 'serializer.json').exists():
|
||||||
with (path / 'vocab' / 'serializer.json').open('r', encoding='utf8') as file_:
|
with (path / 'vocab' / 'serializer.json').open('r', encoding='utf8') as file_:
|
||||||
serializer_freqs = json.load(file_)
|
serializer_freqs = ujson.load(file_)
|
||||||
else:
|
else:
|
||||||
serializer_freqs = None
|
serializer_freqs = None
|
||||||
|
|
||||||
with (path / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
|
with (path / 'vocab' / 'strings.json').open('r', encoding='utf8') as file_:
|
||||||
strings_list = json.load(file_)
|
strings_list = ujson.load(file_)
|
||||||
cdef Vocab self = cls(lex_attr_getters=lex_attr_getters, tag_map=tag_map,
|
cdef Vocab self = cls(lex_attr_getters=lex_attr_getters, tag_map=tag_map,
|
||||||
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs,
|
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs,
|
||||||
strings=strings_list)
|
strings=strings_list)
|
||||||
|
@ -108,7 +96,8 @@ cdef class Vocab:
|
||||||
|
|
||||||
def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None,
|
def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None,
|
||||||
serializer_freqs=None, strings=tuple(), **deprecated_kwargs):
|
serializer_freqs=None, strings=tuple(), **deprecated_kwargs):
|
||||||
'''Create the vocabulary.
|
"""
|
||||||
|
Create the vocabulary.
|
||||||
|
|
||||||
lex_attr_getters (dict):
|
lex_attr_getters (dict):
|
||||||
A dictionary mapping attribute IDs to functions to compute them.
|
A dictionary mapping attribute IDs to functions to compute them.
|
||||||
|
@ -123,7 +112,7 @@ cdef class Vocab:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
Vocab: The newly constructed vocab object.
|
Vocab: The newly constructed vocab object.
|
||||||
'''
|
"""
|
||||||
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
|
util.check_renamed_kwargs({'get_lex_attr': 'lex_attr_getters'}, deprecated_kwargs)
|
||||||
|
|
||||||
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
|
lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {}
|
||||||
|
@ -172,17 +161,19 @@ cdef class Vocab:
|
||||||
return langfunc('_') if langfunc else ''
|
return langfunc('_') if langfunc else ''
|
||||||
|
|
||||||
def __len__(self):
|
def __len__(self):
|
||||||
"""The current number of lexemes stored."""
|
"""
|
||||||
|
The current number of lexemes stored.
|
||||||
|
"""
|
||||||
return self.length
|
return self.length
|
||||||
|
|
||||||
def resize_vectors(self, int new_size):
|
def resize_vectors(self, int new_size):
|
||||||
'''
|
"""
|
||||||
Set vectors_length to a new size, and allocate more memory for the Lexeme
|
Set vectors_length to a new size, and allocate more memory for the Lexeme
|
||||||
vectors if necessary. The memory will be zeroed.
|
vectors if necessary. The memory will be zeroed.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
new_size (int): The new size of the vectors.
|
new_size (int): The new size of the vectors.
|
||||||
'''
|
"""
|
||||||
cdef hash_t key
|
cdef hash_t key
|
||||||
cdef size_t addr
|
cdef size_t addr
|
||||||
if new_size > self.vectors_length:
|
if new_size > self.vectors_length:
|
||||||
|
@ -193,7 +184,8 @@ cdef class Vocab:
|
||||||
self.vectors_length = new_size
|
self.vectors_length = new_size
|
||||||
|
|
||||||
def add_flag(self, flag_getter, int flag_id=-1):
|
def add_flag(self, flag_getter, int flag_id=-1):
|
||||||
'''Set a new boolean flag to words in the vocabulary.
|
"""
|
||||||
|
Set a new boolean flag to words in the vocabulary.
|
||||||
|
|
||||||
The flag_setter function will be called over the words currently in the
|
The flag_setter function will be called over the words currently in the
|
||||||
vocab, and then applied to new words as they occur. You'll then be able
|
vocab, and then applied to new words as they occur. You'll then be able
|
||||||
|
@ -213,7 +205,7 @@ cdef class Vocab:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
flag_id (int): The integer ID by which the flag value can be checked.
|
flag_id (int): The integer ID by which the flag value can be checked.
|
||||||
'''
|
"""
|
||||||
if flag_id == -1:
|
if flag_id == -1:
|
||||||
for bit in range(1, 64):
|
for bit in range(1, 64):
|
||||||
if bit not in self.lex_attr_getters:
|
if bit not in self.lex_attr_getters:
|
||||||
|
@ -234,9 +226,11 @@ cdef class Vocab:
|
||||||
return flag_id
|
return flag_id
|
||||||
|
|
||||||
cdef const LexemeC* get(self, Pool mem, unicode string) except NULL:
|
cdef const LexemeC* get(self, Pool mem, unicode string) except NULL:
|
||||||
'''Get a pointer to a LexemeC from the lexicon, creating a new Lexeme
|
"""
|
||||||
|
Get a pointer to a LexemeC from the lexicon, creating a new Lexeme
|
||||||
if necessary, using memory acquired from the given pool. If the pool
|
if necessary, using memory acquired from the given pool. If the pool
|
||||||
is the lexicon's own memory, the lexeme is saved in the lexicon.'''
|
is the lexicon's own memory, the lexeme is saved in the lexicon.
|
||||||
|
"""
|
||||||
if string == u'':
|
if string == u'':
|
||||||
return &EMPTY_LEXEME
|
return &EMPTY_LEXEME
|
||||||
cdef LexemeC* lex
|
cdef LexemeC* lex
|
||||||
|
@ -252,9 +246,11 @@ cdef class Vocab:
|
||||||
return self._new_lexeme(mem, string)
|
return self._new_lexeme(mem, string)
|
||||||
|
|
||||||
cdef const LexemeC* get_by_orth(self, Pool mem, attr_t orth) except NULL:
|
cdef const LexemeC* get_by_orth(self, Pool mem, attr_t orth) except NULL:
|
||||||
'''Get a pointer to a LexemeC from the lexicon, creating a new Lexeme
|
"""
|
||||||
|
Get a pointer to a LexemeC from the lexicon, creating a new Lexeme
|
||||||
if necessary, using memory acquired from the given pool. If the pool
|
if necessary, using memory acquired from the given pool. If the pool
|
||||||
is the lexicon's own memory, the lexeme is saved in the lexicon.'''
|
is the lexicon's own memory, the lexeme is saved in the lexicon.
|
||||||
|
"""
|
||||||
if orth == 0:
|
if orth == 0:
|
||||||
return &EMPTY_LEXEME
|
return &EMPTY_LEXEME
|
||||||
cdef LexemeC* lex
|
cdef LexemeC* lex
|
||||||
|
@ -297,30 +293,33 @@ cdef class Vocab:
|
||||||
self.length += 1
|
self.length += 1
|
||||||
|
|
||||||
def __contains__(self, unicode string):
|
def __contains__(self, unicode string):
|
||||||
'''Check whether the string has an entry in the vocabulary.
|
"""
|
||||||
|
Check whether the string has an entry in the vocabulary.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
string (unicode): The ID string.
|
string (unicode): The ID string.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
bool Whether the string has an entry in the vocabulary.
|
bool Whether the string has an entry in the vocabulary.
|
||||||
'''
|
"""
|
||||||
key = hash_string(string)
|
key = hash_string(string)
|
||||||
lex = self._by_hash.get(key)
|
lex = self._by_hash.get(key)
|
||||||
return lex is not NULL
|
return lex is not NULL
|
||||||
|
|
||||||
def __iter__(self):
|
def __iter__(self):
|
||||||
'''Iterate over the lexemes in the vocabulary.
|
"""
|
||||||
|
Iterate over the lexemes in the vocabulary.
|
||||||
|
|
||||||
Yields: Lexeme An entry in the vocabulary.
|
Yields: Lexeme An entry in the vocabulary.
|
||||||
'''
|
"""
|
||||||
cdef attr_t orth
|
cdef attr_t orth
|
||||||
cdef size_t addr
|
cdef size_t addr
|
||||||
for orth, addr in self._by_orth.items():
|
for orth, addr in self._by_orth.items():
|
||||||
yield Lexeme(self, orth)
|
yield Lexeme(self, orth)
|
||||||
|
|
||||||
def __getitem__(self, id_or_string):
|
def __getitem__(self, id_or_string):
|
||||||
'''Retrieve a lexeme, given an int ID or a unicode string. If a previously
|
"""
|
||||||
|
Retrieve a lexeme, given an int ID or a unicode string. If a previously
|
||||||
unseen unicode string is given, a new lexeme is created and stored.
|
unseen unicode string is given, a new lexeme is created and stored.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -332,7 +331,7 @@ cdef class Vocab:
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
lexeme (Lexeme): The lexeme indicated by the given ID.
|
lexeme (Lexeme): The lexeme indicated by the given ID.
|
||||||
'''
|
"""
|
||||||
cdef attr_t orth
|
cdef attr_t orth
|
||||||
if type(id_or_string) == unicode:
|
if type(id_or_string) == unicode:
|
||||||
orth = self.strings[id_or_string]
|
orth = self.strings[id_or_string]
|
||||||
|
@ -355,7 +354,8 @@ cdef class Vocab:
|
||||||
return tokens
|
return tokens
|
||||||
|
|
||||||
def dump(self, loc=None):
|
def dump(self, loc=None):
|
||||||
"""Save the lexemes binary data to the given location, or
|
"""
|
||||||
|
Save the lexemes binary data to the given location, or
|
||||||
return a byte-string with the data if loc is None.
|
return a byte-string with the data if loc is None.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
|
@ -392,14 +392,15 @@ cdef class Vocab:
|
||||||
return fp.string_data()
|
return fp.string_data()
|
||||||
|
|
||||||
def load_lexemes(self, loc):
|
def load_lexemes(self, loc):
|
||||||
'''Load the binary vocabulary data from the given location.
|
"""
|
||||||
|
Load the binary vocabulary data from the given location.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
loc (Path): The path to load from.
|
loc (Path): The path to load from.
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
None
|
None
|
||||||
'''
|
"""
|
||||||
fp = CFile(loc, 'rb',
|
fp = CFile(loc, 'rb',
|
||||||
on_open_error=lambda: IOError('LexemeCs file not found at %s' % loc))
|
on_open_error=lambda: IOError('LexemeCs file not found at %s' % loc))
|
||||||
cdef LexemeC* lexeme = NULL
|
cdef LexemeC* lexeme = NULL
|
||||||
|
@ -440,8 +441,9 @@ cdef class Vocab:
|
||||||
fp.close()
|
fp.close()
|
||||||
|
|
||||||
def _deserialize_lexemes(self, CFile fp):
|
def _deserialize_lexemes(self, CFile fp):
|
||||||
'''Load the binary vocabulary data from the given CFile.
|
"""
|
||||||
'''
|
Load the binary vocabulary data from the given CFile.
|
||||||
|
"""
|
||||||
cdef LexemeC* lexeme = NULL
|
cdef LexemeC* lexeme = NULL
|
||||||
cdef hash_t key
|
cdef hash_t key
|
||||||
cdef unicode py_str
|
cdef unicode py_str
|
||||||
|
@ -494,13 +496,14 @@ cdef class Vocab:
|
||||||
fp.close()
|
fp.close()
|
||||||
|
|
||||||
def dump_vectors(self, out_loc):
|
def dump_vectors(self, out_loc):
|
||||||
'''Save the word vectors to a binary file.
|
"""
|
||||||
|
Save the word vectors to a binary file.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
loc (Path): The path to save to.
|
loc (Path): The path to save to.
|
||||||
Returns:
|
Returns:
|
||||||
None
|
None
|
||||||
'''
|
"""
|
||||||
cdef int32_t vec_len = self.vectors_length
|
cdef int32_t vec_len = self.vectors_length
|
||||||
cdef int32_t word_len
|
cdef int32_t word_len
|
||||||
cdef bytes word_str
|
cdef bytes word_str
|
||||||
|
@ -522,7 +525,8 @@ cdef class Vocab:
|
||||||
out_file.close()
|
out_file.close()
|
||||||
|
|
||||||
def load_vectors(self, file_):
|
def load_vectors(self, file_):
|
||||||
"""Load vectors from a text-based file.
|
"""
|
||||||
|
Load vectors from a text-based file.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
file_ (buffer): The file to read from. Entries should be separated by newlines,
|
file_ (buffer): The file to read from. Entries should be separated by newlines,
|
||||||
|
@ -561,7 +565,8 @@ cdef class Vocab:
|
||||||
return vec_len
|
return vec_len
|
||||||
|
|
||||||
def load_vectors_from_bin_loc(self, loc):
|
def load_vectors_from_bin_loc(self, loc):
|
||||||
"""Load vectors from the location of a binary file.
|
"""
|
||||||
|
Load vectors from the location of a binary file.
|
||||||
|
|
||||||
Arguments:
|
Arguments:
|
||||||
loc (unicode): The path of the binary file to load from.
|
loc (unicode): The path of the binary file to load from.
|
||||||
|
|
|
@ -12,7 +12,7 @@
|
||||||
"COMPANY_URL": "https://explosion.ai",
|
"COMPANY_URL": "https://explosion.ai",
|
||||||
"DEMOS_URL": "https://demos.explosion.ai",
|
"DEMOS_URL": "https://demos.explosion.ai",
|
||||||
|
|
||||||
"SPACY_VERSION": "1.7",
|
"SPACY_VERSION": "1.8",
|
||||||
"LATEST_NEWS": {
|
"LATEST_NEWS": {
|
||||||
"url": "https://survey.spacy.io/",
|
"url": "https://survey.spacy.io/",
|
||||||
"title": "Take the spaCy user survey and help us improve the library!"
|
"title": "Take the spaCy user survey and help us improve the library!"
|
||||||
|
|
|
@ -2,9 +2,11 @@
|
||||||
<defs>
|
<defs>
|
||||||
<symbol id="usersurvey" viewBox="0 0 200 111">
|
<symbol id="usersurvey" viewBox="0 0 200 111">
|
||||||
<title>spaCy user survey 2017</title>
|
<title>spaCy user survey 2017</title>
|
||||||
<path fill="#DDD" d="M183.3 89.2l-164.6-40-1-29.2 164.6 40M3.8 106.8l41.6-1.4-1-29.2-41.6 1.4L13.2 92"/>
|
<path fill="#ddd" d="M183.3 89.2l-164.6-40-1-29.2 164.6 40M3.8 106.8l41.6-1.4-1-29.2-41.6 1.4L13.2 92"/>
|
||||||
<path fill="#DDD" d="M196.6 2L155 3.4l1 29.2 41.6-1.4L187.2 17"/>
|
<path fill="#a3cad3" d="M45.4 105.4L19.6 94.6l25.4-1"/>
|
||||||
<path fill="#FFF" d="M17.6 19.4l163-5.6 1 29.2-163 5.6zM19.2 65.6l163-5.6 1 29.2-163 5.6z"/>
|
<path fill="#ddd" d="M196.6 2L155 3.4l1 29.2 41.6-1.4L187.2 17"/>
|
||||||
|
<path fill="#a3cad3" d="M155 3.4l25.8 10.8-25.4 1"/>
|
||||||
|
<path fill="#fff" d="M17.6 19.4l163-5.6 1 29.2-163 5.6zM19.2 65.6l163-5.6 1 29.2-163 5.6z"/>
|
||||||
<path fill="#008EBC" d="M56.8 29h-3.6v-2.4l10-.4.2 2.5h-3.6l.4 10.8h-3L56.8 29zM71 36l-4 .2-.6 3.2h-3L67 26.2h3.6l4.6 13-3.2.2-1-3zm-.6-2.3l-.4-1.2-1.2-4.2-1 4.3-.3 1.2h3zM76 25.8h3l.3 5.3 4-5.4h3.2l-3.8 5.3 5 7.7h-3.3L81 33.4l-1.5 2V39h-3l-.4-13.2zM88.5 25.4l8.3-.3v2.6l-5.2.2v2.6l4.6-.2v2.5L92 33v3l5.6-.3v2.5l-8.4.3-.5-13zM106.4 27.3h-3.6V25l10-.5.2 2.5h-3.6l.4 10.8h-3l-.4-10.5zM115 24.5h3v5l4.7-.2-.2-5 3-.2.5 13.3h-3l-.2-5.4h-4.6l.2 5.6h-3l-.5-13zM128.5 24l8.3-.3v2.5l-5.2.2V29l4.6-.2v2.5l-4.4.2v3l5.6-.2v2.5l-8.4.3-.5-13z"/>
|
<path fill="#008EBC" d="M56.8 29h-3.6v-2.4l10-.4.2 2.5h-3.6l.4 10.8h-3L56.8 29zM71 36l-4 .2-.6 3.2h-3L67 26.2h3.6l4.6 13-3.2.2-1-3zm-.6-2.3l-.4-1.2-1.2-4.2-1 4.3-.3 1.2h3zM76 25.8h3l.3 5.3 4-5.4h3.2l-3.8 5.3 5 7.7h-3.3L81 33.4l-1.5 2V39h-3l-.4-13.2zM88.5 25.4l8.3-.3v2.6l-5.2.2v2.6l4.6-.2v2.5L92 33v3l5.6-.3v2.5l-8.4.3-.5-13zM106.4 27.3h-3.6V25l10-.5.2 2.5h-3.6l.4 10.8h-3l-.4-10.5zM115 24.5h3v5l4.7-.2-.2-5 3-.2.5 13.3h-3l-.2-5.4h-4.6l.2 5.6h-3l-.5-13zM128.5 24l8.3-.3v2.5l-5.2.2V29l4.6-.2v2.5l-4.4.2v3l5.6-.2v2.5l-8.4.3-.5-13z"/>
|
||||||
<path fill="#1A1E23" d="M44.5 73h3l.3 7.4c0 2.6 1 3.4 2.4 3.4s2.3-1 2.2-3.6l-.3-7.4h3l.2 7c.2 4.4-1.6 6.4-5 6.5-3.4 0-5.3-1.7-5.5-6.2l-.2-7zM59 82c1 1 2.2 1.4 3.3 1.4 1.2 0 1.8-.5 1.8-1.3 0-.7-.7-1-2-1.4l-1.6-.7c-1.4-.6-2.7-1.7-2.8-3.6 0-2.2 1.8-4 4.6-4 1.5-.2 3 .4 4.3 1.5L65 75.7c-1-.6-1.7-1-2.8-1-1 0-1.7.5-1.6 1.2 0 .8 1 1 2 1.5l1.8.6c1.6.7 2.7 1.7 2.7 3.6.2 2.2-1.6 4-4.7 4.3-1.7 0-3.6-.6-5-1.8l1.7-2zM69 72.3l8.3-.3v2.5l-5.2.2.2 2.7 4.5-.2v2.5l-4.4.2v3l5.6-.3v2.5l-8.5.3-.5-13.2zM87.6 84.8L85 80l-1.7.2.2 4.8h-3L80 72l4.8-.3c2.8 0 5 .8 5.2 4 0 1.8-.8 3-2.2 3.8l3.2 5.2h-3.4zm-4.4-7h1.5c1.6 0 2.4-.7 2.3-2 0-1.3-1-1.7-2.5-1.7l-1.5.2.2 3.7zM98 80.8c1 .8 2 1.3 3.2 1.3 1.2 0 1.8-.4 1.8-1.2 0-.8-.8-1-2-1.5l-1.7-.7C98 78 96.6 77 96.5 75c0-2 1.8-4 4.6-4 1.6 0 3.2.5 4.4 1.6l-1.4 2c-1-.7-1.7-1-2.8-1-1 0-1.7.4-1.6 1 0 1 1 1.2 2 1.6l1.8.6c1.6.6 2.7 1.6 2.7 3.5.2 2.2-1.6 4-4.7 4.3-1.7 0-3.6-.5-5-1.7l1.6-2.2zM107.8 71l3-.2.3 7.4c.2 2.6 1 3.4 2.5 3.4s2.3-1 2.2-3.6l-.3-7.4h3v7c.3 4.4-1.5 6.4-5 6.5-3.3.2-5.2-1.6-5.4-6l-.2-7zM129 83.4l-2.8-4.7h-1.6l.2 5h-3l-.5-13.2 4.8-.2c3 0 5.2.8 5.3 4 0 1.8-.8 3-2.2 3.8l3.3 5.3H129zm-4.5-7h1.5c1.6-.2 2.4-1 2.3-2 0-1.4-1-1.8-2.5-1.8h-1.5l.2 3.8zM131.6 70h3.2l1.8 6 1.2 4.3c.5-1.5.7-2.8 1-4.3l1.3-6.2h3.2L139.7 83H136l-4.4-13zM144.6 69.7l8.3-.3V72h-5.3V75l4.6-.2V77l-4.4.3v3l5.5-.2v2.6l-8.4.3-.4-13.3zM158.2 77.7l-4.3-8.4h3l1.4 3 1.2 2.8c.4-1 .8-1.8 1-3l1.2-3h3l-3.6 8.5.2 4.7h-3l-.2-4.5z"/>
|
<path fill="#1A1E23" d="M44.5 73h3l.3 7.4c0 2.6 1 3.4 2.4 3.4s2.3-1 2.2-3.6l-.3-7.4h3l.2 7c.2 4.4-1.6 6.4-5 6.5-3.4 0-5.3-1.7-5.5-6.2l-.2-7zM59 82c1 1 2.2 1.4 3.3 1.4 1.2 0 1.8-.5 1.8-1.3 0-.7-.7-1-2-1.4l-1.6-.7c-1.4-.6-2.7-1.7-2.8-3.6 0-2.2 1.8-4 4.6-4 1.5-.2 3 .4 4.3 1.5L65 75.7c-1-.6-1.7-1-2.8-1-1 0-1.7.5-1.6 1.2 0 .8 1 1 2 1.5l1.8.6c1.6.7 2.7 1.7 2.7 3.6.2 2.2-1.6 4-4.7 4.3-1.7 0-3.6-.6-5-1.8l1.7-2zM69 72.3l8.3-.3v2.5l-5.2.2.2 2.7 4.5-.2v2.5l-4.4.2v3l5.6-.3v2.5l-8.5.3-.5-13.2zM87.6 84.8L85 80l-1.7.2.2 4.8h-3L80 72l4.8-.3c2.8 0 5 .8 5.2 4 0 1.8-.8 3-2.2 3.8l3.2 5.2h-3.4zm-4.4-7h1.5c1.6 0 2.4-.7 2.3-2 0-1.3-1-1.7-2.5-1.7l-1.5.2.2 3.7zM98 80.8c1 .8 2 1.3 3.2 1.3 1.2 0 1.8-.4 1.8-1.2 0-.8-.8-1-2-1.5l-1.7-.7C98 78 96.6 77 96.5 75c0-2 1.8-4 4.6-4 1.6 0 3.2.5 4.4 1.6l-1.4 2c-1-.7-1.7-1-2.8-1-1 0-1.7.4-1.6 1 0 1 1 1.2 2 1.6l1.8.6c1.6.6 2.7 1.6 2.7 3.5.2 2.2-1.6 4-4.7 4.3-1.7 0-3.6-.5-5-1.7l1.6-2.2zM107.8 71l3-.2.3 7.4c.2 2.6 1 3.4 2.5 3.4s2.3-1 2.2-3.6l-.3-7.4h3v7c.3 4.4-1.5 6.4-5 6.5-3.3.2-5.2-1.6-5.4-6l-.2-7zM129 83.4l-2.8-4.7h-1.6l.2 5h-3l-.5-13.2 4.8-.2c3 0 5.2.8 5.3 4 0 1.8-.8 3-2.2 3.8l3.3 5.3H129zm-4.5-7h1.5c1.6-.2 2.4-1 2.3-2 0-1.4-1-1.8-2.5-1.8h-1.5l.2 3.8zM131.6 70h3.2l1.8 6 1.2 4.3c.5-1.5.7-2.8 1-4.3l1.3-6.2h3.2L139.7 83H136l-4.4-13zM144.6 69.7l8.3-.3V72h-5.3V75l4.6-.2V77l-4.4.3v3l5.5-.2v2.6l-8.4.3-.4-13.3zM158.2 77.7l-4.3-8.4h3l1.4 3 1.2 2.8c.4-1 .8-1.8 1-3l1.2-3h3l-3.6 8.5.2 4.7h-3l-.2-4.5z"/>
|
||||||
</symbol>
|
</symbol>
|
||||||
|
|
Before Width: | Height: | Size: 18 KiB After Width: | Height: | Size: 18 KiB |
|
@ -55,14 +55,14 @@ p Create or load the pipeline.
|
||||||
|
|
||||||
+table(["Name", "Type", "Description"])
|
+table(["Name", "Type", "Description"])
|
||||||
+row
|
+row
|
||||||
+cell #[code **kwrags]
|
+cell #[code **overrides]
|
||||||
+cell -
|
+cell -
|
||||||
+cell Keyword arguments indicating which defaults to override.
|
+cell Keyword arguments indicating which defaults to override.
|
||||||
|
|
||||||
+footrow
|
+footrow
|
||||||
+cell return
|
+cell return
|
||||||
+cell #[code Language]
|
+cell #[code Language]
|
||||||
+cell #[code self]
|
+cell The newly constructed object.
|
||||||
|
|
||||||
+h(2, "call") Language.__call__
|
+h(2, "call") Language.__call__
|
||||||
+tag method
|
+tag method
|
||||||
|
@ -136,3 +136,19 @@ p
|
||||||
+cell yield
|
+cell yield
|
||||||
+cell #[code Doc]
|
+cell #[code Doc]
|
||||||
+cell Containers for accessing the linguistic annotations.
|
+cell Containers for accessing the linguistic annotations.
|
||||||
|
|
||||||
|
+h(2, "save_to_directory") Language.save_to_directory
|
||||||
|
+tag method
|
||||||
|
|
||||||
|
p Save the #[code Vocab], #[code StringStore] and pipeline to a directory.
|
||||||
|
|
||||||
|
+table(["Name", "Type", "Description"])
|
||||||
|
+row
|
||||||
|
+cell #[code path]
|
||||||
|
+cell string or pathlib path
|
||||||
|
+cell Path to save the model.
|
||||||
|
|
||||||
|
+footrow
|
||||||
|
+cell return
|
||||||
|
+cell #[code None]
|
||||||
|
+cell -
|
||||||
|
|
|
@ -20,8 +20,10 @@
|
||||||
"Word vectors": "word-vectors-similarities",
|
"Word vectors": "word-vectors-similarities",
|
||||||
"Deep learning": "deep-learning",
|
"Deep learning": "deep-learning",
|
||||||
"Custom tokenization": "customizing-tokenizer",
|
"Custom tokenization": "customizing-tokenizer",
|
||||||
|
"Adding languages": "adding-languages",
|
||||||
"Training": "training",
|
"Training": "training",
|
||||||
"Adding languages": "adding-languages"
|
"Training NER": "training-ner",
|
||||||
|
"Saving & loading": "saving-loading"
|
||||||
},
|
},
|
||||||
"Examples": {
|
"Examples": {
|
||||||
"Tutorials": "tutorials",
|
"Tutorials": "tutorials",
|
||||||
|
@ -101,11 +103,21 @@
|
||||||
|
|
||||||
"customizing-tokenizer": {
|
"customizing-tokenizer": {
|
||||||
"title": "Customizing the tokenizer",
|
"title": "Customizing the tokenizer",
|
||||||
"next": "training"
|
"next": "adding-languages"
|
||||||
},
|
},
|
||||||
|
|
||||||
"training": {
|
"training": {
|
||||||
"title": "Training the tagger, parser and entity recognizer"
|
"title": "Training spaCy's statistical models",
|
||||||
|
"next": "saving-loading"
|
||||||
|
},
|
||||||
|
|
||||||
|
"training-ner": {
|
||||||
|
"title": "Training the Named Entity Recognizer",
|
||||||
|
"next": "saving-loading"
|
||||||
|
},
|
||||||
|
|
||||||
|
"saving-loading": {
|
||||||
|
"title": "Saving and loading models"
|
||||||
},
|
},
|
||||||
|
|
||||||
"pos-tagging": {
|
"pos-tagging": {
|
||||||
|
@ -356,6 +368,18 @@
|
||||||
},
|
},
|
||||||
|
|
||||||
"code": {
|
"code": {
|
||||||
|
"Training a new entity type": {
|
||||||
|
"url": "https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py",
|
||||||
|
"author": "Matthew Honnibal",
|
||||||
|
"tags": ["ner", "training"]
|
||||||
|
},
|
||||||
|
|
||||||
|
"Training an NER system from scratch": {
|
||||||
|
"url": "https://github.com/explosion/spaCy/blob/master/examples/training/train_ner_standalone.py",
|
||||||
|
"author": "Matthew Honnibal",
|
||||||
|
"tags": ["ner", "training"]
|
||||||
|
},
|
||||||
|
|
||||||
"Information extraction": {
|
"Information extraction": {
|
||||||
"url": "https://github.com/explosion/spaCy/blob/master/examples/information_extraction.py",
|
"url": "https://github.com/explosion/spaCy/blob/master/examples/information_extraction.py",
|
||||||
"author": "Matthew Honnibal",
|
"author": "Matthew Honnibal",
|
||||||
|
|
|
@ -63,14 +63,16 @@ p
|
||||||
tag_map = TAG_MAP
|
tag_map = TAG_MAP
|
||||||
stop_words = STOP_WORDS
|
stop_words = STOP_WORDS
|
||||||
|
|
||||||
p Additionally, the new #[code Language] class needs to be registered in #[+src(gh("spaCy", "spacy/__init__.py")) spacy/__init__.py] using the #[code set_lang_class()] function, so that you can use #[code spacy.load()].
|
p
|
||||||
|
| Additionally, the new #[code Language] class needs to be added to the
|
||||||
|
| list of available languages in #[+src(gh("spaCy", "spacy/__init__.py")) __init__.py].
|
||||||
|
| The languages are then registered using the #[code set_lang_class()] function.
|
||||||
|
|
||||||
+code("spacy/__init__.py").
|
+code("spacy/__init__.py").
|
||||||
from . import en
|
from . import en
|
||||||
from . import xx
|
from . import xx
|
||||||
|
|
||||||
set_lang_class(en.English.lang, en.English)
|
_languages = (en.English, ..., xx.Xxxxx)
|
||||||
set_lang_class(xx.Xxxxx.lang, xx.Xxxxx)
|
|
||||||
|
|
||||||
p You'll also need to list the new package in #[+src(gh("spaCy", "spacy/setup.py")) setup.py]:
|
p You'll also need to list the new package in #[+src(gh("spaCy", "spacy/setup.py")) setup.py]:
|
||||||
|
|
||||||
|
@ -398,11 +400,12 @@ p
|
||||||
| vectors files, you can use the
|
| vectors files, you can use the
|
||||||
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
||||||
| script from our
|
| script from our
|
||||||
| #[+a(gh("spacy-dev-resources")) developer resources] to create a
|
| #[+a(gh("spacy-dev-resources")) developer resources], or use the new
|
||||||
| spaCy data directory:
|
| #[+a("/docs/usage/cli#model") #[code model] command] to create a data
|
||||||
|
| directory:
|
||||||
|
|
||||||
+code(false, "bash").
|
+code(false, "bash").
|
||||||
python training/init.py xx your_data_directory/ my_data/word_freqs.txt my_data/clusters.txt my_data/word_vectors.bz2
|
python -m spacy model [lang] [model_dir] [freqs_data] [clusters_data] [vectors_data]
|
||||||
|
|
||||||
+aside-code("your_data_directory", "yaml").
|
+aside-code("your_data_directory", "yaml").
|
||||||
├── vocab/
|
├── vocab/
|
||||||
|
@ -421,17 +424,14 @@ p
|
||||||
|
|
||||||
p
|
p
|
||||||
| This creates a spaCy data directory with a vocabulary model, ready to be
|
| This creates a spaCy data directory with a vocabulary model, ready to be
|
||||||
| loaded. By default, the
|
| loaded. By default, the command expects to be able to find your language
|
||||||
| #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
| class using #[code spacy.util.get_lang_class(lang_id)].
|
||||||
| script expects to be able to find your language class using
|
|
||||||
| #[code spacy.util.get_lang_class(lang_id)]. You can edit the script to
|
|
||||||
| help it find your language class if necessary.
|
|
||||||
|
|
||||||
+h(3, "word-frequencies") Word frequencies
|
+h(3, "word-frequencies") Word frequencies
|
||||||
|
|
||||||
p
|
p
|
||||||
| The #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
|
| The #[+a("/docs/usage/cli#model") #[code model] command] expects a
|
||||||
| script expects a tab-separated word frequencies file with three columns:
|
| tab-separated word frequencies file with three columns:
|
||||||
|
|
||||||
+list("numbers")
|
+list("numbers")
|
||||||
+item The number of times the word occurred in your language sample.
|
+item The number of times the word occurred in your language sample.
|
||||||
|
|
|
@ -145,7 +145,9 @@ p
|
||||||
+h(2, "model") Model
|
+h(2, "model") Model
|
||||||
+tag experimental
|
+tag experimental
|
||||||
|
|
||||||
p Initialise a new model and its data directory.
|
p
|
||||||
|
| Initialise a new model and its data directory. For more info on this, see
|
||||||
|
| the documentation on #[+a("/docs/usage/adding-languages") adding languages].
|
||||||
|
|
||||||
+code(false, "bash").
|
+code(false, "bash").
|
||||||
python -m spacy model [lang] [model_dir] [freqs_data] [clusters_data] [vectors_data]
|
python -m spacy model [lang] [model_dir] [freqs_data] [clusters_data] [vectors_data]
|
||||||
|
@ -246,15 +248,17 @@ p
|
||||||
+tag experimental
|
+tag experimental
|
||||||
|
|
||||||
p
|
p
|
||||||
| Generate a #[+a("/docs/usage/models#own-models") model Python package]
|
| Generate a #[+a("/docs/usage/saving-loading#generating") model Python package]
|
||||||
| from an existing model data directory. All data files are copied over,
|
| from an existing model data directory. All data files are copied over.
|
||||||
| and the meta data can be entered directly from the command line. While
|
| If the path to a meta.json is supplied, or a meta.json is found in the
|
||||||
| this feature is still experimental, the required file templates are
|
| input directory, this file is used. Otherwise, the data can be entered
|
||||||
| downloaded from #[+src(gh("spacy-dev-resources", "templates/model")) GitHub].
|
| directly from the command line. While this feature is still experimental,
|
||||||
| This means you need to be connected to the internet to use this command.
|
| the required file templates are downloaded from
|
||||||
|
| #[+src(gh("spacy-dev-resources", "templates/model")) GitHub]. This means
|
||||||
|
| you need to be connected to the internet to use this command.
|
||||||
|
|
||||||
+code(false, "bash").
|
+code(false, "bash").
|
||||||
python -m spacy package [input_dir] [output_dir] [--force]
|
python -m spacy package [input_dir] [output_dir] [--meta] [--force]
|
||||||
|
|
||||||
+table(["Argument", "Type", "Description"])
|
+table(["Argument", "Type", "Description"])
|
||||||
+row
|
+row
|
||||||
|
@ -267,6 +271,11 @@ p
|
||||||
+cell positional
|
+cell positional
|
||||||
+cell Directory to create package folder in.
|
+cell Directory to create package folder in.
|
||||||
|
|
||||||
|
+row
|
||||||
|
+cell #[code meta]
|
||||||
|
+cell option
|
||||||
|
+cell Path to meta.json file (optional).
|
||||||
|
|
||||||
+row
|
+row
|
||||||
+cell #[code --force], #[code -f]
|
+cell #[code --force], #[code -f]
|
||||||
+cell flag
|
+cell flag
|
||||||
|
|
|
@ -57,7 +57,7 @@ p
|
||||||
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])]
|
doc.ents = [Span(doc, 0, 1, label=doc.vocab.strings['GPE'])]
|
||||||
assert doc[0].ent_type_ == 'GPE'
|
assert doc[0].ent_type_ == 'GPE'
|
||||||
doc.ents = []
|
doc.ents = []
|
||||||
doc.ents = [(u'LondonCity', doc.vocab.strings['GPE']), 0, 1)]
|
doc.ents = [(u'LondonCity', doc.vocab.strings['GPE'], 0, 1)]
|
||||||
|
|
||||||
p
|
p
|
||||||
| The value you assign should be a sequence, the values of which
|
| The value you assign should be a sequence, the values of which
|
||||||
|
|
|
@ -137,7 +137,7 @@ p
|
||||||
return word.ent_type != 0
|
return word.ent_type != 0
|
||||||
|
|
||||||
def count_parent_verb_by_person(docs):
|
def count_parent_verb_by_person(docs):
|
||||||
counts = defaultdict(defaultdict(int))
|
counts = defaultdict(lambda: defaultdict(int))
|
||||||
for doc in docs:
|
for doc in docs:
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
if ent.label_ == 'PERSON' and ent.root.head.pos == VERB:
|
if ent.label_ == 'PERSON' and ent.root.head.pos == VERB:
|
||||||
|
|
|
@ -235,62 +235,13 @@ p
|
||||||
|
|
||||||
p
|
p
|
||||||
| If you've trained your own model, for example for
|
| If you've trained your own model, for example for
|
||||||
| #[+a("/docs/usage/adding-languages") additional languages], you can
|
| #[+a("/docs/usage/adding-languages") additional languages] or
|
||||||
| create a shortuct link for it by pointing #[code spacy.link] to the
|
| #[+a("/docs/usage/train-ner") custom named entities], you can save its
|
||||||
| model's data directory. To allow your model to be downloaded and
|
| state using the #[code Language.save_to_directory()] method. To make the
|
||||||
| installed via pip, you'll also need to generate a package for it. You can
|
| model more convenient to deploy, we recommend wrapping it as a Python
|
||||||
| do this manually, or via the new
|
| package.
|
||||||
| #[+a("/docs/usage/cli#package") #[code spacy package] command] that will
|
|
||||||
| create all required files, and walk you through generating the meta data.
|
|
||||||
|
|
||||||
|
+infobox("Saving and loading models")
|
||||||
+infobox("Important note")
|
| For more information and a detailed guide on how to package your model,
|
||||||
| The model packages are #[strong not suitable] for the public
|
| see the documentation on
|
||||||
| #[+a("https://pypi.python.org") pypi.python.org] directory, which is not
|
| #[+a("/docs/usage/saving-loading") saving and loading models].
|
||||||
| designed for binary data and files over 50 MB. However, if your company
|
|
||||||
| is running an internal installation of pypi, publishing your models on
|
|
||||||
| there can be a convenient solution to share them with your team.
|
|
||||||
|
|
||||||
p The model directory should look like this:
|
|
||||||
|
|
||||||
+code("Directory structure", "yaml").
|
|
||||||
└── /
|
|
||||||
├── MANIFEST.in # to include meta.json
|
|
||||||
├── meta.json # model meta data
|
|
||||||
├── setup.py # setup file for pip installation
|
|
||||||
└── en_core_web_md # model directory
|
|
||||||
├── __init__.py # init for pip installation
|
|
||||||
└── en_core_web_md-1.2.0 # model data
|
|
||||||
|
|
||||||
p
|
|
||||||
| You can find templates for all files in our
|
|
||||||
| #[+a(gh("spacy-dev-resouces", "templates/model")) spaCy dev resources].
|
|
||||||
| Unless you want to customise installation and loading, the only file
|
|
||||||
| you'll need to modify is #[code meta.json], which includes the model's
|
|
||||||
| meta data. It will later be copied into the package and data directory.
|
|
||||||
|
|
||||||
+code("meta.json", "json").
|
|
||||||
{
|
|
||||||
"name": "core_web_md",
|
|
||||||
"lang": "en",
|
|
||||||
"version": "1.2.0",
|
|
||||||
"spacy_version": "1.7.0",
|
|
||||||
"description": "English model for spaCy",
|
|
||||||
"author": "Explosion AI",
|
|
||||||
"email": "contact@explosion.ai",
|
|
||||||
"license": "MIT"
|
|
||||||
}
|
|
||||||
|
|
||||||
p
|
|
||||||
| Keep in mind that the directories need to be named according to the
|
|
||||||
| naming conventions. The #[code lang] setting is also used to create the
|
|
||||||
| respective #[code Language] class in spaCy, which will later be returned
|
|
||||||
| by the model's #[code load()] method.
|
|
||||||
|
|
||||||
p
|
|
||||||
| To generate the package, run the following command from within the
|
|
||||||
| directory. This will create a #[code .tar.gz] archive in a directory
|
|
||||||
| #[code /dist].
|
|
||||||
|
|
||||||
+code(false, "bash").
|
|
||||||
python setup.py sdist
|
|
||||||
|
|
109
website/docs/usage/saving-loading.jade
Normal file
109
website/docs/usage/saving-loading.jade
Normal file
|
@ -0,0 +1,109 @@
|
||||||
|
include ../../_includes/_mixins
|
||||||
|
|
||||||
|
p
|
||||||
|
| After training your model, you'll usually want to save its state, and load
|
||||||
|
| it back later. You can do this with the
|
||||||
|
| #[+api("language#save_to_directory") #[code Language.save_to_directory()]]
|
||||||
|
| method:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
nlp.save_to_directory('/home/me/data/en_example_model')
|
||||||
|
|
||||||
|
p
|
||||||
|
| The directory will be created if it doesn't exist, and the whole pipeline
|
||||||
|
| will be written out. To make the model more convenient to deploy, we
|
||||||
|
| recommend wrapping it as a Python package.
|
||||||
|
|
||||||
|
+h(2, "generating") Generating a model package
|
||||||
|
|
||||||
|
+infobox("Important note")
|
||||||
|
| The model packages are #[strong not suitable] for the public
|
||||||
|
| #[+a("https://pypi.python.org") pypi.python.org] directory, which is not
|
||||||
|
| designed for binary data and files over 50 MB. However, if your company
|
||||||
|
| is running an internal installation of pypi, publishing your models on
|
||||||
|
| there can be a convenient solution to share them with your team.
|
||||||
|
|
||||||
|
p
|
||||||
|
| spaCy comes with a handy CLI command that will create all required files,
|
||||||
|
| and walk you through generating the meta data. You can also create the
|
||||||
|
| meta.json manually and place it in the model data directory, or supply a
|
||||||
|
| path to it using the #[code --meta] flag. For more info on this, see the
|
||||||
|
| #[+a("/docs/usage/cli/#package") #[code package] command] documentation.
|
||||||
|
|
||||||
|
+aside-code("meta.json", "json").
|
||||||
|
{
|
||||||
|
"name": "example_model",
|
||||||
|
"lang": "en",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"spacy_version": ">=1.7.0,<2.0.0",
|
||||||
|
"description": "Example model for spaCy",
|
||||||
|
"author": "You",
|
||||||
|
"email": "you@example.com",
|
||||||
|
"license": "CC BY-SA 3.0"
|
||||||
|
}
|
||||||
|
|
||||||
|
+code(false, "bash").
|
||||||
|
python -m spacy package /home/me/data/en_example_model /home/me/my_models
|
||||||
|
|
||||||
|
p This command will create a model package directory that should look like this:
|
||||||
|
|
||||||
|
+code("Directory structure", "yaml").
|
||||||
|
└── /
|
||||||
|
├── MANIFEST.in # to include meta.json
|
||||||
|
├── meta.json # model meta data
|
||||||
|
├── setup.py # setup file for pip installation
|
||||||
|
└── en_example_model # model directory
|
||||||
|
├── __init__.py # init for pip installation
|
||||||
|
└── en_example_model-1.0.0 # model data
|
||||||
|
|
||||||
|
p
|
||||||
|
| You can also find templates for all files in our
|
||||||
|
| #[+a(gh("spacy-dev-resouces", "templates/model")) spaCy dev resources].
|
||||||
|
| If you're creating the package manually, keep in mind that the directories
|
||||||
|
| need to be named according to the naming conventions of
|
||||||
|
| #[code [language]_[type]] and #[code [language]_[type]-[version]]. The
|
||||||
|
| #[code lang] setting in the meta.json is also used to create the
|
||||||
|
| respective #[code Language] class in spaCy, which will later be returned
|
||||||
|
| by the model's #[code load()] method.
|
||||||
|
|
||||||
|
+h(2, "building") Building a model package
|
||||||
|
|
||||||
|
p
|
||||||
|
| To build the package, run the following command from within the
|
||||||
|
| directory. This will create a #[code .tar.gz] archive in a directory
|
||||||
|
| #[code /dist].
|
||||||
|
|
||||||
|
+code(false, "bash").
|
||||||
|
python setup.py sdist
|
||||||
|
|
||||||
|
p
|
||||||
|
| For more information on building Python packages, see the
|
||||||
|
| #[+a("https://setuptools.readthedocs.io/en/latest/") Python Setuptools documentation].
|
||||||
|
|
||||||
|
|
||||||
|
+h(2, "loading") Loading a model package
|
||||||
|
|
||||||
|
p
|
||||||
|
| Model packages can be installed by pointing pip to the model's
|
||||||
|
| #[code .tar.gz] archive:
|
||||||
|
|
||||||
|
+code(false, "bash").
|
||||||
|
pip install /path/to/en_example_model-1.0.0.tar.gz
|
||||||
|
|
||||||
|
p You'll then be able to load the model as follows:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
import en_example_model
|
||||||
|
nlp = en_example_model.load()
|
||||||
|
|
||||||
|
p
|
||||||
|
| To load the model via #[code spacy.load()], you can also
|
||||||
|
| create a #[+a("/docs/usage/models#usage") shortcut link] that maps the
|
||||||
|
| package name to a custom model name of your choice:
|
||||||
|
|
||||||
|
+code(false, "bash").
|
||||||
|
python -m spacy link en_example_model example
|
||||||
|
|
||||||
|
+code.
|
||||||
|
import spacy
|
||||||
|
nlp = spacy.load('example')
|
174
website/docs/usage/training-ner.jade
Normal file
174
website/docs/usage/training-ner.jade
Normal file
|
@ -0,0 +1,174 @@
|
||||||
|
include ../../_includes/_mixins
|
||||||
|
|
||||||
|
p
|
||||||
|
| All #[+a("/docs/usage/models") spaCy models] support online learning, so
|
||||||
|
| you can update a pre-trained model with new examples. You can even add
|
||||||
|
| new classes to an existing model, to recognise a new entity type,
|
||||||
|
| part-of-speech, or syntactic relation. Updating an existing model is
|
||||||
|
| particularly useful as a "quick and dirty solution", if you have only a
|
||||||
|
| few corrections or annotations.
|
||||||
|
|
||||||
|
+h(2, "improving-accuracy") Improving accuracy on existing entity types
|
||||||
|
|
||||||
|
p
|
||||||
|
| To update the model, you first need to create an instance of
|
||||||
|
| #[+api("goldparse") #[code spacy.gold.GoldParse]], with the entity labels
|
||||||
|
| you want to learn. You will then pass this instance to the
|
||||||
|
| #[+api("entityrecognizer#update") #[code EntityRecognizer.update()]]
|
||||||
|
| method. For example:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
import spacy
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
|
||||||
|
nlp = spacy.load('en')
|
||||||
|
doc = nlp.make_doc(u'Facebook released React in 2014')
|
||||||
|
gold = GoldParse(doc, entities=['U-ORG', 'O', 'U-TECHNOLOGY', 'O', 'U-DATE'])
|
||||||
|
nlp.entity.update(doc, gold)
|
||||||
|
|
||||||
|
p
|
||||||
|
| You'll usually need to provide many examples to meaningfully improve the
|
||||||
|
| system — a few hundred is a good start, although more is better. You
|
||||||
|
| should avoid iterating over the same few examples multiple times, or the
|
||||||
|
| model is likely to "forget" how to annotate other examples. If you
|
||||||
|
| iterate over the same few examples, you're effectively changing the loss
|
||||||
|
| function. The optimizer will find a way to minimize the loss on your
|
||||||
|
| examples, without regard for the consequences on the examples it's no
|
||||||
|
| longer paying attention to.
|
||||||
|
|
||||||
|
p
|
||||||
|
| One way to avoid this "catastrophic forgetting" problem is to "remind"
|
||||||
|
| the model of other examples by augmenting your annotations with sentences
|
||||||
|
| annotated with entities automatically recognised by the original model.
|
||||||
|
| Ultimately, this is an empirical process: you'll need to
|
||||||
|
| #[strong experiment on your own data] to find a solution that works best
|
||||||
|
| for you.
|
||||||
|
|
||||||
|
+h(2, "adding") Adding a new entity type
|
||||||
|
|
||||||
|
p
|
||||||
|
| You can add new entity types to an existing model. Let's say we want to
|
||||||
|
| recognise the category #[code TECHNOLOGY]. The new category will include
|
||||||
|
| programming languages, frameworks and platforms. First, we need to
|
||||||
|
| register the new entity type:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
nlp.entity.add_label('TECHNOLOGY')
|
||||||
|
|
||||||
|
p
|
||||||
|
| Next, iterate over your examples, calling #[code entity.update()]. As
|
||||||
|
| above, we want to avoid iterating over only a small number of sentences.
|
||||||
|
| A useful compromise is to run the model over a number of plain-text
|
||||||
|
| sentences, and pass the entities to #[code GoldParse], as "true"
|
||||||
|
| annotations. This encourages the optimizer to find a solution that
|
||||||
|
| predicts the new category with minimal difference from the previous
|
||||||
|
| output.
|
||||||
|
|
||||||
|
+h(2, "saving-loading") Saving and loading
|
||||||
|
|
||||||
|
p
|
||||||
|
| After training our model, you'll usually want to save its state, and load
|
||||||
|
| it back later. You can do this with the #[code Language.save_to_directory()]
|
||||||
|
| method:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
nlp.save_to_directory('/home/me/data/en_technology')
|
||||||
|
|
||||||
|
p
|
||||||
|
| To make the model more convenient to deploy, we recommend wrapping it as
|
||||||
|
| a Python package, so that you can install it via pip and load it as a
|
||||||
|
| module. spaCy comes with a handy #[+a("/docs/usage/cli#package") CLI command]
|
||||||
|
| to create all required files and directories.
|
||||||
|
|
||||||
|
+code(false, "bash").
|
||||||
|
python -m spacy package /home/me/data/en_technology /home/me/my_models
|
||||||
|
|
||||||
|
p
|
||||||
|
| To build the package and create a #[code .tar.gz] archive, run
|
||||||
|
| #[code python setup.py sdist] from within its directory.
|
||||||
|
|
||||||
|
+infobox("Saving and loading models")
|
||||||
|
| For more information and a detailed guide on how to package your model,
|
||||||
|
| see the documentation on
|
||||||
|
| #[+a("/docs/usage/saving-loading") saving and loading models].
|
||||||
|
|
||||||
|
p
|
||||||
|
| After you've generated and installed the package, you'll be able to
|
||||||
|
| load the model as follows:
|
||||||
|
|
||||||
|
+code.
|
||||||
|
import en_technology
|
||||||
|
nlp = en_technology.load()
|
||||||
|
|
||||||
|
+h(2, "example") Example: Adding and training an #[code ANIMAL] entity
|
||||||
|
|
||||||
|
p
|
||||||
|
| This script shows how to add a new entity type to an existing pre-trained
|
||||||
|
| NER model. To keep the example short and simple, only four sentences are
|
||||||
|
| provided as examples. In practice, you'll need many more —
|
||||||
|
| #[strong a few hundred] would be a good start. You will also likely need
|
||||||
|
| to mix in #[strong examples of other entity types], which might be
|
||||||
|
| obtained by running the entity recognizer over unlabelled sentences, and
|
||||||
|
| adding their annotations to the training set.
|
||||||
|
|
||||||
|
p
|
||||||
|
| For the full, runnable script of this example, see
|
||||||
|
| #[+src(gh("spacy", "examples/training/train_new_entity_type.py")) train_new_entity_type.py].
|
||||||
|
|
||||||
|
+code("Training the entity recognizer").
|
||||||
|
import spacy
|
||||||
|
from spacy.pipeline import EntityRecognizer
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.tagger import Tagger
|
||||||
|
import random
|
||||||
|
|
||||||
|
model_name = 'en'
|
||||||
|
entity_label = 'ANIMAL'
|
||||||
|
output_directory = '/path/to/model'
|
||||||
|
train_data = [
|
||||||
|
("Horses are too tall and they pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')]),
|
||||||
|
("horses are too tall and they pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')]),
|
||||||
|
("horses pretend to care about your feelings",
|
||||||
|
[(0, 6, 'ANIMAL')]),
|
||||||
|
("they pretend to care about your feelings, those horses",
|
||||||
|
[(48, 54, 'ANIMAL')])
|
||||||
|
]
|
||||||
|
|
||||||
|
nlp = spacy.load(model_name)
|
||||||
|
nlp.entity.add_label(entity_label)
|
||||||
|
ner = train_ner(nlp, train_data, output_directory)
|
||||||
|
|
||||||
|
def train_ner(nlp, train_data, output_dir):
|
||||||
|
# Add new words to vocab
|
||||||
|
for raw_text, _ in train_data:
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
for word in doc:
|
||||||
|
_ = nlp.vocab[word.orth]
|
||||||
|
|
||||||
|
for itn in range(20):
|
||||||
|
random.shuffle(train_data)
|
||||||
|
for raw_text, entity_offsets in train_data:
|
||||||
|
gold = GoldParse(doc, entities=entity_offsets)
|
||||||
|
doc = nlp.make_doc(raw_text)
|
||||||
|
nlp.tagger(doc)
|
||||||
|
loss = nlp.entity.update(doc, gold)
|
||||||
|
nlp.end_training()
|
||||||
|
nlp.save_to_directory(output_dir)
|
||||||
|
|
||||||
|
p
|
||||||
|
+button(gh("spaCy", "examples/training/train_new_entity_type.py"), false, "secondary") Full example
|
||||||
|
|
||||||
|
p
|
||||||
|
| The actual training is performed by looping over the examples, and
|
||||||
|
| calling #[code nlp.entity.update()]. The #[code update()] method steps
|
||||||
|
| through the words of the input. At each word, it makes a prediction. It
|
||||||
|
| then consults the annotations provided on the #[code GoldParse] instance,
|
||||||
|
| to see whether it was right. If it was wrong, it adjusts its weights so
|
||||||
|
| that the correct action will score higher next time.
|
||||||
|
|
||||||
|
p
|
||||||
|
| After training your model, you can
|
||||||
|
| #[+a("/docs/usage/saving-loading") save it to a directory]. We recommend wrapping
|
||||||
|
| models as Python packages, for ease of deployment.
|
|
@ -1,13 +1,10 @@
|
||||||
include ../../_includes/_mixins
|
include ../../_includes/_mixins
|
||||||
|
|
||||||
p
|
p
|
||||||
| This tutorial describes how to train new statistical models for spaCy's
|
| This workflow describes how to train new statistical models for spaCy's
|
||||||
| part-of-speech tagger, named entity recognizer and dependency parser.
|
| part-of-speech tagger, named entity recognizer and dependency parser.
|
||||||
|
| Once the model is trained, you can then
|
||||||
p
|
| #[+a("/docs/usage/saving-loading") save and load] it.
|
||||||
| I'll start with some quick code examples, that describe how to train
|
|
||||||
| each model. I'll then provide a bit of background about the algorithms,
|
|
||||||
| and explain how the data and feature templates work.
|
|
||||||
|
|
||||||
+h(2, "train-pos-tagger") Training the part-of-speech tagger
|
+h(2, "train-pos-tagger") Training the part-of-speech tagger
|
||||||
|
|
||||||
|
@ -48,7 +45,21 @@ p
|
||||||
p
|
p
|
||||||
+button(gh("spaCy", "examples/training/train_ner.py"), false, "secondary") Full example
|
+button(gh("spaCy", "examples/training/train_ner.py"), false, "secondary") Full example
|
||||||
|
|
||||||
+h(2, "train-entity") Training the dependency parser
|
+h(2, "extend-entity") Extending the named entity recognizer
|
||||||
|
|
||||||
|
p
|
||||||
|
| All #[+a("/docs/usage/models") spaCy models] support online learning, so
|
||||||
|
| you can update a pre-trained model with new examples. You can even add
|
||||||
|
| new classes to an existing model, to recognise a new entity type,
|
||||||
|
| part-of-speech, or syntactic relation. Updating an existing model is
|
||||||
|
| particularly useful as a "quick and dirty solution", if you have only a
|
||||||
|
| few corrections or annotations.
|
||||||
|
|
||||||
|
p.o-inline-list
|
||||||
|
+button(gh("spaCy", "examples/training/train_new_entity_type.py"), true, "secondary") Full example
|
||||||
|
+button("/docs/usage/training-ner", false, "secondary") Usage Workflow
|
||||||
|
|
||||||
|
+h(2, "train-dependency") Training the dependency parser
|
||||||
|
|
||||||
+code.
|
+code.
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
@ -67,7 +78,7 @@ p
|
||||||
p
|
p
|
||||||
+button(gh("spaCy", "examples/training/train_parser.py"), false, "secondary") Full example
|
+button(gh("spaCy", "examples/training/train_parser.py"), false, "secondary") Full example
|
||||||
|
|
||||||
+h(2, 'feature-templates') Customizing the feature extraction
|
+h(2, "feature-templates") Customizing the feature extraction
|
||||||
|
|
||||||
p
|
p
|
||||||
| spaCy currently uses linear models for the tagger, parser and entity
|
| spaCy currently uses linear models for the tagger, parser and entity
|
||||||
|
|
Loading…
Reference in New Issue
Block a user