This commit is contained in:
Matthew Honnibal 2017-04-20 17:03:11 +02:00
commit 1b12f342e4
13 changed files with 58 additions and 35 deletions

View File

@ -87,7 +87,16 @@ Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/). Re
### Python conventions
All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. We could use some extra utilities for this. Please pay particular attention to code that serialises json objects.
All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or platform compatibility should only live in [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin functions, replacement functions are suffixed with an undersocre, for example `unicode_`. If you need to access the user's version or platform information, for example to show more specific error messages, you can use the `is_config()` helper function.
```python
from .compat import unicode_, json_dumps, is_config
compatible_unicode = unicode_('hello world')
compatible_json = json_dumps({'key': 'value'})
if is_config(windows=True, python2=True):
print("You are using Python 2 on Windows.")
```
Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects.
@ -95,6 +104,8 @@ At the time of writing (v1.7), spaCy's serialization and deserialization functio
Although spaCy uses a lot of classes, inheritance is viewed with some suspicion — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing.
We have a number of conventions around variable naming that are still being documented, and aren't 100% strict. A general policy is that instances of the class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`, `Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are of other types these names. For instance, don't name a text string `doc` — you should usually call this `text`. Two general code style preferences further help with naming. First, lean away from introducing temporary variables, as these clutter your namespace. This is one reason why comprehension expressions are often preferred. Second, keep your functions shortish, so that can work in a smaller scope. Of course, this is a question of trade-offs.
### Cython conventions
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef` classes. Memory is managed through the `cymem.cymem.Pool` class, which allows you to allocate memory which will be freed when the `Pool` object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the `Pool`. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.
@ -126,7 +137,7 @@ cdef int c_total(const int* int_array, int length) nogil:
return total
```
If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above -- the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
Your functions cannot be declared `nogil` if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get `nogil` functions. However, if your function isn't `nogil`, you should compile your module with `cython -a --cplus my_module.pyx` and open the resulting `my_module.html` file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.

View File

@ -6,6 +6,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
* Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
* Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
* Aniruddha Adhikary [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
* Ben Eyal, [@beneyal](https://github.com/beneyal)
* Bhargav Srinivasa, [@bhargavvader](https://github.com/bhargavvader)
* Bruno P. Kinoshita, [@kinow](https://github.com/kinow)
* Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)

View File

@ -11,4 +11,5 @@ ujson>=1.35
dill>=0.2,<0.3
requests>=2.13.0,<3.0.0
regex==2017.4.5
ftfy>=4.4.2,<5.0.0
pytest>=3.0.6,<4.0.0

View File

@ -248,7 +248,8 @@ def setup_package():
'ujson>=1.35',
'dill>=0.2,<0.3',
'requests>=2.13.0,<3.0.0',
'regex==2017.4.5'],
'regex==2017.4.5',
'ftfy>=4.4.2,<5.0.0'],
classifiers=[
'Development Status :: 5 - Production/Stable',
'Environment :: Console',

View File

@ -2,6 +2,7 @@
from __future__ import unicode_literals
import json
from ...compat import json_dumps
from ... import util
@ -29,7 +30,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):
output_filename = input_path.parts[-1].replace(".conllu", ".json")
output_file = output_path / output_filename
json.dump(docs, output_file.open('w', encoding='utf-8'), indent=2)
with output_file.open('w', encoding='utf-8') as f:
f.write(json_dumps(docs))
util.print_msg("Created {} documents".format(len(docs)),
title="Generated output file {}".format(output_file))

View File

@ -8,6 +8,7 @@ from pathlib import Path
from preshed.counter import PreshCounter
from ..vocab import write_binary_vectors
from ..compat import fix_text
from .. import util
@ -41,7 +42,7 @@ def create_model(model_path, vectors_path, vocab, oov_prob):
with oov_path.open('w') as f:
f.write('%f' % oov_prob)
if vectors_path:
vectors_dest = model_path / 'vec.bin'
vectors_dest = vocab_path / 'vec.bin'
write_binary_vectors(vectors_path.as_posix(), vectors_dest.as_posix())
@ -76,6 +77,7 @@ def read_clusters(clusters_path):
for line in f:
try:
cluster, word, freq = line.split()
word = fix_text(word)
except ValueError:
continue
# If the clusterer has only seen the word a few times, its

View File

@ -2,6 +2,7 @@
from __future__ import unicode_literals
import six
import ftfy
import sys
import ujson
@ -38,6 +39,9 @@ elif is_python3:
json_dumps = lambda data: ujson.dumps(data, indent=2)
fix_text = lambda text: ftfy.fix_text(text)
def symlink_to(orig, dest):
if is_python2 and is_windows:
import subprocess

View File

@ -304,4 +304,5 @@ TAG_MAP = {
"VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", "pos": "VERB"},
"VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", "pos": "VERB"},
"X___": {"morph": "_", "pos": "X"},
"SP": {"morph": "_", "pos": "SPACE"}
}

View File

@ -13,7 +13,7 @@ from ..symbols import *
import os
import io
import re
import regex as re
def get_exceptions():

View File

@ -1,7 +1,7 @@
# coding: utf8
from __future__ import unicode_literals
import re
import regex as re
from spacy.language_data.punctuation import ALPHA_LOWER, CURRENCY
from ..language_data.tokenizer_exceptions import _URL_PATTERN

View File

@ -1,21 +1,8 @@
# coding: utf8
from __future__ import unicode_literals
import re
_ALPHA_LOWER = """
a ä à á â ǎ æ ã å ā ă ą b c ç ć č ĉ ċ c̄ d ð ď e é è ê ë ė ȅ ȩ ę f g ĝ ğ h i ı
î ï í ī ì ȉ ǐ į ĩ j k ķ l ł ļ m n ñ ń ň ņ o ö ó ò ő ô õ œ ø ō ő ǒ ơ p q r ř ŗ s
ß ś š ş ŝ t ť u ú û ù ú ū ű ǔ ů ų ư v w ŵ x y ÿ ý ŷ z ź ž ż þ
"""
_ALPHA_UPPER = """
A Ä À Á Â Ǎ Æ Ã Å Ā Ă Ą B C Ç Ć Č Ĉ Ċ C̄ D Ð Ď E É È Ê Ë Ė Ȅ Ȩ Ę F G Ĝ Ğ H I İ
Î Ï Í Ī Ì Ȉ Ǐ Į Ĩ J K Ķ L Ł Ļ M N Ñ Ń Ň Ņ O Ö Ó Ò Ő Ô Õ Œ Ø Ō Ő Ǒ Ơ P Q R Ř Ŗ S
Ś Š Ş Ŝ T Ť U Ú Û Ù Ú Ū Ű Ǔ Ů Ų Ư V W Ŵ X Y Ÿ Ý Ŷ Z Ź Ž Ż Þ
"""
import regex as re
re.DEFAULT_VERSION = re.VERSION1
_UNITS = """
@ -57,9 +44,16 @@ LIST_PUNCT = list(_PUNCT.strip().split())
LIST_HYPHENS = list(_HYPHENS.strip().split())
ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '').replace('\n', '')
ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '').replace('\n', '')
ALPHA = ALPHA_LOWER + ALPHA_UPPER
BENGALI = r'[\p{L}&&\p{Bengali}]'
HEBREW = r'[\p{L}&&\p{Hebrew}]'
LATIN_LOWER = r'[\p{Ll}&&\p{Latin}]'
LATIN_UPPER = r'[\p{Lu}&&\p{Latin}]'
LATIN = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
ALPHA_LOWER = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN_LOWER]))
ALPHA_UPPER = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN_UPPER]))
ALPHA = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN]))
QUOTES = _QUOTES.strip().replace(' ', '|')

View File

@ -3,15 +3,21 @@ from __future__ import unicode_literals
import pytest
ABBREVIATION_TESTS = [
('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])
]
TESTCASES = ABBREVIATION_TESTS
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
def test_tokenizer_handles_testcases(he_tokenizer, text, expected_tokens):
@pytest.mark.parametrize('text,expected_tokens',
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
tokens = he_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
@pytest.mark.parametrize('text,expected_tokens', [
('עקבת אחריו בכל רחבי המדינה.', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '.']),
('עקבת אחריו בכל רחבי המדינה?', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']),
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
tokens = he_tokenizer(text)
assert expected_tokens == [token.text for token in tokens]

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals, print_function
import ujson
import re
import regex as re
from pathlib import Path
import sys
import textwrap