Merge branch 'master' of https://github.com/explosion/spaCy

2025-10-25 21:21:10 +03:00 · 2017-04-20 17:03:11 +02:00 · 2017-04-20 17:03:11 +02:00 · 1b12f342e4
commit 1b12f342e4
parent 4eef200bab 25c70b4cc5
13 changed files with 58 additions and 35 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -87,7 +87,16 @@ Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/). Re

 ### Python conventions

-All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. We could use some extra utilities for this. Please pay particular attention to code that serialises json objects.
+All Python code must be written in an **intersection of Python 2 and Python 3**. This is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or platform compatibility should only live in [`spacy.compat`](spacy/compat.py). To distinguish them from the builtin functions, replacement functions are suffixed with an undersocre, for example `unicode_`. If you need to access the user's version or platform information, for example to show more specific error messages, you can use the `is_config()` helper function.
+
+```python
+from .compat import unicode_, json_dumps, is_config
+
+compatible_unicode = unicode_('hello world')
+compatible_json = json_dumps({'key': 'value'})
+if is_config(windows=True, python2=True):
+    print("You are using Python 2 on Windows.")
+```

 Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects.

@ -95,6 +104,8 @@ At the time of writing (v1.7), spaCy's serialization and deserialization functio

 Although spaCy uses a lot of classes, inheritance is viewed with some suspicion — it's seen as a mechanism of last resort. You should discuss plans to extend the class hierarchy before implementing.

+We have a number of conventions around variable naming that are still being documented, and aren't 100% strict. A general policy is that instances of the class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`, `Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are of other types these names. For instance, don't name a text string `doc` — you should usually call this `text`. Two general code style preferences further help with naming. First, lean away from introducing temporary variables, as these clutter your namespace. This is one reason why comprehension expressions are often preferred. Second, keep your functions shortish, so that can work in a smaller scope. Of course, this is a question of trade-offs.
+
 ### Cython conventions

 spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef` classes. Memory is managed through the `cymem.cymem.Pool` class, which allows you to allocate memory which will be freed when the `Pool` object is garbage collected. This means you usually don't have to worry about freeing memory. You just have to decide which Python object owns the memory, and make it own the `Pool`. When that object goes out of scope, the memory will be freed. You do have to take care that no pointers outlive the object that owns them — but this is generally quite easy.
@ -126,7 +137,7 @@ cdef int c_total(const int* int_array, int length) nogil:
    return total
 ```

-If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above -- the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.
+If this is confusing, consider that the compiler couldn't deal with `for item in int_array:` — there's no length attached to a raw pointer, so how could we figure out where to stop? The length is provided in the slice notation as a solution to this. Note that we don't have to declare the type of `item` in the code above — the compiler can easily infer it. This gives us tidy code that looks quite like Python, but is exactly as fast as C — because we've made sure the compilation to C is trivial.

 Your functions cannot be declared `nogil` if they need to create Python objects or call Python functions. This is perfectly okay — you shouldn't torture your code just to get `nogil` functions. However, if your function isn't `nogil`, you should compile your module with `cython -a --cplus my_module.pyx` and open the resulting `my_module.html` file in a browser. This will let you see how Cython is compiling your code. Calls into the Python run-time will be in bright yellow. This lets you easily see whether Cython is able to correctly type your code, or whether there are unexpected problems.

--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -6,6 +6,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
 * Andreas Grivas, [@andreasgrv](https://github.com/andreasgrv)
 * Andrew Poliakov, [@pavlin99th](https://github.com/pavlin99th)
 * Aniruddha Adhikary [@aniruddha-adhikary](https://github.com/aniruddha-adhikary)
+* Ben Eyal, [@beneyal](https://github.com/beneyal)
 * Bhargav Srinivasa, [@bhargavvader](https://github.com/bhargavvader)
 * Bruno P. Kinoshita, [@kinow](https://github.com/kinow)
 * Chris DuBois, [@chrisdubois](https://github.com/chrisdubois)
--- a/requirements.txt
+++ b/requirements.txt
@ -11,4 +11,5 @@ ujson>=1.35
 dill>=0.2,<0.3
 requests>=2.13.0,<3.0.0
 regex==2017.4.5
+ftfy>=4.4.2,<5.0.0
 pytest>=3.0.6,<4.0.0
--- a/setup.py
+++ b/setup.py
@ -248,7 +248,8 @@ def setup_package():
                'ujson>=1.35',
                'dill>=0.2,<0.3',
                'requests>=2.13.0,<3.0.0',
-                'regex==2017.4.5'],
+                'regex==2017.4.5',
+                'ftfy>=4.4.2,<5.0.0'],
            classifiers=[
                'Development Status :: 5 - Production/Stable',
                'Environment :: Console',
--- a/spacy/cli/converters/conllu2json.py
+++ b/spacy/cli/converters/conllu2json.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals

 import json
+from ...compat import json_dumps
 from ... import util


@ -29,7 +30,8 @@ def conllu2json(input_path, output_path, n_sents=10, use_morphology=False):

    output_filename = input_path.parts[-1].replace(".conllu", ".json")
    output_file = output_path / output_filename
-    json.dump(docs, output_file.open('w', encoding='utf-8'), indent=2)
+    with output_file.open('w', encoding='utf-8') as f:
+        f.write(json_dumps(docs))
    util.print_msg("Created {} documents".format(len(docs)),
                   title="Generated output file {}".format(output_file))

--- a/spacy/cli/model.py
+++ b/spacy/cli/model.py
@ -8,6 +8,7 @@ from pathlib import Path
 from preshed.counter import PreshCounter

 from ..vocab import write_binary_vectors
+from ..compat import fix_text
 from .. import util


@ -41,7 +42,7 @@ def create_model(model_path, vectors_path, vocab, oov_prob):
    with oov_path.open('w') as f:
        f.write('%f' % oov_prob)
    if vectors_path:
-        vectors_dest = model_path / 'vec.bin'
+        vectors_dest = vocab_path / 'vec.bin'
        write_binary_vectors(vectors_path.as_posix(), vectors_dest.as_posix())


@ -76,6 +77,7 @@ def read_clusters(clusters_path):
        for line in f:
            try:
                cluster, word, freq = line.split()
+                word = fix_text(word)
            except ValueError:
                continue
            # If the clusterer has only seen the word a few times, its
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals

 import six
+import ftfy
 import sys
 import ujson

@ -38,6 +39,9 @@ elif is_python3:
    json_dumps = lambda data: ujson.dumps(data, indent=2)


+fix_text = lambda text: ftfy.fix_text(text)
+
+
 def symlink_to(orig, dest):
    if is_python2 and is_windows:
        import subprocess
--- a/spacy/es/tag_map.py
+++ b/spacy/es/tag_map.py
@ -304,4 +304,5 @@ TAG_MAP = {
    "VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", "pos": "VERB"},
    "VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", "pos": "VERB"},
    "X___": {"morph": "_", "pos": "X"},
+    "SP": {"morph": "_", "pos": "SPACE"}
 }
--- a/spacy/fr/tokenizer_exceptions.py
+++ b/spacy/fr/tokenizer_exceptions.py
@ -13,7 +13,7 @@ from ..symbols import *

 import os
 import io
-import re
+import regex as re


 def get_exceptions():
--- a/spacy/hu/tokenizer_exceptions.py
+++ b/spacy/hu/tokenizer_exceptions.py
@ -1,7 +1,7 @@
 # coding: utf8
 from __future__ import unicode_literals

-import re
+import regex as re

 from spacy.language_data.punctuation import ALPHA_LOWER, CURRENCY
 from ..language_data.tokenizer_exceptions import _URL_PATTERN
--- a/spacy/language_data/punctuation.py
+++ b/spacy/language_data/punctuation.py
@ -1,21 +1,8 @@
 # coding: utf8
 from __future__ import unicode_literals

-import re
-
-
-_ALPHA_LOWER = """
-a ä à á â ǎ æ ã å ā ă ą b c ç ć č ĉ ċ c̄ d ð ď e é è ê ë ė ȅ ȩ ẽ ę f g ĝ ğ h i ı
-î ï í ī ì ȉ ǐ į ĩ j k ķ l ł ļ m n ñ ń ň ņ o ö ó ò ő ô õ œ ø ō ő ǒ ơ p q r ř ŗ s
-ß ś š ş ŝ t ť u ú û ù ú ū ű ǔ ů ų ư v w ŵ x y ÿ ý ỳ ŷ ỹ z ź ž ż þ
-"""
-
-
-_ALPHA_UPPER = """
-A Ä À Á Â Ǎ Æ Ã Å Ā Ă Ą B C Ç Ć Č Ĉ Ċ C̄ D Ð Ď E É È Ê Ë Ė Ȅ Ȩ Ẽ Ę F G Ĝ Ğ H I İ
-Î Ï Í Ī Ì Ȉ Ǐ Į Ĩ J K Ķ L Ł Ļ M N Ñ Ń Ň Ņ O Ö Ó Ò Ő Ô Õ Œ Ø Ō Ő Ǒ Ơ P Q R Ř Ŗ S
-Ś Š Ş Ŝ T Ť U Ú Û Ù Ú Ū Ű Ǔ Ů Ų Ư V W Ŵ X Y Ÿ Ý Ỳ Ŷ Ỹ Z Ź Ž Ż Þ
-"""
+import regex as re
+re.DEFAULT_VERSION = re.VERSION1


 _UNITS = """
@ -57,9 +44,16 @@ LIST_PUNCT = list(_PUNCT.strip().split())
 LIST_HYPHENS = list(_HYPHENS.strip().split())


-ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '').replace('\n', '')
-ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '').replace('\n', '')
-ALPHA = ALPHA_LOWER + ALPHA_UPPER
+BENGALI = r'[\p{L}&&\p{Bengali}]'
+HEBREW = r'[\p{L}&&\p{Hebrew}]'
+LATIN_LOWER = r'[\p{Ll}&&\p{Latin}]'
+LATIN_UPPER = r'[\p{Lu}&&\p{Latin}]'
+LATIN = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
+
+
+ALPHA_LOWER = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN_LOWER]))
+ALPHA_UPPER = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN_UPPER]))
+ALPHA = '[{}]'.format('||'.join([BENGALI, HEBREW, LATIN]))


 QUOTES = _QUOTES.strip().replace(' ', '|')
--- a/spacy/tests/he/test_tokenizer.py
+++ b/spacy/tests/he/test_tokenizer.py
@ -3,15 +3,21 @@ from __future__ import unicode_literals

 import pytest

-ABBREVIATION_TESTS = [
-    ('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])
-]

-TESTCASES = ABBREVIATION_TESTS
-
-
-@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
-def test_tokenizer_handles_testcases(he_tokenizer, text, expected_tokens):
+@pytest.mark.parametrize('text,expected_tokens',
+    [('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
+def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
    tokens = he_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
-    assert expected_tokens == token_list
+    assert expected_tokens == token_list
+
+
+@pytest.mark.parametrize('text,expected_tokens', [
+    ('עקבת אחריו בכל רחבי המדינה.', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '.']),
+    ('עקבת אחריו בכל רחבי המדינה?', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '?']),
+    ('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
+    ('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
+    ('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
+def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
+    tokens = he_tokenizer(text)
+    assert expected_tokens == [token.text for token in tokens]
--- a/spacy/util.py
+++ b/spacy/util.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals, print_function

 import ujson
-import re
+import regex as re
 from pathlib import Path
 import sys
 import textwrap