mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-25 17:36:30 +03:00
Merge branch 'master' of https://github.com/honnibal/spaCy
This commit is contained in:
commit
3bc498e9c9
|
@ -45,28 +45,27 @@ install:
|
||||||
|
|
||||||
# Upgrade to the latest version of pip to avoid it displaying warnings
|
# Upgrade to the latest version of pip to avoid it displaying warnings
|
||||||
# about it being out of date.
|
# about it being out of date.
|
||||||
- "pip install --disable-pip-version-check --user --upgrade pip"
|
- "pip install --disable-pip-version-check --user -U pip"
|
||||||
|
|
||||||
# Install the build dependencies of the project. If some dependencies contain
|
# Install the build dependencies of the project. If some dependencies contain
|
||||||
# compiled extensions and are not provided as pre-built wheel packages,
|
# compiled extensions and are not provided as pre-built wheel packages,
|
||||||
# pip will build them from source using the MSVC compiler matching the
|
# pip will build them from source using the MSVC compiler matching the
|
||||||
# target Python version and architecture
|
# target Python version and architecture
|
||||||
- "pip install --upgrade setuptools"
|
- "%CMD_IN_ENV% python build.py prepare"
|
||||||
- "%CMD_IN_ENV% pip install cython fabric fabtools"
|
|
||||||
- "%CMD_IN_ENV% pip install -r requirements.txt"
|
|
||||||
|
|
||||||
build_script:
|
build_script:
|
||||||
# Build the compiled extension
|
# Build the compiled extension
|
||||||
- "%CMD_IN_ENV% python setup.py build_ext --inplace"
|
- "%CMD_IN_ENV% python setup.py build_ext --inplace"
|
||||||
- ps: appveyor\download.ps1
|
- ps: appveyor\download.ps1
|
||||||
- "tar -xzf corpora/en/wordnet.tar.gz"
|
- "tar -xzf corpora/en/wordnet.tar.gz"
|
||||||
- "%CMD_IN_ENV% python bin/init_model.py en lang_data/ corpora/ spacy/en/data"
|
- "%CMD_IN_ENV% python bin/init_model.py en lang_data/ corpora/ data"
|
||||||
|
- "cp package.json data"
|
||||||
|
- "%CMD_IN_ENV% sputnik build data en_default.sputnik"
|
||||||
|
- "%CMD_IN_ENV% sputnik install en_default.sputnik"
|
||||||
|
|
||||||
test_script:
|
test_script:
|
||||||
# Run the project tests
|
# Run the project tests
|
||||||
- "pip install pytest"
|
- "%CMD_IN_ENV% python build.py test"
|
||||||
- "%CMD_IN_ENV% py.test spacy/ -x"
|
|
||||||
|
|
||||||
after_test:
|
after_test:
|
||||||
# If tests are successful, create binary packages for the project.
|
# If tests are successful, create binary packages for the project.
|
||||||
|
|
26
.travis.yml
26
.travis.yml
|
@ -1,14 +1,21 @@
|
||||||
language: python
|
language: python
|
||||||
|
|
||||||
|
sudo: required
|
||||||
|
dist: precise
|
||||||
|
group: edge
|
||||||
|
|
||||||
|
python:
|
||||||
|
- "2.7"
|
||||||
|
- "3.4"
|
||||||
|
|
||||||
os:
|
os:
|
||||||
- linux
|
- linux
|
||||||
|
|
||||||
python:
|
env:
|
||||||
- "2.7"
|
- PIP_DATE=2015-10-01 MODE=pip
|
||||||
- "3.4"
|
- PIP_DATE=2015-10-01 MODE=setup-install
|
||||||
- "3.5"
|
- PIP_DATE=2015-10-01 MODE=setup-develop
|
||||||
|
|
||||||
# install dependencies
|
|
||||||
install:
|
install:
|
||||||
- "pip install --upgrade setuptools"
|
- "pip install --upgrade setuptools"
|
||||||
- "pip install cython fabric fabtools"
|
- "pip install cython fabric fabtools"
|
||||||
|
@ -21,8 +28,11 @@ install:
|
||||||
- "mv WordNet-3.0 wordnet"
|
- "mv WordNet-3.0 wordnet"
|
||||||
- "cd ../../"
|
- "cd ../../"
|
||||||
- "export PYTHONPATH=`pwd`"
|
- "export PYTHONPATH=`pwd`"
|
||||||
- "python bin/init_model.py en lang_data/ corpora/ spacy/en/data"
|
- "python bin/init_model.py en lang_data/ corpora/ data"
|
||||||
|
- "cp package.json data"
|
||||||
|
- "sputnik build data en_default.sputnik"
|
||||||
|
- "sputnik install en_default.sputnik"
|
||||||
|
|
||||||
# run tests
|
|
||||||
script:
|
script:
|
||||||
- "py.test spacy/ -x"
|
- python build.py $MODE;
|
||||||
|
- python build.py test
|
||||||
|
|
|
@ -0,0 +1 @@
|
||||||
|
recursive-include include *.h
|
|
@ -1,14 +1,14 @@
|
||||||
Python 2.7 Windows build has been tested with the following toolchain:
|
Python 2.7 Windows build has been tested with the following toolchain:
|
||||||
- Python 2.7.10 :)
|
- Python 2.7.10 :)
|
||||||
- Microsoft Visual C++ Compiler Package for Python 2.7 http://www.microsoft.com/en-us/download/details.aspx?id=44266
|
- Microsoft Visual C++ Compiler Package for Python 2.7 http://www.microsoft.com/en-us/download/details.aspx?id=44266
|
||||||
- C99 compliant stdint.h for MSVC http://msinttypes.googlecode.com/svn/trunk/stdint.h
|
- C99 compliant stdint.h for MSVC https://msinttypes.googlecode.com/svn/trunk/stdint.h
|
||||||
(C99 complian stdint.h header which is not supplied with Microsoft Visual C++ compiler prior to MSVC 2010)
|
(C99 complian stdint.h header which is not supplied with Microsoft Visual C++ compiler prior to MSVC 2010)
|
||||||
|
|
||||||
Build steps:
|
Build steps:
|
||||||
- pip install --upgrade setuptools
|
- pip install --upgrade setuptools
|
||||||
- pip install cython fabric fabtools
|
- pip install cython fabric fabtools
|
||||||
- pip install -r requirements.txt
|
- pip install -r requirements.txt
|
||||||
- python setup.py build_ext --inplace
|
- python pip install -e .
|
||||||
|
|
||||||
If you are using traditional Microsoft SDK (v7.0 for Python 2.x or v7.1 for Python 3.x) consider run_with_env.cmd from appveyor folder (submodule) as a guideline for environment setup.
|
If you are using traditional Microsoft SDK (v7.0 for Python 2.x or v7.1 for Python 3.x) consider run_with_env.cmd from appveyor folder (submodule) as a guideline for environment setup.
|
||||||
It can be also used as shell conviguration script for your build, install and run commands, i.e.: cmd /E:ON /V:ON /C run_with_env.cmd <your command>
|
It can be also used as shell conviguration script for your build, install and run commands, i.e.: cmd /E:ON /V:ON /C run_with_env.cmd <your command>
|
||||||
|
|
|
@ -45,9 +45,6 @@ Supports
|
||||||
* OSX
|
* OSX
|
||||||
* Linux
|
* Linux
|
||||||
* Cygwin
|
* Cygwin
|
||||||
|
|
||||||
Want to support:
|
|
||||||
|
|
||||||
* Visual Studio
|
* Visual Studio
|
||||||
|
|
||||||
Difficult to support:
|
Difficult to support:
|
||||||
|
|
199
bin/cythonize.py
Executable file
199
bin/cythonize.py
Executable file
|
@ -0,0 +1,199 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
""" cythonize
|
||||||
|
|
||||||
|
Cythonize pyx files into C files as needed.
|
||||||
|
|
||||||
|
Usage: cythonize [root_dir]
|
||||||
|
|
||||||
|
Default [root_dir] is 'spacy'.
|
||||||
|
|
||||||
|
Checks pyx files to see if they have been changed relative to their
|
||||||
|
corresponding C files. If they have, then runs cython on these files to
|
||||||
|
recreate the C files.
|
||||||
|
|
||||||
|
The script thinks that the pyx files have changed relative to the C files
|
||||||
|
by comparing hashes stored in a database file.
|
||||||
|
|
||||||
|
Simple script to invoke Cython (and Tempita) on all .pyx (.pyx.in)
|
||||||
|
files; while waiting for a proper build system. Uses file hashes to
|
||||||
|
figure out if rebuild is needed.
|
||||||
|
|
||||||
|
For now, this script should be run by developers when changing Cython files
|
||||||
|
only, and the resulting C files checked in, so that end-users (and Python-only
|
||||||
|
developers) do not get the Cython/Tempita dependencies.
|
||||||
|
|
||||||
|
Originally written by Dag Sverre Seljebotn, and copied here from:
|
||||||
|
|
||||||
|
https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
|
||||||
|
|
||||||
|
Note: this script does not check any of the dependent C libraries; it only
|
||||||
|
operates on the Cython .pyx files.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import division, print_function, absolute_import
|
||||||
|
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import hashlib
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
HASH_FILE = 'cythonize.dat'
|
||||||
|
DEFAULT_ROOT = 'spacy'
|
||||||
|
VENDOR = 'spaCy'
|
||||||
|
|
||||||
|
# WindowsError is not defined on unix systems
|
||||||
|
try:
|
||||||
|
WindowsError
|
||||||
|
except NameError:
|
||||||
|
WindowsError = None
|
||||||
|
|
||||||
|
#
|
||||||
|
# Rules
|
||||||
|
#
|
||||||
|
def process_pyx(fromfile, tofile):
|
||||||
|
try:
|
||||||
|
from Cython.Compiler.Version import version as cython_version
|
||||||
|
from distutils.version import LooseVersion
|
||||||
|
if LooseVersion(cython_version) < LooseVersion('0.19'):
|
||||||
|
raise Exception('Building %s requires Cython >= 0.19' % VENDOR)
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
flags = ['--fast-fail']
|
||||||
|
if tofile.endswith('.cpp'):
|
||||||
|
flags += ['--cplus']
|
||||||
|
|
||||||
|
try:
|
||||||
|
try:
|
||||||
|
r = subprocess.call(['cython'] + flags + ["-o", tofile, fromfile])
|
||||||
|
if r != 0:
|
||||||
|
raise Exception('Cython failed')
|
||||||
|
except OSError:
|
||||||
|
# There are ways of installing Cython that don't result in a cython
|
||||||
|
# executable on the path, see gh-2397.
|
||||||
|
r = subprocess.call([sys.executable, '-c',
|
||||||
|
'import sys; from Cython.Compiler.Main import '
|
||||||
|
'setuptools_main as main; sys.exit(main())'] + flags +
|
||||||
|
["-o", tofile, fromfile])
|
||||||
|
if r != 0:
|
||||||
|
raise Exception('Cython failed')
|
||||||
|
except OSError:
|
||||||
|
raise OSError('Cython needs to be installed')
|
||||||
|
|
||||||
|
def process_tempita_pyx(fromfile, tofile):
|
||||||
|
try:
|
||||||
|
try:
|
||||||
|
from Cython import Tempita as tempita
|
||||||
|
except ImportError:
|
||||||
|
import tempita
|
||||||
|
except ImportError:
|
||||||
|
raise Exception('Building %s requires Tempita: '
|
||||||
|
'pip install --user Tempita' % VENDOR)
|
||||||
|
with open(fromfile, "r") as f:
|
||||||
|
tmpl = f.read()
|
||||||
|
pyxcontent = tempita.sub(tmpl)
|
||||||
|
assert fromfile.endswith('.pyx.in')
|
||||||
|
pyxfile = fromfile[:-len('.pyx.in')] + '.pyx'
|
||||||
|
with open(pyxfile, "w") as f:
|
||||||
|
f.write(pyxcontent)
|
||||||
|
process_pyx(pyxfile, tofile)
|
||||||
|
|
||||||
|
rules = {
|
||||||
|
# fromext : function
|
||||||
|
'.pyx' : process_pyx,
|
||||||
|
'.pyx.in' : process_tempita_pyx
|
||||||
|
}
|
||||||
|
#
|
||||||
|
# Hash db
|
||||||
|
#
|
||||||
|
def load_hashes(filename):
|
||||||
|
# Return { filename : (sha1 of input, sha1 of output) }
|
||||||
|
if os.path.isfile(filename):
|
||||||
|
hashes = {}
|
||||||
|
with open(filename, 'r') as f:
|
||||||
|
for line in f:
|
||||||
|
filename, inhash, outhash = line.split()
|
||||||
|
hashes[filename] = (inhash, outhash)
|
||||||
|
else:
|
||||||
|
hashes = {}
|
||||||
|
return hashes
|
||||||
|
|
||||||
|
def save_hashes(hash_db, filename):
|
||||||
|
with open(filename, 'w') as f:
|
||||||
|
for key, value in sorted(hash_db.items()):
|
||||||
|
f.write("%s %s %s\n" % (key, value[0], value[1]))
|
||||||
|
|
||||||
|
def sha1_of_file(filename):
|
||||||
|
h = hashlib.sha1()
|
||||||
|
with open(filename, "rb") as f:
|
||||||
|
h.update(f.read())
|
||||||
|
return h.hexdigest()
|
||||||
|
|
||||||
|
#
|
||||||
|
# Main program
|
||||||
|
#
|
||||||
|
|
||||||
|
def normpath(path):
|
||||||
|
path = path.replace(os.sep, '/')
|
||||||
|
if path.startswith('./'):
|
||||||
|
path = path[2:]
|
||||||
|
return path
|
||||||
|
|
||||||
|
def get_hash(frompath, topath):
|
||||||
|
from_hash = sha1_of_file(frompath)
|
||||||
|
to_hash = sha1_of_file(topath) if os.path.exists(topath) else None
|
||||||
|
return (from_hash, to_hash)
|
||||||
|
|
||||||
|
def process(path, fromfile, tofile, processor_function, hash_db):
|
||||||
|
fullfrompath = os.path.join(path, fromfile)
|
||||||
|
fulltopath = os.path.join(path, tofile)
|
||||||
|
current_hash = get_hash(fullfrompath, fulltopath)
|
||||||
|
if current_hash == hash_db.get(normpath(fullfrompath), None):
|
||||||
|
print('%s has not changed' % fullfrompath)
|
||||||
|
return
|
||||||
|
|
||||||
|
orig_cwd = os.getcwd()
|
||||||
|
try:
|
||||||
|
os.chdir(path)
|
||||||
|
print('Processing %s' % fullfrompath)
|
||||||
|
processor_function(fromfile, tofile)
|
||||||
|
finally:
|
||||||
|
os.chdir(orig_cwd)
|
||||||
|
# changed target file, recompute hash
|
||||||
|
current_hash = get_hash(fullfrompath, fulltopath)
|
||||||
|
# store hash in db
|
||||||
|
hash_db[normpath(fullfrompath)] = current_hash
|
||||||
|
|
||||||
|
|
||||||
|
def find_process_files(root_dir):
|
||||||
|
hash_db = load_hashes(HASH_FILE)
|
||||||
|
for cur_dir, dirs, files in os.walk(root_dir):
|
||||||
|
for filename in files:
|
||||||
|
in_file = os.path.join(cur_dir, filename + ".in")
|
||||||
|
if filename.endswith('.pyx') and os.path.isfile(in_file):
|
||||||
|
continue
|
||||||
|
for fromext, function in rules.items():
|
||||||
|
if filename.endswith(fromext):
|
||||||
|
toext = ".cpp"
|
||||||
|
# with open(os.path.join(cur_dir, filename), 'rb') as f:
|
||||||
|
# data = f.read()
|
||||||
|
# m = re.search(br"^\s*#\s*distutils:\s*language\s*=\s*c\+\+\s*$", data, re.I|re.M)
|
||||||
|
# if m:
|
||||||
|
# toext = ".cxx"
|
||||||
|
fromfile = filename
|
||||||
|
tofile = filename[:-len(fromext)] + toext
|
||||||
|
process(cur_dir, fromfile, tofile, function, hash_db)
|
||||||
|
save_hashes(hash_db, HASH_FILE)
|
||||||
|
|
||||||
|
def main():
|
||||||
|
try:
|
||||||
|
root_dir = sys.argv[1]
|
||||||
|
except IndexError:
|
||||||
|
root_dir = DEFAULT_ROOT
|
||||||
|
find_process_files(root_dir)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
71
build.py
Normal file
71
build.py
Normal file
|
@ -0,0 +1,71 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
from __future__ import print_function
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import shutil
|
||||||
|
from subprocess import call
|
||||||
|
|
||||||
|
|
||||||
|
def x(cmd):
|
||||||
|
print('$ '+cmd)
|
||||||
|
res = call(cmd, shell=True)
|
||||||
|
if res != 0:
|
||||||
|
sys.exit(res)
|
||||||
|
|
||||||
|
|
||||||
|
if len(sys.argv) < 2:
|
||||||
|
print('usage: %s <install-mode> [<pip-date>]')
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
install_mode = sys.argv[1]
|
||||||
|
|
||||||
|
|
||||||
|
if install_mode == 'prepare':
|
||||||
|
x('python pip-clear.py')
|
||||||
|
|
||||||
|
pip_date = len(sys.argv) > 2 and sys.argv[2]
|
||||||
|
if pip_date:
|
||||||
|
x('python pip-date.py %s pip setuptools wheel six' % pip_date)
|
||||||
|
|
||||||
|
x('pip install -r requirements.txt')
|
||||||
|
x('pip list')
|
||||||
|
|
||||||
|
|
||||||
|
elif install_mode == 'pip':
|
||||||
|
if os.path.exists('dist'):
|
||||||
|
shutil.rmtree('dist')
|
||||||
|
x('python setup.py sdist')
|
||||||
|
x('python pip-clear.py')
|
||||||
|
|
||||||
|
filenames = os.listdir('dist')
|
||||||
|
assert len(filenames) == 1
|
||||||
|
x('pip list')
|
||||||
|
x('pip install dist/%s' % filenames[0])
|
||||||
|
|
||||||
|
|
||||||
|
elif install_mode == 'setup-install':
|
||||||
|
x('python setup.py install')
|
||||||
|
|
||||||
|
|
||||||
|
elif install_mode == 'setup-develop':
|
||||||
|
x('pip install -e .')
|
||||||
|
|
||||||
|
|
||||||
|
elif install_mode == 'test':
|
||||||
|
x('pip install pytest')
|
||||||
|
x('pip list')
|
||||||
|
|
||||||
|
if os.path.exists('tmp'):
|
||||||
|
shutil.rmtree('tmp')
|
||||||
|
os.mkdir('tmp')
|
||||||
|
|
||||||
|
try:
|
||||||
|
old = os.getcwd()
|
||||||
|
os.chdir('tmp')
|
||||||
|
|
||||||
|
x('python -m spacy.en.download')
|
||||||
|
x('python -m pytest --tb="native" ../spacy/ -x --models --vectors --slow')
|
||||||
|
|
||||||
|
finally:
|
||||||
|
os.chdir(old)
|
89
examples/pos_tag.py
Normal file
89
examples/pos_tag.py
Normal file
|
@ -0,0 +1,89 @@
|
||||||
|
'''Print part-of-speech tagged, true-cased, (very roughly) sentence-separated
|
||||||
|
text, with each "sentence" on a newline, and spaces between tokens. Supports
|
||||||
|
multi-processing.
|
||||||
|
'''
|
||||||
|
from __future__ import print_function, unicode_literals, division
|
||||||
|
import io
|
||||||
|
import bz2
|
||||||
|
import logging
|
||||||
|
from toolz import partition
|
||||||
|
from os import path
|
||||||
|
|
||||||
|
import spacy.en
|
||||||
|
|
||||||
|
from joblib import Parallel, delayed
|
||||||
|
import plac
|
||||||
|
import ujson
|
||||||
|
|
||||||
|
|
||||||
|
def parallelize(func, iterator, n_jobs, extra):
|
||||||
|
extra = tuple(extra)
|
||||||
|
return Parallel(n_jobs=n_jobs)(delayed(func)(*(item + extra)) for item in iterator)
|
||||||
|
|
||||||
|
|
||||||
|
def iter_texts_from_json_bz2(loc):
|
||||||
|
'''
|
||||||
|
Iterator of unicode strings, one per document (here, a comment).
|
||||||
|
|
||||||
|
Expects a a path to a BZ2 file, which should be new-line delimited JSON. The
|
||||||
|
document text should be in a string field titled 'body'.
|
||||||
|
|
||||||
|
This is the data format of the Reddit comments corpus.
|
||||||
|
'''
|
||||||
|
with bz2.BZ2File(loc) as file_:
|
||||||
|
for i, line in enumerate(file_):
|
||||||
|
yield ujson.loads(line)['body']
|
||||||
|
|
||||||
|
|
||||||
|
def transform_texts(batch_id, input_, out_dir):
|
||||||
|
out_loc = path.join(out_dir, '%d.txt' % batch_id)
|
||||||
|
if path.exists(out_loc):
|
||||||
|
return None
|
||||||
|
print('Batch', batch_id)
|
||||||
|
nlp = spacy.en.English(parser=False, entity=False)
|
||||||
|
with io.open(out_loc, 'w', encoding='utf8') as file_:
|
||||||
|
for text in input_:
|
||||||
|
doc = nlp(text)
|
||||||
|
file_.write(' '.join(represent_word(w) for w in doc if not w.is_space))
|
||||||
|
file_.write('\n')
|
||||||
|
|
||||||
|
|
||||||
|
def represent_word(word):
|
||||||
|
text = word.text
|
||||||
|
# True-case, i.e. try to normalize sentence-initial capitals.
|
||||||
|
# Only do this if the lower-cased form is more probable.
|
||||||
|
if text.istitle() \
|
||||||
|
and is_sent_begin(word) \
|
||||||
|
and word.prob < word.vocab[text.lower()].prob:
|
||||||
|
text = text.lower()
|
||||||
|
return text + '|' + word.tag_
|
||||||
|
|
||||||
|
|
||||||
|
def is_sent_begin(word):
|
||||||
|
# It'd be nice to have some heuristics like these in the library, for these
|
||||||
|
# times where we don't care so much about accuracy of SBD, and we don't want
|
||||||
|
# to parse
|
||||||
|
if word.i == 0:
|
||||||
|
return True
|
||||||
|
elif word.i >= 2 and word.nbor(-1).text in ('.', '!', '?', '...'):
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
@plac.annotations(
|
||||||
|
in_loc=("Location of input file"),
|
||||||
|
out_dir=("Location of input file"),
|
||||||
|
n_workers=("Number of workers", "option", "n", int),
|
||||||
|
batch_size=("Batch-size for each process", "option", "b", int)
|
||||||
|
)
|
||||||
|
def main(in_loc, out_dir, n_workers=4, batch_size=100000):
|
||||||
|
if not path.exists(out_dir):
|
||||||
|
path.join(out_dir)
|
||||||
|
texts = partition(batch_size, iter_texts(in_loc))
|
||||||
|
parallelize(transform_texts, enumerate(texts), n_workers, [out_dir])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
plac.call(main)
|
||||||
|
|
46
fabfile.py
vendored
46
fabfile.py
vendored
|
@ -74,7 +74,6 @@ def web():
|
||||||
jade('home/index.jade', '')
|
jade('home/index.jade', '')
|
||||||
jade('docs/index.jade', 'docs/')
|
jade('docs/index.jade', 'docs/')
|
||||||
jade('blog/index.jade', 'blog/')
|
jade('blog/index.jade', 'blog/')
|
||||||
jade('tutorials/index.jade', 'tutorials/')
|
|
||||||
|
|
||||||
for collection in ('blog', 'tutorials'):
|
for collection in ('blog', 'tutorials'):
|
||||||
for post_dir in (Path(__file__).parent / 'website' / 'src' / 'jade' / collection).iterdir():
|
for post_dir in (Path(__file__).parent / 'website' / 'src' / 'jade' / collection).iterdir():
|
||||||
|
@ -85,7 +84,50 @@ def web():
|
||||||
|
|
||||||
|
|
||||||
def web_publish(assets_path):
|
def web_publish(assets_path):
|
||||||
local('aws s3 sync --delete --exclude "resources/*" website/site/ s3://spacy.io')
|
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
|
||||||
|
|
||||||
|
site_path = 'website/site'
|
||||||
|
|
||||||
|
os.environ['S3_USE_SIGV4'] = 'True'
|
||||||
|
conn = S3Connection(host='s3.eu-central-1.amazonaws.com',
|
||||||
|
calling_format=OrdinaryCallingFormat())
|
||||||
|
bucket = conn.get_bucket('spacy.io', validate=False)
|
||||||
|
|
||||||
|
keys_left = set([k.name for k in bucket.list()
|
||||||
|
if not k.name.startswith('resources')])
|
||||||
|
|
||||||
|
for root, dirnames, filenames in os.walk(site_path):
|
||||||
|
for dirname in dirnames:
|
||||||
|
target = os.path.relpath(os.path.join(root, dirname), site_path)
|
||||||
|
source = os.path.join(target, 'index.html')
|
||||||
|
|
||||||
|
if os.path.exists(os.path.join(root, dirname, 'index.html')):
|
||||||
|
key = bucket.new_key(source)
|
||||||
|
key.set_redirect('//%s/%s' % (bucket.name, target))
|
||||||
|
print('adding redirect for %s' % target)
|
||||||
|
|
||||||
|
keys_left.remove(source)
|
||||||
|
|
||||||
|
for filename in filenames:
|
||||||
|
source = os.path.join(root, filename)
|
||||||
|
|
||||||
|
target = os.path.relpath(root, site_path)
|
||||||
|
if target == '.':
|
||||||
|
target = filename
|
||||||
|
elif filename != 'index.html':
|
||||||
|
target = os.path.join(target, filename)
|
||||||
|
|
||||||
|
key = bucket.new_key(target)
|
||||||
|
key.set_metadata('Content-Type', 'text/html')
|
||||||
|
key.set_contents_from_filename(source)
|
||||||
|
print('uploading %s' % target)
|
||||||
|
|
||||||
|
keys_left.remove(target)
|
||||||
|
|
||||||
|
for key_name in keys_left:
|
||||||
|
print('deleting %s' % key_name)
|
||||||
|
bucket.delete_key(key_name)
|
||||||
|
|
||||||
local('aws s3 sync --delete %s s3://spacy.io/resources' % assets_path)
|
local('aws s3 sync --delete %s s3://spacy.io/resources' % assets_path)
|
||||||
|
|
||||||
|
|
||||||
|
|
17
package.json
Normal file
17
package.json
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
{
|
||||||
|
"name": "en_default",
|
||||||
|
"version": "0.100.0",
|
||||||
|
"description": "english default model",
|
||||||
|
"license": "public domain",
|
||||||
|
"include": [
|
||||||
|
"deps/*",
|
||||||
|
"ner/*",
|
||||||
|
"pos/*",
|
||||||
|
"tokenizer/*",
|
||||||
|
"vocab/*",
|
||||||
|
"wordnet/*"
|
||||||
|
],
|
||||||
|
"compatibility": {
|
||||||
|
"spacy": "==0.100.0"
|
||||||
|
}
|
||||||
|
}
|
28
pip-clear.py
Executable file
28
pip-clear.py
Executable file
|
@ -0,0 +1,28 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
from __future__ import print_function
|
||||||
|
|
||||||
|
from pip.commands.uninstall import UninstallCommand
|
||||||
|
from pip import get_installed_distributions
|
||||||
|
|
||||||
|
|
||||||
|
packages = []
|
||||||
|
for package in get_installed_distributions():
|
||||||
|
if package.location.endswith('dist-packages'):
|
||||||
|
continue
|
||||||
|
elif package.project_name in ('pip', 'setuptools'):
|
||||||
|
continue
|
||||||
|
packages.append(package.project_name)
|
||||||
|
|
||||||
|
|
||||||
|
if packages:
|
||||||
|
pip = UninstallCommand()
|
||||||
|
options, args = pip.parse_args(packages)
|
||||||
|
options.yes = True
|
||||||
|
|
||||||
|
try:
|
||||||
|
pip.run(options, args)
|
||||||
|
except OSError as e:
|
||||||
|
if e.errno != 13:
|
||||||
|
raise e
|
||||||
|
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
|
||||||
|
exit(13)
|
96
pip-date.py
Normal file
96
pip-date.py
Normal file
|
@ -0,0 +1,96 @@
|
||||||
|
#!/usr/bin/env python
|
||||||
|
from __future__ import print_function
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from bisect import bisect
|
||||||
|
from datetime import datetime
|
||||||
|
from datetime import timedelta
|
||||||
|
import ssl
|
||||||
|
|
||||||
|
try:
|
||||||
|
from urllib.request import Request, build_opener, HTTPSHandler, URLError
|
||||||
|
except ImportError:
|
||||||
|
from urllib2 import Request, build_opener, HTTPSHandler, URLError
|
||||||
|
|
||||||
|
from pip.commands.uninstall import UninstallCommand
|
||||||
|
from pip.commands.install import InstallCommand
|
||||||
|
from pip import get_installed_distributions
|
||||||
|
|
||||||
|
|
||||||
|
def get_releases(package_name):
|
||||||
|
url = 'https://pypi.python.org/pypi/%s/json' % package_name
|
||||||
|
|
||||||
|
ssl_context = HTTPSHandler(
|
||||||
|
context=ssl.SSLContext(ssl.PROTOCOL_TLSv1))
|
||||||
|
opener = build_opener(ssl_context)
|
||||||
|
|
||||||
|
retries = 10
|
||||||
|
while retries > 0:
|
||||||
|
try:
|
||||||
|
r = opener.open(Request(url))
|
||||||
|
break
|
||||||
|
except URLError:
|
||||||
|
retries -= 1
|
||||||
|
|
||||||
|
return json.loads(r.read().decode('utf8'))['releases']
|
||||||
|
|
||||||
|
|
||||||
|
def parse_iso8601(s):
|
||||||
|
return datetime(*map(int, re.split('[^\d]', s)))
|
||||||
|
|
||||||
|
|
||||||
|
def select_version(select_date, package_name):
|
||||||
|
versions = []
|
||||||
|
for version, dists in get_releases(package_name).items():
|
||||||
|
date = [parse_iso8601(d['upload_time']) for d in dists]
|
||||||
|
if date:
|
||||||
|
versions.append((sorted(date)[0], version))
|
||||||
|
|
||||||
|
versions = sorted(versions)
|
||||||
|
min_date = versions[0][0]
|
||||||
|
if select_date < min_date:
|
||||||
|
raise Exception('invalid select_date: %s, must be '
|
||||||
|
'%s or newer.' % (select_date, min_date))
|
||||||
|
|
||||||
|
return versions[bisect([x[0] for x in versions], select_date) - 1][1]
|
||||||
|
|
||||||
|
|
||||||
|
installed_packages = [
|
||||||
|
package.project_name
|
||||||
|
for package in
|
||||||
|
get_installed_distributions()
|
||||||
|
if (not package.location.endswith('dist-packages') and
|
||||||
|
package.project_name not in ('pip', 'setuptools'))
|
||||||
|
]
|
||||||
|
|
||||||
|
if installed_packages:
|
||||||
|
pip = UninstallCommand()
|
||||||
|
options, args = pip.parse_args(installed_packages)
|
||||||
|
options.yes = True
|
||||||
|
|
||||||
|
try:
|
||||||
|
pip.run(options, args)
|
||||||
|
except OSError as e:
|
||||||
|
if e.errno != 13:
|
||||||
|
raise e
|
||||||
|
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
|
||||||
|
exit(13)
|
||||||
|
|
||||||
|
|
||||||
|
date = parse_iso8601(sys.argv[1])
|
||||||
|
packages = {p: select_version(date, p) for p in sys.argv[2:]}
|
||||||
|
args = ['=='.join(a) for a in packages.items()]
|
||||||
|
|
||||||
|
cmd = InstallCommand()
|
||||||
|
options, args = cmd.parse_args(args)
|
||||||
|
options.ignore_installed = True
|
||||||
|
options.force_reinstall = True
|
||||||
|
|
||||||
|
try:
|
||||||
|
print(cmd.run(options, args))
|
||||||
|
except OSError as e:
|
||||||
|
if e.errno != 13:
|
||||||
|
raise e
|
||||||
|
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
|
||||||
|
exit(13)
|
|
@ -1,13 +1,13 @@
|
||||||
cython
|
cython
|
||||||
cymem == 1.30
|
cymem == 1.30
|
||||||
pathlib
|
pathlib
|
||||||
preshed == 0.44
|
preshed == 0.46.1
|
||||||
thinc == 4.0
|
thinc == 4.1.0
|
||||||
murmurhash == 0.24
|
murmurhash == 0.26
|
||||||
text-unidecode
|
text-unidecode
|
||||||
numpy
|
numpy
|
||||||
plac
|
plac
|
||||||
six
|
six
|
||||||
ujson
|
ujson
|
||||||
cloudpickle
|
cloudpickle
|
||||||
sputnik == 0.5.2
|
sputnik == 0.6.4
|
||||||
|
|
426
setup.py
426
setup.py
|
@ -1,231 +1,281 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
from setuptools import setup
|
from __future__ import division, print_function
|
||||||
import shutil
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
import os
|
||||||
from os import path
|
import shutil
|
||||||
|
import subprocess
|
||||||
from setuptools import Extension
|
import sys
|
||||||
from distutils import sysconfig
|
import contextlib
|
||||||
from distutils.core import setup, Extension
|
|
||||||
from distutils.command.build_ext import build_ext
|
from distutils.command.build_ext import build_ext
|
||||||
|
from distutils.sysconfig import get_python_inc
|
||||||
|
|
||||||
import platform
|
try:
|
||||||
|
from setuptools import Extension, setup
|
||||||
|
except ImportError:
|
||||||
|
from distutils.core import Extension, setup
|
||||||
|
|
||||||
PACKAGE_DATA = {
|
|
||||||
"spacy": ["*.pxd"],
|
MAJOR = 0
|
||||||
"spacy.tokens": ["*.pxd"],
|
MINOR = 100
|
||||||
"spacy.serialize": ["*.pxd"],
|
MICRO = 0
|
||||||
"spacy.syntax": ["*.pxd"],
|
ISRELEASED = False
|
||||||
"spacy.en": [
|
VERSION = '%d.%d.%d' % (MAJOR, MINOR, MICRO)
|
||||||
"*.pxd",
|
|
||||||
"data/wordnet/*.exc",
|
|
||||||
"data/wordnet/index.*",
|
PACKAGES = [
|
||||||
"data/tokenizer/*",
|
'spacy',
|
||||||
"data/vocab/serializer.json"
|
'spacy.tokens',
|
||||||
]
|
'spacy.en',
|
||||||
}
|
'spacy.serialize',
|
||||||
|
'spacy.syntax',
|
||||||
|
'spacy.munge',
|
||||||
|
'spacy.tests',
|
||||||
|
'spacy.tests.matcher',
|
||||||
|
'spacy.tests.morphology',
|
||||||
|
'spacy.tests.munge',
|
||||||
|
'spacy.tests.parser',
|
||||||
|
'spacy.tests.serialize',
|
||||||
|
'spacy.tests.spans',
|
||||||
|
'spacy.tests.tagger',
|
||||||
|
'spacy.tests.tokenizer',
|
||||||
|
'spacy.tests.tokens',
|
||||||
|
'spacy.tests.vectors',
|
||||||
|
'spacy.tests.vocab']
|
||||||
|
|
||||||
|
|
||||||
|
MOD_NAMES = [
|
||||||
|
'spacy.parts_of_speech',
|
||||||
|
'spacy.strings',
|
||||||
|
'spacy.lexeme',
|
||||||
|
'spacy.vocab',
|
||||||
|
'spacy.attrs',
|
||||||
|
'spacy.morphology',
|
||||||
|
'spacy.tagger',
|
||||||
|
'spacy.syntax.stateclass',
|
||||||
|
'spacy.tokenizer',
|
||||||
|
'spacy.syntax.parser',
|
||||||
|
'spacy.syntax.transition_system',
|
||||||
|
'spacy.syntax.arc_eager',
|
||||||
|
'spacy.syntax._parse_features',
|
||||||
|
'spacy.gold',
|
||||||
|
'spacy.orth',
|
||||||
|
'spacy.tokens.doc',
|
||||||
|
'spacy.tokens.span',
|
||||||
|
'spacy.tokens.token',
|
||||||
|
'spacy.serialize.packer',
|
||||||
|
'spacy.serialize.huffman',
|
||||||
|
'spacy.serialize.bits',
|
||||||
|
'spacy.cfile',
|
||||||
|
'spacy.matcher',
|
||||||
|
'spacy.syntax.ner',
|
||||||
|
'spacy.symbols']
|
||||||
|
|
||||||
|
|
||||||
|
if sys.version_info[:2] < (2, 7) or (3, 0) <= sys.version_info[0:2] < (3, 4):
|
||||||
|
raise RuntimeError('Python version 2.7 or >= 3.4 required.')
|
||||||
|
|
||||||
|
|
||||||
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
||||||
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
||||||
compile_options = {'msvc' : ['/Ox', '/EHsc'] ,
|
compile_options = {'msvc' : ['/Ox', '/EHsc'],
|
||||||
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'] }
|
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function']}
|
||||||
link_options = {'msvc' : [] ,
|
link_options = {'msvc' : [],
|
||||||
'other' : [] }
|
'other' : []}
|
||||||
|
|
||||||
|
if sys.platform.startswith('darwin'):
|
||||||
|
compile_options['other'].append('-mmacosx-version-min=10.8')
|
||||||
|
compile_options['other'].append('-stdlib=libc++')
|
||||||
|
link_options['other'].append('-lc++')
|
||||||
|
|
||||||
|
|
||||||
class build_ext_options:
|
class build_ext_options:
|
||||||
def build_options(self):
|
def build_options(self):
|
||||||
c_type = None
|
for e in self.extensions:
|
||||||
if self.compiler.compiler_type in compile_options:
|
e.extra_compile_args = compile_options.get(
|
||||||
c_type = self.compiler.compiler_type
|
self.compiler.compiler_type, compile_options['other'])
|
||||||
elif 'other' in compile_options:
|
for e in self.extensions:
|
||||||
c_type = 'other'
|
e.extra_link_args = link_options.get(
|
||||||
if c_type is not None:
|
self.compiler.compiler_type, link_options['other'])
|
||||||
for e in self.extensions:
|
|
||||||
e.extra_compile_args = compile_options[c_type]
|
|
||||||
|
|
||||||
l_type = None
|
|
||||||
if self.compiler.compiler_type in link_options:
|
|
||||||
l_type = self.compiler.compiler_type
|
|
||||||
elif 'other' in link_options:
|
|
||||||
l_type = 'other'
|
|
||||||
if l_type is not None:
|
|
||||||
for e in self.extensions:
|
|
||||||
e.extra_link_args = link_options[l_type]
|
|
||||||
|
|
||||||
class build_ext_subclass( build_ext, build_ext_options ):
|
class build_ext_subclass(build_ext, build_ext_options):
|
||||||
def build_extensions(self):
|
def build_extensions(self):
|
||||||
build_ext_options.build_options(self)
|
build_ext_options.build_options(self)
|
||||||
build_ext.build_extensions(self)
|
build_ext.build_extensions(self)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# PyPy --- NB! PyPy doesn't really work, it segfaults all over the place. But,
|
|
||||||
# this is necessary to get it compile.
|
|
||||||
# We have to resort to monkey-patching to set the compiler, because pypy broke
|
|
||||||
# all the everything.
|
|
||||||
|
|
||||||
pre_patch_customize_compiler = sysconfig.customize_compiler
|
|
||||||
def my_customize_compiler(compiler):
|
|
||||||
pre_patch_customize_compiler(compiler)
|
|
||||||
compiler.compiler_cxx = ['c++']
|
|
||||||
|
|
||||||
|
|
||||||
if platform.python_implementation() == 'PyPy':
|
# Return the git revision as a string
|
||||||
sysconfig.customize_compiler = my_customize_compiler
|
def git_version():
|
||||||
|
def _minimal_ext_cmd(cmd):
|
||||||
|
# construct minimal environment
|
||||||
|
env = {}
|
||||||
|
for k in ['SYSTEMROOT', 'PATH']:
|
||||||
|
v = os.environ.get(k)
|
||||||
|
if v is not None:
|
||||||
|
env[k] = v
|
||||||
|
# LANGUAGE is used on win32
|
||||||
|
env['LANGUAGE'] = 'C'
|
||||||
|
env['LANG'] = 'C'
|
||||||
|
env['LC_ALL'] = 'C'
|
||||||
|
out = subprocess.Popen(cmd, stdout = subprocess.PIPE, env=env).communicate()[0]
|
||||||
|
return out
|
||||||
|
|
||||||
#def install_headers():
|
try:
|
||||||
# dest_dir = path.join(sys.prefix, 'include', 'murmurhash')
|
out = _minimal_ext_cmd(['git', 'rev-parse', 'HEAD'])
|
||||||
# if not path.exists(dest_dir):
|
GIT_REVISION = out.strip().decode('ascii')
|
||||||
# shutil.copytree('murmurhash/headers/murmurhash', dest_dir)
|
except OSError:
|
||||||
#
|
GIT_REVISION = 'Unknown'
|
||||||
# dest_dir = path.join(sys.prefix, 'include', 'numpy')
|
|
||||||
|
return GIT_REVISION
|
||||||
|
|
||||||
|
|
||||||
includes = ['.', path.join(sys.prefix, 'include')]
|
def get_version_info():
|
||||||
|
# Adding the git rev number needs to be done inside write_version_py(),
|
||||||
|
# otherwise the import of spacy.about messes up the build under Python 3.
|
||||||
|
FULLVERSION = VERSION
|
||||||
|
if os.path.exists('.git'):
|
||||||
|
GIT_REVISION = git_version()
|
||||||
|
elif os.path.exists(os.path.join('spacy', 'about.py')):
|
||||||
|
# must be a source distribution, use existing version file
|
||||||
|
try:
|
||||||
|
from spacy.about import git_revision as GIT_REVISION
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError('Unable to import git_revision. Try removing '
|
||||||
|
'spacy/about.py and the build directory '
|
||||||
|
'before building.')
|
||||||
|
else:
|
||||||
|
GIT_REVISION = 'Unknown'
|
||||||
|
|
||||||
|
if not ISRELEASED:
|
||||||
|
FULLVERSION += '.dev0+' + GIT_REVISION[:7]
|
||||||
|
|
||||||
|
return FULLVERSION, GIT_REVISION
|
||||||
|
|
||||||
|
|
||||||
try:
|
def write_version(path):
|
||||||
import numpy
|
cnt = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
|
||||||
numpy_headers = path.join(numpy.get_include(), 'numpy')
|
short_version = '%(version)s'
|
||||||
shutil.copytree(numpy_headers, path.join(sys.prefix, 'include', 'numpy'))
|
version = '%(version)s'
|
||||||
except ImportError:
|
full_version = '%(full_version)s'
|
||||||
pass
|
git_revision = '%(git_revision)s'
|
||||||
except OSError:
|
release = %(isrelease)s
|
||||||
pass
|
if not release:
|
||||||
|
version = full_version
|
||||||
|
"""
|
||||||
|
FULLVERSION, GIT_REVISION = get_version_info()
|
||||||
|
|
||||||
|
with open(path, 'w') as f:
|
||||||
|
f.write(cnt % {'version': VERSION,
|
||||||
|
'full_version' : FULLVERSION,
|
||||||
|
'git_revision' : GIT_REVISION,
|
||||||
|
'isrelease': str(ISRELEASED)})
|
||||||
|
|
||||||
|
|
||||||
|
def generate_cython(root, source):
|
||||||
|
print('Cythonizing sources')
|
||||||
|
p = subprocess.call([sys.executable,
|
||||||
|
os.path.join(root, 'bin', 'cythonize.py'),
|
||||||
|
source])
|
||||||
|
if p != 0:
|
||||||
|
raise RuntimeError('Running cythonize failed')
|
||||||
|
|
||||||
|
|
||||||
def clean(mod_names):
|
def import_include(module_name):
|
||||||
for name in mod_names:
|
try:
|
||||||
|
return __import__(module_name, globals(), locals(), [], 0)
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError('Unable to import %s. Create a virtual environment '
|
||||||
|
'and install all dependencies from requirements.txt, '
|
||||||
|
'e.g., run "pip install -r requirements.txt".' % module_name)
|
||||||
|
|
||||||
|
|
||||||
|
def copy_include(src, dst, path):
|
||||||
|
assert os.path.isdir(src)
|
||||||
|
assert os.path.isdir(dst)
|
||||||
|
shutil.copytree(
|
||||||
|
os.path.join(src, path),
|
||||||
|
os.path.join(dst, path))
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_includes(path):
|
||||||
|
include_dir = os.path.join(path, 'include')
|
||||||
|
if os.path.exists(include_dir):
|
||||||
|
shutil.rmtree(include_dir)
|
||||||
|
os.mkdir(include_dir)
|
||||||
|
|
||||||
|
numpy = import_include('numpy')
|
||||||
|
copy_include(numpy.get_include(), include_dir, 'numpy')
|
||||||
|
|
||||||
|
murmurhash = import_include('murmurhash')
|
||||||
|
copy_include(murmurhash.get_include(), include_dir, 'murmurhash')
|
||||||
|
|
||||||
|
|
||||||
|
def is_source_release(path):
|
||||||
|
return os.path.exists(os.path.join(path, 'PKG-INFO'))
|
||||||
|
|
||||||
|
|
||||||
|
def clean(path):
|
||||||
|
for name in MOD_NAMES:
|
||||||
name = name.replace('.', '/')
|
name = name.replace('.', '/')
|
||||||
so = name + '.so'
|
for ext in ['.so', '.html', '.cpp', '.c']:
|
||||||
html = name + '.html'
|
file_path = os.path.join(path, name + ext)
|
||||||
cpp = name + '.cpp'
|
|
||||||
c = name + '.c'
|
|
||||||
for file_path in [so, html, cpp, c]:
|
|
||||||
if os.path.exists(file_path):
|
if os.path.exists(file_path):
|
||||||
os.unlink(file_path)
|
os.unlink(file_path)
|
||||||
|
|
||||||
|
|
||||||
def name_to_path(mod_name, ext):
|
@contextlib.contextmanager
|
||||||
return '%s.%s' % (mod_name.replace('.', '/'), ext)
|
def chdir(new_dir):
|
||||||
|
old_dir = os.getcwd()
|
||||||
|
try:
|
||||||
|
os.chdir(new_dir)
|
||||||
|
sys.path.insert(0, new_dir)
|
||||||
|
yield
|
||||||
|
finally:
|
||||||
|
del sys.path[0]
|
||||||
|
os.chdir(old_dir)
|
||||||
|
|
||||||
|
|
||||||
def c_ext(mod_name, language, includes):
|
def setup_package():
|
||||||
mod_path = name_to_path(mod_name, language)
|
root = os.path.abspath(os.path.dirname(__file__))
|
||||||
return Extension(mod_name, [mod_path], include_dirs=includes)
|
|
||||||
|
|
||||||
|
if len(sys.argv) > 1 and sys.argv[1] == 'clean':
|
||||||
|
return clean(root)
|
||||||
|
|
||||||
def cython_setup(mod_names, language, includes):
|
with chdir(root):
|
||||||
import Cython.Distutils
|
write_version(os.path.join(root, 'spacy', 'about.py'))
|
||||||
import Cython.Build
|
|
||||||
import distutils.core
|
|
||||||
|
|
||||||
class build_ext_cython_subclass( Cython.Distutils.build_ext, build_ext_options ):
|
include_dirs = [
|
||||||
def build_extensions(self):
|
get_python_inc(plat_specific=True),
|
||||||
build_ext_options.build_options(self)
|
os.path.join(root, 'include')]
|
||||||
Cython.Distutils.build_ext.build_extensions(self)
|
|
||||||
|
|
||||||
if language == 'cpp':
|
ext_modules = []
|
||||||
language = 'c++'
|
for mod_name in MOD_NAMES:
|
||||||
exts = []
|
mod_path = mod_name.replace('.', '/') + '.cpp'
|
||||||
for mod_name in mod_names:
|
ext_modules.append(
|
||||||
mod_path = mod_name.replace('.', '/') + '.pyx'
|
Extension(mod_name, [mod_path],
|
||||||
e = Extension(mod_name, [mod_path], language=language, include_dirs=includes)
|
language='c++', include_dirs=include_dirs))
|
||||||
exts.append(e)
|
|
||||||
distutils.core.setup(
|
|
||||||
name='spacy',
|
|
||||||
packages=['spacy', 'spacy.tokens', 'spacy.en', 'spacy.serialize',
|
|
||||||
'spacy.syntax', 'spacy.munge'],
|
|
||||||
description="Industrial-strength NLP",
|
|
||||||
author='Matthew Honnibal',
|
|
||||||
author_email='honnibal@gmail.com',
|
|
||||||
version=VERSION,
|
|
||||||
url="http://spacy.io",
|
|
||||||
package_data=PACKAGE_DATA,
|
|
||||||
ext_modules=exts,
|
|
||||||
cmdclass={'build_ext': build_ext_cython_subclass},
|
|
||||||
license="MIT",
|
|
||||||
)
|
|
||||||
|
|
||||||
|
if not is_source_release(root):
|
||||||
|
generate_cython(root, 'spacy')
|
||||||
|
prepare_includes(root)
|
||||||
|
|
||||||
def run_setup(exts):
|
setup(
|
||||||
setup(
|
name='spacy',
|
||||||
name='spacy',
|
packages=PACKAGES,
|
||||||
packages=['spacy', 'spacy.tokens', 'spacy.en', 'spacy.serialize',
|
package_data={'': ['*.pyx', '*.pxd']},
|
||||||
'spacy.syntax', 'spacy.munge',
|
description='Industrial-strength NLP',
|
||||||
'spacy.tests',
|
author='Matthew Honnibal',
|
||||||
'spacy.tests.matcher',
|
author_email='matt@spacy.io',
|
||||||
'spacy.tests.morphology',
|
version=VERSION,
|
||||||
'spacy.tests.munge',
|
url='https://spacy.io',
|
||||||
'spacy.tests.parser',
|
license='MIT',
|
||||||
'spacy.tests.serialize',
|
ext_modules=ext_modules,
|
||||||
'spacy.tests.spans',
|
install_requires=['numpy', 'murmurhash == 0.26', 'cymem == 1.30', 'preshed == 0.46.1',
|
||||||
'spacy.tests.tagger',
|
'thinc == 4.1.0', 'text_unidecode', 'plac', 'six',
|
||||||
'spacy.tests.tokenizer',
|
'ujson', 'cloudpickle', 'sputnik == 0.6.4'],
|
||||||
'spacy.tests.tokens',
|
cmdclass = {
|
||||||
'spacy.tests.vectors',
|
'build_ext': build_ext_subclass},
|
||||||
'spacy.tests.vocab'],
|
)
|
||||||
description="Industrial-strength NLP",
|
|
||||||
author='Matthew Honnibal',
|
|
||||||
author_email='honnibal@gmail.com',
|
|
||||||
version=VERSION,
|
|
||||||
url="http://honnibal.github.io/spaCy/",
|
|
||||||
package_data=PACKAGE_DATA,
|
|
||||||
ext_modules=exts,
|
|
||||||
license="MIT",
|
|
||||||
install_requires=['numpy', 'murmurhash == 0.24', 'cymem == 1.30', 'preshed == 0.44',
|
|
||||||
'thinc == 4.0.0', "text_unidecode", 'plac', 'six',
|
|
||||||
'ujson', 'cloudpickle', 'sputnik == 0.5.2'],
|
|
||||||
setup_requires=["headers_workaround"],
|
|
||||||
cmdclass = {'build_ext': build_ext_subclass },
|
|
||||||
)
|
|
||||||
|
|
||||||
import headers_workaround
|
|
||||||
|
|
||||||
headers_workaround.fix_venv_pypy_include()
|
|
||||||
headers_workaround.install_headers('murmurhash')
|
|
||||||
headers_workaround.install_headers('numpy')
|
|
||||||
|
|
||||||
|
|
||||||
VERSION = '0.100'
|
|
||||||
def main(modules, is_pypy):
|
|
||||||
language = "cpp"
|
|
||||||
includes = ['.', path.join(sys.prefix, 'include')]
|
|
||||||
if sys.platform.startswith('darwin'):
|
|
||||||
compile_options['other'].append('-mmacosx-version-min=10.8')
|
|
||||||
compile_options['other'].append('-stdlib=libc++')
|
|
||||||
link_options['other'].append('-lc++')
|
|
||||||
if use_cython:
|
|
||||||
cython_setup(modules, language, includes)
|
|
||||||
else:
|
|
||||||
exts = [c_ext(mn, language, includes)
|
|
||||||
for mn in modules]
|
|
||||||
run_setup(exts)
|
|
||||||
|
|
||||||
MOD_NAMES = ['spacy.parts_of_speech', 'spacy.strings',
|
|
||||||
'spacy.lexeme', 'spacy.vocab', 'spacy.attrs',
|
|
||||||
'spacy.morphology', 'spacy.tagger',
|
|
||||||
'spacy.syntax.stateclass',
|
|
||||||
'spacy.tokenizer',
|
|
||||||
'spacy.syntax.parser',
|
|
||||||
'spacy.syntax.transition_system',
|
|
||||||
'spacy.syntax.arc_eager',
|
|
||||||
'spacy.syntax._parse_features',
|
|
||||||
'spacy.gold', 'spacy.orth',
|
|
||||||
'spacy.tokens.doc', 'spacy.tokens.span', 'spacy.tokens.token',
|
|
||||||
'spacy.serialize.packer', 'spacy.serialize.huffman', 'spacy.serialize.bits',
|
|
||||||
'spacy.cfile', 'spacy.matcher',
|
|
||||||
'spacy.syntax.ner',
|
|
||||||
'spacy.symbols']
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
if sys.argv[1] == 'clean':
|
setup_package()
|
||||||
clean(MOD_NAMES)
|
|
||||||
else:
|
|
||||||
use_cython = sys.argv[1] == 'build_ext'
|
|
||||||
main(MOD_NAMES, use_cython)
|
|
||||||
|
|
|
@ -1,3 +0,0 @@
|
||||||
"""Feed-forward neural network, using Thenao."""
|
|
||||||
|
|
||||||
|
|
146
spacy/_nn.pyx
146
spacy/_nn.pyx
|
@ -1,146 +0,0 @@
|
||||||
"""Feed-forward neural network, using Thenao."""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
|
|
||||||
import theano
|
|
||||||
import theano.tensor as T
|
|
||||||
import plac
|
|
||||||
|
|
||||||
from spacy.gold import read_json_file
|
|
||||||
from spacy.gold import GoldParse
|
|
||||||
from spacy.en.pos import POS_TEMPLATES, POS_TAGS, setup_model_dir
|
|
||||||
|
|
||||||
|
|
||||||
def build_model(n_classes, n_vocab, n_hidden, n_word_embed, n_tag_embed):
|
|
||||||
# allocate symbolic variables for the data
|
|
||||||
words = T.vector('words')
|
|
||||||
tags = T.vector('tags')
|
|
||||||
|
|
||||||
word_e = _init_embedding(n_words, n_word_embed)
|
|
||||||
tag_e = _init_embedding(n_tags, n_tag_embed)
|
|
||||||
label_e = _init_embedding(n_labels, n_label_embed)
|
|
||||||
maxent_W, maxent_b = _init_maxent_weights(n_hidden, n_classes)
|
|
||||||
hidden_W, hidden_b = _init_hidden_weights(28*28, n_hidden, T.tanh)
|
|
||||||
params = [hidden_W, hidden_b, maxent_W, maxent_b, word_e, tag_e, label_e]
|
|
||||||
|
|
||||||
x = T.concatenate([
|
|
||||||
T.flatten(word_e[word_indices], outdim=1),
|
|
||||||
T.flatten(tag_e[tag_indices], outdim=1)])
|
|
||||||
|
|
||||||
p_y_given_x = feed_layer(
|
|
||||||
T.nnet.softmax,
|
|
||||||
maxent_W,
|
|
||||||
maxent_b,
|
|
||||||
feed_layer(
|
|
||||||
T.tanh,
|
|
||||||
hidden_W,
|
|
||||||
hidden_b,
|
|
||||||
x))[0]
|
|
||||||
|
|
||||||
guess = T.argmax(p_y_given_x)
|
|
||||||
|
|
||||||
cost = (
|
|
||||||
-T.log(p_y_given_x[y])
|
|
||||||
+ L1(L1_reg, maxent_W, hidden_W, word_e, tag_e)
|
|
||||||
+ L2(L2_reg, maxent_W, hidden_W, wod_e, tag_e)
|
|
||||||
)
|
|
||||||
|
|
||||||
train_model = theano.function(
|
|
||||||
inputs=[words, tags, y],
|
|
||||||
outputs=guess,
|
|
||||||
updates=[update(learning_rate, param, cost) for param in params]
|
|
||||||
)
|
|
||||||
|
|
||||||
evaluate_model = theano.function(
|
|
||||||
inputs=[x, y],
|
|
||||||
outputs=T.neq(y, T.argmax(p_y_given_x[0])),
|
|
||||||
)
|
|
||||||
return train_model, evaluate_model
|
|
||||||
|
|
||||||
|
|
||||||
def _init_embedding(vocab_size, n_dim):
|
|
||||||
embedding = 0.2 * numpy.random.uniform(-1.0, 1.0, (vocab_size+1, n_dim))
|
|
||||||
return theano.shared(embedding).astype(theano.config.floatX)
|
|
||||||
|
|
||||||
|
|
||||||
def _init_maxent_weights(n_hidden, n_out):
|
|
||||||
weights = numpy.zeros((n_hidden, 10), dtype=theano.config.floatX)
|
|
||||||
bias = numpy.zeros((10,), dtype=theano.config.floatX)
|
|
||||||
return (
|
|
||||||
theano.shared(name='W', borrow=True, value=weights),
|
|
||||||
theano.shared(name='b', borrow=True, value=bias)
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def _init_hidden_weights(n_in, n_out, activation=T.tanh):
|
|
||||||
rng = numpy.random.RandomState(1234)
|
|
||||||
weights = numpy.asarray(
|
|
||||||
rng.uniform(
|
|
||||||
low=-numpy.sqrt(6. / (n_in + n_out)),
|
|
||||||
high=numpy.sqrt(6. / (n_in + n_out)),
|
|
||||||
size=(n_in, n_out)
|
|
||||||
),
|
|
||||||
dtype=theano.config.floatX
|
|
||||||
)
|
|
||||||
|
|
||||||
bias = numpy.zeros((n_out,), dtype=theano.config.floatX)
|
|
||||||
return (
|
|
||||||
theano.shared(value=weights, name='W', borrow=True),
|
|
||||||
theano.shared(value=bias, name='b', borrow=True)
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def feed_layer(activation, weights, bias, input):
|
|
||||||
return activation(T.dot(input, weights) + bias)
|
|
||||||
|
|
||||||
|
|
||||||
def L1(L1_reg, w1, w2):
|
|
||||||
return L1_reg * (abs(w1).sum() + abs(w2).sum())
|
|
||||||
|
|
||||||
|
|
||||||
def L2(L2_reg, w1, w2):
|
|
||||||
return L2_reg * ((w1 ** 2).sum() + (w2 ** 2).sum())
|
|
||||||
|
|
||||||
|
|
||||||
def update(eta, param, cost):
|
|
||||||
return (param, param - (eta * T.grad(cost, param)))
|
|
||||||
|
|
||||||
|
|
||||||
def main(train_loc, eval_loc, model_dir):
|
|
||||||
learning_rate = 0.01
|
|
||||||
L1_reg = 0.00
|
|
||||||
L2_reg = 0.0001
|
|
||||||
|
|
||||||
print "... reading the data"
|
|
||||||
gold_train = list(read_json_file(train_loc))
|
|
||||||
print '... building the model'
|
|
||||||
pos_model_dir = path.join(model_dir, 'pos')
|
|
||||||
if path.exists(pos_model_dir):
|
|
||||||
shutil.rmtree(pos_model_dir)
|
|
||||||
os.mkdir(pos_model_dir)
|
|
||||||
|
|
||||||
setup_model_dir(sorted(POS_TAGS.keys()), POS_TAGS, POS_TEMPLATES, pos_model_dir)
|
|
||||||
|
|
||||||
train_model, evaluate_model = build_model(n_hidden, len(POS_TAGS), learning_rate,
|
|
||||||
L1_reg, L2_reg)
|
|
||||||
|
|
||||||
print '... training'
|
|
||||||
for epoch in range(1, n_epochs+1):
|
|
||||||
for raw_text, sents in gold_tuples:
|
|
||||||
for (ids, words, tags, ner, heads, deps), _ in sents:
|
|
||||||
tokens = nlp.tokenizer.tokens_from_list(words)
|
|
||||||
for t in tokens:
|
|
||||||
guess = train_model([t.orth], [t.tag])
|
|
||||||
loss += guess != t.tag
|
|
||||||
print loss
|
|
||||||
# compute zero-one loss on validation set
|
|
||||||
#error = numpy.mean([evaluate_model(x, y) for x, y in dev_examples])
|
|
||||||
#print('epoch %i, validation error %f %%' % (epoch, error * 100))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
plac.call(main)
|
|
|
@ -1,13 +0,0 @@
|
||||||
from ._ml cimport Model
|
|
||||||
from thinc.nn cimport InputLayer
|
|
||||||
|
|
||||||
|
|
||||||
cdef class TheanoModel(Model):
|
|
||||||
cdef InputLayer input_layer
|
|
||||||
cdef object train_func
|
|
||||||
cdef object predict_func
|
|
||||||
cdef object debug
|
|
||||||
|
|
||||||
cdef public float eta
|
|
||||||
cdef public float mu
|
|
||||||
cdef public float t
|
|
|
@ -1,52 +0,0 @@
|
||||||
from thinc.api cimport Example, ExampleC
|
|
||||||
from thinc.typedefs cimport weight_t
|
|
||||||
|
|
||||||
from ._ml cimport arg_max_if_true
|
|
||||||
from ._ml cimport arg_max_if_zero
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
from os import path
|
|
||||||
|
|
||||||
|
|
||||||
cdef class TheanoModel(Model):
|
|
||||||
def __init__(self, n_classes, input_spec, train_func, predict_func, model_loc=None,
|
|
||||||
eta=0.001, mu=0.9, debug=None):
|
|
||||||
if model_loc is not None and path.isdir(model_loc):
|
|
||||||
model_loc = path.join(model_loc, 'model')
|
|
||||||
|
|
||||||
self.eta = eta
|
|
||||||
self.mu = mu
|
|
||||||
self.t = 1
|
|
||||||
initializer = lambda: 0.2 * numpy.random.uniform(-1.0, 1.0)
|
|
||||||
self.input_layer = InputLayer(input_spec, initializer)
|
|
||||||
self.train_func = train_func
|
|
||||||
self.predict_func = predict_func
|
|
||||||
self.debug = debug
|
|
||||||
|
|
||||||
self.n_classes = n_classes
|
|
||||||
self.n_feats = len(self.input_layer)
|
|
||||||
self.model_loc = model_loc
|
|
||||||
|
|
||||||
def predict(self, Example eg):
|
|
||||||
self.input_layer.fill(eg.embeddings, eg.atoms, use_avg=True)
|
|
||||||
theano_scores = self.predict_func(eg.embeddings)[0]
|
|
||||||
cdef int i
|
|
||||||
for i in range(self.n_classes):
|
|
||||||
eg.c.scores[i] = theano_scores[i]
|
|
||||||
eg.c.guess = arg_max_if_true(eg.c.scores, eg.c.is_valid, self.n_classes)
|
|
||||||
|
|
||||||
def train(self, Example eg):
|
|
||||||
self.input_layer.fill(eg.embeddings, eg.atoms, use_avg=False)
|
|
||||||
theano_scores, update, y, loss = self.train_func(eg.embeddings, eg.costs,
|
|
||||||
self.eta, self.mu)
|
|
||||||
self.input_layer.update(update, eg.atoms, self.t, self.eta, self.mu)
|
|
||||||
for i in range(self.n_classes):
|
|
||||||
eg.c.scores[i] = theano_scores[i]
|
|
||||||
eg.c.guess = arg_max_if_true(eg.c.scores, eg.c.is_valid, self.n_classes)
|
|
||||||
eg.c.best = arg_max_if_zero(eg.c.scores, eg.c.costs, self.n_classes)
|
|
||||||
eg.c.cost = eg.c.costs[eg.c.guess]
|
|
||||||
eg.c.loss = loss
|
|
||||||
self.t += 1
|
|
||||||
|
|
||||||
def end_training(self):
|
|
||||||
pass
|
|
|
@ -6,6 +6,4 @@ from ..language import Language
|
||||||
|
|
||||||
|
|
||||||
class German(Language):
|
class German(Language):
|
||||||
@classmethod
|
pass
|
||||||
def default_data_dir(cls):
|
|
||||||
return path.join(path.dirname(__file__), 'data')
|
|
||||||
|
|
|
@ -4,8 +4,6 @@ from os import path
|
||||||
|
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
|
|
||||||
LOCAL_DATA_DIR = path.join(path.dirname(__file__), 'data')
|
|
||||||
|
|
||||||
|
|
||||||
# improved list from Stone, Denis, Kwantes (2010)
|
# improved list from Stone, Denis, Kwantes (2010)
|
||||||
STOPWORDS = """
|
STOPWORDS = """
|
||||||
|
@ -35,9 +33,7 @@ your yours yourself yourselves
|
||||||
STOPWORDS = set(w for w in STOPWORDS.split() if w)
|
STOPWORDS = set(w for w in STOPWORDS.split() if w)
|
||||||
|
|
||||||
class English(Language):
|
class English(Language):
|
||||||
@classmethod
|
lang = 'en'
|
||||||
def default_data_dir(cls):
|
|
||||||
return LOCAL_DATA_DIR
|
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def is_stop(string):
|
def is_stop(string):
|
||||||
|
|
|
@ -8,8 +8,11 @@ from sputnik import Sputnik
|
||||||
|
|
||||||
def migrate(path):
|
def migrate(path):
|
||||||
data_path = os.path.join(path, 'data')
|
data_path = os.path.join(path, 'data')
|
||||||
if os.path.isdir(data_path) and not os.path.islink(data_path):
|
if os.path.isdir(data_path):
|
||||||
shutil.rmtree(data_path)
|
if os.path.islink(data_path):
|
||||||
|
os.unlink(data_path)
|
||||||
|
else:
|
||||||
|
shutil.rmtree(data_path)
|
||||||
for filename in os.listdir(path):
|
for filename in os.listdir(path):
|
||||||
if filename.endswith('.tgz'):
|
if filename.endswith('.tgz'):
|
||||||
os.unlink(os.path.join(path, filename))
|
os.unlink(os.path.join(path, filename))
|
||||||
|
@ -17,8 +20,15 @@ def migrate(path):
|
||||||
|
|
||||||
def link(package, path):
|
def link(package, path):
|
||||||
if os.path.exists(path):
|
if os.path.exists(path):
|
||||||
os.unlink(path)
|
if os.path.isdir(path):
|
||||||
os.symlink(package.dir_path('data'), path)
|
shutil.rmtree(path)
|
||||||
|
else:
|
||||||
|
os.unlink(path)
|
||||||
|
|
||||||
|
if not hasattr(os, 'symlink'): # not supported by win+py27
|
||||||
|
shutil.copytree(package.dir_path('data'), path)
|
||||||
|
else:
|
||||||
|
os.symlink(package.dir_path('data'), path)
|
||||||
|
|
||||||
|
|
||||||
@plac.annotations(
|
@plac.annotations(
|
||||||
|
@ -26,12 +36,16 @@ def link(package, path):
|
||||||
)
|
)
|
||||||
def main(data_size='all', force=False):
|
def main(data_size='all', force=False):
|
||||||
# TODO read version from the same source as the setup
|
# TODO read version from the same source as the setup
|
||||||
sputnik = Sputnik('spacy', '0.99.0', console=sys.stdout)
|
sputnik = Sputnik('spacy', '0.100.0', console=sys.stdout)
|
||||||
|
|
||||||
path = os.path.dirname(os.path.abspath(__file__))
|
path = os.path.dirname(os.path.abspath(__file__))
|
||||||
|
|
||||||
command = sputnik.make_command(
|
data_path = os.path.abspath(os.path.join(path, '..', 'data'))
|
||||||
data_path=os.path.abspath(os.path.join(path, '..', 'data')),
|
if not os.path.isdir(data_path):
|
||||||
|
os.mkdir(data_path)
|
||||||
|
|
||||||
|
command = sputnik.command(
|
||||||
|
data_path=data_path,
|
||||||
repository_url='https://index.spacy.io')
|
repository_url='https://index.spacy.io')
|
||||||
|
|
||||||
if force:
|
if force:
|
||||||
|
@ -42,9 +56,6 @@ def main(data_size='all', force=False):
|
||||||
# FIXME clean up old-style packages
|
# FIXME clean up old-style packages
|
||||||
migrate(path)
|
migrate(path)
|
||||||
|
|
||||||
# FIXME supply spacy with an old-style data dir
|
|
||||||
link(package, os.path.join(path, 'data'))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
plac.call(main)
|
plac.call(main)
|
||||||
|
|
|
@ -6,6 +6,4 @@ from ..language import Language
|
||||||
|
|
||||||
|
|
||||||
class Finnish(Language):
|
class Finnish(Language):
|
||||||
@classmethod
|
pass
|
||||||
def default_data_dir(cls):
|
|
||||||
return path.join(path.dirname(__file__), 'data')
|
|
||||||
|
|
|
@ -6,6 +6,4 @@ from ..language import Language
|
||||||
|
|
||||||
|
|
||||||
class Italian(Language):
|
class Italian(Language):
|
||||||
@classmethod
|
pass
|
||||||
def default_data_dir(cls):
|
|
||||||
return path.join(path.dirname(__file__), 'data')
|
|
||||||
|
|
|
@ -20,9 +20,12 @@ from .syntax.ner import BiluoPushDown
|
||||||
from .syntax.arc_eager import ArcEager
|
from .syntax.arc_eager import ArcEager
|
||||||
|
|
||||||
from .attrs import TAG, DEP, ENT_IOB, ENT_TYPE, HEAD
|
from .attrs import TAG, DEP, ENT_IOB, ENT_TYPE, HEAD
|
||||||
|
from .util import get_package
|
||||||
|
|
||||||
|
|
||||||
class Language(object):
|
class Language(object):
|
||||||
|
lang = None
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def lower(string):
|
def lower(string):
|
||||||
return string.lower()
|
return string.lower()
|
||||||
|
@ -100,7 +103,7 @@ class Language(object):
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def default_lex_attrs(cls, data_dir=None):
|
def default_lex_attrs(cls):
|
||||||
return {
|
return {
|
||||||
attrs.LOWER: cls.lower,
|
attrs.LOWER: cls.lower,
|
||||||
attrs.NORM: cls.norm,
|
attrs.NORM: cls.norm,
|
||||||
|
@ -134,85 +137,108 @@ class Language(object):
|
||||||
return {0: {'PER': True, 'LOC': True, 'ORG': True, 'MISC': True}}
|
return {0: {'PER': True, 'LOC': True, 'ORG': True, 'MISC': True}}
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def default_data_dir(cls):
|
def default_vocab(cls, package=None, get_lex_attr=None):
|
||||||
return path.join(path.dirname(__file__), 'data')
|
if package is None:
|
||||||
|
package = get_package()
|
||||||
@classmethod
|
|
||||||
def default_vocab(cls, data_dir=None, get_lex_attr=None):
|
|
||||||
if data_dir is None:
|
|
||||||
data_dir = cls.default_data_dir()
|
|
||||||
if get_lex_attr is None:
|
if get_lex_attr is None:
|
||||||
get_lex_attr = cls.default_lex_attrs(data_dir)
|
get_lex_attr = cls.default_lex_attrs()
|
||||||
return Vocab.from_dir(
|
return Vocab.from_package(package, get_lex_attr=get_lex_attr)
|
||||||
path.join(data_dir, 'vocab'),
|
|
||||||
get_lex_attr=get_lex_attr)
|
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def default_tokenizer(cls, vocab, data_dir):
|
def default_parser(cls, package, vocab):
|
||||||
if path.exists(data_dir):
|
data_dir = package.dir_path('deps', require=False)
|
||||||
return Tokenizer.from_dir(vocab, data_dir)
|
if data_dir and path.exists(data_dir):
|
||||||
else:
|
|
||||||
return Tokenizer(vocab, {}, None, None, None)
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def default_tagger(cls, vocab, data_dir):
|
|
||||||
if path.exists(data_dir):
|
|
||||||
return Tagger.from_dir(data_dir, vocab)
|
|
||||||
else:
|
|
||||||
return None
|
|
||||||
|
|
||||||
@classmethod
|
|
||||||
def default_parser(cls, vocab, data_dir):
|
|
||||||
if path.exists(data_dir):
|
|
||||||
return Parser.from_dir(data_dir, vocab.strings, ArcEager)
|
return Parser.from_dir(data_dir, vocab.strings, ArcEager)
|
||||||
else:
|
|
||||||
return None
|
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def default_entity(cls, vocab, data_dir):
|
def default_entity(cls, package, vocab):
|
||||||
if path.exists(data_dir):
|
data_dir = package.dir_path('ner', require=False)
|
||||||
|
if data_dir and path.exists(data_dir):
|
||||||
return Parser.from_dir(data_dir, vocab.strings, BiluoPushDown)
|
return Parser.from_dir(data_dir, vocab.strings, BiluoPushDown)
|
||||||
else:
|
|
||||||
return None
|
|
||||||
|
|
||||||
@classmethod
|
def __init__(self,
|
||||||
def default_matcher(cls, vocab, data_dir):
|
data_dir=None,
|
||||||
if path.exists(data_dir):
|
model=None,
|
||||||
return Matcher.from_dir(data_dir, vocab)
|
vocab=None,
|
||||||
else:
|
tokenizer=None,
|
||||||
return None
|
tagger=None,
|
||||||
|
parser=None,
|
||||||
|
entity=None,
|
||||||
|
matcher=None,
|
||||||
|
serializer=None,
|
||||||
|
load_vectors=True):
|
||||||
|
"""
|
||||||
|
a model can be specified:
|
||||||
|
|
||||||
def __init__(self, data_dir=None, vocab=None, tokenizer=None, tagger=None,
|
1) by a path to the model directory (DEPRECATED)
|
||||||
parser=None, entity=None, matcher=None, serializer=None,
|
- Language(data_dir='path/to/data')
|
||||||
load_vectors=True):
|
|
||||||
|
2) by a language identifier (and optionally a package root dir)
|
||||||
|
- Language(lang='en')
|
||||||
|
- Language(lang='en', data_dir='spacy/data')
|
||||||
|
|
||||||
|
3) by a model name/version (and optionally a package root dir)
|
||||||
|
- Language(model='en_default')
|
||||||
|
- Language(model='en_default ==1.0.0')
|
||||||
|
- Language(model='en_default <1.1.0, data_dir='spacy/data')
|
||||||
|
"""
|
||||||
|
# support non-package data dirs
|
||||||
|
if data_dir and path.exists(path.join(data_dir, 'vocab')):
|
||||||
|
class Package(object):
|
||||||
|
def __init__(self, root):
|
||||||
|
self.root = root
|
||||||
|
|
||||||
|
def has_file(self, *path_parts):
|
||||||
|
return path.exists(path.join(self.root, *path_parts))
|
||||||
|
|
||||||
|
def file_path(self, *path_parts, **kwargs):
|
||||||
|
return path.join(self.root, *path_parts)
|
||||||
|
|
||||||
|
def dir_path(self, *path_parts, **kwargs):
|
||||||
|
return path.join(self.root, *path_parts)
|
||||||
|
|
||||||
|
def load_utf8(self, func, *path_parts, **kwargs):
|
||||||
|
with io.open(self.file_path(path.join(*path_parts)),
|
||||||
|
mode='r', encoding='utf8') as f:
|
||||||
|
return func(f)
|
||||||
|
|
||||||
|
warn("using non-package data_dir", DeprecationWarning)
|
||||||
|
package = Package(data_dir)
|
||||||
|
else:
|
||||||
|
package = get_package(name=model, data_path=data_dir)
|
||||||
if load_vectors is not True:
|
if load_vectors is not True:
|
||||||
warn("load_vectors is deprecated", DeprecationWarning)
|
warn("load_vectors is deprecated", DeprecationWarning)
|
||||||
if data_dir in (None, True):
|
|
||||||
data_dir = self.default_data_dir()
|
|
||||||
if vocab in (None, True):
|
if vocab in (None, True):
|
||||||
vocab = self.default_vocab(data_dir)
|
self.vocab = self.default_vocab(package)
|
||||||
if tokenizer in (None, True):
|
|
||||||
tokenizer = self.default_tokenizer(vocab, data_dir=path.join(data_dir, 'tokenizer'))
|
|
||||||
if tagger in (None, True):
|
|
||||||
tagger = self.default_tagger(vocab, data_dir=path.join(data_dir, 'pos'))
|
|
||||||
if entity in (None, True):
|
|
||||||
entity = self.default_entity(vocab, data_dir=path.join(data_dir, 'ner'))
|
|
||||||
if parser in (None, True):
|
|
||||||
parser = self.default_parser(vocab, data_dir=path.join(data_dir, 'deps'))
|
|
||||||
if matcher in (None, True):
|
|
||||||
matcher = self.default_matcher(vocab, data_dir=data_dir)
|
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
|
if tokenizer in (None, True):
|
||||||
|
self.tokenizer = Tokenizer.from_package(package, self.vocab)
|
||||||
self.tokenizer = tokenizer
|
self.tokenizer = tokenizer
|
||||||
|
if tagger in (None, True):
|
||||||
|
tagger = Tagger.from_package(package, self.vocab)
|
||||||
self.tagger = tagger
|
self.tagger = tagger
|
||||||
self.parser = parser
|
if entity in (None, True):
|
||||||
|
entity = self.default_entity(package, self.vocab)
|
||||||
self.entity = entity
|
self.entity = entity
|
||||||
|
if parser in (None, True):
|
||||||
|
parser = self.default_parser(package, self.vocab)
|
||||||
|
self.parser = parser
|
||||||
|
if matcher in (None, True):
|
||||||
|
matcher = Matcher.from_package(package, self.vocab)
|
||||||
self.matcher = matcher
|
self.matcher = matcher
|
||||||
|
|
||||||
def __reduce__(self):
|
def __reduce__(self):
|
||||||
return (self.__class__,
|
args = (
|
||||||
(None, self.vocab, self.tokenizer, self.tagger, self.parser,
|
None, # data_dir
|
||||||
self.entity, self.matcher, None),
|
None, # model
|
||||||
None, None)
|
self.vocab,
|
||||||
|
self.tokenizer,
|
||||||
|
self.tagger,
|
||||||
|
self.parser,
|
||||||
|
self.entity,
|
||||||
|
self.matcher
|
||||||
|
)
|
||||||
|
return (self.__class__, args, None, None)
|
||||||
|
|
||||||
def __call__(self, text, tag=True, parse=True, entity=True):
|
def __call__(self, text, tag=True, parse=True, entity=True):
|
||||||
"""Apply the pipeline to some text. The text can span multiple sentences,
|
"""Apply the pipeline to some text. The text can span multiple sentences,
|
||||||
|
|
|
@ -12,16 +12,21 @@ from .parts_of_speech import NOUN, VERB, ADJ, PUNCT
|
||||||
|
|
||||||
class Lemmatizer(object):
|
class Lemmatizer(object):
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_dir(cls, data_dir):
|
def from_package(cls, package):
|
||||||
index = {}
|
index = {}
|
||||||
exc = {}
|
exc = {}
|
||||||
for pos in ['adj', 'noun', 'verb']:
|
for pos in ['adj', 'noun', 'verb']:
|
||||||
index[pos] = read_index(path.join(data_dir, 'wordnet', 'index.%s' % pos))
|
index[pos] = package.load_utf8(read_index,
|
||||||
exc[pos] = read_exc(path.join(data_dir, 'wordnet', '%s.exc' % pos))
|
'wordnet', 'index.%s' % pos,
|
||||||
if path.exists(path.join(data_dir, 'vocab', 'lemma_rules.json')):
|
default=set()) # TODO: really optional?
|
||||||
rules = json.load(codecs.open(path.join(data_dir, 'vocab', 'lemma_rules.json'), encoding='utf_8'))
|
exc[pos] = package.load_utf8(read_exc,
|
||||||
else:
|
'wordnet', '%s.exc' % pos,
|
||||||
rules = {}
|
default={}) # TODO: really optional?
|
||||||
|
|
||||||
|
rules = package.load_utf8(json.load,
|
||||||
|
'vocab', 'lemma_rules.json',
|
||||||
|
default={}) # TODO: really optional?
|
||||||
|
|
||||||
return cls(index, exc, rules)
|
return cls(index, exc, rules)
|
||||||
|
|
||||||
def __init__(self, index, exceptions, rules):
|
def __init__(self, index, exceptions, rules):
|
||||||
|
@ -70,11 +75,9 @@ def lemmatize(string, index, exceptions, rules):
|
||||||
return set(forms)
|
return set(forms)
|
||||||
|
|
||||||
|
|
||||||
def read_index(loc):
|
def read_index(fileobj):
|
||||||
index = set()
|
index = set()
|
||||||
if not path.exists(loc):
|
for line in fileobj:
|
||||||
return index
|
|
||||||
for line in codecs.open(loc, 'r', 'utf8'):
|
|
||||||
if line.startswith(' '):
|
if line.startswith(' '):
|
||||||
continue
|
continue
|
||||||
pieces = line.split()
|
pieces = line.split()
|
||||||
|
@ -84,11 +87,9 @@ def read_index(loc):
|
||||||
return index
|
return index
|
||||||
|
|
||||||
|
|
||||||
def read_exc(loc):
|
def read_exc(fileobj):
|
||||||
exceptions = {}
|
exceptions = {}
|
||||||
if not path.exists(loc):
|
for line in fileobj:
|
||||||
return exceptions
|
|
||||||
for line in codecs.open(loc, 'r', 'utf8'):
|
|
||||||
if line.startswith(' '):
|
if line.startswith(' '):
|
||||||
continue
|
continue
|
||||||
pieces = line.split()
|
pieces = line.split()
|
||||||
|
|
|
@ -169,14 +169,11 @@ cdef class Matcher:
|
||||||
cdef object _patterns
|
cdef object _patterns
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_dir(cls, data_dir, Vocab vocab):
|
def from_package(cls, package, Vocab vocab):
|
||||||
patterns_loc = path.join(data_dir, 'vocab', 'gazetteer.json')
|
patterns = package.load_utf8(json.load,
|
||||||
if path.exists(patterns_loc):
|
'vocab', 'gazetteer.json',
|
||||||
patterns_data = open(patterns_loc).read()
|
default={}) # TODO: really optional?
|
||||||
patterns = json.loads(patterns_data)
|
return cls(vocab, patterns)
|
||||||
return cls(vocab, patterns)
|
|
||||||
else:
|
|
||||||
return cls(vocab, {})
|
|
||||||
|
|
||||||
def __init__(self, vocab, patterns):
|
def __init__(self, vocab, patterns):
|
||||||
self.vocab = vocab
|
self.vocab = vocab
|
||||||
|
|
|
@ -1,62 +0,0 @@
|
||||||
# Enum of Wordnet supersenses
|
|
||||||
cimport parts_of_speech
|
|
||||||
from .typedefs cimport flags_t
|
|
||||||
|
|
||||||
cpdef enum:
|
|
||||||
A_behavior
|
|
||||||
A_body
|
|
||||||
A_feeling
|
|
||||||
A_mind
|
|
||||||
A_motion
|
|
||||||
A_perception
|
|
||||||
A_quantity
|
|
||||||
A_relation
|
|
||||||
A_social
|
|
||||||
A_spatial
|
|
||||||
A_substance
|
|
||||||
A_time
|
|
||||||
A_weather
|
|
||||||
N_act
|
|
||||||
N_animal
|
|
||||||
N_artifact
|
|
||||||
N_attribute
|
|
||||||
N_body
|
|
||||||
N_cognition
|
|
||||||
N_communication
|
|
||||||
N_event
|
|
||||||
N_feeling
|
|
||||||
N_food
|
|
||||||
N_group
|
|
||||||
N_location
|
|
||||||
N_motive
|
|
||||||
N_object
|
|
||||||
N_person
|
|
||||||
N_phenomenon
|
|
||||||
N_plant
|
|
||||||
N_possession
|
|
||||||
N_process
|
|
||||||
N_quantity
|
|
||||||
N_relation
|
|
||||||
N_shape
|
|
||||||
N_state
|
|
||||||
N_substance
|
|
||||||
N_time
|
|
||||||
V_body
|
|
||||||
V_change
|
|
||||||
V_cognition
|
|
||||||
V_communication
|
|
||||||
V_competition
|
|
||||||
V_consumption
|
|
||||||
V_contact
|
|
||||||
V_creation
|
|
||||||
V_emotion
|
|
||||||
V_motion
|
|
||||||
V_perception
|
|
||||||
V_possession
|
|
||||||
V_social
|
|
||||||
V_stative
|
|
||||||
V_weather
|
|
||||||
|
|
||||||
|
|
||||||
cdef flags_t[<int>parts_of_speech.N_UNIV_TAGS] POS_SENSES
|
|
||||||
|
|
|
@ -1,88 +0,0 @@
|
||||||
from __future__ import unicode_literals
|
|
||||||
cimport parts_of_speech
|
|
||||||
|
|
||||||
|
|
||||||
POS_SENSES[<int>parts_of_speech.NO_TAG] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.ADJ] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.ADV] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.ADP] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.CONJ] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.DET] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.NOUN] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.NUM] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.PRON] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.PRT] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.VERB] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.X] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.PUNCT] = 0
|
|
||||||
POS_SENSES[<int>parts_of_speech.EOL] = 0
|
|
||||||
|
|
||||||
|
|
||||||
cdef int _sense = 0
|
|
||||||
|
|
||||||
for _sense in range(A_behavior, N_act):
|
|
||||||
POS_SENSES[<int>parts_of_speech.ADJ] |= 1 << _sense
|
|
||||||
|
|
||||||
for _sense in range(N_act, V_body):
|
|
||||||
POS_SENSES[<int>parts_of_speech.NOUN] |= 1 << _sense
|
|
||||||
|
|
||||||
for _sense in range(V_body, V_weather+1):
|
|
||||||
POS_SENSES[<int>parts_of_speech.VERB] |= 1 << _sense
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
STRINGS = (
|
|
||||||
'A_behavior',
|
|
||||||
'A_body',
|
|
||||||
'A_feeling',
|
|
||||||
'A_mind',
|
|
||||||
'A_motion',
|
|
||||||
'A_perception',
|
|
||||||
'A_quantity',
|
|
||||||
'A_relation',
|
|
||||||
'A_social',
|
|
||||||
'A_spatial',
|
|
||||||
'A_substance',
|
|
||||||
'A_time',
|
|
||||||
'A_weather',
|
|
||||||
'N_act',
|
|
||||||
'N_animal',
|
|
||||||
'N_artifact',
|
|
||||||
'N_attribute',
|
|
||||||
'N_body',
|
|
||||||
'N_cognition',
|
|
||||||
'N_communication',
|
|
||||||
'N_event',
|
|
||||||
'N_feeling',
|
|
||||||
'N_food',
|
|
||||||
'N_group',
|
|
||||||
'N_location',
|
|
||||||
'N_motive',
|
|
||||||
'N_object',
|
|
||||||
'N_person',
|
|
||||||
'N_phenomenon',
|
|
||||||
'N_plant',
|
|
||||||
'N_possession',
|
|
||||||
'N_process',
|
|
||||||
'N_quantity',
|
|
||||||
'N_relation',
|
|
||||||
'N_shape',
|
|
||||||
'N_state',
|
|
||||||
'N_substance',
|
|
||||||
'N_time',
|
|
||||||
'V_body',
|
|
||||||
'V_change',
|
|
||||||
'V_cognition',
|
|
||||||
'V_communication',
|
|
||||||
'V_competition',
|
|
||||||
'V_consumption',
|
|
||||||
'V_contact',
|
|
||||||
'V_creation',
|
|
||||||
'V_emotion',
|
|
||||||
'V_motion',
|
|
||||||
'V_perception',
|
|
||||||
'V_possession',
|
|
||||||
'V_social',
|
|
||||||
'V_stative',
|
|
||||||
'V_weather'
|
|
||||||
)
|
|
|
@ -146,15 +146,19 @@ cdef class Tagger:
|
||||||
return cls(vocab, model)
|
return cls(vocab, model)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_dir(cls, data_dir, vocab):
|
def from_package(cls, package, vocab):
|
||||||
if path.exists(path.join(data_dir, 'templates.json')):
|
# TODO: templates.json deprecated? not present in latest package
|
||||||
templates = json.loads(open(path.join(data_dir, 'templates.json')))
|
templates = cls.default_templates()
|
||||||
else:
|
# templates = package.load_utf8(json.load,
|
||||||
templates = cls.default_templates()
|
# 'pos', 'templates.json',
|
||||||
|
# default=cls.default_templates())
|
||||||
|
|
||||||
model = TaggerModel(vocab.morphology.n_tags,
|
model = TaggerModel(vocab.morphology.n_tags,
|
||||||
ConjunctionExtracter(N_CONTEXT_FIELDS, templates))
|
ConjunctionExtracter(N_CONTEXT_FIELDS, templates))
|
||||||
if path.exists(path.join(data_dir, 'model')):
|
|
||||||
model.load(path.join(data_dir, 'model'))
|
if package.has_file('pos', 'model'): # TODO: really optional?
|
||||||
|
model.load(package.file_path('pos', 'model'))
|
||||||
|
|
||||||
return cls(vocab, model)
|
return cls(vocab, model)
|
||||||
|
|
||||||
def __init__(self, Vocab vocab, TaggerModel model):
|
def __init__(self, Vocab vocab, TaggerModel model):
|
||||||
|
|
|
@ -1,12 +1,11 @@
|
||||||
|
from spacy.en import English
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
from spacy.en import English, LOCAL_DATA_DIR
|
|
||||||
import os
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def EN():
|
def EN():
|
||||||
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
|
return English()
|
||||||
return English(data_dir=data_dir)
|
|
||||||
|
|
||||||
|
|
||||||
def pytest_addoption(parser):
|
def pytest_addoption(parser):
|
||||||
|
|
|
@ -1,12 +0,0 @@
|
||||||
# encoding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import spacy.de
|
|
||||||
|
|
||||||
|
|
||||||
#def test_tokenizer():
|
|
||||||
# lang = spacy.de.German()
|
|
||||||
#
|
|
||||||
# doc = lang(u'Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.')
|
|
||||||
# for token in doc:
|
|
||||||
# print(repr(token.string))
|
|
|
@ -4,6 +4,9 @@ from spacy.serialize.packer import Packer
|
||||||
from spacy.attrs import ORTH, SPACY
|
from spacy.attrs import ORTH, SPACY
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
import math
|
import math
|
||||||
|
import tempfile
|
||||||
|
import shutil
|
||||||
|
import os
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
|
@ -11,17 +14,21 @@ def test_read_write(EN):
|
||||||
doc1 = EN(u'This is a simple test. With a couple of sentences.')
|
doc1 = EN(u'This is a simple test. With a couple of sentences.')
|
||||||
doc2 = EN(u'This is another test document.')
|
doc2 = EN(u'This is another test document.')
|
||||||
|
|
||||||
with open('/tmp/spacy_docs.bin', 'wb') as file_:
|
try:
|
||||||
file_.write(doc1.to_bytes())
|
tmp_dir = tempfile.mkdtemp()
|
||||||
file_.write(doc2.to_bytes())
|
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'wb') as file_:
|
||||||
|
file_.write(doc1.to_bytes())
|
||||||
|
file_.write(doc2.to_bytes())
|
||||||
|
|
||||||
with open('/tmp/spacy_docs.bin', 'rb') as file_:
|
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'rb') as file_:
|
||||||
bytes1, bytes2 = Doc.read_bytes(file_)
|
bytes1, bytes2 = Doc.read_bytes(file_)
|
||||||
r1 = Doc(EN.vocab).from_bytes(bytes1)
|
r1 = Doc(EN.vocab).from_bytes(bytes1)
|
||||||
r2 = Doc(EN.vocab).from_bytes(bytes2)
|
r2 = Doc(EN.vocab).from_bytes(bytes2)
|
||||||
|
|
||||||
assert r1.string == doc1.string
|
assert r1.string == doc1.string
|
||||||
assert r2.string == doc2.string
|
assert r2.string == doc2.string
|
||||||
|
finally:
|
||||||
|
shutil.rmtree(tmp_dir)
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
|
|
|
@ -10,7 +10,6 @@ from spacy.en import English
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from spacy.tokens.doc import Doc
|
from spacy.tokens.doc import Doc
|
||||||
from spacy.tokenizer import Tokenizer
|
from spacy.tokenizer import Tokenizer
|
||||||
from spacy.en import LOCAL_DATA_DIR
|
|
||||||
from os import path
|
from os import path
|
||||||
|
|
||||||
from spacy.attrs import ORTH, SPACY, TAG, DEP, HEAD
|
from spacy.attrs import ORTH, SPACY, TAG, DEP, HEAD
|
||||||
|
|
|
@ -1,9 +1,8 @@
|
||||||
import pytest
|
import pytest
|
||||||
from spacy.en import English, LOCAL_DATA_DIR
|
from spacy.en import English
|
||||||
import os
|
import os
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def en_nlp():
|
def en_nlp():
|
||||||
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
|
return English()
|
||||||
return English(data_dir=data_dir)
|
|
||||||
|
|
|
@ -4,31 +4,33 @@ import io
|
||||||
import pickle
|
import pickle
|
||||||
|
|
||||||
from spacy.lemmatizer import Lemmatizer, read_index, read_exc
|
from spacy.lemmatizer import Lemmatizer, read_index, read_exc
|
||||||
from spacy.en import LOCAL_DATA_DIR
|
from spacy.util import get_package
|
||||||
from os import path
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_read_index():
|
@pytest.fixture
|
||||||
wn = path.join(LOCAL_DATA_DIR, 'wordnet')
|
def package():
|
||||||
index = read_index(path.join(wn, 'index.noun'))
|
return get_package()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def lemmatizer(package):
|
||||||
|
return Lemmatizer.from_package(package)
|
||||||
|
|
||||||
|
|
||||||
|
def test_read_index(package):
|
||||||
|
index = package.load_utf8(read_index, 'wordnet', 'index.noun')
|
||||||
assert 'man' in index
|
assert 'man' in index
|
||||||
assert 'plantes' not in index
|
assert 'plantes' not in index
|
||||||
assert 'plant' in index
|
assert 'plant' in index
|
||||||
|
|
||||||
|
|
||||||
def test_read_exc():
|
def test_read_exc(package):
|
||||||
wn = path.join(LOCAL_DATA_DIR, 'wordnet')
|
exc = package.load_utf8(read_exc, 'wordnet', 'verb.exc')
|
||||||
exc = read_exc(path.join(wn, 'verb.exc'))
|
|
||||||
assert exc['was'] == ('be',)
|
assert exc['was'] == ('be',)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def lemmatizer():
|
|
||||||
return Lemmatizer.from_dir(path.join(LOCAL_DATA_DIR))
|
|
||||||
|
|
||||||
|
|
||||||
def test_noun_lemmas(lemmatizer):
|
def test_noun_lemmas(lemmatizer):
|
||||||
do = lemmatizer.noun
|
do = lemmatizer.noun
|
||||||
|
|
||||||
|
|
|
@ -2,16 +2,15 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
import gc
|
import gc
|
||||||
|
|
||||||
from spacy.en import English, LOCAL_DATA_DIR
|
from spacy.en import English
|
||||||
import os
|
import os
|
||||||
|
|
||||||
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
|
|
||||||
# Let this have its own instances, as we have to be careful about memory here
|
# Let this have its own instances, as we have to be careful about memory here
|
||||||
# that's the point, after all
|
# that's the point, after all
|
||||||
|
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
def get_orphan_token(text, i):
|
def get_orphan_token(text, i):
|
||||||
nlp = English(data_dir=data_dir)
|
nlp = English()
|
||||||
tokens = nlp(text)
|
tokens = nlp(text)
|
||||||
gc.collect()
|
gc.collect()
|
||||||
token = tokens[i]
|
token = tokens[i]
|
||||||
|
@ -41,7 +40,7 @@ def _orphan_from_list(toks):
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
def test_list_orphans():
|
def test_list_orphans():
|
||||||
# Test case from NSchrading
|
# Test case from NSchrading
|
||||||
nlp = English(data_dir=data_dir)
|
nlp = English()
|
||||||
samples = ["a", "test blah wat okay"]
|
samples = ["a", "test blah wat okay"]
|
||||||
lst = []
|
lst = []
|
||||||
for sample in samples:
|
for sample in samples:
|
||||||
|
|
|
@ -5,9 +5,8 @@ import os
|
||||||
|
|
||||||
@pytest.fixture(scope='session')
|
@pytest.fixture(scope='session')
|
||||||
def nlp():
|
def nlp():
|
||||||
from spacy.en import English, LOCAL_DATA_DIR
|
from spacy.en import English
|
||||||
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
|
return English()
|
||||||
return English(data_dir=data_dir)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture()
|
@pytest.fixture()
|
||||||
|
|
|
@ -75,7 +75,7 @@ def test_count_by(nlp):
|
||||||
@pytest.mark.models
|
@pytest.mark.models
|
||||||
def test_read_bytes(nlp):
|
def test_read_bytes(nlp):
|
||||||
from spacy.tokens.doc import Doc
|
from spacy.tokens.doc import Doc
|
||||||
loc = '/tmp/test_serialize.bin'
|
loc = 'test_serialize.bin'
|
||||||
with open(loc, 'wb') as file_:
|
with open(loc, 'wb') as file_:
|
||||||
file_.write(nlp(u'This is a document.').to_bytes())
|
file_.write(nlp(u'This is a document.').to_bytes())
|
||||||
file_.write(nlp(u'This is another.').to_bytes())
|
file_.write(nlp(u'This is another.').to_bytes())
|
||||||
|
|
|
@ -10,9 +10,8 @@ def token(doc):
|
||||||
|
|
||||||
|
|
||||||
def test_load_resources_and_process_text():
|
def test_load_resources_and_process_text():
|
||||||
from spacy.en import English, LOCAL_DATA_DIR
|
from spacy.en import English
|
||||||
data_dir = os.environ.get('SPACY_DATA', LOCAL_DATA_DIR)
|
nlp = English()
|
||||||
nlp = English(data_dir=data_dir)
|
|
||||||
doc = nlp('Hello, world. Here are two sentences.')
|
doc = nlp('Hello, world. Here are two sentences.')
|
||||||
|
|
||||||
|
|
||||||
|
@ -154,9 +153,9 @@ def test_efficient_binary_serialization(doc):
|
||||||
from spacy.tokens.doc import Doc
|
from spacy.tokens.doc import Doc
|
||||||
|
|
||||||
byte_string = doc.to_bytes()
|
byte_string = doc.to_bytes()
|
||||||
open('/tmp/moby_dick.bin', 'wb').write(byte_string)
|
open('moby_dick.bin', 'wb').write(byte_string)
|
||||||
|
|
||||||
nlp = spacy.en.English()
|
nlp = spacy.en.English()
|
||||||
for byte_string in Doc.read_bytes(open('/tmp/moby_dick.bin', 'rb')):
|
for byte_string in Doc.read_bytes(open('moby_dick.bin', 'rb')):
|
||||||
doc = Doc(nlp.vocab)
|
doc = Doc(nlp.vocab)
|
||||||
doc.from_bytes(byte_string)
|
doc.from_bytes(byte_string)
|
||||||
|
|
|
@ -41,8 +41,8 @@ cdef class Tokenizer:
|
||||||
return (self.__class__, args, None, None)
|
return (self.__class__, args, None, None)
|
||||||
|
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_dir(cls, Vocab vocab, data_dir):
|
def from_package(cls, package, Vocab vocab):
|
||||||
rules, prefix_re, suffix_re, infix_re = read_lang_data(data_dir)
|
rules, prefix_re, suffix_re, infix_re = read_lang_data(package)
|
||||||
prefix_re = re.compile(prefix_re)
|
prefix_re = re.compile(prefix_re)
|
||||||
suffix_re = re.compile(suffix_re)
|
suffix_re = re.compile(suffix_re)
|
||||||
infix_re = re.compile(infix_re)
|
infix_re = re.compile(infix_re)
|
||||||
|
|
108
spacy/util.py
108
spacy/util.py
|
@ -1,10 +1,24 @@
|
||||||
from os import path
|
import os
|
||||||
import io
|
import io
|
||||||
import json
|
import json
|
||||||
import re
|
import re
|
||||||
|
|
||||||
|
from sputnik import Sputnik
|
||||||
|
|
||||||
from .attrs import TAG, HEAD, DEP, ENT_IOB, ENT_TYPE
|
from .attrs import TAG, HEAD, DEP, ENT_IOB, ENT_TYPE
|
||||||
|
|
||||||
DATA_DIR = path.join(path.dirname(__file__), '..', 'data')
|
|
||||||
|
def get_package(name=None, data_path=None):
|
||||||
|
if data_path is None:
|
||||||
|
if os.environ.get('SPACY_DATA'):
|
||||||
|
data_path = os.environ.get('SPACY_DATA')
|
||||||
|
else:
|
||||||
|
data_path = os.path.abspath(
|
||||||
|
os.path.join(os.path.dirname(__file__), 'data'))
|
||||||
|
|
||||||
|
sputnik = Sputnik('spacy', '0.100.0') # TODO: retrieve version
|
||||||
|
pool = sputnik.pool(data_path)
|
||||||
|
return pool.get(name or 'en_default')
|
||||||
|
|
||||||
|
|
||||||
def normalize_slice(length, start, stop, step=None):
|
def normalize_slice(length, start, stop, step=None):
|
||||||
|
@ -31,67 +45,63 @@ def utf8open(loc, mode='r'):
|
||||||
return io.open(loc, mode, encoding='utf8')
|
return io.open(loc, mode, encoding='utf8')
|
||||||
|
|
||||||
|
|
||||||
def read_lang_data(data_dir):
|
def read_lang_data(package):
|
||||||
with open(path.join(data_dir, 'specials.json')) as file_:
|
tokenization = package.load_utf8(json.load, 'tokenizer', 'specials.json')
|
||||||
tokenization = json.load(file_)
|
prefix = package.load_utf8(read_prefix, 'tokenizer', 'prefix.txt')
|
||||||
prefix = read_prefix(data_dir)
|
suffix = package.load_utf8(read_suffix, 'tokenizer', 'suffix.txt')
|
||||||
suffix = read_suffix(data_dir)
|
infix = package.load_utf8(read_infix, 'tokenizer', 'infix.txt')
|
||||||
infix = read_infix(data_dir)
|
|
||||||
return tokenization, prefix, suffix, infix
|
return tokenization, prefix, suffix, infix
|
||||||
|
|
||||||
|
|
||||||
def read_prefix(data_dir):
|
def read_prefix(fileobj):
|
||||||
with utf8open(path.join(data_dir, 'prefix.txt')) as file_:
|
entries = fileobj.read().split('\n')
|
||||||
entries = file_.read().split('\n')
|
expression = '|'.join(['^' + re.escape(piece) for piece in entries if piece.strip()])
|
||||||
expression = '|'.join(['^' + re.escape(piece) for piece in entries if piece.strip()])
|
|
||||||
return expression
|
return expression
|
||||||
|
|
||||||
|
|
||||||
def read_suffix(data_dir):
|
def read_suffix(fileobj):
|
||||||
with utf8open(path.join(data_dir, 'suffix.txt')) as file_:
|
entries = fileobj.read().split('\n')
|
||||||
entries = file_.read().split('\n')
|
expression = '|'.join([piece + '$' for piece in entries if piece.strip()])
|
||||||
expression = '|'.join([piece + '$' for piece in entries if piece.strip()])
|
|
||||||
return expression
|
return expression
|
||||||
|
|
||||||
|
|
||||||
def read_infix(data_dir):
|
def read_infix(fileobj):
|
||||||
with utf8open(path.join(data_dir, 'infix.txt')) as file_:
|
entries = fileobj.read().split('\n')
|
||||||
entries = file_.read().split('\n')
|
expression = '|'.join([piece for piece in entries if piece.strip()])
|
||||||
expression = '|'.join([piece for piece in entries if piece.strip()])
|
|
||||||
return expression
|
return expression
|
||||||
|
|
||||||
|
|
||||||
def read_tokenization(lang):
|
# def read_tokenization(lang):
|
||||||
loc = path.join(DATA_DIR, lang, 'tokenization')
|
# loc = path.join(DATA_DIR, lang, 'tokenization')
|
||||||
entries = []
|
# entries = []
|
||||||
seen = set()
|
# seen = set()
|
||||||
with utf8open(loc) as file_:
|
# with utf8open(loc) as file_:
|
||||||
for line in file_:
|
# for line in file_:
|
||||||
line = line.strip()
|
# line = line.strip()
|
||||||
if line.startswith('#'):
|
# if line.startswith('#'):
|
||||||
continue
|
# continue
|
||||||
if not line:
|
# if not line:
|
||||||
continue
|
# continue
|
||||||
pieces = line.split()
|
# pieces = line.split()
|
||||||
chunk = pieces.pop(0)
|
# chunk = pieces.pop(0)
|
||||||
assert chunk not in seen, chunk
|
# assert chunk not in seen, chunk
|
||||||
seen.add(chunk)
|
# seen.add(chunk)
|
||||||
entries.append((chunk, list(pieces)))
|
# entries.append((chunk, list(pieces)))
|
||||||
if chunk[0].isalpha() and chunk[0].islower():
|
# if chunk[0].isalpha() and chunk[0].islower():
|
||||||
chunk = chunk[0].title() + chunk[1:]
|
# chunk = chunk[0].title() + chunk[1:]
|
||||||
pieces[0] = pieces[0][0].title() + pieces[0][1:]
|
# pieces[0] = pieces[0][0].title() + pieces[0][1:]
|
||||||
seen.add(chunk)
|
# seen.add(chunk)
|
||||||
entries.append((chunk, pieces))
|
# entries.append((chunk, pieces))
|
||||||
return entries
|
# return entries
|
||||||
|
|
||||||
|
|
||||||
def read_detoken_rules(lang): # Deprecated?
|
# def read_detoken_rules(lang): # Deprecated?
|
||||||
loc = path.join(DATA_DIR, lang, 'detokenize')
|
# loc = path.join(DATA_DIR, lang, 'detokenize')
|
||||||
entries = []
|
# entries = []
|
||||||
with utf8open(loc) as file_:
|
# with utf8open(loc) as file_:
|
||||||
for line in file_:
|
# for line in file_:
|
||||||
entries.append(line.strip())
|
# entries.append(line.strip())
|
||||||
return entries
|
# return entries
|
||||||
|
|
||||||
|
|
||||||
def align_tokens(ref, indices): # Deprecated, surely?
|
def align_tokens(ref, indices): # Deprecated, surely?
|
||||||
|
|
|
@ -47,28 +47,27 @@ cdef class Vocab:
|
||||||
'''A map container for a language's LexemeC structs.
|
'''A map container for a language's LexemeC structs.
|
||||||
'''
|
'''
|
||||||
@classmethod
|
@classmethod
|
||||||
def from_dir(cls, data_dir, get_lex_attr=None):
|
def from_package(cls, package, get_lex_attr=None):
|
||||||
if not path.exists(data_dir):
|
tag_map = package.load_utf8(json.load,
|
||||||
raise IOError("Directory %s not found -- cannot load Vocab." % data_dir)
|
'vocab', 'tag_map.json')
|
||||||
if not path.isdir(data_dir):
|
|
||||||
raise IOError("Path %s is a file, not a dir -- cannot load Vocab." % data_dir)
|
lemmatizer = Lemmatizer.from_package(package)
|
||||||
|
|
||||||
|
serializer_freqs = package.load_utf8(json.load,
|
||||||
|
'vocab', 'serializer.json',
|
||||||
|
require=False) # TODO: really optional?
|
||||||
|
|
||||||
tag_map = json.load(open(path.join(data_dir, 'tag_map.json')))
|
|
||||||
lemmatizer = Lemmatizer.from_dir(path.join(data_dir, '..'))
|
|
||||||
if path.exists(path.join(data_dir, 'serializer.json')):
|
|
||||||
serializer_freqs = json.load(open(path.join(data_dir, 'serializer.json')))
|
|
||||||
else:
|
|
||||||
serializer_freqs = None
|
|
||||||
cdef Vocab self = cls(get_lex_attr=get_lex_attr, tag_map=tag_map,
|
cdef Vocab self = cls(get_lex_attr=get_lex_attr, tag_map=tag_map,
|
||||||
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs)
|
lemmatizer=lemmatizer, serializer_freqs=serializer_freqs)
|
||||||
|
|
||||||
if path.exists(path.join(data_dir, 'strings.json')):
|
if package.has_file('vocab', 'strings.json'): # TODO: really optional?
|
||||||
with io.open(path.join(data_dir, 'strings.json'), 'r', encoding='utf8') as file_:
|
package.load_utf8(self.strings.load, 'vocab', 'strings.json')
|
||||||
self.strings.load(file_)
|
self.load_lexemes(package.file_path('vocab', 'lexemes.bin'))
|
||||||
self.load_lexemes(path.join(data_dir, 'lexemes.bin'))
|
|
||||||
|
if package.has_file('vocab', 'vec.bin'): # TODO: really optional?
|
||||||
if path.exists(path.join(data_dir, 'vec.bin')):
|
self.vectors_length = self.load_vectors_from_bin_loc(
|
||||||
self.vectors_length = self.load_vectors_from_bin_loc(path.join(data_dir, 'vec.bin'))
|
package.file_path('vocab', 'vec.bin'))
|
||||||
|
|
||||||
return self
|
return self
|
||||||
|
|
||||||
def __init__(self, get_lex_attr=None, tag_map=None, lemmatizer=None, serializer_freqs=None):
|
def __init__(self, get_lex_attr=None, tag_map=None, lemmatizer=None, serializer_freqs=None):
|
||||||
|
|
13
tox.ini
Normal file
13
tox.ini
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
[tox]
|
||||||
|
envlist =
|
||||||
|
py27
|
||||||
|
py34
|
||||||
|
recreate = True
|
||||||
|
|
||||||
|
[testenv]
|
||||||
|
changedir = {envtmpdir}
|
||||||
|
deps =
|
||||||
|
pytest
|
||||||
|
commands =
|
||||||
|
python -m spacy.en.download
|
||||||
|
python -m pytest {toxinidir}/spacy/ -x --models --vectors --slow
|
32
venv.ps1
Normal file
32
venv.ps1
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
param (
|
||||||
|
[string]$python = $(throw "-python is required."),
|
||||||
|
[string]$install_mode = $(throw "-install_mode is required."),
|
||||||
|
[string]$pip_date,
|
||||||
|
[string]$compiler
|
||||||
|
)
|
||||||
|
|
||||||
|
$ErrorActionPreference = "Stop"
|
||||||
|
|
||||||
|
if(!(Test-Path -Path ".build"))
|
||||||
|
{
|
||||||
|
if($compiler -eq "mingw32")
|
||||||
|
{
|
||||||
|
virtualenv .build --system-site-packages --python $python
|
||||||
|
}
|
||||||
|
else
|
||||||
|
{
|
||||||
|
virtualenv .build --python $python
|
||||||
|
}
|
||||||
|
|
||||||
|
if($compiler)
|
||||||
|
{
|
||||||
|
"[build]`r`ncompiler=$compiler" | Out-File -Encoding ascii .\.build\Lib\distutils\distutils.cfg
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
.build\Scripts\activate.ps1
|
||||||
|
|
||||||
|
python build.py prepare $pip_date
|
||||||
|
python build.py $install_mode
|
||||||
|
python build.py test
|
||||||
|
exit $LASTEXITCODE
|
16
venv.sh
Executable file
16
venv.sh
Executable file
|
@ -0,0 +1,16 @@
|
||||||
|
#!/bin/bash
|
||||||
|
set -e
|
||||||
|
|
||||||
|
if [ ! -d ".build" ]; then
|
||||||
|
virtualenv .build --python $1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -d ".build/bin" ]; then
|
||||||
|
source .build/bin/activate
|
||||||
|
elif [ -d ".build/Scripts" ]; then
|
||||||
|
source .build/Scripts/activate
|
||||||
|
fi
|
||||||
|
|
||||||
|
python build.py prepare $3
|
||||||
|
python build.py $2
|
||||||
|
python build.py test
|
|
@ -2,6 +2,7 @@ include ./header
|
||||||
include ./mixins.jade
|
include ./mixins.jade
|
||||||
|
|
||||||
- var Page = InitPage(Site, Authors.spacy, "home", '404')
|
- var Page = InitPage(Site, Authors.spacy, "home", '404')
|
||||||
|
- Page.canonical_url = null
|
||||||
- Page.is_error = true
|
- Page.is_error = true
|
||||||
- Site.slogan = "404"
|
- Site.slogan = "404"
|
||||||
- Page.active = {}
|
- Page.active = {}
|
||||||
|
|
|
@ -4,7 +4,7 @@ include ./meta.jade
|
||||||
|
|
||||||
+WritePost(Meta)
|
+WritePost(Meta)
|
||||||
section.intro
|
section.intro
|
||||||
p Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote #[a(href="http://spacy.io") spaCy] to do things differently.
|
p Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote #[a(href="https://spacy.io") spaCy] to do things differently.
|
||||||
|
|
||||||
p Imagine: you try to use Google Translate, but it asks you to first select which model you want. The new, awesome deep-learning model is there, but so are lots of others. You pick one that sounds fancy, but it turns out it's a 20-year old experimental model trained on a corpus of oven manuals. When it performs little better than chance, you can't even tell from its output. Of course, Google Translate would not do this to you. But most Natural Language Processing libraries do, and it's terrible.
|
p Imagine: you try to use Google Translate, but it asks you to first select which model you want. The new, awesome deep-learning model is there, but so are lots of others. You pick one that sounds fancy, but it turns out it's a 20-year old experimental model trained on a corpus of oven manuals. When it performs little better than chance, you can't even tell from its output. Of course, Google Translate would not do this to you. But most Natural Language Processing libraries do, and it's terrible.
|
||||||
|
|
||||||
|
@ -12,7 +12,7 @@ include ./meta.jade
|
||||||
|
|
||||||
p Have a look through the #[a(href="http://gate.ac.uk/sale/tao/split.html") GATE software]. There's a lot there, developed over 12 years and many person-hours. But there's approximately zero curation. The philosophy is just to provide things. It's up to you to decide what to use.
|
p Have a look through the #[a(href="http://gate.ac.uk/sale/tao/split.html") GATE software]. There's a lot there, developed over 12 years and many person-hours. But there's approximately zero curation. The philosophy is just to provide things. It's up to you to decide what to use.
|
||||||
|
|
||||||
p This is bad. It's bad to provide an implementation of #[a(href="https://gate.ac.uk/sale/tao/splitch18.html") MiniPar], and have it just...sit there, with no hint that it's 20 years old and should not be used. The RASP parser, too. Why are these provided? Worse, why is there no warning? The #[a(href="http://webdocs.cs.ualberta.ca/~lindek/minipar.htm") Minipar homepage] puts the software in the right context:
|
p This is bad. It's bad to provide an implementation of #[a(href="https://gate.ac.uk/sale/tao/splitch18.html") MiniPar], and have it just...sit there, with no hint that it's 20 years old and should not be used. The RASP parser, too. Why are these provided? Worse, why is there no warning? The #[a(href="https://web.archive.org/web/20150907234221/http://webdocs.cs.ualberta.ca/~lindek/minipar.htm") Minipar homepage] puts the software in the right context:
|
||||||
|
|
||||||
blockquote
|
blockquote
|
||||||
p MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, #[strong on a Pentium II 300 with 128MB memory], it parses about 300 words per second.
|
p MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, #[strong on a Pentium II 300 with 128MB memory], it parses about 300 words per second.
|
||||||
|
@ -23,13 +23,13 @@ include ./meta.jade
|
||||||
|
|
||||||
h3 Why I didn't contribute to NLTK
|
h3 Why I didn't contribute to NLTK
|
||||||
|
|
||||||
p Various people have asked me why I decided to make a new Python NLP library, #[a(href="http://spacy.io") spaCy], instead of supporting the #[a(href="http://nltk.org") NLTK] project. This is the main reason. You can't contribute to a project if you believe that the first thing that they should do is throw almost all of it away. You should just make your own project, which is what I did.
|
p Various people have asked me why I decided to make a new Python NLP library, #[a(href="https://spacy.io") spaCy], instead of supporting the #[a(href="http://nltk.org") NLTK] project. This is the main reason. You can't contribute to a project if you believe that the first thing that they should do is throw almost all of it away. You should just make your own project, which is what I did.
|
||||||
p Have a look through #[a(href="http://www.nltk.org/py-modindex.html") the module list of NLTK]. It looks like there's a lot there, but there's not. What NLTK has is a decent tokenizer, some passable stemmers, a good implementation of the Punkt sentence boundary detector (after #[a(href="http://joelnothman.com/") Joel Nothman] rewrote it), some visualization tools, and some wrappers for other libraries. Nothing else is of any use.
|
p Have a look through #[a(href="http://www.nltk.org/py-modindex.html") the module list of NLTK]. It looks like there's a lot there, but there's not. What NLTK has is a decent tokenizer, some passable stemmers, a good implementation of the Punkt sentence boundary detector (after #[a(href="http://joelnothman.com/") Joel Nothman] rewrote it), some visualization tools, and some wrappers for other libraries. Nothing else is of any use.
|
||||||
|
|
||||||
p For instance, consider #[code nltk.parse]. You might think that amongst all this code there was something that could actually predict the syntactic structure of a sentence for you, but you would be wrong. There are wrappers for the BLLIP and Stanford parsers, and since March there's been an implementation of Nivre's 2003 transition-based dependency parser. Unfortunately no model is provided for it, as they rely on an external wrapper of an external learner, which is unsuitable for the structure of their problem. So the implementation is too slow to be actually useable.
|
p For instance, consider #[code nltk.parse]. You might think that amongst all this code there was something that could actually predict the syntactic structure of a sentence for you, but you would be wrong. There are wrappers for the BLLIP and Stanford parsers, and since March there's been an implementation of Nivre's 2003 transition-based dependency parser. Unfortunately no model is provided for it, as they rely on an external wrapper of an external learner, which is unsuitable for the structure of their problem. So the implementation is too slow to be actually useable.
|
||||||
|
|
||||||
p This problem is totally avoidable, if you just sit down and write good code, instead of stitching together external dependencies. I pointed NLTK to my tutorial describing #[a(href="http://spacy.io/blog/parsing-english-in-python/") how to implement a modern dependency parser], which includes a BSD-licensed implementation in 500 lines of Python. I was told "thanks but no thanks", and #[a(href="https://github.com/nltk/nltk/issues/694") the issue was abruptly closed]. Another researcher's offer from 2012 to implement this type of model also went #[a(href="http://arxiv.org/pdf/1409.7386v1.pdf") unanswered].
|
p This problem is totally avoidable, if you just sit down and write good code, instead of stitching together external dependencies. I pointed NLTK to my tutorial describing #[a(href="https://spacy.io/blog/parsing-english-in-python") how to implement a modern dependency parser], which includes a BSD-licensed implementation in 500 lines of Python. I was told "thanks but no thanks", and #[a(href="https://github.com/nltk/nltk/issues/694") the issue was abruptly closed]. Another researcher's offer from 2012 to implement this type of model also went #[a(href="http://arxiv.org/pdf/1409.7386v1.pdf") unanswered].
|
||||||
|
|
||||||
p The story in #[code nltk.tag] is similar. There are plenty of wrappers, for the external libraries that have actual taggers. The only actual tagger model they distribute is #[a(href="http://spacy.io/blog/part-of-speech-POS-tagger-in-python/") terrible]. Now it seems that #[a(href="https://github.com/nltk/nltk/issues/1063") NLTK does not even know how its POS tagger was trained]. The model is just this .pickle file that's been passed around for 5 years, its origins lost to time. It's not okay to offer this to people, to recommend they use it.
|
p The story in #[code nltk.tag] is similar. There are plenty of wrappers, for the external libraries that have actual taggers. The only actual tagger model they distribute is #[a(href="https://spacy.io/blog/part-of-speech-POS-tagger-in-python") terrible]. Now it seems that #[a(href="https://github.com/nltk/nltk/issues/1063") NLTK does not even know how its POS tagger was trained]. The model is just this .pickle file that's been passed around for 5 years, its origins lost to time. It's not okay to offer this to people, to recommend they use it.
|
||||||
|
|
||||||
p I think open source software should be very careful to make its limitations clear. It's a disservice to provide something that's much less useful than you imply. It's like offering your friend a lift and then not showing up. It's totally fine to not do something – so long as you never suggested you were going to do it. There are ways to do worse than nothing.
|
p I think open source software should be very careful to make its limitations clear. It's a disservice to provide something that's much less useful than you imply. It's like offering your friend a lift and then not showing up. It's totally fine to not do something – so long as you never suggested you were going to do it. There are ways to do worse than nothing.
|
||||||
|
|
|
@ -2,7 +2,7 @@ include ../../header.jade
|
||||||
include ./meta.jade
|
include ./meta.jade
|
||||||
|
|
||||||
mixin Displacy(sentence, caption_text, height)
|
mixin Displacy(sentence, caption_text, height)
|
||||||
- var url = "http://api.spacy.io/displacy/?full=" + sentence.replace(" ", "%20")
|
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(/\s+/g, "%20")
|
||||||
|
|
||||||
.displacy
|
.displacy
|
||||||
iframe.displacy(src="/resources/displacy/robots.html" height=height)
|
iframe.displacy(src="/resources/displacy/robots.html" height=height)
|
||||||
|
@ -20,7 +20,7 @@ mixin Displacy(sentence, caption_text, height)
|
||||||
|
|
||||||
p A syntactic dependency parse is a kind of shallow meaning representation. It's an important piece of many language understanding and text processing technologies. Now that these representations can be computed quickly, and with increasingly high accuracy, they're being used in lots of applications – translation, sentiment analysis, and summarization are major application areas.
|
p A syntactic dependency parse is a kind of shallow meaning representation. It's an important piece of many language understanding and text processing technologies. Now that these representations can be computed quickly, and with increasingly high accuracy, they're being used in lots of applications – translation, sentiment analysis, and summarization are major application areas.
|
||||||
|
|
||||||
p I've been living and breathing similar representations for most of my career. But there's always been a problem: talking about these things is tough. Most people haven't thought much about grammatical structure, and the idea of them is inherently abstract. When I left academia to write #[a(href="http://spaCy.io") spaCy], I knew I wanted a good visualizer. Unfortunately, I also knew I'd never be the one to write it. I'm deeply graphically challenged. Fortunately, when working with #[a(href="http://ines.io") Ines] to build this site, she really nailed the problem, with a solution I'd never have thought of. I really love the result, which we're calling #[a(href="http://api.spacy.io/displacy") displaCy]:
|
p I've been living and breathing similar representations for most of my career. But there's always been a problem: talking about these things is tough. Most people haven't thought much about grammatical structure, and the idea of them is inherently abstract. When I left academia to write #[a(href="https://spacy.io") spaCy], I knew I wanted a good visualizer. Unfortunately, I also knew I'd never be the one to write it. I'm deeply graphically challenged. Fortunately, when working with #[a(href="http://ines.io") Ines] to build this site, she really nailed the problem, with a solution I'd never have thought of. I really love the result, which we're calling #[a(href="https://api.spacy.io/displacy") displaCy]:
|
||||||
|
|
||||||
+Displacy("Robots in popular culture are there to remind us of the awesomeness of unbounded human agency", "Click the button to full-screen and interact, or scroll to see the full parse.", 325)
|
+Displacy("Robots in popular culture are there to remind us of the awesomeness of unbounded human agency", "Click the button to full-screen and interact, or scroll to see the full parse.", 325)
|
||||||
|
|
||||||
|
@ -40,7 +40,7 @@ mixin Displacy(sentence, caption_text, height)
|
||||||
|
|
||||||
p To me, this seemed like witchcraft, or a hack at best. But I was quickly won over: if all we do is declare the data and the relationships, in standards-compliant HTML and CSS, then we can simply step back and let the browser do its job. We know the code will be small, the layout will work on a variety of display, and we'll have a ready separation of style and content. For long output, we simply let the graphic overflow, and let users scroll.
|
p To me, this seemed like witchcraft, or a hack at best. But I was quickly won over: if all we do is declare the data and the relationships, in standards-compliant HTML and CSS, then we can simply step back and let the browser do its job. We know the code will be small, the layout will work on a variety of display, and we'll have a ready separation of style and content. For long output, we simply let the graphic overflow, and let users scroll.
|
||||||
|
|
||||||
p What I'm particularly excited about is the potential for displaCy as an #[a(href="http://api.spacy.io/displacy/?manual=Robots%20in%20popular%20culture%20are%20there%20to%20remind%20us%20of%20the%20awesomeness%20of%20unbounded%20human%20agency" target="_blank") annotation tool]. It may seem unintuitive at first, but I think it will be much better to annotate texts the way the parser operates, with a small set of actions and a stack, than by selecting arcs directly. Why? A few reasons:
|
p What I'm particularly excited about is the potential for displaCy as an #[a(href="https://api.spacy.io/displacy/?manual=Robots%20in%20popular%20culture%20are%20there%20to%20remind%20us%20of%20the%20awesomeness%20of%20unbounded%20human%20agency" target="_blank") annotation tool]. It may seem unintuitive at first, but I think it will be much better to annotate texts the way the parser operates, with a small set of actions and a stack, than by selecting arcs directly. Why? A few reasons:
|
||||||
|
|
||||||
ul
|
ul
|
||||||
li You're always asked a question. You don't have to decide-what-to-decide.
|
li You're always asked a question. You don't have to decide-what-to-decide.
|
||||||
|
|
|
@ -9,4 +9,4 @@
|
||||||
- Meta.links[0].name = 'Reddit'
|
- Meta.links[0].name = 'Reddit'
|
||||||
- Meta.links[0].title = 'Discuss on Reddit'
|
- Meta.links[0].title = 'Discuss on Reddit'
|
||||||
- Meta.links[0].url = "https://www.reddit.com/r/programming/comments/3hoj0b/displaying_linguistic_structure_with_css/"
|
- Meta.links[0].url = "https://www.reddit.com/r/programming/comments/3hoj0b/displaying_linguistic_structure_with_css/"
|
||||||
- Meta.image = "http://spacy.io/resources/img/displacy_screenshot.jpg"
|
- Meta.image = "https://spacy.io/resources/img/displacy_screenshot.jpg"
|
||||||
|
|
|
@ -10,7 +10,7 @@ include ./meta.jade
|
||||||
|
|
||||||
p It turns out that almost anything we say could mean many many different things, but we don't notice because almost all of those meanings would be weird or stupid or just not possible. If I say:
|
p It turns out that almost anything we say could mean many many different things, but we don't notice because almost all of those meanings would be weird or stupid or just not possible. If I say:
|
||||||
|
|
||||||
p.example #[a(href="http://api.spacy.io/displacy/?full=I%20saw%20a%20movie%20in%20a%20dress" target="_blank") I saw a movie in a dress]
|
p.example #[a(href="https://api.spacy.io/displacy/?full=I%20saw%20a%20movie%20in%20a%20dress" target="_blank") I saw a movie in a dress]
|
||||||
|
|
||||||
p Would you ever ask me,
|
p Would you ever ask me,
|
||||||
|
|
||||||
|
@ -18,7 +18,7 @@ include ./meta.jade
|
||||||
|
|
||||||
p It's weird to even think of that. But a computer just might, because there are other cases like:
|
p It's weird to even think of that. But a computer just might, because there are other cases like:
|
||||||
|
|
||||||
p.example #[a(href="http://api.spacy.io/displacy/?full=The%20TV%20showed%20a%20girl%20in%20a%20dress" target="_blank") The TV showed a girl in a dress]
|
p.example #[a(href="https://api.spacy.io/displacy/?full=The%20TV%20showed%20a%20girl%20in%20a%20dress" target="_blank") The TV showed a girl in a dress]
|
||||||
|
|
||||||
p Where the words hang together in the other way. People used to think that the answer was to tell the computer lots and lots of facts. But then you wake up one day and you're writing facts like #[em movies do not wear dresses], and you wonder where it all went wrong. Actually it's even worse than that. Not only are there too many facts, most of them are not even really facts! #[a(href="https://en.wikipedia.org/wiki/Cyc") People really tried this]. We've found that the world is made up of #[em if]s and #[em but]s.
|
p Where the words hang together in the other way. People used to think that the answer was to tell the computer lots and lots of facts. But then you wake up one day and you're writing facts like #[em movies do not wear dresses], and you wonder where it all went wrong. Actually it's even worse than that. Not only are there too many facts, most of them are not even really facts! #[a(href="https://en.wikipedia.org/wiki/Cyc") People really tried this]. We've found that the world is made up of #[em if]s and #[em but]s.
|
||||||
|
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
- Meta.headline = "Statistical NLP in Basic English"
|
- Meta.headline = "Statistical NLP in Basic English"
|
||||||
- Meta.description = "When I was little, my favorite TV shows all had talking computers. Now I’m big and there are still no talking computers, so I’m trying to make some myself. Well, we can make computers say things. But when we say things back, they don’t really understand. Why not?"
|
- Meta.description = "When I was little, my favorite TV shows all had talking computers. Now I’m big and there are still no talking computers, so I’m trying to make some myself. Well, we can make computers say things. But when we say things back, they don’t really understand. Why not?"
|
||||||
- Meta.date = "2015-08-24"
|
- Meta.date = "2015-08-24"
|
||||||
- Meta.url = "/blog/eli5-computers-learn-reading/"
|
- Meta.url = "/blog/eli5-computers-learn-reading"
|
||||||
- Meta.links = []
|
- Meta.links = []
|
||||||
//- Meta.links[0].id = 'reddit'
|
//- Meta.links[0].id = 'reddit'
|
||||||
//- Meta.links[0].name = "Reddit"
|
//- Meta.links[0].name = "Reddit"
|
||||||
|
|
|
@ -92,13 +92,13 @@ include ./meta.jade
|
||||||
|
|
||||||
h3 Part-of-speech Tagger
|
h3 Part-of-speech Tagger
|
||||||
|
|
||||||
p In 2013, I wrote a blog post describing #[a(href="/blog/part-of-speech-POS-tagger-in-python/") how to write a good part of speech tagger]. My recommendation then was to use greedy decoding with the averaged perceptron. I think this is still the best approach, so it's what I implemented in spaCy.
|
p In 2013, I wrote a blog post describing #[a(href="/blog/part-of-speech-POS-tagger-in-python") how to write a good part of speech tagger]. My recommendation then was to use greedy decoding with the averaged perceptron. I think this is still the best approach, so it's what I implemented in spaCy.
|
||||||
|
|
||||||
p The tutorial also recommends the use of Brown cluster features, and case normalization features, as these make the model more robust and domain independent. spaCy's tagger makes heavy use of these features.
|
p The tutorial also recommends the use of Brown cluster features, and case normalization features, as these make the model more robust and domain independent. spaCy's tagger makes heavy use of these features.
|
||||||
|
|
||||||
h3 Dependency Parser
|
h3 Dependency Parser
|
||||||
|
|
||||||
p The parser uses the algorithm described in my #[a(href="/blog/parsing-english-in-python/") 2014 blog post]. This algorithm, shift-reduce dependency parsing, is becoming widely adopted due to its compelling speed/accuracy trade-off.
|
p The parser uses the algorithm described in my #[a(href="/blog/parsing-english-in-python") 2014 blog post]. This algorithm, shift-reduce dependency parsing, is becoming widely adopted due to its compelling speed/accuracy trade-off.
|
||||||
|
|
||||||
p Some quick details about spaCy's take on this, for those who happen to know these models well. I'll write up a better description shortly.
|
p Some quick details about spaCy's take on this, for those who happen to know these models well. I'll write up a better description shortly.
|
||||||
|
|
||||||
|
|
|
@ -33,7 +33,7 @@ include ../header.jade
|
||||||
|
|
||||||
+WritePage(Site, Authors.spacy, Page)
|
+WritePage(Site, Authors.spacy, Page)
|
||||||
section.intro.profile
|
section.intro.profile
|
||||||
p A lot of work has gone into #[strong spaCy], but no magic. We plan to keep no secrets. We want you to be able to #[a(href="/blog/spacy-now-mit") build your business] on #[strong spaCy] – so we want you to understand it. Tell us whether you do. #[span.social #[a(href="//twitter.com/" + Site.twitter, target="_blank") Twitter] #[a(href="mailto:contact@spacy.io") Contact us]]
|
p A lot of work has gone into #[strong spaCy], but no magic. We plan to keep no secrets. We want you to be able to #[a(href="/blog/spacy-now-mit") build your business] on #[strong spaCy] – so we want you to understand it. Tell us whether you do. #[span.social #[a(href="https://twitter.com/" + Site.twitter, target="_blank") Twitter] #[a(href="mailto:contact@spacy.io") Contact us]]
|
||||||
nav(role='navigation')
|
nav(role='navigation')
|
||||||
ul
|
ul
|
||||||
li #[a.button(href='#blogs') Blog]
|
li #[a.button(href='#blogs') Blog]
|
||||||
|
|
|
@ -19,4 +19,4 @@ include ./meta.jade
|
||||||
+TweetThis("Computers don't understand text. This is unfortunate, because that's what the web is mostly made of.", Meta.url)
|
+TweetThis("Computers don't understand text. This is unfortunate, because that's what the web is mostly made of.", Meta.url)
|
||||||
|
|
||||||
p If none of that made any sense to you, here's the gist of it. Computers don't understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it.
|
p If none of that made any sense to you, here's the gist of it. Computers don't understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it.
|
||||||
p spaCy provides a library of utility functions that help programmers build such products. It's commercial open source software: you can either use it under the AGPL, or you can buy a commercial license under generous terms (Note: #[a(href="/blog/spacy-now-mit/") spaCy is now licensed under MIT]).
|
p spaCy provides a library of utility functions that help programmers build such products. It's commercial open source software: you can either use it under the AGPL, or you can buy a commercial license under generous terms (Note: #[a(href="/blog/spacy-now-mit") spaCy is now licensed under MIT]).
|
||||||
|
|
|
@ -5,7 +5,7 @@ include ../../header.jade
|
||||||
+WritePost(Meta)
|
+WritePost(Meta)
|
||||||
//# AGPL not free enough: spaCy now under MIT, offering adaptation as a service
|
//# AGPL not free enough: spaCy now under MIT, offering adaptation as a service
|
||||||
|
|
||||||
p Three big announcements for #[a(href="http://spacy.io") spaCy], a Python library for industrial-strength natural language processing (NLP).
|
p Three big announcements for #[a(href="https://spacy.io") spaCy], a Python library for industrial-strength natural language processing (NLP).
|
||||||
|
|
||||||
ol
|
ol
|
||||||
li The founding team is doubling in size: I'd like to welcome my new co-founder, #[a(href="https://www.linkedin.com/profile/view?id=ADEAAADkZcYBnipeHOAS6HqrDBPK1IzAAVI64ds&authType=NAME_SEARCH&authToken=YYZ1&locale=en_US&srchid=3310922891443387747239&srchindex=1&srchtotal=16&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A3310922891443387747239%2CVSRPtargetId%3A14968262%2CVSRPcmpt%3Aprimary%2CVSRPnm%3Atrue%2CauthType%3ANAME_SEARCH") Henning Peters].
|
li The founding team is doubling in size: I'd like to welcome my new co-founder, #[a(href="https://www.linkedin.com/profile/view?id=ADEAAADkZcYBnipeHOAS6HqrDBPK1IzAAVI64ds&authType=NAME_SEARCH&authToken=YYZ1&locale=en_US&srchid=3310922891443387747239&srchindex=1&srchtotal=16&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A3310922891443387747239%2CVSRPtargetId%3A14968262%2CVSRPcmpt%3Aprimary%2CVSRPnm%3Atrue%2CauthType%3ANAME_SEARCH") Henning Peters].
|
||||||
|
|
|
@ -52,7 +52,7 @@ details
|
||||||
|
|
||||||
p The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.
|
p The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.
|
||||||
|
|
||||||
p Details #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124") here].
|
p Details #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/tagger.pyx") here].
|
||||||
|
|
||||||
details
|
details
|
||||||
summary: h4 Lemmatization
|
summary: h4 Lemmatization
|
||||||
|
|
|
@ -2,20 +2,20 @@
|
||||||
- Site.name = "spaCy.io"
|
- Site.name = "spaCy.io"
|
||||||
- Site.slogan = "Build Tomorrow's Language Technologies"
|
- Site.slogan = "Build Tomorrow's Language Technologies"
|
||||||
- Site.description = "spaCy is a library for industrial-strength text processing in Python. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
|
- Site.description = "spaCy is a library for industrial-strength text processing in Python. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
|
||||||
- Site.image = "http://spacy.io/resources/img/social.png"
|
- Site.image = "https://spacy.io/resources/img/social.png"
|
||||||
- Site.image_small = "http://spacy.io/resources/img/social_small.png"
|
- Site.image_small = "https://spacy.io/resources/img/social_small.png"
|
||||||
- Site.twitter = "spacy_io"
|
- Site.twitter = "spacy_io"
|
||||||
- Site.url = "http://spacy.io"
|
- Site.url = "https://spacy.io"
|
||||||
-
|
-
|
||||||
- Authors = {"matt": {}, "spacy": {}};
|
- Authors = {"matt": {}, "spacy": {}};
|
||||||
- Authors.matt.name = "Matthew Honnibal"
|
- Authors.matt.name = "Matthew Honnibal"
|
||||||
- Authors.matt.bio = "Matthew Honnibal is the author of the <a href=\"http://spacy.io\">spaCy</a> software and the sole founder of its parent company. He studied linguistics as an undergrad, and never thought he'd be a programmer. By 2009 he had a PhD in computer science, and in 2014 he left academia to found Syllogism Co. He's from Sydney and lives in Berlin."
|
- Authors.matt.bio = "Matthew Honnibal is the author of the <a href=\"https://spacy.io\">spaCy</a> software and the sole founder of its parent company. He studied linguistics as an undergrad, and never thought he'd be a programmer. By 2009 he had a PhD in computer science, and in 2014 he left academia to found Syllogism Co. He's from Sydney and lives in Berlin."
|
||||||
|
|
||||||
- Authors.matt.image = "/resources/img/matt.png"
|
- Authors.matt.image = "/resources/img/matt.png"
|
||||||
- Authors.matt.twitter = "honnibal"
|
- Authors.matt.twitter = "honnibal"
|
||||||
-
|
-
|
||||||
- Authors.spacy.name = "SpaCy.io"
|
- Authors.spacy.name = "SpaCy.io"
|
||||||
- Authors.spacy.bio = "<a href=\"http://spacy.io\">spaCy</a> is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
|
- Authors.spacy.bio = "<a href=\"https://spacy.io\">spaCy</a> is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
|
||||||
- Authors.spacy.image = "/resources/img/social_small.png"
|
- Authors.spacy.image = "/resources/img/social_small.png"
|
||||||
- Authors.spacy.twitter = "spacy_io"
|
- Authors.spacy.twitter = "spacy_io"
|
||||||
|
|
||||||
|
@ -27,10 +27,11 @@
|
||||||
- Page.active[type] = true;
|
- Page.active[type] = true;
|
||||||
- Page.links = [];
|
- Page.links = [];
|
||||||
- if (type == "home") {
|
- if (type == "home") {
|
||||||
- Page.url = "";
|
- Page.url = "/";
|
||||||
- } else {
|
- } else {
|
||||||
- Page.url = "/" + type;
|
- Page.url = "/" + type;
|
||||||
- }
|
- }
|
||||||
|
- Page.canonical_url = Site.url + Page.url;
|
||||||
-
|
-
|
||||||
- // Set defaults
|
- // Set defaults
|
||||||
- Page.description = Site.description;
|
- Page.description = Site.description;
|
||||||
|
@ -57,6 +58,7 @@
|
||||||
- Page.description = Meta.description
|
- Page.description = Meta.description
|
||||||
- Page.date = Meta.date
|
- Page.date = Meta.date
|
||||||
- Page.url = Meta.url
|
- Page.url = Meta.url
|
||||||
|
- Page.canonical_url = Site.url + Page.url;
|
||||||
- Page.active["blog"] = true
|
- Page.active["blog"] = true
|
||||||
- Page.links = Meta.links
|
- Page.links = Meta.links
|
||||||
- if (Meta.image != null) {
|
- if (Meta.image != null) {
|
||||||
|
@ -98,6 +100,8 @@ mixin WritePage(Site, Author, Page)
|
||||||
meta(property="og:site_name" content=Site.name)
|
meta(property="og:site_name" content=Site.name)
|
||||||
meta(property="article:published_time" content=getDate(Page.date).timestamp)
|
meta(property="article:published_time" content=getDate(Page.date).timestamp)
|
||||||
link(rel="stylesheet" href="/resources/css/style.css")
|
link(rel="stylesheet" href="/resources/css/style.css")
|
||||||
|
if Page.canonical_url
|
||||||
|
link(rel="canonical" href=Page.canonical_url)
|
||||||
|
|
||||||
//[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]
|
//[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]
|
||||||
|
|
||||||
|
@ -114,7 +118,7 @@ mixin WritePage(Site, Author, Page)
|
||||||
nav(role="navigation")
|
nav(role="navigation")
|
||||||
li(class={active: Page.active.home}): a(href="/") Home
|
li(class={active: Page.active.home}): a(href="/") Home
|
||||||
li(class={active: Page.active.docs}): a(href="/docs") Docs
|
li(class={active: Page.active.docs}): a(href="/docs") Docs
|
||||||
li: a(href="http://api.spacy.io/displacy", target="_blank") Demo
|
li: a(href="https://api.spacy.io/displacy", target="_blank") Demo
|
||||||
li(class={active: Page.active.blog}): a(href="/blog") Blog
|
li(class={active: Page.active.blog}): a(href="/blog") Blog
|
||||||
main#content
|
main#content
|
||||||
block
|
block
|
||||||
|
@ -158,10 +162,10 @@ mixin WritePost(Meta)
|
||||||
+WriteAuthorBio(Author)
|
+WriteAuthorBio(Author)
|
||||||
|
|
||||||
mixin WriteByline(Author, Meta)
|
mixin WriteByline(Author, Meta)
|
||||||
.subhead by #[a(href="//twitter.com/" + Author.twitter, rel="author" target="_blank") #{Author.name}] on #[time #{getDate(Meta.date).fulldate}]
|
.subhead by #[a(href="https://twitter.com/" + Author.twitter, rel="author" target="_blank") #{Author.name}] on #[time #{getDate(Meta.date).fulldate}]
|
||||||
|
|
||||||
mixin WriteShareLinks(headline, url, twitter, links)
|
mixin WriteShareLinks(headline, url, twitter, links)
|
||||||
a.button.button-twitter(href="http://twitter.com/share?text=" + headline + "&url=" + Site.url + url + "&via=" + twitter title="Share on Twitter" target="_blank")
|
a.button.button-twitter(href="https://twitter.com/share?text=" + headline.replace(/\s+/g, "%20") + "&url=" + Site.url + url + "&via=" + twitter title="Share on Twitter" target="_blank")
|
||||||
| Share on Twitter
|
| Share on Twitter
|
||||||
if links
|
if links
|
||||||
.discuss
|
.discuss
|
||||||
|
@ -174,11 +178,11 @@ mixin WriteShareLinks(headline, url, twitter, links)
|
||||||
| Discuss on #{link.name}
|
| Discuss on #{link.name}
|
||||||
|
|
||||||
mixin TweetThis(text, url)
|
mixin TweetThis(text, url)
|
||||||
p #[span #{text} #[a.share(href='http://twitter.com/share?text="' + text + '"&url=' + Site.url + url + '&via=' + Site.twitter title='Share on Twitter' target='_blank') Tweet]]
|
p #[span #{text} #[a.share(href='https://twitter.com/share?text="' + text.replace(/\s+/g, "%20") + '"&url=' + Site.url + url + '&via=' + Site.twitter title='Share on Twitter' target='_blank') Tweet]]
|
||||||
|
|
||||||
mixin WriteAuthorBio(Author)
|
mixin WriteAuthorBio(Author)
|
||||||
section.intro.profile
|
section.intro.profile
|
||||||
p #[img(src=Author.image)] !{Author.bio} #[span.social #[a(href="//twitter.com/" + Author.twitter target="_blank") Twitter]]
|
p #[img(src=Author.image alt=Author.name)] !{Author.bio} #[span.social #[a(href="https://twitter.com/" + Author.twitter target="_blank") Twitter]]
|
||||||
|
|
||||||
|
|
||||||
- var getDate = function(input) {
|
- var getDate = function(input) {
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
mixin Displacy(sentence, caption_text, height)
|
mixin Displacy(sentence, caption_text, height)
|
||||||
- var url = "http://api.spacy.io/displacy/?full=" + sentence.replace(" ", "%20")
|
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(/\s+/g, "%20")
|
||||||
|
|
||||||
.displacy
|
.displacy
|
||||||
iframe.displacy(src="/resources/displacy/displacy_demo.html" height=height)
|
iframe.displacy(src="/resources/displacy/displacy_demo.html" height=height)
|
||||||
|
@ -17,6 +17,6 @@ mixin Displacy(sentence, caption_text, height)
|
||||||
275
|
275
|
||||||
)
|
)
|
||||||
|
|
||||||
p #[a(href="http://api.spacy.io/displacy") displaCy] lets you peek inside spaCy's syntactic parser, as it reads a sentence word-by-word. By repeatedly choosing from a small set of actions, it links the words together according to their syntactic structure. This type of representation powers a wide range of technologies, from translation and summarization, to sentiment analysis and algorithmic trading. #[a(href="/blog/displacy") Read more.]
|
p #[a(href="https://api.spacy.io/displacy") displaCy] lets you peek inside spaCy's syntactic parser, as it reads a sentence word-by-word. By repeatedly choosing from a small set of actions, it links the words together according to their syntactic structure. This type of representation powers a wide range of technologies, from translation and summarization, to sentiment analysis and algorithmic trading. #[a(href="/blog/displacy") Read more.]
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -58,4 +58,4 @@ mixin example(name)
|
||||||
ul
|
ul
|
||||||
li: a(href="/docs#api") API documentation
|
li: a(href="/docs#api") API documentation
|
||||||
li: a(href="/docs#tutorials") Tutorials
|
li: a(href="/docs#tutorials") Tutorials
|
||||||
li: a(href="/docs/#spec") Annotation specs
|
li: a(href="/docs#spec") Annotation specs
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
- var Meta = {}
|
- var Meta = {}
|
||||||
- Meta.author_id = 'spacy'
|
- Meta.author_id = 'spacy'
|
||||||
- Meta.headline = "Tutorial: Adding a language to spaCy"
|
- Meta.headline = "Adding a language to spaCy"
|
||||||
- Meta.description = "Long awaited documentation for adding a language to spaCy"
|
- Meta.description = "Long awaited documentation for adding a language to spaCy"
|
||||||
- Meta.date = "2015-08-18"
|
- Meta.date = "2015-08-18"
|
||||||
- Meta.url = "/tutorials/add-a-language"
|
- Meta.url = "/tutorials/add-a-language"
|
||||||
|
|
|
@ -0,0 +1,8 @@
|
||||||
|
- var Meta = {}
|
||||||
|
- Meta.headline = "Load new word vectors"
|
||||||
|
- Meta.description = "Word vectors allow simple similarity queries, and drive many NLP applications. This tutorial explains how to load custom word vectors into spaCy, to make use of task or data-specific representations."
|
||||||
|
- Meta.author_id = "matt"
|
||||||
|
- Meta.date = "2015-09-24"
|
||||||
|
- Meta.url = "/tutorials/load-new-word-vectors"
|
||||||
|
- Meta.active = { "blog": true }
|
||||||
|
- Meta.links = []
|
|
@ -1,5 +1,5 @@
|
||||||
- var Meta = {}
|
- var Meta = {}
|
||||||
- Meta.headline = "Tutorial: Mark all adverbs, particularly for verbs of speech"
|
- Meta.headline = "Mark all adverbs, particularly for verbs of speech"
|
||||||
- Meta.author_id = 'matt'
|
- Meta.author_id = 'matt'
|
||||||
- Meta.description = "Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that adverbs are not your friend so you want to highlight all adverbs."
|
- Meta.description = "Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that adverbs are not your friend so you want to highlight all adverbs."
|
||||||
- Meta.date = "2015-08-18"
|
- Meta.date = "2015-08-18"
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
- var Meta = {}
|
- var Meta = {}
|
||||||
- Meta.headline = "Tutorial: Search Reddit for comments about Google doing something"
|
- Meta.headline = "Search Reddit for comments about Google doing something"
|
||||||
- Meta.description = "Example use of the spaCy NLP tools for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of \"Google\" now dominate usage of the word in conversation, particularly references to using Google products."
|
- Meta.description = "Example use of the spaCy NLP tools for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of \"Google\" now dominate usage of the word in conversation, particularly references to using Google products."
|
||||||
- Meta.author_id = "matt"
|
- Meta.author_id = "matt"
|
||||||
- Meta.date = "2015-08-18"
|
- Meta.date = "2015-08-18"
|
||||||
|
|
|
@ -4,7 +4,7 @@ include ./meta.jade
|
||||||
|
|
||||||
+WritePost(Meta)
|
+WritePost(Meta)
|
||||||
section.intro
|
section.intro
|
||||||
p #[a(href="http://spaCy.io") spaCy] is great for data exploration. Poking, prodding and sifting is fundamental to good data science. In this tutorial, we'll do a broad keword search of Twitter, and then sift through the live stream of tweets, zooming in on some topics and excluding others.
|
p #[a(href="https://spacy.io") spaCy] is great for data exploration. Poking, prodding and sifting is fundamental to good data science. In this tutorial, we'll do a broad keword search of Twitter, and then sift through the live stream of tweets, zooming in on some topics and excluding others.
|
||||||
|
|
||||||
p An example filter-function:
|
p An example filter-function:
|
||||||
|
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
- var Meta = {}
|
- var Meta = {}
|
||||||
- Meta.headline = "Tutorial: Finding Relevant Tweets"
|
- Meta.headline = "Finding Relevant Tweets"
|
||||||
- Meta.author_id = 'matt'
|
- Meta.author_id = 'matt'
|
||||||
- Meta.description = "In this tutorial, we will use word vectors to search for tweets about Jeb Bush. We'll do this by building up two word lists: one that represents the type of meanings in the Jeb Bush tweets, and another to help screen out irrelevant tweets that mention the common, ambiguous word 'bush'."
|
- Meta.description = "In this tutorial, we will use word vectors to search for tweets about Jeb Bush. We'll do this by building up two word lists: one that represents the type of meanings in the Jeb Bush tweets, and another to help screen out irrelevant tweets that mention the common, ambiguous word 'bush'."
|
||||||
- Meta.date = "2015-08-18"
|
- Meta.date = "2015-08-18"
|
||||||
|
|
Loading…
Reference in New Issue
Block a user