Merge branch 'henningpeters-headers'

This commit is contained in:
Matthew Honnibal 2015-12-27 17:38:06 +01:00
commit 45afda5af3
46 changed files with 836 additions and 654 deletions

View File

@ -45,28 +45,21 @@ install:
# Upgrade to the latest version of pip to avoid it displaying warnings
# about it being out of date.
- "pip install --disable-pip-version-check --user --upgrade pip"
- "pip install --disable-pip-version-check --user -U pip"
# Install the build dependencies of the project. If some dependencies contain
# compiled extensions and are not provided as pre-built wheel packages,
# pip will build them from source using the MSVC compiler matching the
# target Python version and architecture
- "pip install --upgrade setuptools"
- "%CMD_IN_ENV% pip install cython fabric fabtools"
- "%CMD_IN_ENV% pip install -r requirements.txt"
- "%CMD_IN_ENV% python build.py prepare"
build_script:
# Build the compiled extension
- "%CMD_IN_ENV% python setup.py build_ext --inplace"
- ps: appveyor\download.ps1
- "tar -xzf corpora/en/wordnet.tar.gz"
- "%CMD_IN_ENV% python bin/init_model.py en lang_data/ corpora/ spacy/en/data"
- "%CMD_IN_ENV% python build.py pip"
test_script:
# Run the project tests
- "pip install pytest"
- "%CMD_IN_ENV% py.test spacy/ -x"
- "%CMD_IN_ENV% python build.py test"
after_test:
# If tests are successful, create binary packages for the project.

View File

@ -1,28 +1,25 @@
language: python
sudo: required
dist: precise
group: edge
python:
- "2.7"
- "3.4"
os:
- linux
python:
- "2.7"
- "3.4"
- "3.5"
env:
- PIP_DATE=2015-10-01 MODE=pip
- PIP_DATE=2015-10-01 MODE=setup-install
- PIP_DATE=2015-10-01 MODE=setup-develop
# install dependencies
install:
- "pip install --upgrade setuptools"
- "pip install cython fabric fabtools"
- "pip install -r requirements.txt"
- "python setup.py build_ext --inplace"
- "mkdir -p corpora/en"
- "cd corpora/en"
- "wget --no-check-certificate http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz"
- "tar -xzf WordNet-3.0.tar.gz"
- "mv WordNet-3.0 wordnet"
- "cd ../../"
- "export PYTHONPATH=`pwd`"
- "python bin/init_model.py en lang_data/ corpora/ spacy/en/data"
- pip install --disable-pip-version-check -U pip
- python build.py prepare $PIP_DATE
# run tests
script:
- "py.test spacy/ -x"
- python build.py $MODE;
- python build.py test

View File

@ -0,0 +1 @@
recursive-include include *.h

View File

@ -1,14 +1,14 @@
Python 2.7 Windows build has been tested with the following toolchain:
- Python 2.7.10 :)
- Microsoft Visual C++ Compiler Package for Python 2.7 http://www.microsoft.com/en-us/download/details.aspx?id=44266
- C99 compliant stdint.h for MSVC http://msinttypes.googlecode.com/svn/trunk/stdint.h
- C99 compliant stdint.h for MSVC https://msinttypes.googlecode.com/svn/trunk/stdint.h
(C99 complian stdint.h header which is not supplied with Microsoft Visual C++ compiler prior to MSVC 2010)
Build steps:
- pip install --upgrade setuptools
- pip install cython fabric fabtools
- pip install -r requirements.txt
- python setup.py build_ext --inplace
- python pip install -e .
If you are using traditional Microsoft SDK (v7.0 for Python 2.x or v7.1 for Python 3.x) consider run_with_env.cmd from appveyor folder (submodule) as a guideline for environment setup.
It can be also used as shell conviguration script for your build, install and run commands, i.e.: cmd /E:ON /V:ON /C run_with_env.cmd <your command>

View File

@ -45,9 +45,6 @@ Supports
* OSX
* Linux
* Cygwin
Want to support:
* Visual Studio
Difficult to support:

199
bin/cythonize.py Executable file
View File

@ -0,0 +1,199 @@
#!/usr/bin/env python
""" cythonize
Cythonize pyx files into C files as needed.
Usage: cythonize [root_dir]
Default [root_dir] is 'spacy'.
Checks pyx files to see if they have been changed relative to their
corresponding C files. If they have, then runs cython on these files to
recreate the C files.
The script thinks that the pyx files have changed relative to the C files
by comparing hashes stored in a database file.
Simple script to invoke Cython (and Tempita) on all .pyx (.pyx.in)
files; while waiting for a proper build system. Uses file hashes to
figure out if rebuild is needed.
For now, this script should be run by developers when changing Cython files
only, and the resulting C files checked in, so that end-users (and Python-only
developers) do not get the Cython/Tempita dependencies.
Originally written by Dag Sverre Seljebotn, and copied here from:
https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py
Note: this script does not check any of the dependent C libraries; it only
operates on the Cython .pyx files.
"""
from __future__ import division, print_function, absolute_import
import os
import re
import sys
import hashlib
import subprocess
HASH_FILE = 'cythonize.dat'
DEFAULT_ROOT = 'spacy'
VENDOR = 'spaCy'
# WindowsError is not defined on unix systems
try:
WindowsError
except NameError:
WindowsError = None
#
# Rules
#
def process_pyx(fromfile, tofile):
try:
from Cython.Compiler.Version import version as cython_version
from distutils.version import LooseVersion
if LooseVersion(cython_version) < LooseVersion('0.19'):
raise Exception('Building %s requires Cython >= 0.19' % VENDOR)
except ImportError:
pass
flags = ['--fast-fail']
if tofile.endswith('.cpp'):
flags += ['--cplus']
try:
try:
r = subprocess.call(['cython'] + flags + ["-o", tofile, fromfile])
if r != 0:
raise Exception('Cython failed')
except OSError:
# There are ways of installing Cython that don't result in a cython
# executable on the path, see gh-2397.
r = subprocess.call([sys.executable, '-c',
'import sys; from Cython.Compiler.Main import '
'setuptools_main as main; sys.exit(main())'] + flags +
["-o", tofile, fromfile])
if r != 0:
raise Exception('Cython failed')
except OSError:
raise OSError('Cython needs to be installed')
def process_tempita_pyx(fromfile, tofile):
try:
try:
from Cython import Tempita as tempita
except ImportError:
import tempita
except ImportError:
raise Exception('Building %s requires Tempita: '
'pip install --user Tempita' % VENDOR)
with open(fromfile, "r") as f:
tmpl = f.read()
pyxcontent = tempita.sub(tmpl)
assert fromfile.endswith('.pyx.in')
pyxfile = fromfile[:-len('.pyx.in')] + '.pyx'
with open(pyxfile, "w") as f:
f.write(pyxcontent)
process_pyx(pyxfile, tofile)
rules = {
# fromext : function
'.pyx' : process_pyx,
'.pyx.in' : process_tempita_pyx
}
#
# Hash db
#
def load_hashes(filename):
# Return { filename : (sha1 of input, sha1 of output) }
if os.path.isfile(filename):
hashes = {}
with open(filename, 'r') as f:
for line in f:
filename, inhash, outhash = line.split()
hashes[filename] = (inhash, outhash)
else:
hashes = {}
return hashes
def save_hashes(hash_db, filename):
with open(filename, 'w') as f:
for key, value in sorted(hash_db.items()):
f.write("%s %s %s\n" % (key, value[0], value[1]))
def sha1_of_file(filename):
h = hashlib.sha1()
with open(filename, "rb") as f:
h.update(f.read())
return h.hexdigest()
#
# Main program
#
def normpath(path):
path = path.replace(os.sep, '/')
if path.startswith('./'):
path = path[2:]
return path
def get_hash(frompath, topath):
from_hash = sha1_of_file(frompath)
to_hash = sha1_of_file(topath) if os.path.exists(topath) else None
return (from_hash, to_hash)
def process(path, fromfile, tofile, processor_function, hash_db):
fullfrompath = os.path.join(path, fromfile)
fulltopath = os.path.join(path, tofile)
current_hash = get_hash(fullfrompath, fulltopath)
if current_hash == hash_db.get(normpath(fullfrompath), None):
print('%s has not changed' % fullfrompath)
return
orig_cwd = os.getcwd()
try:
os.chdir(path)
print('Processing %s' % fullfrompath)
processor_function(fromfile, tofile)
finally:
os.chdir(orig_cwd)
# changed target file, recompute hash
current_hash = get_hash(fullfrompath, fulltopath)
# store hash in db
hash_db[normpath(fullfrompath)] = current_hash
def find_process_files(root_dir):
hash_db = load_hashes(HASH_FILE)
for cur_dir, dirs, files in os.walk(root_dir):
for filename in files:
in_file = os.path.join(cur_dir, filename + ".in")
if filename.endswith('.pyx') and os.path.isfile(in_file):
continue
for fromext, function in rules.items():
if filename.endswith(fromext):
toext = ".cpp"
# with open(os.path.join(cur_dir, filename), 'rb') as f:
# data = f.read()
# m = re.search(br"^\s*#\s*distutils:\s*language\s*=\s*c\+\+\s*$", data, re.I|re.M)
# if m:
# toext = ".cxx"
fromfile = filename
tofile = filename[:-len(fromext)] + toext
process(cur_dir, fromfile, tofile, function, hash_db)
save_hashes(hash_db, HASH_FILE)
def main():
try:
root_dir = sys.argv[1]
except IndexError:
root_dir = DEFAULT_ROOT
find_process_files(root_dir)
if __name__ == '__main__':
main()

71
build.py Normal file
View File

@ -0,0 +1,71 @@
#!/usr/bin/env python
from __future__ import print_function
import os
import sys
import shutil
from subprocess import call
def x(cmd):
print('$ '+cmd)
res = call(cmd, shell=True)
if res != 0:
sys.exit(res)
if len(sys.argv) < 2:
print('usage: %s <install-mode> [<pip-date>]')
sys.exit(1)
install_mode = sys.argv[1]
if install_mode == 'prepare':
x('python pip-clear.py')
pip_date = len(sys.argv) > 2 and sys.argv[2]
if pip_date:
x('python pip-date.py %s pip setuptools wheel six' % pip_date)
x('pip install -r requirements.txt')
x('pip list')
elif install_mode == 'pip':
if os.path.exists('dist'):
shutil.rmtree('dist')
x('python setup.py sdist')
x('python pip-clear.py')
filenames = os.listdir('dist')
assert len(filenames) == 1
x('pip list')
x('pip install dist/%s' % filenames[0])
elif install_mode == 'setup-install':
x('python setup.py install')
elif install_mode == 'setup-develop':
x('pip install -e .')
elif install_mode == 'test':
x('pip install pytest')
x('pip list')
if os.path.exists('tmp'):
shutil.rmtree('tmp')
os.mkdir('tmp')
try:
old = os.getcwd()
os.chdir('tmp')
x('python -m spacy.en.download')
x('python -m pytest ../spacy/ -x --models --vectors --slow')
finally:
os.chdir(old)

46
fabfile.py vendored
View File

@ -74,7 +74,6 @@ def web():
jade('home/index.jade', '')
jade('docs/index.jade', 'docs/')
jade('blog/index.jade', 'blog/')
jade('tutorials/index.jade', 'tutorials/')
for collection in ('blog', 'tutorials'):
for post_dir in (Path(__file__).parent / 'website' / 'src' / 'jade' / collection).iterdir():
@ -85,7 +84,50 @@ def web():
def web_publish(assets_path):
local('aws s3 sync --delete --exclude "resources/*" website/site/ s3://spacy.io')
from boto.s3.connection import S3Connection, OrdinaryCallingFormat
site_path = 'website/site'
os.environ['S3_USE_SIGV4'] = 'True'
conn = S3Connection(host='s3.eu-central-1.amazonaws.com',
calling_format=OrdinaryCallingFormat())
bucket = conn.get_bucket('spacy.io', validate=False)
keys_left = set([k.name for k in bucket.list()
if not k.name.startswith('resources')])
for root, dirnames, filenames in os.walk(site_path):
for dirname in dirnames:
target = os.path.relpath(os.path.join(root, dirname), site_path)
source = os.path.join(target, 'index.html')
if os.path.exists(os.path.join(root, dirname, 'index.html')):
key = bucket.new_key(source)
key.set_redirect('//%s/%s' % (bucket.name, target))
print('adding redirect for %s' % target)
keys_left.remove(source)
for filename in filenames:
source = os.path.join(root, filename)
target = os.path.relpath(root, site_path)
if target == '.':
target = filename
elif filename != 'index.html':
target = os.path.join(target, filename)
key = bucket.new_key(target)
key.set_metadata('Content-Type', 'text/html')
key.set_contents_from_filename(source)
print('uploading %s' % target)
keys_left.remove(target)
for key_name in keys_left:
print('deleting %s' % key_name)
bucket.delete_key(key_name)
local('aws s3 sync --delete %s s3://spacy.io/resources' % assets_path)

28
pip-clear.py Executable file
View File

@ -0,0 +1,28 @@
#!/usr/bin/env python
from __future__ import print_function
from pip.commands.uninstall import UninstallCommand
from pip import get_installed_distributions
packages = []
for package in get_installed_distributions():
if package.location.endswith('dist-packages'):
continue
elif package.project_name in ('pip', 'setuptools'):
continue
packages.append(package.project_name)
if packages:
pip = UninstallCommand()
options, args = pip.parse_args(packages)
options.yes = True
try:
pip.run(options, args)
except OSError as e:
if e.errno != 13:
raise e
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
exit(13)

96
pip-date.py Normal file
View File

@ -0,0 +1,96 @@
#!/usr/bin/env python
from __future__ import print_function
import json
import re
import sys
from bisect import bisect
from datetime import datetime
from datetime import timedelta
import ssl
try:
from urllib.request import Request, build_opener, HTTPSHandler, URLError
except ImportError:
from urllib2 import Request, build_opener, HTTPSHandler, URLError
from pip.commands.uninstall import UninstallCommand
from pip.commands.install import InstallCommand
from pip import get_installed_distributions
def get_releases(package_name):
url = 'https://pypi.python.org/pypi/%s/json' % package_name
ssl_context = HTTPSHandler(
context=ssl.SSLContext(ssl.PROTOCOL_TLSv1))
opener = build_opener(ssl_context)
retries = 10
while retries > 0:
try:
r = opener.open(Request(url))
break
except URLError:
retries -= 1
return json.loads(r.read().decode('utf8'))['releases']
def parse_iso8601(s):
return datetime(*map(int, re.split('[^\d]', s)))
def select_version(select_date, package_name):
versions = []
for version, dists in get_releases(package_name).items():
date = [parse_iso8601(d['upload_time']) for d in dists]
if date:
versions.append((sorted(date)[0], version))
versions = sorted(versions)
min_date = versions[0][0]
if select_date < min_date:
raise Exception('invalid select_date: %s, must be '
'%s or newer.' % (select_date, min_date))
return versions[bisect([x[0] for x in versions], select_date) - 1][1]
installed_packages = [
package.project_name
for package in
get_installed_distributions()
if (not package.location.endswith('dist-packages') and
package.project_name not in ('pip', 'setuptools'))
]
if installed_packages:
pip = UninstallCommand()
options, args = pip.parse_args(installed_packages)
options.yes = True
try:
pip.run(options, args)
except OSError as e:
if e.errno != 13:
raise e
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
exit(13)
date = parse_iso8601(sys.argv[1])
packages = {p: select_version(date, p) for p in sys.argv[2:]}
args = ['=='.join(a) for a in packages.items()]
cmd = InstallCommand()
options, args = cmd.parse_args(args)
options.ignore_installed = True
options.force_reinstall = True
try:
print(cmd.run(options, args))
except OSError as e:
if e.errno != 13:
raise e
print("You lack permissions to uninstall this package. Perhaps run with sudo? Exiting.")
exit(13)

View File

@ -1,13 +1,13 @@
cython
cymem == 1.30
pathlib
preshed == 0.44
thinc == 4.0
murmurhash == 0.24
preshed == 0.46.1
thinc == 4.1.0
murmurhash == 0.26
text-unidecode
numpy
plac
six
ujson
cloudpickle
sputnik == 0.5.2
sputnik == 0.6.2

426
setup.py
View File

@ -1,231 +1,281 @@
#!/usr/bin/env python
from setuptools import setup
import shutil
import sys
from __future__ import division, print_function
import os
from os import path
from setuptools import Extension
from distutils import sysconfig
from distutils.core import setup, Extension
import shutil
import subprocess
import sys
import contextlib
from distutils.command.build_ext import build_ext
from distutils.sysconfig import get_python_inc
import platform
try:
from setuptools import Extension, setup
except ImportError:
from distutils.core import Extension, setup
PACKAGE_DATA = {
"spacy": ["*.pxd"],
"spacy.tokens": ["*.pxd"],
"spacy.serialize": ["*.pxd"],
"spacy.syntax": ["*.pxd"],
"spacy.en": [
"*.pxd",
"data/wordnet/*.exc",
"data/wordnet/index.*",
"data/tokenizer/*",
"data/vocab/serializer.json"
]
}
MAJOR = 0
MINOR = 100
MICRO = 0
ISRELEASED = False
VERSION = '%d.%d.%d' % (MAJOR, MINOR, MICRO)
PACKAGES = [
'spacy',
'spacy.tokens',
'spacy.en',
'spacy.serialize',
'spacy.syntax',
'spacy.munge',
'spacy.tests',
'spacy.tests.matcher',
'spacy.tests.morphology',
'spacy.tests.munge',
'spacy.tests.parser',
'spacy.tests.serialize',
'spacy.tests.spans',
'spacy.tests.tagger',
'spacy.tests.tokenizer',
'spacy.tests.tokens',
'spacy.tests.vectors',
'spacy.tests.vocab']
MOD_NAMES = [
'spacy.parts_of_speech',
'spacy.strings',
'spacy.lexeme',
'spacy.vocab',
'spacy.attrs',
'spacy.morphology',
'spacy.tagger',
'spacy.syntax.stateclass',
'spacy.tokenizer',
'spacy.syntax.parser',
'spacy.syntax.transition_system',
'spacy.syntax.arc_eager',
'spacy.syntax._parse_features',
'spacy.gold',
'spacy.orth',
'spacy.tokens.doc',
'spacy.tokens.span',
'spacy.tokens.token',
'spacy.serialize.packer',
'spacy.serialize.huffman',
'spacy.serialize.bits',
'spacy.cfile',
'spacy.matcher',
'spacy.syntax.ner',
'spacy.symbols']
if sys.version_info[:2] < (2, 7) or (3, 0) <= sys.version_info[0:2] < (3, 4):
raise RuntimeError('Python version 2.7 or >= 3.4 required.')
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
compile_options = {'msvc' : ['/Ox', '/EHsc'] ,
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function'] }
link_options = {'msvc' : [] ,
'other' : [] }
compile_options = {'msvc' : ['/Ox', '/EHsc'],
'other' : ['-O3', '-Wno-strict-prototypes', '-Wno-unused-function']}
link_options = {'msvc' : [],
'other' : []}
if sys.platform.startswith('darwin'):
compile_options['other'].append('-mmacosx-version-min=10.8')
compile_options['other'].append('-stdlib=libc++')
link_options['other'].append('-lc++')
class build_ext_options:
def build_options(self):
c_type = None
if self.compiler.compiler_type in compile_options:
c_type = self.compiler.compiler_type
elif 'other' in compile_options:
c_type = 'other'
if c_type is not None:
for e in self.extensions:
e.extra_compile_args = compile_options[c_type]
for e in self.extensions:
e.extra_compile_args = compile_options.get(
self.compiler.compiler_type, compile_options['other'])
for e in self.extensions:
e.extra_link_args = link_options.get(
self.compiler.compiler_type, link_options['other'])
l_type = None
if self.compiler.compiler_type in link_options:
l_type = self.compiler.compiler_type
elif 'other' in link_options:
l_type = 'other'
if l_type is not None:
for e in self.extensions:
e.extra_link_args = link_options[l_type]
class build_ext_subclass( build_ext, build_ext_options ):
class build_ext_subclass(build_ext, build_ext_options):
def build_extensions(self):
build_ext_options.build_options(self)
build_ext.build_extensions(self)
# PyPy --- NB! PyPy doesn't really work, it segfaults all over the place. But,
# this is necessary to get it compile.
# We have to resort to monkey-patching to set the compiler, because pypy broke
# all the everything.
pre_patch_customize_compiler = sysconfig.customize_compiler
def my_customize_compiler(compiler):
pre_patch_customize_compiler(compiler)
compiler.compiler_cxx = ['c++']
if platform.python_implementation() == 'PyPy':
sysconfig.customize_compiler = my_customize_compiler
# Return the git revision as a string
def git_version():
def _minimal_ext_cmd(cmd):
# construct minimal environment
env = {}
for k in ['SYSTEMROOT', 'PATH']:
v = os.environ.get(k)
if v is not None:
env[k] = v
# LANGUAGE is used on win32
env['LANGUAGE'] = 'C'
env['LANG'] = 'C'
env['LC_ALL'] = 'C'
out = subprocess.Popen(cmd, stdout = subprocess.PIPE, env=env).communicate()[0]
return out
#def install_headers():
# dest_dir = path.join(sys.prefix, 'include', 'murmurhash')
# if not path.exists(dest_dir):
# shutil.copytree('murmurhash/headers/murmurhash', dest_dir)
#
# dest_dir = path.join(sys.prefix, 'include', 'numpy')
try:
out = _minimal_ext_cmd(['git', 'rev-parse', 'HEAD'])
GIT_REVISION = out.strip().decode('ascii')
except OSError:
GIT_REVISION = 'Unknown'
return GIT_REVISION
includes = ['.', path.join(sys.prefix, 'include')]
def get_version_info():
# Adding the git rev number needs to be done inside write_version_py(),
# otherwise the import of spacy.about messes up the build under Python 3.
FULLVERSION = VERSION
if os.path.exists('.git'):
GIT_REVISION = git_version()
elif os.path.exists(os.path.join('spacy', 'about.py')):
# must be a source distribution, use existing version file
try:
from spacy.about import git_revision as GIT_REVISION
except ImportError:
raise ImportError('Unable to import git_revision. Try removing '
'spacy/about.py and the build directory '
'before building.')
else:
GIT_REVISION = 'Unknown'
if not ISRELEASED:
FULLVERSION += '.dev0+' + GIT_REVISION[:7]
return FULLVERSION, GIT_REVISION
try:
import numpy
numpy_headers = path.join(numpy.get_include(), 'numpy')
shutil.copytree(numpy_headers, path.join(sys.prefix, 'include', 'numpy'))
except ImportError:
pass
except OSError:
pass
def write_version(path):
cnt = """# THIS FILE IS GENERATED FROM SPACY SETUP.PY
short_version = '%(version)s'
version = '%(version)s'
full_version = '%(full_version)s'
git_revision = '%(git_revision)s'
release = %(isrelease)s
if not release:
version = full_version
"""
FULLVERSION, GIT_REVISION = get_version_info()
with open(path, 'w') as f:
f.write(cnt % {'version': VERSION,
'full_version' : FULLVERSION,
'git_revision' : GIT_REVISION,
'isrelease': str(ISRELEASED)})
def generate_cython(root, source):
print('Cythonizing sources')
p = subprocess.call([sys.executable,
os.path.join(root, 'bin', 'cythonize.py'),
source])
if p != 0:
raise RuntimeError('Running cythonize failed')
def clean(mod_names):
for name in mod_names:
def import_include(module_name):
try:
return __import__(module_name, globals(), locals(), [], 0)
except ImportError:
raise ImportError('Unable to import %s. Create a virtual environment '
'and install all dependencies from requirements.txt, '
'e.g., run "pip install -r requirements.txt".' % module_name)
def copy_include(src, dst, path):
assert os.path.isdir(src)
assert os.path.isdir(dst)
shutil.copytree(
os.path.join(src, path),
os.path.join(dst, path))
def prepare_includes(path):
include_dir = os.path.join(path, 'include')
if os.path.exists(include_dir):
shutil.rmtree(include_dir)
os.mkdir(include_dir)
numpy = import_include('numpy')
copy_include(numpy.get_include(), include_dir, 'numpy')
murmurhash = import_include('murmurhash')
copy_include(murmurhash.get_include(), include_dir, 'murmurhash')
def is_source_release(path):
return os.path.exists(os.path.join(path, 'PKG-INFO'))
def clean(path):
for name in MOD_NAMES:
name = name.replace('.', '/')
so = name + '.so'
html = name + '.html'
cpp = name + '.cpp'
c = name + '.c'
for file_path in [so, html, cpp, c]:
for ext in ['.so', '.html', '.cpp', '.c']:
file_path = os.path.join(path, name + ext)
if os.path.exists(file_path):
os.unlink(file_path)
def name_to_path(mod_name, ext):
return '%s.%s' % (mod_name.replace('.', '/'), ext)
@contextlib.contextmanager
def chdir(new_dir):
old_dir = os.getcwd()
try:
os.chdir(new_dir)
sys.path.insert(0, new_dir)
yield
finally:
del sys.path[0]
os.chdir(old_dir)
def c_ext(mod_name, language, includes):
mod_path = name_to_path(mod_name, language)
return Extension(mod_name, [mod_path], include_dirs=includes)
def setup_package():
root = os.path.abspath(os.path.dirname(__file__))
if len(sys.argv) > 1 and sys.argv[1] == 'clean':
return clean(root)
def cython_setup(mod_names, language, includes):
import Cython.Distutils
import Cython.Build
import distutils.core
with chdir(root):
write_version(os.path.join(root, 'spacy', 'about.py'))
class build_ext_cython_subclass( Cython.Distutils.build_ext, build_ext_options ):
def build_extensions(self):
build_ext_options.build_options(self)
Cython.Distutils.build_ext.build_extensions(self)
include_dirs = [
get_python_inc(plat_specific=True),
os.path.join(root, 'include')]
if language == 'cpp':
language = 'c++'
exts = []
for mod_name in mod_names:
mod_path = mod_name.replace('.', '/') + '.pyx'
e = Extension(mod_name, [mod_path], language=language, include_dirs=includes)
exts.append(e)
distutils.core.setup(
name='spacy',
packages=['spacy', 'spacy.tokens', 'spacy.en', 'spacy.serialize',
'spacy.syntax', 'spacy.munge'],
description="Industrial-strength NLP",
author='Matthew Honnibal',
author_email='honnibal@gmail.com',
version=VERSION,
url="http://spacy.io",
package_data=PACKAGE_DATA,
ext_modules=exts,
cmdclass={'build_ext': build_ext_cython_subclass},
license="MIT",
)
ext_modules = []
for mod_name in MOD_NAMES:
mod_path = mod_name.replace('.', '/') + '.cpp'
ext_modules.append(
Extension(mod_name, [mod_path],
language='c++', include_dirs=include_dirs))
if not is_source_release(root):
generate_cython(root, 'spacy')
prepare_includes(root)
def run_setup(exts):
setup(
name='spacy',
packages=['spacy', 'spacy.tokens', 'spacy.en', 'spacy.serialize',
'spacy.syntax', 'spacy.munge',
'spacy.tests',
'spacy.tests.matcher',
'spacy.tests.morphology',
'spacy.tests.munge',
'spacy.tests.parser',
'spacy.tests.serialize',
'spacy.tests.spans',
'spacy.tests.tagger',
'spacy.tests.tokenizer',
'spacy.tests.tokens',
'spacy.tests.vectors',
'spacy.tests.vocab'],
description="Industrial-strength NLP",
author='Matthew Honnibal',
author_email='honnibal@gmail.com',
version=VERSION,
url="http://honnibal.github.io/spaCy/",
package_data=PACKAGE_DATA,
ext_modules=exts,
license="MIT",
install_requires=['numpy', 'murmurhash == 0.24', 'cymem == 1.30', 'preshed == 0.44',
'thinc == 4.0.0', "text_unidecode", 'plac', 'six',
'ujson', 'cloudpickle', 'sputnik == 0.5.2'],
setup_requires=["headers_workaround"],
cmdclass = {'build_ext': build_ext_subclass },
)
import headers_workaround
headers_workaround.fix_venv_pypy_include()
headers_workaround.install_headers('murmurhash')
headers_workaround.install_headers('numpy')
VERSION = '0.100'
def main(modules, is_pypy):
language = "cpp"
includes = ['.', path.join(sys.prefix, 'include')]
if sys.platform.startswith('darwin'):
compile_options['other'].append('-mmacosx-version-min=10.8')
compile_options['other'].append('-stdlib=libc++')
link_options['other'].append('-lc++')
if use_cython:
cython_setup(modules, language, includes)
else:
exts = [c_ext(mn, language, includes)
for mn in modules]
run_setup(exts)
MOD_NAMES = ['spacy.parts_of_speech', 'spacy.strings',
'spacy.lexeme', 'spacy.vocab', 'spacy.attrs',
'spacy.morphology', 'spacy.tagger',
'spacy.syntax.stateclass',
'spacy.tokenizer',
'spacy.syntax.parser',
'spacy.syntax.transition_system',
'spacy.syntax.arc_eager',
'spacy.syntax._parse_features',
'spacy.gold', 'spacy.orth',
'spacy.tokens.doc', 'spacy.tokens.span', 'spacy.tokens.token',
'spacy.serialize.packer', 'spacy.serialize.huffman', 'spacy.serialize.bits',
'spacy.cfile', 'spacy.matcher',
'spacy.syntax.ner',
'spacy.symbols']
setup(
name='spacy',
packages=PACKAGES,
package_data={'': ['*.pyx', '*.pxd']},
description='Industrial-strength NLP',
author='Matthew Honnibal',
author_email='matt@spacy.io',
version=VERSION,
url='https://spacy.io',
license='MIT',
ext_modules=ext_modules,
install_requires=['numpy', 'murmurhash == 0.26', 'cymem == 1.30', 'preshed == 0.46.1',
'thinc == 4.1.0', 'text_unidecode', 'plac', 'six',
'ujson', 'cloudpickle', 'sputnik == 0.6.2'],
cmdclass = {
'build_ext': build_ext_subclass},
)
if __name__ == '__main__':
if sys.argv[1] == 'clean':
clean(MOD_NAMES)
else:
use_cython = sys.argv[1] == 'build_ext'
main(MOD_NAMES, use_cython)
setup_package()

View File

@ -1,3 +0,0 @@
"""Feed-forward neural network, using Thenao."""

View File

@ -1,146 +0,0 @@
"""Feed-forward neural network, using Thenao."""
import os
import sys
import time
import numpy
import theano
import theano.tensor as T
import plac
from spacy.gold import read_json_file
from spacy.gold import GoldParse
from spacy.en.pos import POS_TEMPLATES, POS_TAGS, setup_model_dir
def build_model(n_classes, n_vocab, n_hidden, n_word_embed, n_tag_embed):
# allocate symbolic variables for the data
words = T.vector('words')
tags = T.vector('tags')
word_e = _init_embedding(n_words, n_word_embed)
tag_e = _init_embedding(n_tags, n_tag_embed)
label_e = _init_embedding(n_labels, n_label_embed)
maxent_W, maxent_b = _init_maxent_weights(n_hidden, n_classes)
hidden_W, hidden_b = _init_hidden_weights(28*28, n_hidden, T.tanh)
params = [hidden_W, hidden_b, maxent_W, maxent_b, word_e, tag_e, label_e]
x = T.concatenate([
T.flatten(word_e[word_indices], outdim=1),
T.flatten(tag_e[tag_indices], outdim=1)])
p_y_given_x = feed_layer(
T.nnet.softmax,
maxent_W,
maxent_b,
feed_layer(
T.tanh,
hidden_W,
hidden_b,
x))[0]
guess = T.argmax(p_y_given_x)
cost = (
-T.log(p_y_given_x[y])
+ L1(L1_reg, maxent_W, hidden_W, word_e, tag_e)
+ L2(L2_reg, maxent_W, hidden_W, wod_e, tag_e)
)
train_model = theano.function(
inputs=[words, tags, y],
outputs=guess,
updates=[update(learning_rate, param, cost) for param in params]
)
evaluate_model = theano.function(
inputs=[x, y],
outputs=T.neq(y, T.argmax(p_y_given_x[0])),
)
return train_model, evaluate_model
def _init_embedding(vocab_size, n_dim):
embedding = 0.2 * numpy.random.uniform(-1.0, 1.0, (vocab_size+1, n_dim))
return theano.shared(embedding).astype(theano.config.floatX)
def _init_maxent_weights(n_hidden, n_out):
weights = numpy.zeros((n_hidden, 10), dtype=theano.config.floatX)
bias = numpy.zeros((10,), dtype=theano.config.floatX)
return (
theano.shared(name='W', borrow=True, value=weights),
theano.shared(name='b', borrow=True, value=bias)
)
def _init_hidden_weights(n_in, n_out, activation=T.tanh):
rng = numpy.random.RandomState(1234)
weights = numpy.asarray(
rng.uniform(
low=-numpy.sqrt(6. / (n_in + n_out)),
high=numpy.sqrt(6. / (n_in + n_out)),
size=(n_in, n_out)
),
dtype=theano.config.floatX
)
bias = numpy.zeros((n_out,), dtype=theano.config.floatX)
return (
theano.shared(value=weights, name='W', borrow=True),
theano.shared(value=bias, name='b', borrow=True)
)
def feed_layer(activation, weights, bias, input):
return activation(T.dot(input, weights) + bias)
def L1(L1_reg, w1, w2):
return L1_reg * (abs(w1).sum() + abs(w2).sum())
def L2(L2_reg, w1, w2):
return L2_reg * ((w1 ** 2).sum() + (w2 ** 2).sum())
def update(eta, param, cost):
return (param, param - (eta * T.grad(cost, param)))
def main(train_loc, eval_loc, model_dir):
learning_rate = 0.01
L1_reg = 0.00
L2_reg = 0.0001
print "... reading the data"
gold_train = list(read_json_file(train_loc))
print '... building the model'
pos_model_dir = path.join(model_dir, 'pos')
if path.exists(pos_model_dir):
shutil.rmtree(pos_model_dir)
os.mkdir(pos_model_dir)
setup_model_dir(sorted(POS_TAGS.keys()), POS_TAGS, POS_TEMPLATES, pos_model_dir)
train_model, evaluate_model = build_model(n_hidden, len(POS_TAGS), learning_rate,
L1_reg, L2_reg)
print '... training'
for epoch in range(1, n_epochs+1):
for raw_text, sents in gold_tuples:
for (ids, words, tags, ner, heads, deps), _ in sents:
tokens = nlp.tokenizer.tokens_from_list(words)
for t in tokens:
guess = train_model([t.orth], [t.tag])
loss += guess != t.tag
print loss
# compute zero-one loss on validation set
#error = numpy.mean([evaluate_model(x, y) for x, y in dev_examples])
#print('epoch %i, validation error %f %%' % (epoch, error * 100))
if __name__ == '__main__':
plac.call(main)

View File

@ -1,13 +0,0 @@
from ._ml cimport Model
from thinc.nn cimport InputLayer
cdef class TheanoModel(Model):
cdef InputLayer input_layer
cdef object train_func
cdef object predict_func
cdef object debug
cdef public float eta
cdef public float mu
cdef public float t

View File

@ -1,52 +0,0 @@
from thinc.api cimport Example, ExampleC
from thinc.typedefs cimport weight_t
from ._ml cimport arg_max_if_true
from ._ml cimport arg_max_if_zero
import numpy
from os import path
cdef class TheanoModel(Model):
def __init__(self, n_classes, input_spec, train_func, predict_func, model_loc=None,
eta=0.001, mu=0.9, debug=None):
if model_loc is not None and path.isdir(model_loc):
model_loc = path.join(model_loc, 'model')
self.eta = eta
self.mu = mu
self.t = 1
initializer = lambda: 0.2 * numpy.random.uniform(-1.0, 1.0)
self.input_layer = InputLayer(input_spec, initializer)
self.train_func = train_func
self.predict_func = predict_func
self.debug = debug
self.n_classes = n_classes
self.n_feats = len(self.input_layer)
self.model_loc = model_loc
def predict(self, Example eg):
self.input_layer.fill(eg.embeddings, eg.atoms, use_avg=True)
theano_scores = self.predict_func(eg.embeddings)[0]
cdef int i
for i in range(self.n_classes):
eg.c.scores[i] = theano_scores[i]
eg.c.guess = arg_max_if_true(eg.c.scores, eg.c.is_valid, self.n_classes)
def train(self, Example eg):
self.input_layer.fill(eg.embeddings, eg.atoms, use_avg=False)
theano_scores, update, y, loss = self.train_func(eg.embeddings, eg.costs,
self.eta, self.mu)
self.input_layer.update(update, eg.atoms, self.t, self.eta, self.mu)
for i in range(self.n_classes):
eg.c.scores[i] = theano_scores[i]
eg.c.guess = arg_max_if_true(eg.c.scores, eg.c.is_valid, self.n_classes)
eg.c.best = arg_max_if_zero(eg.c.scores, eg.c.costs, self.n_classes)
eg.c.cost = eg.c.costs[eg.c.guess]
eg.c.loss = loss
self.t += 1
def end_training(self):
pass

View File

@ -17,8 +17,15 @@ def migrate(path):
def link(package, path):
if os.path.exists(path):
os.unlink(path)
os.symlink(package.dir_path('data'), path)
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)
if not hasattr(os, 'symlink'): # not supported by win+py27
shutil.copytree(package.dir_path('data'), path)
else:
os.symlink(package.dir_path('data'), path)
@plac.annotations(
@ -30,8 +37,12 @@ def main(data_size='all', force=False):
path = os.path.dirname(os.path.abspath(__file__))
command = sputnik.make_command(
data_path=os.path.abspath(os.path.join(path, '..', 'data')),
data_path = os.path.abspath(os.path.join(path, '..', 'data'))
if not os.path.isdir(data_path):
os.mkdir(data_path)
command = sputnik.command(
data_path=data_path,
repository_url='https://index.spacy.io')
if force:

View File

@ -1,62 +0,0 @@
# Enum of Wordnet supersenses
cimport parts_of_speech
from .typedefs cimport flags_t
cpdef enum:
A_behavior
A_body
A_feeling
A_mind
A_motion
A_perception
A_quantity
A_relation
A_social
A_spatial
A_substance
A_time
A_weather
N_act
N_animal
N_artifact
N_attribute
N_body
N_cognition
N_communication
N_event
N_feeling
N_food
N_group
N_location
N_motive
N_object
N_person
N_phenomenon
N_plant
N_possession
N_process
N_quantity
N_relation
N_shape
N_state
N_substance
N_time
V_body
V_change
V_cognition
V_communication
V_competition
V_consumption
V_contact
V_creation
V_emotion
V_motion
V_perception
V_possession
V_social
V_stative
V_weather
cdef flags_t[<int>parts_of_speech.N_UNIV_TAGS] POS_SENSES

View File

@ -1,88 +0,0 @@
from __future__ import unicode_literals
cimport parts_of_speech
POS_SENSES[<int>parts_of_speech.NO_TAG] = 0
POS_SENSES[<int>parts_of_speech.ADJ] = 0
POS_SENSES[<int>parts_of_speech.ADV] = 0
POS_SENSES[<int>parts_of_speech.ADP] = 0
POS_SENSES[<int>parts_of_speech.CONJ] = 0
POS_SENSES[<int>parts_of_speech.DET] = 0
POS_SENSES[<int>parts_of_speech.NOUN] = 0
POS_SENSES[<int>parts_of_speech.NUM] = 0
POS_SENSES[<int>parts_of_speech.PRON] = 0
POS_SENSES[<int>parts_of_speech.PRT] = 0
POS_SENSES[<int>parts_of_speech.VERB] = 0
POS_SENSES[<int>parts_of_speech.X] = 0
POS_SENSES[<int>parts_of_speech.PUNCT] = 0
POS_SENSES[<int>parts_of_speech.EOL] = 0
cdef int _sense = 0
for _sense in range(A_behavior, N_act):
POS_SENSES[<int>parts_of_speech.ADJ] |= 1 << _sense
for _sense in range(N_act, V_body):
POS_SENSES[<int>parts_of_speech.NOUN] |= 1 << _sense
for _sense in range(V_body, V_weather+1):
POS_SENSES[<int>parts_of_speech.VERB] |= 1 << _sense
STRINGS = (
'A_behavior',
'A_body',
'A_feeling',
'A_mind',
'A_motion',
'A_perception',
'A_quantity',
'A_relation',
'A_social',
'A_spatial',
'A_substance',
'A_time',
'A_weather',
'N_act',
'N_animal',
'N_artifact',
'N_attribute',
'N_body',
'N_cognition',
'N_communication',
'N_event',
'N_feeling',
'N_food',
'N_group',
'N_location',
'N_motive',
'N_object',
'N_person',
'N_phenomenon',
'N_plant',
'N_possession',
'N_process',
'N_quantity',
'N_relation',
'N_shape',
'N_state',
'N_substance',
'N_time',
'V_body',
'V_change',
'V_cognition',
'V_communication',
'V_competition',
'V_consumption',
'V_contact',
'V_creation',
'V_emotion',
'V_motion',
'V_perception',
'V_possession',
'V_social',
'V_stative',
'V_weather'
)

View File

@ -1,12 +0,0 @@
# encoding: utf8
from __future__ import unicode_literals
import spacy.de
#def test_tokenizer():
# lang = spacy.de.German()
#
# doc = lang(u'Biografie: Ein Spiel ist ein Theaterstück des Schweizer Schriftstellers Max Frisch, das 1967 entstand und am 1. Februar 1968 im Schauspielhaus Zürich uraufgeführt wurde. 1984 legte Frisch eine überarbeitete Neufassung vor. Das von Frisch als Komödie bezeichnete Stück greift eines seiner zentralen Themen auf: die Möglichkeit oder Unmöglichkeit des Menschen, seine Identität zu verändern.')
# for token in doc:
# print(repr(token.string))

View File

@ -4,6 +4,9 @@ from spacy.serialize.packer import Packer
from spacy.attrs import ORTH, SPACY
from spacy.tokens import Doc
import math
import tempfile
import shutil
import os
@pytest.mark.models
@ -11,17 +14,21 @@ def test_read_write(EN):
doc1 = EN(u'This is a simple test. With a couple of sentences.')
doc2 = EN(u'This is another test document.')
with open('/tmp/spacy_docs.bin', 'wb') as file_:
file_.write(doc1.to_bytes())
file_.write(doc2.to_bytes())
try:
tmp_dir = tempfile.mkdtemp()
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'wb') as file_:
file_.write(doc1.to_bytes())
file_.write(doc2.to_bytes())
with open('/tmp/spacy_docs.bin', 'rb') as file_:
bytes1, bytes2 = Doc.read_bytes(file_)
r1 = Doc(EN.vocab).from_bytes(bytes1)
r2 = Doc(EN.vocab).from_bytes(bytes2)
with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'rb') as file_:
bytes1, bytes2 = Doc.read_bytes(file_)
r1 = Doc(EN.vocab).from_bytes(bytes1)
r2 = Doc(EN.vocab).from_bytes(bytes2)
assert r1.string == doc1.string
assert r2.string == doc2.string
assert r1.string == doc1.string
assert r2.string == doc2.string
finally:
shutil.rmtree(tmp_dir)
@pytest.mark.models

View File

@ -75,7 +75,7 @@ def test_count_by(nlp):
@pytest.mark.models
def test_read_bytes(nlp):
from spacy.tokens.doc import Doc
loc = '/tmp/test_serialize.bin'
loc = 'test_serialize.bin'
with open(loc, 'wb') as file_:
file_.write(nlp(u'This is a document.').to_bytes())
file_.write(nlp(u'This is another.').to_bytes())

View File

@ -154,9 +154,9 @@ def test_efficient_binary_serialization(doc):
from spacy.tokens.doc import Doc
byte_string = doc.to_bytes()
open('/tmp/moby_dick.bin', 'wb').write(byte_string)
open('moby_dick.bin', 'wb').write(byte_string)
nlp = spacy.en.English()
for byte_string in Doc.read_bytes(open('/tmp/moby_dick.bin', 'rb')):
for byte_string in Doc.read_bytes(open('moby_dick.bin', 'rb')):
doc = Doc(nlp.vocab)
doc.from_bytes(byte_string)

13
tox.ini Normal file
View File

@ -0,0 +1,13 @@
[tox]
envlist =
py27
py34
recreate = True
[testenv]
changedir = {envtmpdir}
deps =
pytest
commands =
python -m spacy.en.download
python -m pytest {toxinidir}/spacy/ -x --models --vectors --slow

32
venv.ps1 Normal file
View File

@ -0,0 +1,32 @@
param (
[string]$python = $(throw "-python is required."),
[string]$install_mode = $(throw "-install_mode is required."),
[string]$pip_date,
[string]$compiler
)
$ErrorActionPreference = "Stop"
if(!(Test-Path -Path ".build"))
{
if($compiler -eq "mingw32")
{
virtualenv .build --system-site-packages --python $python
}
else
{
virtualenv .build --python $python
}
if($compiler)
{
"[build]`r`ncompiler=$compiler" | Out-File -Encoding ascii .\.build\Lib\distutils\distutils.cfg
}
}
.build\Scripts\activate.ps1
python build.py prepare $pip_date
python build.py $install_mode
python build.py test
exit $LASTEXITCODE

16
venv.sh Executable file
View File

@ -0,0 +1,16 @@
#!/bin/bash
set -e
if [ ! -d ".build" ]; then
virtualenv .build --python $1
fi
if [ -d ".build/bin" ]; then
source .build/bin/activate
elif [ -d ".build/Scripts" ]; then
source .build/Scripts/activate
fi
python build.py prepare $3
python build.py $2
python build.py test

View File

@ -2,6 +2,7 @@ include ./header
include ./mixins.jade
- var Page = InitPage(Site, Authors.spacy, "home", '404')
- Page.canonical_url = null
- Page.is_error = true
- Site.slogan = "404"
- Page.active = {}

View File

@ -4,7 +4,7 @@ include ./meta.jade
+WritePost(Meta)
section.intro
p Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote #[a(href="http://spacy.io") spaCy] to do things differently.
p Natural Language Processing moves fast, so maintaining a good library means constantly throwing things away. Most libraries are failing badly at this, as academics hate to editorialize. This post explains the problem, why it's so damaging, and why I wrote #[a(href="https://spacy.io") spaCy] to do things differently.
p Imagine: you try to use Google Translate, but it asks you to first select which model you want. The new, awesome deep-learning model is there, but so are lots of others. You pick one that sounds fancy, but it turns out it's a 20-year old experimental model trained on a corpus of oven manuals. When it performs little better than chance, you can't even tell from its output. Of course, Google Translate would not do this to you. But most Natural Language Processing libraries do, and it's terrible.
@ -12,7 +12,7 @@ include ./meta.jade
p Have a look through the #[a(href="http://gate.ac.uk/sale/tao/split.html") GATE software]. There's a lot there, developed over 12 years and many person-hours. But there's approximately zero curation. The philosophy is just to provide things. It's up to you to decide what to use.
p This is bad. It's bad to provide an implementation of #[a(href="https://gate.ac.uk/sale/tao/splitch18.html") MiniPar], and have it just...sit there, with no hint that it's 20 years old and should not be used. The RASP parser, too. Why are these provided? Worse, why is there no warning? The #[a(href="http://webdocs.cs.ualberta.ca/~lindek/minipar.htm") Minipar homepage] puts the software in the right context:
p This is bad. It's bad to provide an implementation of #[a(href="https://gate.ac.uk/sale/tao/splitch18.html") MiniPar], and have it just...sit there, with no hint that it's 20 years old and should not be used. The RASP parser, too. Why are these provided? Worse, why is there no warning? The #[a(href="https://web.archive.org/web/20150907234221/http://webdocs.cs.ualberta.ca/~lindek/minipar.htm") Minipar homepage] puts the software in the right context:
blockquote
p MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, #[strong on a Pentium II 300 with 128MB memory], it parses about 300 words per second.
@ -23,13 +23,13 @@ include ./meta.jade
h3 Why I didn't contribute to NLTK
p Various people have asked me why I decided to make a new Python NLP library, #[a(href="http://spacy.io") spaCy], instead of supporting the #[a(href="http://nltk.org") NLTK] project. This is the main reason. You can't contribute to a project if you believe that the first thing that they should do is throw almost all of it away. You should just make your own project, which is what I did.
p Various people have asked me why I decided to make a new Python NLP library, #[a(href="https://spacy.io") spaCy], instead of supporting the #[a(href="http://nltk.org") NLTK] project. This is the main reason. You can't contribute to a project if you believe that the first thing that they should do is throw almost all of it away. You should just make your own project, which is what I did.
p Have a look through #[a(href="http://www.nltk.org/py-modindex.html") the module list of NLTK]. It looks like there's a lot there, but there's not. What NLTK has is a decent tokenizer, some passable stemmers, a good implementation of the Punkt sentence boundary detector (after #[a(href="http://joelnothman.com/") Joel Nothman] rewrote it), some visualization tools, and some wrappers for other libraries. Nothing else is of any use.
p For instance, consider #[code nltk.parse]. You might think that amongst all this code there was something that could actually predict the syntactic structure of a sentence for you, but you would be wrong. There are wrappers for the BLLIP and Stanford parsers, and since March there's been an implementation of Nivre's 2003 transition-based dependency parser. Unfortunately no model is provided for it, as they rely on an external wrapper of an external learner, which is unsuitable for the structure of their problem. So the implementation is too slow to be actually useable.
p This problem is totally avoidable, if you just sit down and write good code, instead of stitching together external dependencies. I pointed NLTK to my tutorial describing #[a(href="http://spacy.io/blog/parsing-english-in-python/") how to implement a modern dependency parser], which includes a BSD-licensed implementation in 500 lines of Python. I was told "thanks but no thanks", and #[a(href="https://github.com/nltk/nltk/issues/694") the issue was abruptly closed]. Another researcher's offer from 2012 to implement this type of model also went #[a(href="http://arxiv.org/pdf/1409.7386v1.pdf") unanswered].
p This problem is totally avoidable, if you just sit down and write good code, instead of stitching together external dependencies. I pointed NLTK to my tutorial describing #[a(href="https://spacy.io/blog/parsing-english-in-python") how to implement a modern dependency parser], which includes a BSD-licensed implementation in 500 lines of Python. I was told "thanks but no thanks", and #[a(href="https://github.com/nltk/nltk/issues/694") the issue was abruptly closed]. Another researcher's offer from 2012 to implement this type of model also went #[a(href="http://arxiv.org/pdf/1409.7386v1.pdf") unanswered].
p The story in #[code nltk.tag] is similar. There are plenty of wrappers, for the external libraries that have actual taggers. The only actual tagger model they distribute is #[a(href="http://spacy.io/blog/part-of-speech-POS-tagger-in-python/") terrible]. Now it seems that #[a(href="https://github.com/nltk/nltk/issues/1063") NLTK does not even know how its POS tagger was trained]. The model is just this .pickle file that's been passed around for 5 years, its origins lost to time. It's not okay to offer this to people, to recommend they use it.
p The story in #[code nltk.tag] is similar. There are plenty of wrappers, for the external libraries that have actual taggers. The only actual tagger model they distribute is #[a(href="https://spacy.io/blog/part-of-speech-POS-tagger-in-python") terrible]. Now it seems that #[a(href="https://github.com/nltk/nltk/issues/1063") NLTK does not even know how its POS tagger was trained]. The model is just this .pickle file that's been passed around for 5 years, its origins lost to time. It's not okay to offer this to people, to recommend they use it.
p I think open source software should be very careful to make its limitations clear. It's a disservice to provide something that's much less useful than you imply. It's like offering your friend a lift and then not showing up. It's totally fine to not do something &ndash; so long as you never suggested you were going to do it. There are ways to do worse than nothing.

View File

@ -2,7 +2,7 @@ include ../../header.jade
include ./meta.jade
mixin Displacy(sentence, caption_text, height)
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(" ", "%20")
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(/\s+/g, "%20")
.displacy
iframe.displacy(src="/resources/displacy/robots.html" height=height)
@ -20,7 +20,7 @@ mixin Displacy(sentence, caption_text, height)
p A syntactic dependency parse is a kind of shallow meaning representation. It's an important piece of many language understanding and text processing technologies. Now that these representations can be computed quickly, and with increasingly high accuracy, they're being used in lots of applications &ndash; translation, sentiment analysis, and summarization are major application areas.
p I've been living and breathing similar representations for most of my career. But there's always been a problem: talking about these things is tough. Most people haven't thought much about grammatical structure, and the idea of them is inherently abstract. When I left academia to write #[a(href="http://spaCy.io") spaCy], I knew I wanted a good visualizer. Unfortunately, I also knew I'd never be the one to write it. I'm deeply graphically challenged. Fortunately, when working with #[a(href="http://ines.io") Ines] to build this site, she really nailed the problem, with a solution I'd never have thought of. I really love the result, which we're calling #[a(href="https://api.spacy.io/displacy") displaCy]:
p I've been living and breathing similar representations for most of my career. But there's always been a problem: talking about these things is tough. Most people haven't thought much about grammatical structure, and the idea of them is inherently abstract. When I left academia to write #[a(href="https://spacy.io") spaCy], I knew I wanted a good visualizer. Unfortunately, I also knew I'd never be the one to write it. I'm deeply graphically challenged. Fortunately, when working with #[a(href="http://ines.io") Ines] to build this site, she really nailed the problem, with a solution I'd never have thought of. I really love the result, which we're calling #[a(href="https://api.spacy.io/displacy") displaCy]:
+Displacy("Robots in popular culture are there to remind us of the awesomeness of unbounded human agency", "Click the button to full-screen and interact, or scroll to see the full parse.", 325)

View File

@ -9,4 +9,4 @@
- Meta.links[0].name = 'Reddit'
- Meta.links[0].title = 'Discuss on Reddit'
- Meta.links[0].url = "https://www.reddit.com/r/programming/comments/3hoj0b/displaying_linguistic_structure_with_css/"
- Meta.image = "http://spacy.io/resources/img/displacy_screenshot.jpg"
- Meta.image = "https://spacy.io/resources/img/displacy_screenshot.jpg"

View File

@ -3,7 +3,7 @@
- Meta.headline = "Statistical NLP in Basic English"
- Meta.description = "When I was little, my favorite TV shows all had talking computers. Now Im big and there are still no talking computers, so Im trying to make some myself. Well, we can make computers say things. But when we say things back, they dont really understand. Why not?"
- Meta.date = "2015-08-24"
- Meta.url = "/blog/eli5-computers-learn-reading/"
- Meta.url = "/blog/eli5-computers-learn-reading"
- Meta.links = []
//- Meta.links[0].id = 'reddit'
//- Meta.links[0].name = "Reddit"

View File

@ -92,13 +92,13 @@ include ./meta.jade
h3 Part-of-speech Tagger
p In 2013, I wrote a blog post describing #[a(href="/blog/part-of-speech-POS-tagger-in-python/") how to write a good part of speech tagger]. My recommendation then was to use greedy decoding with the averaged perceptron. I think this is still the best approach, so it's what I implemented in spaCy.
p In 2013, I wrote a blog post describing #[a(href="/blog/part-of-speech-POS-tagger-in-python") how to write a good part of speech tagger]. My recommendation then was to use greedy decoding with the averaged perceptron. I think this is still the best approach, so it's what I implemented in spaCy.
p The tutorial also recommends the use of Brown cluster features, and case normalization features, as these make the model more robust and domain independent. spaCy's tagger makes heavy use of these features.
h3 Dependency Parser
p The parser uses the algorithm described in my #[a(href="/blog/parsing-english-in-python/") 2014 blog post]. This algorithm, shift-reduce dependency parsing, is becoming widely adopted due to its compelling speed/accuracy trade-off.
p The parser uses the algorithm described in my #[a(href="/blog/parsing-english-in-python") 2014 blog post]. This algorithm, shift-reduce dependency parsing, is becoming widely adopted due to its compelling speed/accuracy trade-off.
p Some quick details about spaCy's take on this, for those who happen to know these models well. I'll write up a better description shortly.

View File

@ -33,7 +33,7 @@ include ../header.jade
+WritePage(Site, Authors.spacy, Page)
section.intro.profile
p A lot of work has gone into #[strong spaCy], but no magic. We plan to keep no secrets. We want you to be able to #[a(href="/blog/spacy-now-mit") build your business] on #[strong spaCy] &ndash; so we want you to understand it. Tell us whether you do. #[span.social #[a(href="//twitter.com/" + Site.twitter, target="_blank") Twitter] #[a(href="mailto:contact@spacy.io") Contact us]]
p A lot of work has gone into #[strong spaCy], but no magic. We plan to keep no secrets. We want you to be able to #[a(href="/blog/spacy-now-mit") build your business] on #[strong spaCy] &ndash; so we want you to understand it. Tell us whether you do. #[span.social #[a(href="https://twitter.com/" + Site.twitter, target="_blank") Twitter] #[a(href="mailto:contact@spacy.io") Contact us]]
nav(role='navigation')
ul
li #[a.button(href='#blogs') Blog]

View File

@ -19,4 +19,4 @@ include ./meta.jade
+TweetThis("Computers don't understand text. This is unfortunate, because that's what the web is mostly made of.", Meta.url)
p If none of that made any sense to you, here's the gist of it. Computers don't understand text. This is unfortunate, because that's what the web almost entirely consists of. We want to recommend people text based on other text they liked. We want to shorten text to display it on a mobile screen. We want to aggregate it, link it, filter it, categorise it, generate it and correct it.
p spaCy provides a library of utility functions that help programmers build such products. It's commercial open source software: you can either use it under the AGPL, or you can buy a commercial license under generous terms (Note: #[a(href="/blog/spacy-now-mit/") spaCy is now licensed under MIT]).
p spaCy provides a library of utility functions that help programmers build such products. It's commercial open source software: you can either use it under the AGPL, or you can buy a commercial license under generous terms (Note: #[a(href="/blog/spacy-now-mit") spaCy is now licensed under MIT]).

View File

@ -5,7 +5,7 @@ include ../../header.jade
+WritePost(Meta)
//# AGPL not free enough: spaCy now under MIT, offering adaptation as a service
p Three big announcements for #[a(href="http://spacy.io") spaCy], a Python library for industrial-strength natural language processing (NLP).
p Three big announcements for #[a(href="https://spacy.io") spaCy], a Python library for industrial-strength natural language processing (NLP).
ol
li The founding team is doubling in size: I'd like to welcome my new co-founder, #[a(href="https://www.linkedin.com/profile/view?id=ADEAAADkZcYBnipeHOAS6HqrDBPK1IzAAVI64ds&authType=NAME_SEARCH&authToken=YYZ1&locale=en_US&srchid=3310922891443387747239&srchindex=1&srchtotal=16&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A3310922891443387747239%2CVSRPtargetId%3A14968262%2CVSRPcmpt%3Aprimary%2CVSRPnm%3Atrue%2CauthType%3ANAME_SEARCH") Henning Peters].

View File

@ -52,7 +52,7 @@ details
p The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.
p Details #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124") here].
p Details #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/tagger.pyx") here].
details
summary: h4 Lemmatization

View File

@ -2,20 +2,20 @@
- Site.name = "spaCy.io"
- Site.slogan = "Build Tomorrow's Language Technologies"
- Site.description = "spaCy is a library for industrial-strength text processing in Python. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
- Site.image = "http://spacy.io/resources/img/social.png"
- Site.image_small = "http://spacy.io/resources/img/social_small.png"
- Site.image = "https://spacy.io/resources/img/social.png"
- Site.image_small = "https://spacy.io/resources/img/social_small.png"
- Site.twitter = "spacy_io"
- Site.url = "http://spacy.io"
- Site.url = "https://spacy.io"
-
- Authors = {"matt": {}, "spacy": {}};
- Authors.matt.name = "Matthew Honnibal"
- Authors.matt.bio = "Matthew Honnibal is the author of the <a href=\"http://spacy.io\">spaCy</a> software and the sole founder of its parent company. He studied linguistics as an undergrad, and never thought he'd be a programmer. By 2009 he had a PhD in computer science, and in 2014 he left academia to found Syllogism Co. He's from Sydney and lives in Berlin."
- Authors.matt.bio = "Matthew Honnibal is the author of the <a href=\"https://spacy.io\">spaCy</a> software and the sole founder of its parent company. He studied linguistics as an undergrad, and never thought he'd be a programmer. By 2009 he had a PhD in computer science, and in 2014 he left academia to found Syllogism Co. He's from Sydney and lives in Berlin."
- Authors.matt.image = "/resources/img/matt.png"
- Authors.matt.twitter = "honnibal"
-
- Authors.spacy.name = "SpaCy.io"
- Authors.spacy.bio = "<a href=\"http://spacy.io\">spaCy</a> is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
- Authors.spacy.bio = "<a href=\"https://spacy.io\">spaCy</a> is a library for industrial-strength natural language processing in Python and Cython. It features state-of-the-art speed and accuracy, a concise API, and great documentation. If you're a small company doing NLP, we want spaCy to seem like a minor miracle."
- Authors.spacy.image = "/resources/img/social_small.png"
- Authors.spacy.twitter = "spacy_io"
@ -27,10 +27,11 @@
- Page.active[type] = true;
- Page.links = [];
- if (type == "home") {
- Page.url = "";
- Page.url = "/";
- } else {
- Page.url = "/" + type;
- }
- Page.canonical_url = Site.url + Page.url;
-
- // Set defaults
- Page.description = Site.description;
@ -57,6 +58,7 @@
- Page.description = Meta.description
- Page.date = Meta.date
- Page.url = Meta.url
- Page.canonical_url = Site.url + Page.url;
- Page.active["blog"] = true
- Page.links = Meta.links
- if (Meta.image != null) {
@ -98,6 +100,8 @@ mixin WritePage(Site, Author, Page)
meta(property="og:site_name" content=Site.name)
meta(property="article:published_time" content=getDate(Page.date).timestamp)
link(rel="stylesheet" href="/resources/css/style.css")
if Page.canonical_url
link(rel="canonical" href=Page.canonical_url)
//[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]
@ -158,10 +162,10 @@ mixin WritePost(Meta)
+WriteAuthorBio(Author)
mixin WriteByline(Author, Meta)
.subhead by #[a(href="//twitter.com/" + Author.twitter, rel="author" target="_blank") #{Author.name}] on #[time #{getDate(Meta.date).fulldate}]
.subhead by #[a(href="https://twitter.com/" + Author.twitter, rel="author" target="_blank") #{Author.name}] on #[time #{getDate(Meta.date).fulldate}]
mixin WriteShareLinks(headline, url, twitter, links)
a.button.button-twitter(href="http://twitter.com/share?text=" + headline + "&url=" + Site.url + url + "&via=" + twitter title="Share on Twitter" target="_blank")
a.button.button-twitter(href="https://twitter.com/share?text=" + headline.replace(/\s+/g, "%20") + "&url=" + Site.url + url + "&via=" + twitter title="Share on Twitter" target="_blank")
| Share on Twitter
if links
.discuss
@ -174,11 +178,11 @@ mixin WriteShareLinks(headline, url, twitter, links)
| Discuss on #{link.name}
mixin TweetThis(text, url)
p #[span #{text} #[a.share(href='http://twitter.com/share?text="' + text + '"&url=' + Site.url + url + '&via=' + Site.twitter title='Share on Twitter' target='_blank') Tweet]]
p #[span #{text} #[a.share(href='https://twitter.com/share?text="' + text.replace(/\s+/g, "%20") + '"&url=' + Site.url + url + '&via=' + Site.twitter title='Share on Twitter' target='_blank') Tweet]]
mixin WriteAuthorBio(Author)
section.intro.profile
p #[img(src=Author.image)] !{Author.bio} #[span.social #[a(href="//twitter.com/" + Author.twitter target="_blank") Twitter]]
p #[img(src=Author.image alt=Author.name)] !{Author.bio} #[span.social #[a(href="https://twitter.com/" + Author.twitter target="_blank") Twitter]]
- var getDate = function(input) {

View File

@ -1,5 +1,5 @@
mixin Displacy(sentence, caption_text, height)
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(" ", "%20")
- var url = "https://api.spacy.io/displacy/?full=" + sentence.replace(/\s+/g, "%20")
.displacy
iframe.displacy(src="/resources/displacy/displacy_demo.html" height=height)

View File

@ -58,4 +58,4 @@ mixin example(name)
ul
li: a(href="/docs#api") API documentation
li: a(href="/docs#tutorials") Tutorials
li: a(href="/docs/#spec") Annotation specs
li: a(href="/docs#spec") Annotation specs

View File

@ -1,6 +1,6 @@
- var Meta = {}
- Meta.author_id = 'spacy'
- Meta.headline = "Tutorial: Adding a language to spaCy"
- Meta.headline = "Adding a language to spaCy"
- Meta.description = "Long awaited documentation for adding a language to spaCy"
- Meta.date = "2015-08-18"
- Meta.url = "/tutorials/add-a-language"

View File

@ -1,5 +1,5 @@
- var Meta = {}
- Meta.headline = "Tutorial: Load new word vectors"
- Meta.headline = "Load new word vectors"
- Meta.description = "Word vectors allow simple similarity queries, and drive many NLP applications. This tutorial explains how to load custom word vectors into spaCy, to make use of task or data-specific representations."
- Meta.author_id = "matt"
- Meta.date = "2015-09-24"

View File

@ -1,5 +1,5 @@
- var Meta = {}
- Meta.headline = "Tutorial: Mark all adverbs, particularly for verbs of speech"
- Meta.headline = "Mark all adverbs, particularly for verbs of speech"
- Meta.author_id = 'matt'
- Meta.description = "Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that adverbs are not your friend so you want to highlight all adverbs."
- Meta.date = "2015-08-18"

View File

@ -1,5 +1,5 @@
- var Meta = {}
- Meta.headline = "Tutorial: Search Reddit for comments about Google doing something"
- Meta.headline = "Search Reddit for comments about Google doing something"
- Meta.description = "Example use of the spaCy NLP tools for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of \"Google\" now dominate usage of the word in conversation, particularly references to using Google products."
- Meta.author_id = "matt"
- Meta.date = "2015-08-18"

View File

@ -4,7 +4,7 @@ include ./meta.jade
+WritePost(Meta)
section.intro
p #[a(href="http://spaCy.io") spaCy] is great for data exploration. Poking, prodding and sifting is fundamental to good data science. In this tutorial, we'll do a broad keword search of Twitter, and then sift through the live stream of tweets, zooming in on some topics and excluding others.
p #[a(href="https://spacy.io") spaCy] is great for data exploration. Poking, prodding and sifting is fundamental to good data science. In this tutorial, we'll do a broad keword search of Twitter, and then sift through the live stream of tweets, zooming in on some topics and excluding others.
p An example filter-function:

View File

@ -1,5 +1,5 @@
- var Meta = {}
- Meta.headline = "Tutorial: Finding Relevant Tweets"
- Meta.headline = "Finding Relevant Tweets"
- Meta.author_id = 'matt'
- Meta.description = "In this tutorial, we will use word vectors to search for tweets about Jeb Bush. We'll do this by building up two word lists: one that represents the type of meanings in the Jeb Bush tweets, and another to help screen out irrelevant tweets that mention the common, ambiguous word 'bush'."
- Meta.date = "2015-08-18"