spaCy/website/docs/tutorials/syntax-search.jade
2016-10-03 20:19:13 +02:00

77 lines
3.1 KiB
Plaintext

include ../../_includes/_mixins
p.u-text-large Example use of the #[a(href="/docs") spaCy NLP tools] for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of "Google" now dominate usage of the word in conversation, particularly references to using Google products.
p The heuristics used are quick and dirty – about 5 minutes work.
+h(2, "imports") Imports
+code.
from __future__ import unicode_literals
from __future__ import print_function
import sys
import plac
import bz2
import ujson
import spacy.en
+h(2, "load-model-and-iterate") Load the Model and Iterate Over the Data
+code.
def main(input_loc):
nlp = spacy.en.English() # Load the model takes 10-20 seconds.
for line in bz2.BZ2File(input_loc): # Iterate over the Reddit comments from the dump.
comment_str = ujson.loads(line)['body'] # Parse the json object, and extract the 'body' attribute.
+h(2, "apply-spacy-pipeline") Apply the spaCy NLP Pipeline, and Look For the Cases We Want
+code.
comment_parse = nlp(comment_str)
for word in comment_parse:
if google_doing_something(word):
# Print the clause
print(''.join(w.string for w in word.head.subtree).strip())
+h(2, "define-filter-function") Define the Filter Function
+code.
def google_doing_something(w):
if w.lower_ != 'google':
return False
# Is it the subject of a verb?
elif w.dep_ != 'nsubj':
return False
# And not 'is'
elif w.head.lemma_ == 'be' and w.head.dep_ != 'aux':
return False
# Exclude e.g. "Google says..."
elif w.head.lemma_ in ('say', 'show'):
return False
else:
return True
+h(2, "call-main") Call main
+code.
if __name__ == '__main__':
plac.call(main)
+h(2, "example-output") Example output
p Many false positives remain. Some are from incorrect interpretations of the sentence by spaCy, some are flaws in our filtering logic. But the results are vastly better than a string-based search, which returns almost no examples of the pattern we're looking for.
+code("markup").
Google dropped support for Android < 4.0 already
google drive
Google to enforce a little more uniformity in its hardware so that we can see a better 3rd party market for things like mounts, cases, etc
When Google responds
Google translate cyka pasterino.
A quick google looks like Synology does have a sync'ing feature which does support block level so that should work
(google came up with some weird One Piece/FairyTail crossover stuff), and is their knowledge universally infallible?
Until you have the gear, google some videos on best farming runs on each planet, you can get a lot REAL fast with the right loop.
Google offers something like this already, but it is truly terrible.
google isn't helping me
Google tells me: 0 results, 250 pages removed from google.
how did Google swoop in and eat our lunch