less than 1 minute read

Extract content from blog posts

For me, one of the best python packages to crawl article based content is goose

from goose3 import Goose
from pprint import pprint

g = Goose({'strict': False, 'http_timeout' : 2.})
article = g.extract(url='https://en.wikipedia.org/wiki/Web_crawler')
data=article.infos # get all infos in json format

Extract content from files

  1. Launch tika server

[Tika] (http://tika.apache.org) is a toolkit that detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF)

The most simple way to launch a local tika server is via a prepared docker file. Docker is a full development platform for creating containerized apps.

    docker pull logicalspark/docker-tikaserver # only on initial download/update
    docker run --rm -p 9998:9998 logicalspark/docker-tikaserver

the tika server is now up on port 9998

  1. Send content to local server By using the python tika [package] (https://pypi.python.org/pypi/tika) we can send request to the local server by just a view lines
from tika import parser
tika.TikaClientOnly = True

parsed = parser.from_file(<filename>,server_path)


The retun value is a json file the extracted content