scrapy

本篇内容:利用python scrapy爬取某新闻网站新闻,并将新闻存进elasticsearch便于全文检索。

安装scrapy

1
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy

建立项目

通过命令行在当前目录新建项目

1
scrapy startproject tutorial

编写Item类,即实体类

修改items.py代码如下:

1
2
3
4
5
6
7
8
9
10
import scrapy

# 类名随意
class DmozItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url = scrapy.Field()
title = scrapy.Field()
publish = scrapy.Field()
content = scrapy.Field()

编写spider类

在spiders文件夹下新建dmoz_spider.py文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import scrapy
import re
import urllib
import json
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider,Rule
from tutorial.items import DmozItem
from elasticsearch import Elasticsearch

es = Elasticsearch()
index_mappings = {
"mappings": {
"news": {
"properties": {
"url": {"type": "text"},
"title": {"type": "text"},
"publish": {"type": "text"},
"content": {"type": "text"},
}
}
}
}

if not es.indices.exists('defensenews'):
es.indices.create('defensenews', index_mappings)
# keywords = ['E-2D', 'E-2C', 'DARPA', 'THAAD', 'SMART-L', 'F-22', 'RADAR']

class DmozSpider(CrawlSpider):
name = "dmoz"
allowed_domains = ["defensenews.com"]
start_urls = [
"https://www.defensenews.com"
]
rules = [Rule(LinkExtractor(allow=r"/.*/2017/\d+/\d+/.*"), follow=True, callback='parse_item')]

def parse_item(self, response):
url = response.url
# 如果item调用下面的实体类,最后得到的item并非字典格式,无法存进es。
# item = DmozItem()
item = {}
item['url'] = [response.url]
try:
response = urllib.request.urlopen(url)
text = response.read().decode('utf-8')
title = re.findall('head col-sm-12.*?<h1>(.*?)</h1>', text, re.S)[0]
publish = \
re.findall('publish addthis.*?</i>(.*?)</span>', text,
re.S)[0].strip()
pattern = """<div class="row"> <div class="col-md-12 col-xs-12 col-print-12"> <p class="element element-paragraph">(.*?)</p> </div> </div>"""
content = '<br />'.join(re.findall(pattern, text, re.S))
item['title'] = title
item['publish'] = publish
item['content'] = content
es.index('defensenews', doc_type='news', body=item)
except:
pass
return item

allowed_domains限制域名,start_urls设定起始url,可以多个。

特别注意,类继承的是CrawlSpider,如果只是继承scrapy.Spider,rules无法使用会报错!

爬虫持久化

在项目根目录下通过如下命令启动爬虫,其中dmoz为spider类中指定的name值。

1
scrapy crawl dmoz -s JOBDIR=crawls/dmoz-1

按ctrl-c就可以暂停。再次运行以上命令就可以继续运行。

inferences

网络爬虫:使用Scrapy框架编写一个抓取书籍信息的爬虫服务