文章目录
python爬虫–scrapy(初识)
scrapy环境安装
因为我是同时安装anaconda和python3.7,所以在使用pip的时候总是会显示anaconda中已经安装(众所周知),就很烦。一气之下,挂着VPN并且在CMD中使用conda install scrapy
,然后安装好。
PS:也有可能直接使用conda install scrapy就可以了(我没试)
出现下面这张图后,就说明已经安装完成
scrapy基本使用
使用命令行创建scrapy项目工程scrapy startproject qiushi
就会提示你创建成功
然后提示你cd到该目录下,并且创建first spider
命令scrapy genspider example example
配置文件的修改
别忘了user-Agent
运行项目文件
#终端输入
scrapy crawl first
糗事百科数据解析
import scrapy
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
for div in div_list:
#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中的data参数存储的字符串提取出来
#author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
#.extract_first()是确定列表中只存在一个元素
author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
content = div.xpath('./a[1]/div/span/text()').extract()
content = ''.join(content)
print(author,content)
break
效果图
持久化存储
基于终端指令的持久化存储
import scrapy
from qiushi.items import QiushiItem
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
all_data = []
for div in div_list:
#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中的data参数存储的字符串提取出来
#author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
content = div.xpath('./a[1]/div/span/text()').extract()
content = ''.join(content)
dic = {
'author':author,
'content':content
}
all_data.append(dic)
return all_data
终端命令:scrapy crawl qiubai -o qiushi.csv
#终端命令
(acoda) D:\桌面\acoda\06scrapy模块\qiushi>scrapy crawl qiubai -o qiushi.csv
开始爬虫。。。。
爬虫结束!!!
需注意的是:基于终端命令存储,只能存储(‘json’, ‘jsonlines’, ‘jl’, ‘csv’, ‘xml’, ‘marshal’, ‘pickle’)后缀的名称
基于管道的持久化存储
-
数据解析
-
在item类中定义相关的属性
-
将解析的数据封装存储到item类型的对象
-
将item类型的对象提交给管道进行持久化存储的操作
-
在管道类的process_ item中要将其接受到的item对象中存储的数据进行持久化存储操作
-
在配置文件中开启管道
步骤1and3and4爬虫文件
import scrapy
from qiushi.items import QiushiItem#导入QiushiItem类
class QiubaiSpider(scrapy.Spider):
name = 'qiubai'
#allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
all_data = []
for div in div_list:
#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中的data参数存储的字符串提取出来
#author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
content = div.xpath('./a[1]/div/span/text()').extract()
content = ''.join(content)
item = QiushiItem(author=author, content=content)
# item['author'] = author
# item['content'] = content
#将item类型的对象提交给管道进行持久化存储的操作
yield item
步骤2items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class QiushiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
author = scrapy.Field()
content = scrapy.Field()
#pass
步骤5pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class QiushiPipeline:
fp = open('./qiushi.txt','w',encoding='utf-8')
#重写父类的一个方法:该方法只在爬虫开始的时候被调用一次
def open_spider(self, spider):
print('开始爬虫。。。。')
fp = open('./qiushi.txt','w',encoding='utf-8')
def close_spider(self,spider):
print('爬虫结束!!!')
self.fp.close()
def process_item(self, item, spider):
author = item['author']
content = item['content']
self.fp.write(author+content)
return item
步骤6setting.py
ITEM_PIPELINES = {
'qiushi.pipelines.QiushiPipeline': 300,
}#300表示优先级,数值越大优先级越高
注意:open_spider(self, spider),close_spider(self,spider)是之前定义好在QiushiPipeline这个父类的