文章目录

python 爬虫–scrapy（初识）
- scrapy环境安装
- scrapy基本使用
- 糗事百科数据解析
- 持久化存储
- - 基于终端指令的持久化存储
  - 基于管道的持久化存储

python 爬虫–scrapy（初识）

scrapy环境安装

因为我是同时安装anaconda和python3.7，所以在使用pip的时候总是会显示anaconda中已经安装（众所周知），就很烦。一气之下，挂着VPN并且在CMD中使用conda install scrapy，然后安装好。
PS：也有可能直接使用conda install scrapy就可以了（我没试）
出现下面这张图后，就说明已经安装完成
python爬虫--scrapy（初识）

scrapy基本使用

使用命令行创建scrapy项目工程scrapy startproject qiushi就会提示你创建成功

python爬虫--scrapy（初识）
然后提示你cd到该目录下，并且创建first spider

命令scrapy genspider example example
python爬虫--scrapy（初识）
配置文件的修改

别忘了user-Agent

运行项目文件

#终端输入
scrapy crawl first

python爬虫--scrapy（初识）

糗事百科数据解析

import scrapy


class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        for div in div_list:
            #xpath返回的是列表，但是列表元素一定是Selector类型的对象
            #extract可以将Selector对象中的data参数存储的字符串提取出来

            #author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            #.extract_first()是确定列表中只存在一个元素
            author  = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span/text()').extract()
            content = ''.join(content)
            print(author,content)
            break

效果图
python爬虫--scrapy（初识）

持久化存储

基于终端指令的持久化存储

import scrapy
from qiushi.items import QiushiItem

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        all_data = []
        for div in div_list:
            #xpath返回的是列表，但是列表元素一定是Selector类型的对象
            #extract可以将Selector对象中的data参数存储的字符串提取出来

            #author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            author  = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span/text()').extract()
            content = ''.join(content)
            dic = { 
                'author':author,
                'content':content
            }
            all_data.append(dic)
        return all_data

终端命令：scrapy crawl qiubai -o qiushi.csv

#终端命令
(acoda) D:\桌面\acoda\06scrapy模块\qiushi>scrapy crawl qiubai -o qiushi.csv
开始爬虫。。。。
爬虫结束!!!

python爬虫--scrapy（初识）

需注意的是：基于终端命令存储，只能存储(‘json’, ‘jsonlines’, ‘jl’, ‘csv’, ‘xml’, ‘marshal’, ‘pickle’)后缀的名称

python爬虫--scrapy（初识）

基于管道的持久化存储

数据解析
在item类中定义相关的属性
将解析的数据封装存储到item类型的对象
将item类型的对象提交给管道进行持久化存储的操作
在管道类的process_ item中要将其接受到的item对象中存储的数据进行持久化存储操作
在配置文件中开启管道

步骤1and3and4爬虫文件

import scrapy
from qiushi.items import QiushiItem#导入QiushiItem类

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']
    def parse(self, response):
        div_list = response.xpath('//*[@id="content"]/div/div[2]/div')
        all_data = []
        for div in div_list:
            #xpath返回的是列表，但是列表元素一定是Selector类型的对象
            #extract可以将Selector对象中的data参数存储的字符串提取出来

            #author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span/text()').extract()
            content = ''.join(content)

            item = QiushiItem(author=author, content=content)
            # item['author'] = author
            # item['content'] = content
			#将item类型的对象提交给管道进行持久化存储的操作
            yield item

步骤2items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QiushiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()
    #pass

步骤5pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class QiushiPipeline:
    fp = open('./qiushi.txt','w',encoding='utf-8')
    #重写父类的一个方法：该方法只在爬虫开始的时候被调用一次
    def open_spider(self, spider):
        print('开始爬虫。。。。')
        fp = open('./qiushi.txt','w',encoding='utf-8')

    def close_spider(self,spider):
        print('爬虫结束!!!')
        self.fp.close()
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        self.fp.write(author+content)
        return item

步骤6setting.py

ITEM_PIPELINES = { 
   'qiushi.pipelines.QiushiPipeline': 300,
}#300表示优先级，数值越大优先级越高

注意：open_spider(self, spider)，close_spider(self,spider)是之前定义好在QiushiPipeline这个父类的

python爬虫–scrapy（初识）

文章目录

python 爬虫–scrapy（初识）

scrapy环境安装

scrapy基本使用

糗事百科数据解析

持久化存储

基于终端指令的持久化存储

基于管道的持久化存储

您必须登录才能发表评论！

文章目录

python爬虫–scrapy（初识）

scrapy环境安装

scrapy基本使用

糗事百科数据解析

持久化存储

基于终端指令的持久化存储

基于管道的持久化存储

您必须 登录 才能发表评论！

python 爬虫–scrapy（初识）

您必须登录才能发表评论！