想学爬虫，看这一篇就够啦！_python

　　导读：如何使用scrapy框架实现爬虫？如何提取爬取的数据？如何处理内容？想要了解更多，就让我们一起来看看吧！

Scrapy安装（mac）

# 安装
pip install scrapy

注意：不要使用commandlinetools自带的python进行安装，不然可能报架构错误；用brew下载的python进行安装。

新建爬虫

# 新建爬虫
scrapy startproject demo

demo为项目名。

确定目标

编写items.py，如添加要爬取的字段，如：

name = scrapy.Field()
age = scrapy.Field()
hobby = scrapy.Field()

制作爬虫

cd demo

# 生成爬虫demo.py
scrapy genspider demo "baidu.com"

我们需要修改demo.py里的start_urls的地址为自己想爬取的地址，如：https://www.cnblogs.com/teark/，还需要修改parse()方法，可暂时改为保存所爬取的页面，如：

# 不加encoding会遇到编码问题with open("goal.html", "w", encoding='utf-8') as f:    f.write(response.text)

一般保存html信息后，该段代码便没有意义了，注释掉即可。

爬取

# 运行爬虫
scrapy crawl demo

运行完会生成goal.html文件，接下来打开文件，进行调试数据。

页面调试

该过程是为了定位元素，确定xpath表达式的值，使用scrapy shell（最好先安装ipython，有语法提示），调试好了再放到代码里，如：

scrapy shell "https://www.cnblogs.com/teark/"
response.xpath('//*[@class="even"]')
print site[0].xpath('./td[2]/text()').extract()[0]

提取

提取过程就是实现parse方法，根据保存的网页提取目标数据，一般用xpath表达式，如：

def parse(self, response):
    for _ in response.xpath("//div[@class='teark_article']"):
        item = ItcastItem()
        title = each.xpath("h3/text()").extract()
        content = each.xpath("p/text()").extract()
        item['title'] = title[0]
        item['content'] = content[0]
        yield item

至此，html文件和保存文件的代码已经完成了它的作用，可以删掉，我们可以根据定义的字段直接抓取目标值。

# 保存数据，下格式都行：jsonl，jsonl，csv，xml
scrapy crawl demo -o demo.json

会生成demo.json文件，里面就是我们所需要的数据。

处理内容-pipeline

每次爬取时输出这么长的指令，能不能简单一点呢？可以的，pipline就是用来保存文件的：

from itemadapter import ItemAdapter


class TeacherPipeline:
    def __init__(self):
        self.file = open('teacher.json', 'wb')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        # 注意写入的是encode之后的字节，不然会乱码
        self.file.write(content.encode())
        return item

    def close_spider(self, spider):
        self.file.close()

还需要在settings.py中添加ITEM_PIPELINES配置，注意路径，如：

ITEM_PIPELINES = {

　　"teacher.pipelines.TeacherPipeline": 300

这时启动爬虫只需要scrapy crawl demo，看看当前目录是否生成demo.json，这样我们输入的指令就简单一点了。

简记

创建-外：scrapy startproject mySpider

生成-内：scrapy genspider name “url”

执行-外：scrapy crawl <内的自动生成的文件>

保存-外：scrapy crawl itcast -o teachers.json（jsonl, scv, xml）