Python爬虫系统化学习(5)
多线程爬虫,在之前的网络编程中,我学习过多线程socket进行单服务器对多客户端的连接,通过使用多线程编程,可以大大提升爬虫的效率。
Python多线程爬虫主要由三部分组成:线程的创建,线程的定义,线程中函数的调用。
线程的创建:多通过for循环调用进行,通过thread.start()唤醒线程,thread.join()等待线程自动阻塞
示例代码如下:
for i in range(1,6): thread=MyThread("thread"+str(i),list[i-1]) thread.start() thread_list.append(thread) for thread in thread_list: thread.join()
线程的定义:线程的定义使用了继承,通常定义线程中包含两个函数,一个是init初始化函数,在类创建时自动调用,另一个是run函数,在thread.start()函数执行时自动调用,示例代码如下:
class MyThread(threading.Thread): def __init__(self,name,link_s): threading.Thread.__init__(self) self.name=name def run(self): print('%s is in Process:'%self.name) #通过spider我们调用了爬虫函数 spider(self.name,self.links) print('%s is out Process'%self.name)
线程中函数的调用是在run里面进行的,而多线程爬虫的重点就是将多线程与爬虫函数紧密结合起来,这就需要我们为爬虫们分布任务,也就是每个函数都要爬些什么内容。
首先我编写了个写文件,将贝壳找房的1-300页南京租房网址链接写入a.txt,代码如下:
zurl="https://nj.zu.ke.com/zufang/pg" for i in range(101,300): turl=url+str(i)+'\n' print(turl) with open ('a.txt','a+') as f: f.write(turl)
其次在main函数中将这些链接写入元组中
link_list=[] with open('a.txt',"r") as f: file_list=f.readlines() for i in file_list: i=re.sub('\n','',i) link_list.append(i)
此后通过调用link_list[i]就可以为每个爬虫布置不同的任务了
max=len(link_list) #max为最大页数 page=0 #page为当前页数 def spider(threadName, link_range): global page global max while page<=max: i = page page+=1 try: r = requests.get(link_list[i], timeout=20) soup = BeautifulSoup(r.content, "lxml") house_list = soup.find_all("div", class_="content__list--item") for house in house_list: global num num += 1 house_name = house.find('a', class_="twoline").text.strip() house_price = house.find('span', class_="content__list--item-price").text.strip() info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price print(info) except Exception as e: print(threadName, "Error", e)
如此这些线程就可以异步的进行信息获取了,整体代码如下
#coding=utf-8 import re import requests import threading import time from bs4 import BeautifulSoup page=0 num=0 link_list=[] with open('a.txt',"r") as f: file_list=f.readlines() for i in file_list: i=re.sub('\n','',i) link_list.append(i) max=len(link_list) print(max) class MyThread(threading.Thread): def __init__(self,name): threading.Thread.__init__(self) self.name=name def run(self): print('%s is in Process:'%self.name) spider(self.name) print('%s is out Process'%self.name) max=len(link_list) #max为最大页数 page=0 #page为当前页数 def spider(threadName): global page global max while page<=max: i = page page+=1 try: r = requests.get(link_list[i], timeout=20) soup = BeautifulSoup(r.content, "lxml") house_list = soup.find_all("div", class_="content__list--item") for house in house_list: global num num += 1 house_name = house.find('a', class_="twoline").text.strip() house_price = house.find('span', class_="content__list--item-price").text.strip() info ="page:"+str(i)+"num:" + str(num) + threadName + house_name + house_price print(info) except Exception as e: print(threadName, "Error", e) start = time.time() for i in range(1,6): thread=MyThread("thread"+str(i)) thread.start() thread_list.append(thread) for thread in thread_list: thread.join() end=time.time() print("All using time:",end-start)
此外多线程爬虫还可以与队列方式结合起来,产生全速爬虫,速度会更快一点:具体完全代码如下:
#coding:utf-8 import threading import time import re import requests import queue as Queue link_list=[] with open('a.txt','r') as f: file_list=f.readlines() for each in file_list: each=re.sub('\n','',each) link_list.append(each) class MyThread(threading.Thread): def __init__(self,name,q): threading.Thread.__init__(self) self.name=name self.q=q def run(self): print("%s is start "%self.name) crawel(self.name,self.q) print("%s is end "%self.name) def crawel(threadname,q): while not q.empty(): temp_url=q.get(timeout=1) try: r=requests.get(temp_url,timeout=20) print(threadname,r.status_code,temp_url) except Exception as e: print("Error",e) pass if __name__=='__main__': start=time.time() thread_list=[] thread_Name=['Thread-1','Thread-2','Thread-3','Thread-4','Thread-5'] workQueue=Queue.Queue(1000) #填充队列 for url in link_list: workQueue.put(url) #创建线程 for tname in thread_Name: thread=MyThread(tname,workQueue) thread.start() thread_list.append(thread) for t in thread_list: t.join() end=time.time() print("All using time:",end-start) print("Exiting Main Thread")
使用队列进行爬虫需要queue库,除去线程的知识,我们还需要队列的知识与之结合,上述代码中关键的队列知识有创建与填充队列,调用队列,持续使用队列3个,分别如下:
1⃣️:创建与队列:
workQueue=Queue.Queue(1000) #填充队列 for url in link_list: workQueue.put(url)
2⃣️:调用队列:
thread=MyThread(tname,workQueue)
3⃣️:持续使用队列:
def crawel(threadname,q): while not q.empty(): pass
使用队列的思想就是先进先出,出完了就结束。
多进程爬虫:一般来说多进程爬虫有两种组合方式:multiprocessing和Pool+Queuex
muiltprocessing使用方法与thread并无多大差异,只需要替换部分代码即可,分别为进程的定义与初始化,以及进程的结束。
1⃣️:进程的定义与初始化:
class Myprocess(Process): def __init__(self): Process.__init__(self)
2⃣️:进程的递归结束:设置后当父进程结束后,子进程自动会被终止
p.daemon=True
另外一种方法是通过Manager和Pool结合使用
manager=Manager() workQueue=manager.Queue(1000) for url in link_list: workQueue.put(url) pool=Pool(processes=3) for i in range(1,5): pool.apply_async(crawler,args=(workQueue,i)) pool.close() pool.join()