欢迎光临
我们一直在努力

scrapy中使用selenium+webdriver获取网页源码,爬取简书网站

scrapy中使用selenium+webdriver获取网页源码,爬取简书网站

由于简书中一些数据是通过js渲染出来的,所以通过正常的request请求返回的response源码中没有相关数据,
所以这里选择selenium+webdriver获取网页源码

1. 设置需要爬取的数据

import scrapy class Jianshuitem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() author_img = scrapy.Field() time = scrapy.Field() read_count = scrapy.Field() subjects = scrapy.Field() 

2. 在下载器中间件中使用 selenium+webdriver

from scrapy import signals from scrapy.http.response.html import HtmlResponse from selenium import webdriver # 显示等待 from selenium.webdriver.support.ui import WebDriverWait class SeleniumDownloaderMiddleware: def __init__(self): # 加载chrome驱动,若chromedriver.exe文件和python.exe 在相同目录下,可以省略executable_path="D:\python\chromedriver.exe" # 即 self.driver=webdriver.Chrome()就可以 self.driver = webdriver.Chrome(executable_path="D:\python\chromedriver.exe") def process_request(self, request, spider): print("-"*40) print(id(self)) print("-"*40) self.driver.get(request.url) try: while True: WebDriverWait(self.driver, 3).until(lambda x: x.find_element_by_class_name("H7E3vT")) # 获取加载更多按钮 # show_more = self.driver.find_element_by_xpath("//div[@class='H7E3vT']") show_more = self.driver.find_element_by_class_name("H7E3vT") show_more.click() except: print("找不到更多按钮") pass # 获取网页源代码 html = self.driver.page_source # 使用url=self.driver.current_url而不使用url=request.url,是有可能发生重定向,导致url发生变化 response = HtmlResponse(url=self.driver.current_url, body=html, request=request, encoding="utf-8") # 返回response,请求就直接返回给scrapy引擎,而不会再发给下载器执行下载 return response 

3. 编写解析数据的爬虫

# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapylearn.jianshu.jianshu.items import JianshuItem class JianshuspiderSpider(CrawlSpider): name = 'jianshuspider' allowed_domains = ['jianshu.com'] start_urls = ['http://jianshu.com/'] rules = ( Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True), ) def parse_detail(self, response): title = response.xpath("//h1[@class='_1RuRku']/text()").get() author = response.xpath("//span[@class='FxYr8x']/a/text()").get() author_img = response.xpath("//img[@class='_13D2Eh']/@src").get() time = response.xpath("//div[@class='s-dsoj']/time/text()").get() read_count = response.xpath("//div[@class='s-dsoj']/span[2]/text()").get().split()[1].replace(",", "") subjects = ",".join(response.xpath("//div[@class='_2Nttfz']/a/span/text()").getall()) yield JianshuItem(title=title, author=author, author_img=author_img, time=time, read_count=read_count, subjects=subjects) def parse_item(self, response): item = {} # item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() # item['name'] = response.xpath('//div[@id="name"]').get() # item['description'] = response.xpath('//div[@id="description"]').get() return item 

4. 将数据保存到mysql

import pymysql class JianshuPipeline: def __init__(self): self.conn = pymysql.connect( host='localhost', port=3307, user='root', password='1612480331', database='houses', charset='utf8' ) def process_item(self, item, spider): print("=" * 40) print(id(self)) print("=" * 40) # 打开数据库连接 # conn = pymysql.connect("localhost", "root", "1612480331", "houses", 3307) # 创建一个游标对象 cursor = self.conn.cursor() sql = "insert into jianshu values (%s,%s,%s,%s,%s,%s)" cursor.execute(sql, ( item["title"], item["author"], item["author_img"], item["time"], item["read_count"], item["subjects"])) self.conn.commit() # print(values) # for v in values: # print(v) cursor.close() return item # 当爬虫关闭的时候会调用 def close_spider(self, spider): self.conn.close() print("爬虫执行结束") 

5. 在settings.py中进行配置

DOWNLOADER_MIDDLEWARES = { # 'jianshu.middlewares.JianshuDownloaderMiddleware': 543, 'jianshu.middlewares.SeleniumDownloaderMiddleware': 1 } ITEM_PIPELINES = { 'jianshu.pipelines.JianshuPipeline': 300, } USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False 

  • 海报
海报图正在生成中...
赞(0) 打赏
声明:
1、本博客不从事任何主机及服务器租赁业务,不参与任何交易,也绝非中介。博客内容仅记录博主个人感兴趣的服务器测评结果及一些服务器相关的优惠活动,信息均摘自网络或来自服务商主动提供;所以对本博客提及的内容不作直接、间接、法定、约定的保证,博客内容也不具备任何参考价值及引导作用,访问者需自行甄别。
2、访问本博客请务必遵守有关互联网的相关法律、规定与规则;不能利用本博客所提及的内容从事任何违法、违规操作;否则造成的一切后果由访问者自行承担。
3、未成年人及不能独立承担法律责任的个人及群体请勿访问本博客。
4、一旦您访问本博客,即表示您已经知晓并接受了以上声明通告。
文章名称:《scrapy中使用selenium+webdriver获取网页源码,爬取简书网站》
文章链接:https://www.456zj.com/8899.html
本站资源仅供个人学习交流,请于下载后24小时内删除,不允许用于商业用途,否则法律问题自行承担。

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址