Scrapy框架爬虫：job51信息收集（第二种）

更新时间:2025-05-19 12:30:24 阅读：评论：0

与第一种的不同

第一种的开始目录为："search.51job/list/190200,000000,000笔记本电池损耗0,00,9,99,Java,2,1.html"内容是搜得来比较有限制范围

第二种则可以没有太多关键字限制，可以一次性收集跟多种类的的招聘信息jobs.51job/beiji二极管型号ng/p1/

part 1 （主体）

# -*- coding: utf-8 -*-import Scrapy# from scrapy import optional_featuresfrom job51.items import Job51Item# optional_features.remove('boto')class Job51spiderSpider(scrapy.Spider): # 爬虫名称 name = 'job51Spider' # 允许爬取的域名 allowed_domains = ['51job'] # 定义爬取的初始URL列中国恐怖电影表，这里的beijing改为all之后可以消除城市关键字限制 start_urls = ['jobs.51job/beijing/p1/'] # 解析响应的函数 def parse(self, response): # 获取页面中招聘信息在网页中位置节点 node_list = response.xpath('//div[@class="left"]/div[@class="detlist gbox"]/div[@class="e "]') if node_list: # 遍历节点世界十大禁书 for node in node_list: # 实例化Item对象，将解析到的数据存入其中 item = Job51Item() item['job_title'] = node.xpath('./p[@class=mini9"info"]/span[@class="title"]/a/@title').extract_first() item['job_company'] = node.xpath('./p[@class="info"]/a/@title').extract_first() item['job_address'] = node.xpath('./p[@class="info"]/span[2]/text(太谷饼)').extract_first() item['job_salary'] = node.xpath('./p[@class="info"]/span[3]/text(ps抠图教程)').extract_first() yield item # 匹配到下一页的按钮 next_page = response.xpath('//div[@id="cppageno"]/ul/li[@class="bk"]')[1].xpath( './a/@hremonicaf').extract_first() if next_page: # print(next_page) # 访问下一页信息 yield scrapy.Request(url=next_page, callback=self.parse)from scrapy import cmdlinedef main(): # 需要的是一个列表不是一个字符串，使用split()切割成一个列表 cmdline.executz49e("scrapy crawl job51Spider".split())main()

part 2 （items）

import scrapyclass Job51Item(scrapy.Item): # 职位编号 job_id = scrapy.Field() # 职位名称 job_title = scrapy.Field() # 公司名称 job_company = scrapy.Field() # 公司地址 job_address = scra白居易长恨歌py.Field() # 工作薪资 job_salary = scrapy.Field()

part 3 (pipeline)

# -*- coding: utf-8 -*-import jsonimport iofrom job51.items import Job51Item# Define your item pipelines here## Don't forget to add your pipeline to the IT毒地EM_PIPELINES setting# 京包线See: doc.scrapy/en/latest/topics/item-pipeline.htmlclass Job51Pipeline(object): def open_spider(self, spider): # 爬虫运行时，执行的方法 self.file = io.open('job51.json', 'w', encoding世界第一大峡谷='utf-8') def process_item(self, item, spider): # 将item转换为json字符串 content = json.dumps(dict(item), ensure_ascii=False) # 将数据写出到文件 self.file.write(content + '\n') return item def close_spider(self, spider): # 爬虫运行结束后执行的方法 self.file.close()

pa平安车辆保险rt 4 （修改setting文件）

#1.修改请求头，模拟浏览器浏览USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Apple宋姓WebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99周禄宝 Safari/537.36'#2.取消注释pipeline，才可以在执行时，调用我们自己改写的pipeline函数ITEM_PIPELINES = { 'job51.pipelines.Job51Pipeline': 300,}#3.设置下载延迟，减轻对方服务器负担psypDOWNLOAD_DELAY = 3

结果大概格式

{"job_title": "万达集团行政前台（助理）", "job_company": "万达集团股份有限公司", "job_address": "北京-朝阳区", "job_salary": "5-7千/月"}{"job_title": "新媒体运营经理", "job_company": "字节跳动", "job_address": &cd14#34;北京", "job_salary": "1.5-2.5万/月"}{"job_title": "品牌高级专员", "job_company": "新东方教育科技集团有限公司"黑质而白章;, "job_address": "北京-海淀区", "job_salary&#sem营销34;: "0.8-1万/月"}

本文发布于:2023-06-05 12:54:42，感谢您对本站的认可！

本文链接：http://www.ranqi119.com/ge/85/227435.html

上一篇：一封信字优质(三篇)

下一篇：2023年一封信字一封信字汇总(四篇)

标签：爬虫第二种框架信息 Scrapy

留言与评论（共有 0 条评论）