首页 > 生活

用Python爬取了海南房产在搜房网的详细信息

更新时间:2025-05-11 16:42:18 阅读: 评论:0

转载请注明,有问题可以留言一起探讨。

文章中代码运行的环境为Windows下 Python 2.7.14

一:需求背景介绍

海南做家居灯饰的朋友,希望能收集到一份整个海南岛的新房信息。以便更好的做市场拓展。但是每个楼盘手动点进去看,并复制粘贴出自己想要的信息显得耗时又笨拙。于是

乎找到我,叫我帮他爬取相关信息。吧啦吧啦.......一顿沟通之后,确定了以搜房网的数据为准,收集每个楼盘以下的信息,并整理到excel表里面去。

楼盘具体的数据信息要求

二:分析过程

目的:找到我们需要的数据的源头

1、打开搜房网,选择海南--->新房。分析到整个海南的入口url是:海南新房_海南楼盘_海南房价_海南房产信息网-海南房天下

2、分析到每座城市的的url:在浏览器中按F12键,看到整个网页的源代码,找到各市区显示的链接地址(可以按CTRL+F在网页源代码搜索 "quyu_name dingwei",快速定位到每座城市的url),例如海口的链接显示的为/house/s/haikoushi/,其完整的url为:【海口市楼盘】_海口市新楼盘_海口市房价-海南中金浩房天下

3、分析到每个楼盘的url和楼盘详情的url(以海口市--->滨江名苑-->楼盘详情为例):在浏览器中按F12键,看到整个网页的源代码,找到各楼盘链接地址(可以按CTRL+F在网页源代码搜索 "nhouse_list",快速查看到每座楼盘的url),例如:滨江名苑楼盘链接为://binjiangmingyuanhn.fang/

滨江名苑楼盘地址获悉

获取楼盘详细信息方法:在点进去楼盘信息之后的页面,楼盘详情栏(在源码中搜navleft tf,能快速定位到),点击楼盘详情后显示的内容中有我们所要采集的数据源。

楼盘详情url获得均价等需求的数据来源(搜索关键字为main-item)

4、获取到每个页面的下一页的url,关键字检索为:class="fr",详细的解析过程见文章第三部分源代码处

获取到每座城市下每个页面的链接

三:源代码

1、代码中运用到了 requests、BeautifulSoup、urlparse、xlrd、xlutils模块的简单接口,如果不会,请自行百度查一下,都是常用接口

2、代码的整体思路就是顺着文章第二部分 --分析过程中的内容

3、结果展示

部分采集到的数据

4、源代码来了......

from xlutils.copy import copyfrom bs4 import BeautifulSoupimport requestsimpor营养素密度t xlr化工厂污染dimport urlparseimport reclass HtmlParser(object):def city_quyu_area_parse(self, root_url, html_cont):soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')city_urls = set()city_names = {}linklist = soup.find('li', class_="quyu_name dingwei").find_all('a', href=True)for city_url in linklist:city_urls.add(root_url + city_url["href"])city_names[root_url + city_url["href"]] = city_url.get_text()print "city_urls:%s,city_name:%s"%(city_urls,city_names)return city_urls, city_namesdef parsehousedetail(self,page_url,html_cont):if page_url is None or html_cont is None:returnsoup=BeautifulSoup(html_cont,'html.parser',from_encoding='utf-8')new_urls=self._get_new_urls(page_url,soup)print "new_urls:",new_urlsreturn new_urls,"123"def parseOtherPagesUrlAll(self,page_url,html_cont):print "parseOtherPagesUrlA计算机专业课程ll header:",p企业内刊age_urlif page_url is None or html_cont is None:returnsoup = BeautifulSoup(html_cont,'html.parser',from_encoding='utf-8')new_urls=set()links_li=soup.find('li',class_="fr")if links_li:links = links_li.find_all('a',href=True)lastpage = 康尔健野links_li.find('a',class_="last")if lastpage is None:return new_urlslastpagetext = lastpage['href']splitarry = lastpagetext.split('/')maxnum = int(splitarry[4][2:])a = max(1, maxnum)for i in xrange(2, a + 1):new_url = '/house/s/' + splitarry[3] + "/b9%s/" % (i)new_full_url = urlparse.urljoin(page_url, new_url)new_urls.add(new_full_url)print "parseOtherPagesUrlAll page new_urls",new_urlsreturn new_urls#获取分页链接def _get_new_urls(self,page_url,soup):new_urls=set()linkss = soup.find_all('div', class_="nlc_img")for aa in linkss:links = aa.find_all('a', href=True)for link in links:new_url = link['href']new_full_url = urlparse.urljoi严宽微博n(page_url,new_url)new_urls.add(new_full_url)print new_urlsreturn new_urls#获取分页链接def get_last_urls(self,page_url,soup):new_urls=set()links=soup.find('div', class_="navleft tf").find_all('a', href=repile(r".housedetail"))print "get_last_urls:",linksfor link in links:new_url=link['href']new_full_url=urlparse.urljoin(page_url,new_url)new_urls.add(new_full_url)print "_get_last_urls",new_urlsreturn new_urlsdef get_new_data_detail(self, souplast):house_data = {}house_name =souplast.find('a', class_='ts_linear').get_text()house_data['house_name'] = house_namehouse_price =souplast.find('div', class_='main-info-price').find('em').get_text().strip()house_data['house_price'] = house_priceall_item_data = souplast.find_all('div', class_='main-item')for item_data in all_item_data:all_li_data = item_data.find_all('li')for li_data in all_li_data:textContent = li_data.find('div',class_='list-left')if textContent is None:continuetext = textContent.get_text().replace(' ','')if text == u"物业类别:":house_type = li_data.find('div',class_='list-right').get_text().replace(' ','')house_data['house_type'] = house_typeelif text == u"装修状况:":house_de = li_data.find('div',class_='list-ri按摩机ght').get_text().replace(' ','')house_data['house_de'] = house_deelif text == u"楼盘地址:":house_address_find= li_data.find('div',class_='list-right-text')house_address = '尚未提供'if house_address_find is None:house_address_find2 = li_data.find('div', class_='list-ri纱窗材质ght')if house_address_find2:house_address = house_address_find2.get_text().replace(' ','闭口')else:house_address = house_address_find.get_text().replace(' ', ''青岛五十八中)house_data['house_address'] = house_addresselif text == u"销售状态:":house_sell = li_data.find('div',class_='list-right').get_text().replace(' ','')house_data['house_sell'] = house_sellelif text == u"开盘时间:":house_opentime = li_data.find('div',class_='list-right').get_text().replace(' ','')house_data['house_opentime'] = hou红米手机se_opentimeelif text == u"总户数:":house_total= li_data.find('div',class_='list-right').get_text().replace(' ','')house_data['house_total'] = house_totalelif text == u"物业公司:":house_wuye = li_data.find('div',class_='list-right').get_text().replace(' ','')house_data['house_wuye'] = house_wuyereturn house_dataclass HtmlOutputer(object):def output_excel_dic(self, new_data, house_city):file = xlrd.open_workbook("soufang.xls")#获取已有sheet的行数nrow=file.sheets()[0].nrowsif nrow!=0: nrow==nrow+1#复制原有sheetcopy_file=copy(file)sheet=copy_file.get_sheet(0)#插入数据row = 0for key in new_data.keys():sheet.write((row+nrow),0,house_city)col = 10if key == u"house_name":col = 1elif key == u"house_address":col = 2elif key == u"house_type":col = 3elif key == u"house_sell":col = 4elif key == u"house_de":col = 5elif key == u"house_opentime":col = 6elif key == u"house_total":col = 7elif 小丑电影key == u"house_price":col = 8elif key == u"house_wuye":col = 9sheet.write((row+nrow), col, new_data.get(key, 'ab'))copy_file.save('soufang.xls')class HtmlDownloader(object):def download(self, url):if url is None:return Nonehtml = requests.get(url)html = html.text.encode(html.encoding).decode("gb18030",errors='ignore').eemmm什么意思ncode("utf-8")return htmlclass SpiderMain(object):def __init__(self):self.downloader = HtmlDownloader()#下载URL内容self.parser = HtmlParser()#解析URL内容self.outputer = HtmlOutputer()#输出获取到的内容#获取待爬取城市链接def CityUrl_Crwa(self,root_url):html_cont = self.downloader.download(roo儿童连体泳衣t_url)cityurls_list, city_names =self.parser.city_quyu_area_parse(root_url, html_cont)return cityurls_list, city_namesdef crwa_detail(self,city_url,city_name):#某市市第一页html_cont=self.downl胡辣汤怎么做oader.download(city_url)#下载页面内容new_urls,house_city= self.parser.parsehousedetail(city_url, html_cont)#解析页面内容for oneurl in new_urls:html_detail = se前言格式lf.downloader.download(oneurl) # 下载页面内容soupdetail = BeautifulSoup(html_detail, 'html.parser', from_encoding='utf-8')lasturls = self.parser.get_last_urls(oneurl,soupdetail)for onelast in lasturls:house_detail = self.downloader.download(onelast) # 下载页面内容soupLast = BeautifulSoup(house_detail, 'html.parser', from_encoding='utf-8')new_data_detail=self.parser.get_new_data_detail(soupLast)self.outputer.output_excel_dic(new_data_detail, city_name) # 写入excel# 同城市的其他页count = 0page_urls = self.parser.parseOtherPagesUrlAll(city_url, html_cont)for oneurl in page_urls:count += 1html_cont = self.downloader.download(oneurl) # 下载页面内容new_urls, house_city = self.parser.parsehousedetail(city_url, html_cont) # 解析页面内容for oneurl in new_urls:html_detail = self.downloader.download(oneurl) # 下载页面内容soupdetail = BeautifulSoup(html_detail, 'html.parser', from_encoding='utf-8')lasturls = self.parser.get_last_urls(oneurl, soupdetail)for onelast in lasturls:house_detail = self.downloader.download(onelast) # 下载页面内容soupLast = BeautifulSoup(house_detail, 'html.parser', from_encoding='utf-8')new_data_detail = self.parser.get_new_data_detail(soupLast)self.outputer.output_excel_dic(new_data_detail, city_name) # 写入excelprint "-------------------------------------%d------------------------------" % (count)print "城市【",city_name,"】输出成功--------------"if __name__=="__main__":obj_spider = SpiderMain()root_url = "hn.newhouse.fang"city_urls, city_names = obj_spider.CityUrl_Crwa(roo上海电大t_url)count = 0for city_url in city_urls:if city_url[-3:] == "#no":continuename = city_names.get(city_url, "其他")count += 1obj虚拟机安装ubuntu_spider.crwa_detail(city_url, name)print "*********************************************name:%s,count:%s:***********************************"%(name, count)

四:踩过的坑

1、Excel的写入不支持后缀为.xlsw的文件格式,在代码执行的过程中并不会报错,但是最后会无法打开文件,需要将文件后缀改为xls的。

2、关于字符编码的坑:

text = textContent.get_text().replace(' ','');if text == u"物业类别:":

获取到的text在Python中已unicode码存在,而“物业类别”是字符串类型,跟text的unicode无法比较,要在前面加u(u"物业类别:"),这样才会进入到你期待的分支里面,在代码中还金介屎有类似的地方,请注意观察。

3、关于解析下一页的坑:

在搜房网中,解析下一页需要考虑已下情况:(1)页面只有1页的,不会出现下一页、上一页按钮;(2)页面在9页范围内的,会出现下一页,但是不会出现上一页的标签 (3)大于9页的页面,会同时出现上一页,下一页,但是都最后几页,它又不会出现下一页,但会出现上一页的标签;(4)上一页,下一页的class均为next,无法区别,这导致我在解析的时候出过死循环;

解决方案:parseOtherPagesUrlAll 函数接口,我是按最后一页 class ='last'来解析所有的下一页的编码的,这样可以确保页面的唯一性和完整性

本文发布于:2023-06-02 09:47:33,感谢您对本站的认可!

本文链接:http://www.ranqi119.com/ge/85/189412.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:详细信息   海南   搜房网   房产   Python
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 站长QQ:55-9-10-26|友情:优美诗词|电脑我帮您|扬州装修|369文学|学编程|软件玩家|水木编程|编程频道