当前位置：首页 > news >正文

寻找网站建设公司购物网站系统建设方案

news 2025/11/14 18:28:33

寻找网站建设公司,购物网站系统建设方案,针对315老坛酸菜企业解决方案,wordpress the7 安装网络爬虫#xff08;又被称为网页蜘蛛#xff0c;网络机器人#xff0c;在FOAF社区中间#xff0c;更经常的称为网页追逐者#xff09;#xff0c;是一种按照一定的规则#xff0c;自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟…网络爬虫又被称为网页蜘蛛网络机器人在FOAF社区中间更经常的称为网页追逐者是一种按照一定的规则自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫。 Requests Python标准库中提供了urllib、urllib2、httplib等模块以供Http请求但是它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作甚至包括各种方法覆盖来完成最简单的任务。 import urllib2 import json import cookielibdef urllib2_request(url, methodGET, cookie, headers{}, dataNone)::param url: 要请求的url:param cookie: 请求方式GET、POST、DELETE、PUT..:param cookie: 要传入的cookiecookie k1v1;k1v2:param headers: 发送数据时携带的请求头headers {ContentType:application/json; charsetUTF-8}:param data: 要发送的数据GET方式需要传入参数data{d1: v1}:return: 返回元祖响应的字符串内容和 cookiejar对象对于cookiejar对象可以使用for循环访问for item in cookiejar:print item.name,item.valueif data:data json.dumps(data)cookie_jar cookielib.CookieJar()handler urllib2.HTTPCookieProcessor(cookie_jar)opener urllib2.build_opener(handler)opener.addheaders.append([Cookie, k1v1;k1v2])request urllib2.Request(urlurl, datadata, headersheaders)request.get_method lambda: methodresponse opener.open(request)origin response.read()return origin, cookie_jar# GET result urllib2_request(http://127.0.0.1:8001/index/, methodGET)# POST result urllib2_request(http://127.0.0.1:8001/index/, methodPOST, data {k1: v1})# PUT result urllib2_request(http://127.0.0.1:8001/index/, methodPUT, data {k1: v1}) 封装urllib请求 Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库其在Python内置模块的基础上进行了高度的封装从而使得Pythoner进行网络请求时变得美好了许多使用Requests可以轻而易举的完成浏览器可有的任何操作。 1、GET请求 # 1、无参数实例import requestsret requests.get(https://github.com/timeline.json)print ret.url print ret.text# 2、有参数实例import requestspayload {key1: value1, key2: value2} ret requests.get(http://httpbin.org/get, paramspayload)print ret.url print ret.text 向 https://github.com/timeline.json 发送一个GET请求将请求和响应相关均封装在 ret 对象中。 2、POST请求 # 1、基本POST实例import requestspayload {key1: value1, key2: value2} ret requests.post(http://httpbin.org/post, datapayload)print ret.text# 2、发送请求头和数据实例import requests import jsonurl https://api.github.com/some/endpoint payload {some: data} headers {content-type: application/json}ret requests.post(url, datajson.dumps(payload), headersheaders)print ret.text print ret.cookies 向https://api.github.com/some/endpoint发送一个POST请求将请求和相应相关的内容封装在 ret 对象中。 3、其他请求 requests.get(url, paramsNone, **kwargs) requests.post(url, dataNone, jsonNone, **kwargs) requests.put(url, dataNone, **kwargs) requests.head(url, **kwargs) requests.delete(url, **kwargs) requests.patch(url, dataNone, **kwargs) requests.options(url, **kwargs)# 以上方法均是在此方法的基础上构建 requests.request(method, url, **kwargs) requests模块已经将常用的Http请求方法为用户封装完成用户直接调用其提供的相应方法即可其中方法的所有参数有 def request(method, url, **kwargs):Constructs and sends a :class:Request Request.:param method: method for the new :class:Request object.:param url: URL for the new :class:Request object.:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:Request.:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:Request.:param json: (optional) json data to send in the body of the :class:Request.:param headers: (optional) Dictionary of HTTP Headers to send with the :class:Request.:param cookies: (optional) Dict or CookieJar object to send with the :class:Request.:param files: (optional) Dictionary of name: file-like-objects (or {name: (filename, fileobj)}) for multipart encoding upload.:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.:param timeout: (optional) How long to wait for the server to send databefore giving up, as a float, or a :ref:(connect timeout, readtimeout) timeouts tuple.:type timeout: float or tuple:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.:type allow_redirects: bool:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to True.:param stream: (optional) if False, the response content will be immediately downloaded.:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, (cert, key) pair.:return: :class:Response Response object:rtype: requests.ResponseUsage:: import requests req requests.request(GET, http://httpbin.org/get)Response [200]# By using the with statement we are sure the session is closed, thus we# avoid leaving sockets open which can trigger a ResourceWarning in some# cases, and look like a memory leak in others.with sessions.Session() as session:return session.request(methodmethod, urlurl, **kwargs) 更多参数 ### 1、首先登陆任何页面获取cookiei1 requests.get(url http://dig.chouti.com/help/service)### 2、用户登陆携带上一次的cookie后台对cookie中的 gpsd 进行授权 i2 requests.post(url http://dig.chouti.com/login,data {phone: 86手机号,password: 密码,oneMonth: },cookies i1.cookies.get_dict() )### 3、点赞只需要携带已经被授权的gpsd即可 gpsd i1.cookies.get_dict()[gpsd] i3 requests.post(urlhttp://dig.chouti.com/link/vote?linksId8589523,cookies{gpsd: gpsd} ) print(i3.text) 自动登录抽屉 “破解”微信公众号 “破解”微信公众号其实就是使用Python代码自动实现【登陆公众号】-【获取观众用户】- 【向关注用户发送消息】。注只能向48小时内有互动的粉丝主动推送消息 1、自动登陆分析对于Web登陆页面用户登陆验证时仅做了如下操作登陆的URLhttps://mp.weixin.qq.com/cgi-bin/login?langzh_CNPOST的数据为 { username: 用户名, pwd: 密码的MD5值, imgcode: , f: json }注imgcode是需要提供的验证码默认无需验证码只有在多次登陆未成功时才需要用户提供验证码才能登陆 POST的请求头的Referer值微信后台用次来检查是谁发送来的请求请求发送并登陆成功后获取用户响应的cookie以后操作其他页面时需要携带此cookie 请求发送并登陆成功后获取用户相应的内容中的token import requests import time import hashlibdef _password(pwd):ha hashlib.md5()ha.update(pwd)return ha.hexdigest()def login():login_dict {username: 用户名,pwd: _password(密码),imgcode: ,f: json}login_res requests.post(url https://mp.weixin.qq.com/cgi-bin/login?langzh_CN,datalogin_dict,headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})# 登陆成功之后获取服务器响应的cookieresp_cookies_dict login_res.cookies.get_dict()# 登陆成功后获取服务器响应的内容resp_text login_res.text# 登陆成功后获取tokentoken re.findall(.*token(\d), resp_text)[0]print resp_textprint tokenprint resp_cookies_dictlogin() 登陆代码登陆成功获取的相应内容如下响应内容 {base_resp:{ret:0,err_msg:ok},redirect_url:\/cgi-bin\/home?thome\/indexlangzh_CNtoken537908795}响应cookie {data_bizuin: 3016804678, bizuin: 3016804678, data_ticket: CaoXQA0ZA9LRZ4YM3zZkvedyCY8mZi0XlLonPwvBGkX0/jY/FZgmGTq6xGuQk4H, slave_user: gh_5abeaed48d10, slave_sid: elNLbU1TZHRPWDNXSWdNc2FjckUxalM0Y000amtTamlJOUliSnRnWGRCdjFseV9uQkl5cUpHYkxqaGJNcERtYnM2WjdFT1pQckNwMFNfUW5fUzVZZnFlWGpSRFlVRF9obThtZlBwYnRIVGt6cnNGbUJsNTNIdTlIc2JJU29QM2FPaHZjcTcya0F6UWRhQkhO} 2、访问其他页面获取用户信息分析用户管理页面通过Pyhton代码以Get方式访问此页面分析响应到的 HTML 代码从中获取用户信息获取用户的URLhttps://mp.weixin.qq.com/cgi-bin/user_tag?actionget_all_datalangzh_CNtoken登陆时获取的token发送GET请求时需要携带登陆成功后获取的cookie {data_bizuin: 3016804678, bizuin: 3016804678, data_ticket: C4YM3zZ... 获取当前请求的响应的html代码通过正则表达式获取html中的指定内容Python的模块Beautiful Soup获取html中每个用户的 data-fakeid属性该值是用户的唯一标识通过它可向用户推送消息 import requests import time import hashlib import json import reLOGIN_COOKIES_DICT {}def _password(pwd):ha hashlib.md5()ha.update(pwd)return ha.hexdigest()def login():login_dict {username: 用户名,pwd: _password(密码),imgcode: ,f: json}login_res requests.post(url https://mp.weixin.qq.com/cgi-bin/login?langzh_CN,datalogin_dict,headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})# 登陆成功之后获取服务器响应的cookieresp_cookies_dict login_res.cookies.get_dict()# 登陆成功后获取服务器响应的内容resp_text login_res.text# 登陆成功后获取tokentoken re.findall(.*token(\d), resp_text)[0]return {token: token, cookies: resp_cookies_dict}def standard_user_list(content):content re.sub(\s*, , content)content re.sub(\n*, , content)data re.findall(cgiData(.*);seajs, content)[0]data data.strip()while True:temp re.split(({)(\w)(:), data, 1)if len(temp) 5:temp[2] temp[2] data .join(temp)else:breakwhile True:temp re.split((,)(\w)(:), data, 1)if len(temp) 5:temp[2] temp[2] data .join(temp)else:breakdata re.sub(\*\d, , data)ret json.loads(data)return retdef get_user_list():login_dict login()LOGIN_COOKIES_DICT.update(login_dict)login_cookie_dict login_dict[cookies]res_user_list requests.get(url https://mp.weixin.qq.com/cgi-bin/user_tag,params {action: get_all_data, lang: zh_CN, token: login_dict[token]},cookies login_cookie_dict,headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})user_info standard_user_list(res_user_list.text)for item in user_info[user_list]:print %s %s % (item[nick_name],item[id],)get_user_list() 代码实现 3、发送消息分析给用户发送消息的页面从网络请求中剖析得到发送消息的URL从而使用Python代码发送消息发送消息的URLhttps://mp.weixin.qq.com/cgi-bin/singlesend?tajax-responsefjsontoken登陆时获取的token放在此处langzh_CN从登陆时相应的内容中获取token和cookie从用户列表中获取某个用户唯一标识 fake_id封装消息并发送POST请求 send_dict {token: 登陆时获取的token,lang: zh_CN,f: json,ajax: 1,random: 0.5322618900912392,type: 1,content: 要发送的内容,tofakeid: 用户列表中获取的用户的ID,imgcode: } import requests import time import hashlib import json import reLOGIN_COOKIES_DICT {}def _password(pwd):ha hashlib.md5()ha.update(pwd)return ha.hexdigest()def login():login_dict {username: 用户名,pwd: _password(密码),imgcode: ,f: json}login_res requests.post(url https://mp.weixin.qq.com/cgi-bin/login?langzh_CN,datalogin_dict,headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})# 登陆成功之后获取服务器响应的cookieresp_cookies_dict login_res.cookies.get_dict()# 登陆成功后获取服务器响应的内容resp_text login_res.text# 登陆成功后获取tokentoken re.findall(.*token(\d), resp_text)[0]return {token: token, cookies: resp_cookies_dict}def standard_user_list(content):content re.sub(\s*, , content)content re.sub(\n*, , content)data re.findall(cgiData(.*);seajs, content)[0]data data.strip()while True:temp re.split(({)(\w)(:), data, 1)if len(temp) 5:temp[2] temp[2] data .join(temp)else:breakwhile True:temp re.split((,)(\w)(:), data, 1)if len(temp) 5:temp[2] temp[2] data .join(temp)else:breakdata re.sub(\*\d, , data)ret json.loads(data)return retdef get_user_list():login_dict login()LOGIN_COOKIES_DICT.update(login_dict)login_cookie_dict login_dict[cookies]res_user_list requests.get(url https://mp.weixin.qq.com/cgi-bin/user_tag,params {action: get_all_data, lang: zh_CN, token: login_dict[token]},cookies login_cookie_dict,headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})user_info standard_user_list(res_user_list.text)for item in user_info[user_list]:print %s %s % (item[nick_name],item[id],)def send_msg(user_fake_id, content啥也没发):login_dict LOGIN_COOKIES_DICTtoken login_dict[token]login_cookie_dict login_dict[cookies]send_dict {token: token,lang: zh_CN,f: json,ajax: 1,random: 0.5322618900912392,type: 1,content: content,tofakeid: user_fake_id,imgcode: }send_url https://mp.weixin.qq.com/cgi-bin/singlesend?tajax-responsefjsontoken%slangzh_CN % (token,)message_list requests.post(urlsend_url, datasend_dict, cookieslogin_cookie_dict, headers{Referer: https://mp.weixin.qq.com/cgi-bin/login?langzh_CN})get_user_list() fake_id raw_input(请输入用户ID:) content raw_input(请输入消息内容:) send_msg(fake_id, content) 发送消息代码以上就是“破解”微信公众号的整个过程通过Python代码实现了自动【登陆微信公众号平台】【获取用户列表】【指定用户发送消息】。 Scrapy Scrapy是一个为了爬取网站数据提取结构性数据而编写的应用框架。其可以应用在数据挖掘信息处理或存储历史数据等一系列的程序中。其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛可以用于数据挖掘、监测和自动化测试。 Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下 Scrapy主要包括了以下组件引擎(Scrapy)用来处理整个系统的数据流处理, 触发事务(框架核心)调度器(Scheduler)用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL抓取网页的网址或者说是链接的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址下载器(Downloader)用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)爬虫(Spiders)爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面项目管道(Pipeline)负责处理爬虫从网页中抽取的实体主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后将被发送到项目管道并经过几个特定的次序处理数据。下载器中间件(Downloader Middlewares)位于Scrapy引擎和下载器之间的框架主要是处理Scrapy引擎与下载器之间的请求及响应。爬虫中间件(Spider Middlewares)介于Scrapy引擎和爬虫之间的框架主要工作是处理蜘蛛的响应输入和请求输出。调度中间件(Scheduler Middewares)介于Scrapy引擎和调度之间的中间件从Scrapy引擎发送到调度的请求和响应。Scrapy运行流程大概如下引擎从调度器中取出一个链接(URL)用于接下来的抓取引擎把URL封装成一个请求(Request)传给下载器下载器把资源下载下来并封装成应答包(Response)爬虫解析Response解析出实体Item,则交给实体管道进行进一步的处理解析出的是链接URL,则把URL交给调度器等待抓取一、安装 pip install Scrapy 注windows平台需要依赖pywin32请根据自己系统32/64位选择下载安装https://sourceforge.net/projects/pywin32/ 二、基本使用 1、创建项目运行命令: scrapy startproject your_project_name 自动创建目录 project_name/scrapy.cfgproject_name/__init__.pyitems.pypipelines.pysettings.pyspiders/__init__.py 文件说明 scrapy.cfg 项目的配置信息主要为Scrapy命令行工具提供一个基础的配置信息。真正爬虫相关的配置信息在settings.py文件中items.py 设置数据存储模板用于结构化数据如Django的Modelpipelines 数据处理行为如一般结构化的数据持久化settings.py 配置文件如递归的层数、并发数延迟下载等spiders 爬虫目录如创建文件编写爬虫规则注意一般创建爬虫文件时以网站域名命名 2、编写爬虫在spiders目录中新建 xiaohuar_spider.py 文件 import scrapyclass XiaoHuarSpider(scrapy.spiders.Spider):name xiaohuarallowed_domains [xiaohuar.com]start_urls [http://www.xiaohuar.com/hua/,]def parse(self, response):# print(response, type(response))# from scrapy.http.response.html import HtmlResponse# print(response.body_as_unicode())current_url response.urlbody response.bodyunicode_body response.body_as_unicode() View Code 3、运行进入project_name目录运行命令 scrapy crawl spider_name --nolog 4、递归的访问以上的爬虫仅仅是爬去初始页而我们爬虫是需要源源不断的执行下去直到所有的网页被执行完毕 import scrapy from scrapy.http import Request from scrapy.selector import HtmlXPathSelector import re import urllib import osclass XiaoHuarSpider(scrapy.spiders.Spider):name xiaohuarallowed_domains [xiaohuar.com]start_urls [http://www.xiaohuar.com/list-1-1.html,]def parse(self, response):# 分析页面# 找到页面中符合规则的内容校花图片保存# 找到所有的a标签再访问其他a标签一层一层的搞下去hxs HtmlXPathSelector(response)# 如果url是 http://www.xiaohuar.com/list-1-\d.htmlif re.match(http://www.xiaohuar.com/list-1-\d.html, response.url):items hxs.select(//div[classitem_list infinite_scroll]/div)for i in range(len(items)):src hxs.select(//div[classitem_list infinite_scroll]/div[%d]//div[classimg]/a/img/src % i).extract()name hxs.select(//div[classitem_list infinite_scroll]/div[%d]//div[classimg]/span/text() % i).extract()school hxs.select(//div[classitem_list infinite_scroll]/div[%d]//div[classimg]/div[classbtns]/a/text() % i).extract()if src:ab_src http://www.xiaohuar.com src[0]file_name %s_%s.jpg % (school[0].encode(utf-8), name[0].encode(utf-8))file_path os.path.join(/Users/wupeiqi/PycharmProjects/beauty/pic, file_name)urllib.urlretrieve(ab_src, file_path)# 获取所有的url继续访问并在其中寻找相同的urlall_urls hxs.select(//a/href).extract()for url in all_urls:if url.startswith(http://www.xiaohuar.com/list-1-):yield Request(url, callbackself.parse) 以上代码将符合规则的页面中的图片保存在指定目录并且在HTML源码中找到所有的其他 a 标签的href属性从而“递归”的执行下去直到所有的页面都被访问过为止。以上代码之所以可以进行“递归”的访问相关URL关键在于parse方法使用了 yield Request对象。注可以修改settings.py 中的配置文件以此来指定“递归”的层数如 DEPTH_LIMIT 1 from scrapy.selector import Selector from scrapy.http import HtmlResponse html !DOCTYPE html html head langenmeta charsetUTF-8title/title /head bodyli classitem-a hreflink.htmlfirst item/a/lili classitem-0a hreflink1.htmlfirst item/a/lili classitem-1a hreflink2.htmlsecond item/a/li /body /htmlresponse HtmlResponse(urlhttp://example.com, bodyhtml,encodingutf-8) ret Selector(responseresponse).xpath(//li[re:test(class, item-\d*)]//href).extract() print(ret) 正则选择器 import scrapy import hashlib from tutorial.items import JinLuoSiItem from scrapy.http import Request from scrapy.selector import HtmlXPathSelectorclass JinLuoSiSpider(scrapy.spiders.Spider):count 0url_set set()name jluosidomain http://www.jluosi.comallowed_domains [jluosi.com]start_urls [http://www.jluosi.com:80/ec/goodsDetail.action?jlsQjRDNEIzMzAzOEZFNEE3NQ,]def parse(self, response):md5_obj hashlib.md5()md5_obj.update(response.url)md5_url md5_obj.hexdigest()if md5_url in JinLuoSiSpider.url_set:passelse:JinLuoSiSpider.url_set.add(md5_url)hxs HtmlXPathSelector(response)if response.url.startswith(http://www.jluosi.com:80/ec/goodsDetail.action):item JinLuoSiItem()item[company] hxs.select(//div[classShopAddress]/ul/li[1]/text()).extract()item[link] hxs.select(//div[classShopAddress]/ul/li[2]/text()).extract()item[qq] hxs.select(//div[classShopAddress]//a/href).re(.*uin(?Pqq\d*))item[address] hxs.select(//div[classShopAddress]/ul/li[4]/text()).extract()item[title] hxs.select(//h1[classgoodsDetail_goodsName]/text()).extract()item[unit] hxs.select(//table[classR_WebDetail_content_tab]//tr[1]//td[3]/text()).extract()product_list []product_tr hxs.select(//table[classR_WebDetail_content_tab]//tr)for i in range(2,len(product_tr)):temp {standard:hxs.select(//table[classR_WebDetail_content_tab]//tr[%d]//td[2]/text() %i).extract()[0].strip(),price:hxs.select(//table[classR_WebDetail_content_tab]//tr[%d]//td[3]/text() %i).extract()[0].strip(),}product_list.append(temp)item[product_list] product_listyield itemcurrent_page_urls hxs.select(//a/href).extract()for i in range(len(current_page_urls)):url current_page_urls[i]if url.startswith(http://www.jluosi.com):url_ab urlyield Request(url_ab, callbackself.parse) 选择器规则 def parse(self, response):from scrapy.http.cookies import CookieJarcookieJar CookieJar()cookieJar.extract_cookies(response, response.request)print(cookieJar._cookies) 获取响应cookies 5、格式化处理上述实例只是简单的图片处理所以在parse方法中直接处理。如果对于想要获取更多的数据获取页面的价格、商品名称、QQ等则可以利用Scrapy的items将数据格式化然后统一交由pipelines来处理。在items.py中创建类 # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass JieYiCaiItem(scrapy.Item):company scrapy.Field()title scrapy.Field()qq scrapy.Field()info scrapy.Field()more scrapy.Field() 上述定义模板以后对于从请求的源码中获取的数据同意按照此结构来获取所以在spider中需要有一下操作 import scrapy import hashlib from beauty.items import JieYiCaiItem from scrapy.http import Request from scrapy.selector import HtmlXPathSelector from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractorclass JieYiCaiSpider(scrapy.spiders.Spider):count 0url_set set()name jieyicaidomain http://www.jieyicai.comallowed_domains [jieyicai.com]start_urls [http://www.jieyicai.com,]rules [#下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)#Rule(SgmlLinkExtractor(allow(rhttp://test_url/test?page_index\d))),#下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)#Rule(LinkExtractor(allow(rhttp://www.jieyicai.com/Product/Detail.aspx?pid\d)), callbackparse),]def parse(self, response):md5_obj hashlib.md5()md5_obj.update(response.url)md5_url md5_obj.hexdigest()if md5_url in JieYiCaiSpider.url_set:passelse:JieYiCaiSpider.url_set.add(md5_url)hxs HtmlXPathSelector(response)if response.url.startswith(http://www.jieyicai.com/Product/Detail.aspx):item JieYiCaiItem()item[company] hxs.select(//span[classusername g-fs-14]/text()).extract()item[qq] hxs.select(//span[classg-left bor1qq]/a/href).re(.*uin(?Pqq\d*))item[info] hxs.select(//div[classpadd20 bor1 comard]/text()).extract()item[more] hxs.select(//li[classstyle4]/a/href).extract()item[title] hxs.select(//div[classg-left prodetail-text]/h2/text()).extract()yield itemcurrent_page_urls hxs.select(//a/href).extract()for i in range(len(current_page_urls)):url current_page_urls[i]if url.startswith(/):url_ab JieYiCaiSpider.domain urlyield Request(url_ab, callbackself.parse) spider 此处代码的关键在于将获取的数据封装在了Item对象中yield Item对象一旦parse中执行yield Item对象则自动将该对象交个pipelines的类来处理 Define your item pipelines here # # Dont forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport json from twisted.enterprise import adbapi import MySQLdb.cursors import remobile_re re.compile(r(13[0-9]|15[012356789]|17[678]|18[0-9]|14[57])[0-9]{8}) phone_re re.compile(r(\d-\d|\d))class JsonPipeline(object):def __init__(self):self.file open(/Users/wupeiqi/PycharmProjects/beauty/beauty/jieyicai.json, wb)def process_item(self, item, spider):line %s %s\n % (item[company][0].encode(utf-8), item[title][0].encode(utf-8))self.file.write(line)return itemclass DBPipeline(object):def __init__(self):self.db_pool adbapi.ConnectionPool(MySQLdb,dbDbCenter,userroot,passwd123,cursorclassMySQLdb.cursors.DictCursor,use_unicodeTrue)def process_item(self, item, spider):query self.db_pool.runInteraction(self._conditional_insert, item)query.addErrback(self.handle_error)return itemdef _conditional_insert(self, tx, item):tx.execute(select nid from company where company %s, (item[company][0], ))result tx.fetchone()if result:passelse:phone_obj phone_re.search(item[info][0].strip())phone phone_obj.group() if phone_obj else mobile_obj mobile_re.search(item[info][1].strip())mobile mobile_obj.group() if mobile_obj else values (item[company][0],item[qq][0],phone,mobile,item[info][2].strip(),item[more][0])tx.execute(insert into company(company,qq,phone,mobile,address,more) values(%s,%s,%s,%s,%s,%s), values)def handle_error(self, e):printerror,e pipelines 上述中的pipelines中有多个类到底Scapy会自动执行那个哈哈哈哈当然需要先配置了不然Scapy就蒙逼了。。。在settings.py中做如下配置 ITEM_PIPELINES {beauty.pipelines.DBPipeline: 300,beauty.pipelines.JsonPipeline: 100, } # 每行后面的整型值确定了他们运行的顺序item按数字从低到高的顺序通过pipeline通常将这些数字定义在0-1000范围内。转载于:https://www.cnblogs.com/lst1010/p/6582065.html

查看全文

http://www.zqtcl.cn/news/113695/