大连网站外包,网站建设邀标方案,网站突然显示 建设中,江西省建设工程学校网站一步一步构建一个爬虫实例#xff0c;抓取糗事百科的段子先不用beautifulsoup包来进行解析第一步#xff0c;访问网址并抓取源码# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 20:17:…一步一步构建一个爬虫实例抓取糗事百科的段子先不用beautifulsoup包来进行解析第一步访问网址并抓取源码# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ __main__:# 访问网址并抓取源码url http://www.qiushibaike.com/textnew/page/1/?s4941357user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36headers {User-Agent:user_agent}try:request urllib2.Request(url url, headers headers)response urllib2.urlopen(request)content response.read()except urllib2.HTTPError as e:print eexit()except urllib2.URLError as e:print eexit()print content.decode(utf-8)第二步利用正则表达式提取信息首先先观察源码中你需要的内容的位置以及如何识别然后用正则表达式去识别读取注意正则表达式中的 . 是不能匹配\n的所以需要设置一下匹配模式。# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ __main__:# 访问网址并抓取源码url http://www.qiushibaike.com/textnew/page/1/?s4941357user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36headers {User-Agent:user_agent}try:request urllib2.Request(url url, headers headers)response urllib2.urlopen(request)content response.read()except urllib2.HTTPError as e:print eexit()except urllib2.URLError as e:print eexit()regex re.compile(.*?(.*?).*?, re.S)items re.findall(regex, content)# 提取数据# 注意换行符设置 . 能够匹配换行符for item in items:print item第三步修正数据并保存到文件中# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 21:41:32import urllibimport urllib2import reimport osif __name__ __main__:# 访问网址并抓取源码url http://www.qiushibaike.com/textnew/page/1/?s4941357user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36headers {User-Agent:user_agent}try:request urllib2.Request(url url, headers headers)response urllib2.urlopen(request)content response.read()except urllib2.HTTPError as e:print eexit()except urllib2.URLError as e:print eexit()regex re.compile(.*?(.*?).*?, re.S)items re.findall(regex, content)# 提取数据# 注意换行符设置 . 能够匹配换行符path ./qiubaiif not os.path.exists(path):os.makedirs(path)count 1for item in items:#整理数据去掉\n,将换成\nitem item.replace(\n, ).replace(, \n)filepath path / str(count) .txtf open(filepath, w)f.write(item)f.close()count 1第四步将多个页面下的内容都抓取下来# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ __main__:# 访问网址并抓取源码path ./qiubaiif not os.path.exists(path):os.makedirs(path)user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36headers {User-Agent:user_agent}regex re.compile(.*?(.*?).*?, re.S)count 1for cnt in range(1, 35):print 第 str(cnt) 轮url http://www.qiushibaike.com/textnew/page/ str(cnt) /?s4941357try:request urllib2.Request(url url, headers headers)response urllib2.urlopen(request)content response.read()except urllib2.HTTPError as e:print eexit()except urllib2.URLError as e:print eexit()# print content# 提取数据# 注意换行符设置 . 能够匹配换行符items re.findall(regex, content)# 保存信息for item in items:# print item#整理数据去掉\n,将换成\nitem item.replace(\n, ).replace(, \n)filepath path / str(count) .txtf open(filepath, w)f.write(item)f.close()count 1print 完成使用BeautifulSoup对源码进行解析# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 16:16:08# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 21:34:02import urllibimport urllib2import reimport osfrom bs4 import BeautifulSoupif __name__ __main__:url http://www.qiushibaike.com/textnew/page/1/?s4941357user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36headers {User-Agent:user_agent}request urllib2.Request(url url, headers headers)response urllib2.urlopen(request)# print response.read()soup_packetpage BeautifulSoup(response, lxml)items soup_packetpage.find_all(div, class_content)for item in items:try:content item.span.stringexcept AttributeError as e:print eexit()if content:print content \n这是用BeautifulSoup去抓取书本以及其价格的代码可以通过对比得出到bs4对标签的读取以及标签内容的读取(因为我自己也没有学到这一部分目前只能依葫芦画瓢地写)# -*- coding: utf-8 -*-# Author: HaonanWu# Date: 2016-12-22 20:37:38# Last Modified by: HaonanWu# Last Modified time: 2016-12-22 21:27:30import urllib2import urllibimport refrom bs4 import BeautifulSoupurl https://www.packtpub.com/alltry:html urllib2.urlopen(url)except urllib2.HTTPError as e:print eexit()soup_packtpage BeautifulSoup(html, lxml)all_book_title soup_packtpage.find_all(div, class_book-block-title)price_regexp re.compile(u\s\$\s\d\.\d)for book_title in all_book_title:try:print Books name is book_title.string.strip()except AttributeError as e:print eexit()book_price book_title.find_next(textprice_regexp)try:print Books price is book_price.strip()except AttributeError as e:print eexit()print 以上全部为本篇文章的全部内容希望对大家的学习有所帮助也希望大家多多支持脚本之家。