当前位置：首页 > news >正文

深圳品牌网站推广公司建设企业网站的人员组成

news 2025/11/15 4:04:04

深圳品牌网站推广公司,建设企业网站的人员组成,抖音小程序多少钱,wordpress如何换图片目录一、实验目的二、实验预习提示编辑三、实验内容四、实验要求五、实验过程 1. 基本要求#xff1a; 2. 改进要求A 3. 改进要求B: 六、资料 1.实验框架代码#xff1a; 2.OpenSSL#xff1a;Win32/Win64 OpenSSL Installer for Windows - Shining Light… 目录一、实验目的二、实验预习提示编辑三、实验内容四、实验要求五、实验过程 1. 基本要求 2. 改进要求A 3. 改进要求B: 六、资料 1.实验框架代码 2.OpenSSLWin32/Win64 OpenSSL Installer for Windows - Shining Light Productions (slproweb.com) 3.Josn存储先安装json包 4.实验小提示七、源码一、实验目的熟悉使用Selenium、Pyppeteer等工具爬取网站基本内容通过分析具有文本反爬技术网站设计爬取策略来获取文本正确的内容。二、实验预习提示安装Python环境 Python 3.xPychramAnaconda为Python安装Selenium、PyQuery库打开pycharm新建项目选择Anaconda创建的Python环境在下面对应Console窗口执行 pip install selenium pip install pyquery 安装Chrome和对应ChromeDriver 下载安装完后查看Chrome版本点击 Chrome 的菜单帮助 - 关于 Chrome即可查看 Chrome 的版本号105.0.5195.127如图所示在ChromeDriver 官方网站ChromeDriver - WebDriver for Chrome - Downloads (chromium.org)下载Chrome版本对应的驱动105.0.5195.x, 看主版本号105都行点击下划线的链接根据系统型号下载。windows下chromedriver_win32.zip其他系统找到对应版本下载下面这部分配置环境变量内容【到图片结束包含图片】可以省略参考最新内容【爬虫】5.2 Selenium编写爬虫程序_即使再小的船也能远航的博客-CSDN博客运行代码前配置系统环境变量Path前指定chrome driver位置 Path替换为chrome driver解压后的位置或者在Pycharm运行配置指定三、实验内容爬取网站Scrape | Book 使用浏览器开发者工具(F12)分析网站结构和其中文本反爬机制编码实现获取该网站每本书的封面图片URL、书名和作者信息。实验基框架代码见文档末资料。四、实验要求基本要求将网站一页每本书的信息保存在一个josn文件中每个json文件命名为书名.json其内容为保存书籍相应的信息 {title: Wonder,cover_url:https://img1.doubanio.com/view/subject/l/public/s27252687.jpg,authors:R. J. Palacio } 实现方法不一定要用Selenium、Pyppeteer但是必须是Python编写的并以完成实验要求为准并附上代码运行结果。改进要求A在完成基本要求的基础上选项一实现可以遍历网站的每一页来爬取书籍信息。或指定爬取条目数量当爬取总条目满足数量后停止爬取。选项二或者举例至少三个其他网站的文本爬虫技术分析并给出解决方案不需要实现。改进要求B在完成改进要求A的选项一的基础上可以爬取书籍的额外信息如评分出版时间出版社ISBM, 价格等。五、实验过程 1. 基本要求想要爬取网页内容首先得分析网页结构查看源代码如下图所示点击封面有对应该书得二级页面详情后半部分地址改进要求B用书的封面URL可以用img.class查询; # 获取书籍封面图片url for tag in soup.select(img.cover):pics.append(tag.attrs[src]) 书名都在h3标题中如果是英文书名直接h3.name即可但中文书名由多个classchar的SPAN元素组成这里用到了文本反爬机制利用CSS控制文本偏移来实现文本顺序改变。但不难发现其文本偏移由left属性决定原文正确顺序因此需要按偏移left属性值大小升序排序获取正确的文本顺序。 # 获取书籍名字 for tag in soup.select(h3.name):if whole in tag.attrs[class]:names.append(tag.text)else:chars tag.select(span.char)chars sorted(chars, keylambda a: eval(a.attrs[style][6:-3]))name for char in chars:name char.text.strip()names.append(name) 作者可以直接p.class查询 # 获取作者名字 for tag in soup.select(p.authors):authors.append(tag.text.strip().replace( , ).replace(\n, )) 2. 改进要求A 这里实现的是选项一实现可以遍历网站的每一页来爬取书籍信息。从游览器url: https://antispider3.scrape.center/page/2 得之每页都是在后边加/page/页数这不难实现就是写个文本数字追加到url后即可 url https://antispider3.scrape.center/page/ page_start int(input(请指定爬取起始页(包含该页):)) page_end int(input(请指定爬取结束页(不包含该页):)) for i in range(page_start, page_end):names, pics, authors, links get_cover(url str(i)) 指定爬取条目数量当爬取总条目满足数量后停止爬取,这个就是在循环爬取写个计数器爬取到指定数目break即可但只得注意的是指定数量超过一页18条时继续下一页爬取也可以直接加在上述代码里把结束页可以给的很大用计数器break即可不会造成伪死循环。 3. 改进要求B: 从上图页面分析得知每本书得二级页面都是在https://antispider3.scrape.center后加/detail/数字该部分网址在a标签得href属性里由于页面里超链接很多所以先find_all出div下的classel-col el-col-24,这里用得class_是为了解决class是python中的关键字问题爬取后与原始url拼接即可。 # 获取每本书对用url(二级页面) tags soup.find_all(div, class_el-col el-col-24) print(len(tags)) for tag in [tags[i] for i in range(len(tags)) if i % 2 0]:link tag.find(a).get(href)links.append(url1 link) print(links) 现在得到了每本书得二级页面得url就可以分析二级页面页面结构来爬取相应书籍信息分析如下所示二级页面结构其实还是清新明了的出了评分时span标签再其他都是p标签这里只爬取了上图标注的信息数据再爬取其他的都是一样的换汤不换药其实就换个class就OK,这里不做过多介绍。由于爬取页面过多发现问题有些书籍没有出版社页数等所以这里统一用None,没有的数据就用统一添加该字段去空即可如做特殊处理识别没有的信息每个属性都要增加相同的代码代码冗余度太高学术水平限制这里没想到其他好的方法所以没有做特殊处理。爬取下来的数据由于中间有很多空格与\n,如下所示这里就用到77行一系列的替换使达到想要的格式其他类似。下面介绍主函数部分这里将每本书的二级页面的url赋给对应属性 for link in links:print(link)score, price, publishtime, publisher, page, isbm get_details(link) 这里遍历出每本书的信息保存在以书名为名称的json文件中。 for i in range(len(names)):book {title: names[i],cover_url: pics[i],authors: authors[i],link: links[i],score: scores[i],price: prices[i],publish_time: publishtimes[i],publishers: publishers[i],pages: pages[i],ISBM: isbms[i]}data_path f{book[title]}.jsonjson.dump(book, open(data_path, w, encodingutf-8), ensure_asciiFalse, indent2) 最后附上爬取结果本次实现总结计算机专业的课程只理论不实践那就例如纸上谈兵本次实践说简单也不难但有些点还是触及我的知识盲区了例如span char的书名实践是检验真理的唯一标准。爬虫技术有限每次爬二级页面都要加载打开很浪费时间的后期学了更多的知识再来解决此问题吧。六、资料 1.实验框架代码 from selenium import webdriver from pyquery import PyQuery as pq from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait browser webdriver.Chrome() browser.get(https://antispider3.scrape.center/) WebDriverWait(browser, 10) \.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, .item))) html browser.page_source doc pq(html) names doc(.item .name) for name in names.items():print(name.text()) 2.OpenSSLWin32/Win64 OpenSSL Installer for Windows - Shining Light Productions (slproweb.com) 3.Josn存储先安装json包 import jsonbook {title: Wonder,cover_url:https://img1.doubanio.com/view/subject/l/public/s27252687.jpg,authors:R. J. Palacio}data_path f{book[title]}.json json.dump(book, open(data_path, w, encodingutf-8), ensure_asciiFalse, indent2) 4.实验小提示可以根据HTML结构发现每个书籍信息都保存在。有的书名放在classname whole的H3元素有书名由多个classchar的SPAN元素组成。对于放在H3元素的书名直接取出其元素内容即可而对于放在多个SPAN元素中的书名这里用到了文本反爬机制利用CSS控制文本偏移来实现文本顺序改变。但不难发现其文本偏移由left属性决定原文正确顺序因此需要按偏移left属性值大小升序排序获取正确的文本顺序。七、源码 import json import warnings from selenium import webdriver from pyquery import PyQuery as pq from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from bs4 import BeautifulSoup# 定义容器用来存储书籍的信息 names [] # 书籍名字 authors [] # 书籍作者 pics [] # 书籍封面图片 links [] # 链接 scores [] # 评分 prices [] # 定价 publishtimes [] # 出版时间 publishers [] # 出版社 pages [] # 页数 isbms [] # ISBM# 获取书籍分面信息与对应书籍二级页面url def get_cover(url):warnings.filterwarnings(ignore)browser webdriver.Chrome()browser.get(url)WebDriverWait(browser, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, .item)))html browser.page_sourcedoc pq(html)# 使用BeautifulSoup进行解析网页soup BeautifulSoup(doc.html(), html.parser)browser.close()# 获取书籍名字for tag in soup.select(h3.name):if whole in tag.attrs[class]:names.append(tag.text)else:chars tag.select(span.char)chars sorted(chars, keylambda a: eval(a.attrs[style][6:-3]))name for char in chars:name char.text.strip()names.append(name)# 获取作者名字for tag in soup.select(p.authors):authors.append(tag.text.strip().replace( , ).replace(\n, ))# 获取书籍封面图片urlfor tag in soup.select(img.cover):pics.append(tag.attrs[src])# 获取每本书对用url(二级页面)tags soup.find_all(div, class_el-col el-col-24)print(len(tags))for tag in [tags[i] for i in range(len(tags)) if i % 2 0]:link tag.find(a).get(href)links.append(url1 link)print(links)return names, pics, authors, links# 获取每本书的详细信息二级页面信息 def get_details(url):warnings.filterwarnings(ignore)browser webdriver.Chrome()browser.get(url)WebDriverWait(browser, 300).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, .item)))html browser.page_sourcedoc pq(html)# 使用BeautifulSoup进行解析网页soup BeautifulSoup(doc.html(), html.parser)# 获取评分score soup.find(span, class_score m-r m-b-sm)if score ! None:score score.textscore str(score).replace( , ).replace(\t, ).replace(\n, )else:score scores.append(score)# 获取定价price soup.find(p, class_price)if price ! None:price price.textprice str(price).replace( , ).replace(\t, ).replace(\n, ).split()[1]else:price prices.append(price)# 获取出版时间publishtime soup.find(p, class_published-at)if publishtime ! None:publishtime publishtime.textpublishtime str(publishtime).replace( , ).replace(\t, ).replace(\n, ).split()[1]else:publishtime publishtimes.append(publishtime)# 获取出版社publisher soup.find(p, class_publisher)if publisher ! None:publisher publisher.textpublisher str(publisher).replace( , ).replace(\t, ).replace(\n, ).split()[1]else:publisher publishers.append(publisher)# 获取页数page soup.find(p, class_page-number)if page ! None:page page.textpage str(page).replace( , ).replace(\t, ).replace(\n, ).split()[1]else:page pages.append(page)# 获取ISBMisbm soup.find(p, class_isbn)if isbm ! None:isbm isbm.textisbm str(isbm).replace( , ).replace(\t, ).replace(\n, ).split()[1]else:isbm isbms.append(isbm)browser.close()return score, price, publishtime, publisher, page, isbmif __name__ __main__:url1 https://antispider3.scrape.centerurl https://antispider3.scrape.center/page/page_start int(input(请指定爬取起始页(包含该页):))page_end int(input(请指定爬取结束页(不包含该页):))for i in range(page_start, page_end):names, pics, authors, links get_cover(url str(i))for link in links:print(link)score, price, publishtime, publisher, page, isbm get_details(link)for i in range(len(names)):book {title: names[i],cover_url: pics[i],authors: authors[i],link: links[i],score: scores[i],price: prices[i],publish_time: publishtimes[i],publishers: publishers[i],pages: pages[i],ISBM: isbms[i]}data_path f{book[title]}.jsonjson.dump(book, open(data_path, w, encodingutf-8), ensure_asciiFalse, indent2)下一篇文章实验项目二模拟登录和数据持久化

查看全文

http://www.zqtcl.cn/news/321132/