中国做网站的公司有哪些,网站优化公司哪个好,wordpress 产品展示,畔游网站建设获取IEEE会议论文的标题和摘要 – 潘登同学的爬虫笔记 文章目录 获取IEEE会议论文的标题和摘要 -- 潘登同学的爬虫笔记 打开IEEE的高级搜索环境准备完整爬虫过程获取文章地址翻译函数获取文章标题和摘要 前几天接到导师的一个任务#xff0c;要我去找找IEEE Transactions on K…获取IEEE会议论文的标题和摘要 – 潘登同学的爬虫笔记 文章目录 获取IEEE会议论文的标题和摘要 -- 潘登同学的爬虫笔记 打开IEEE的高级搜索环境准备完整爬虫过程获取文章地址翻译函数获取文章标题和摘要 前几天接到导师的一个任务要我去找找IEEE Transactions on Knowledge and Data Engineering期刊中与金融、企业有关的论文。起初我在IEEE官网不知所措上了知网等很多论文网站都很难将范围界定在这个期刊上。 后来我使用了谷歌学术的高级搜索然后找了大概30篇文献但是导师却说谷歌上的不全让我去官网找我又打开了官网。哪知道官网那网速真是感人而且还不支持Ctrl左键点开论文每次点开论文看两眼回去搜索结果又要等上一会儿。于是在研究了IEEE搜索结果及论文信息的页面后写下了一个批量获取论文标题、论文摘要及将其翻译成中文的全自动脚本。 打开IEEE的高级搜索
尽管爬虫再强大他也不具备将人家论文数据库全弄下来的能力还是要在官网的筛选引擎中进行初步筛选点开Advanced search。 将关键词用引号括起来以 OR 分割(记得加空格)。在出版标题中输入 IEEE Transactions on Knowledge and Data Engineering。 点击搜索。会进入一个结果界面我们就能得到一个URL:
https://ieeexplore.ieee.org/search/searchresult.jsp?actionsearchnewsearchtruematchBooleantruequeryText(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)ranges2020_2024_Year但是三年的代码经验告诉我这个URL不全因为搜索结果是分页的这个URL没有页码所以我直接点下一页获得另一个URL
https://ieeexplore.ieee.org/search/searchresult.jsp?actionsearchnewsearchtruematchBooleantruequeryText(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)ranges2020_2024_YearhighlighttruereturnFacetsALLreturnTypeSEARCHmatchPubstruepageNumber2此时我们能看到在Year之后出现了很多参数这些应该都是默认参数不用改但是最后一个是pageNumber2这时候页码才是这个URL的关键。
环境准备
我们需要创建一个虚拟环境专门用于爬虫编写因为在写爬虫的过程中经常需要调试VScode支持的调试版本是py3.7以上所以选择了比较熟悉的3.8。
conda create -n beautifulsoup_py38 python3.8
conda activate beautifulsoup_py38
conda install requests
conda install pandas
conda install bs4
conda install lxml
conda install selenium不同于静态网页现在很多网页都用AJAX或者别的方法动态加载如果用bs4最常用的方式只会得到一串JS代码这就是一个重大调整后来我找到了selenium这个库Selenium是一个用电脑模拟人操作浏览器网页可以实现自动化测试等。说白了这个方法很蠢就是打开网页然后等着等他加载完再爬…但是好在能全自动。
web requests.get(url, headers myHttpheader)
web.encoding utf-8 # important
soup BeautifulSoup(web.text,lxml)使用selenium前除了下载库外还需要下载浏览器驱动我用的是Chrome浏览器在设置里可以看到最新版本。我的是118版本。
114之前版本 http://chromedriver.storage.googleapis.com/index.html116版本 https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/win64/chromedriver-win64.zip117-119 https://googlechromelabs.github.io/chrome-for-testing/
下载好后解压找到exe文件
将exe复制到 C:\Program Files\Google\Chrome\Application (与Chrome.exe在相同文件下)将exe复制到C:\Users\pdnbplus\anaconda3\envs\beautifulsoup_py38 (与python.exe在相同文件下)
配置环境变量:此电脑→右击属性→高级系统设置→环境变量→用户变量→Path→编辑→新建 C:\Program Files\Google\Chrome\Application\ 最后记得确认
申请百度密匙 百度翻译开放平台http://api.fanyi.baidu.com/api/trans/product/index
既然百度翻译需要验证密匙我们就有必要先申请账号获取密匙。申请完后点开发者信息就可以获取密匙。
完整爬虫过程
获取文章地址
在筛选界面随便点一篇文章能看到URL是
https://ieeexplore.ieee.org/document/9942340然后进入浏览器开发视图看看链接上的href能看到是/document/9942340/,那么我们就只需要收集所有href然后拼接一下就能获取文章了。
因为文章标题的a标签的class是fw-bold,所有设置了一个等待等待出现fw-bold这个class的时候才开始解析最后将所有href放进一个列表关闭浏览器。
import pandas as pd
import os
import time
import numpy as np
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait #等待页面加载
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC# 设置浏览器选项
options webdriver.ChromeOptions()
# options.add_argument(--headless) # 无头模式不打开浏览器窗口def get_urls(url):# 因为文章是js动态加载的所以先要加载js加载的内容try:# 创建浏览器对象driver webdriver.Chrome(optionsoptions)driver.implicitly_wait(20)# 访问网页driver.get(url)driver.find_element_by_class_name(fw-bold)# 获取动态加载的网页内容dynamic_content driver.page_source# 使用BeautifulSoup解析动态内容soup BeautifulSoup(dynamic_content, html.parser)# web requests.get(url, headers myHttpheader)# web.encoding utf-8 # important# soup BeautifulSoup(web.text,lxml)# print(web.text)soup soup.findAll(a,attrs{class:fw-bold})# pattern r/document/\d/urls[]for x in soup:urls.append(x[href][:-1])urls list(set(urls))print(len(urls))except:pass# 关闭浏览器driver.quit()return urls# 阶段调试
if __name__ __main__:get_urls(https://ieeexplore.ieee.org/search/searchresult.jsp?actionsearchnewsearchtruematchBooleantruequeryText(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)ranges2020_2024_YearhighlighttruereturnFacetsALLreturnTypeSEARCHmatchPubstruepageNumber1)看到如果输出25那就是没问题了因为默认一页有25篇文章
翻译函数
因为需要把英文的标题和摘要翻译成中文所以先写一个翻译函数
import http.client
import hashlib
import urllib
import random
import jsonappid xxx # 填写你的appid
secretKey xxx # 填写你的密钥
httpClient None
url_baidu /api/trans/vip/translate # 通用翻译API HTTP地址def translateBaidu(text, fen, tzh):salt random.randint(32768, 65536)sign appid text str(salt) secretKeysign hashlib.md5(sign.encode()).hexdigest()url url_baidu ?appid appid q urllib.parse.quote(text) from f to t \salt str(salt) sign signtry:httpClient http.client.HTTPConnection(api.fanyi.baidu.com)httpClient.request(GET, url)# response是HTTPResponse对象response httpClient.getresponse()result_all response.read().decode(utf-8)data json.loads(result_all)result str(data[trans_result][0][dst])return resultexcept Exception as e:print (e)finally:if httpClient:httpClient.close()
# 阶段调试
if __name__ __main__:translateBaidu(i am happy)看到如果输我很高兴,那么就没有问题
获取文章标题和摘要
根据前面找到的URL规律对文章标题与摘要的标签进行解析得到文本内容其中有一些文章没有doi代码主要是书籍相关的部分需要做一些异常处理。并进行翻译。
def get_info(urls):url1https://ieeexplore.ieee.orgurls# 创建浏览器对象driver webdriver.Chrome(optionsoptions)# 访问网页driver.get(url1)content driver.page_sourcetry:# 使用BeautifulSoup解析内容soup BeautifulSoup(content, html.parser)# 显式等待指定等待某个标签加载完毕waitWebDriverWait(driver,5)wait.until(EC.presence_of_element_located((By.CLASS_NAME,document-main)))tle soup.find(h1,attrs{class:document-title text-2xl-md-lh}).find(span).texttitle tle.strip()abstract soup.findAll(div,attrs{class: u-mb-1})[1].textzh_title translateBaidu(title)zh_abstract translateBaidu(abstract)try:doi soup.find(div,attrs{class: u-pb-1 stats-document-abstract-doi}).find(a)[href]info {title:title,标题:zh_title,abstract:abstract,摘要:zh_abstract,doi:doi}driver.quit()return infoexcept:info {title:title,标题:zh_title,abstract:abstract,摘要:zh_abstract}driver.quit()return infoexcept:driver.quit()# 阶段调试
if __name__ __main__:url /document/10154753get_info(urlsurl)最后将所有结合起来写成一段脚本,为了减少难以避免的问题对文章收集的影响我一页保存一次但是这个官网实在比较慢平均一页都需要10分钟。
if __name__ __main__:# set the work file directorypath rC:/Users/pdnbplus/Documents/python全系列/网络爬虫/爬取IEEE文章/resultif not os.path.exists(path):print(path)os.mkdir(path)os.chdir(path)# Get the start urls.start_urls https://ieeexplore.ieee.org/search/searchresult.jsp?actionsearchnewsearchtruematchBooleantruequeryText(%22All%20Metadata%22:%22financial%22%20OR%20%22All%20Metadata%22:%22finance%22%20OR%20%22All%20Metadata%22:%22trade%22%EF%BC%8C%22trading%22%20OR%20%22All%20Metadata%22:%22bank%22%20OR%20%22All%20Metadata%22:%22company%22%20OR%20%22All%20Metadata%22:%22enterprise%22%20OR%20%22All%20Metadata%22:%22management%22%20OR%20%22All%20Metadata%22:%22credit%22%20OR%20%22All%20Metadata%22:%22default%22%20OR%20%22All%20Metadata%22:%22risk%22%20OR%20%22All%20Metadata%22:%22asset%22%20OR%20%22All%20Metadata%22:%22bond%22%20OR%20%22All%20Metadata%22:%22stock%22%20OR%20%22All%20Metadata%22:%22equity%22%20OR%20%22All%20Metadata%22:%22volalitity%22%20OR%20%22All%20Metadata%22:%22futures%22%20OR%20%22All%20Metadata%22:%22share%22%20%22option%22%20OR%20%22All%20Metadata%22:%22return%22%20OR%20%22All%20Metadata%22:%22price%22%20OR%20%22All%20Metadata%22:%22pricing%22%20OR%20%22All%20Metadata%22:%22earning%22%20OR%20%22All%20Metadata%22:%22interest%22%20OR%20%22All%20Metadata%22:%22investment%22%20OR%20%22All%20Metadata%22:%22loan%22%20OR%20%22All%20Metadata%22:%22bankruptcy%22%20OR%20%22All%20Metadata%22:%22arbitrary%22)%20AND%20(%22Publication%20Title%22:IEEE%20Transactions%C2%A0on%C2%A0Knowledge%C2%A0and%20Data%20Engineering)ranges2020_2024_YearhighlighttruereturnFacetsALLreturnTypeSEARCHmatchPubstruepageNumberfor i in range(10,21):info_all [] url_i start_urls str(i)top_urls []try:top_urls top_urls get_urls(url_i)except Exception as e:print(i,e)for url in top_urls:time.sleep(random.randint(1,5)/10)info get_info(url)if info and len(info)!0:info_all info_all [info] info_all pd.DataFrame(info_all)info_all.to_csv(pathf/article{i}.csv)