微信商城平台开发,企业如何进行seo,陈金凌 wordpress,qq注册账号免费申请一、Requests请求示例【京东API接口】
爬虫爬取网页内容首先要获取网页的内容#xff0c;通过requests库进行获取。 安装 pip install requests
示例代码 import requests
url http://store.weigou365.cnres requests.get(url)res.text
执行效果如下#x…一、Requests请求示例【京东API接口】
爬虫爬取网页内容首先要获取网页的内容通过requests库进行获取。 安装 pip install requests
示例代码 import requests
url http://store.weigou365.cnres requests.get(url)res.text
执行效果如下 二、Selenium库
爬虫爬取网页有时需要模拟网页行为比如京东、淘宝详情页面图片加载随着滚动自动加载的。这种情况我们就要进行浏览器模拟操作才能获取要爬取的数据。
Selenium 是一个用于自动化浏览器操作的开源框架主要用于网页测试支持多种浏览器包括 Chrome、Firefox、Safari 等。它提供了一系列的API允许你模拟用户在浏览器中的行为例如点击按钮、填写表单、导航等。 官方网站: https://sites.google.com/a/chromium.org/chromedriver114之前版本http://chromedriver.storage.googleapis.com/index.html116版本:https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/win64/chromedriver-win64.zip117之后的版本https://googlechromelabs.github.io/chrome-for-testing/
安装 pip install selenium
示例代码 from selenium import webdriverbrowser webdriver.Chrome()browser.get(https://baidu.com/)print(browser.title)browser.quit() 三、爬取京东详情页面代码 from selenium import webdriverfrom lxml import etreeimport timeimport openpyxlimport reimport osimport requestsheaders {content-type: application/json, User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0}def exchange_url(small,big,flag0): lists small[0].strip(/).split(/) return lists[0] /n str(flag) / big[0]def get_image_path(model): path ./imgs/ str(time.strftime(%Y%m%d%H%M, time.localtime()) ) / if model ! : path model if(os.path.exists(path)): pass else: os.makedirs(path) return pathdef download_img(title,url,headers,model): img_data requests.get(url,headersheaders).content filename url.strip(/).split(/).pop() if model ! : filename model _ filenameimg_path os.path.join(get_image_path(model),filename) with open(img_path,wb) as f: f.write(img_data) returndef get_source(driver,url): #发起请求 driver.get(url) time.sleep(1) #休息一秒然后操纵滚轮滑到最底部这时浏览器数据全部加载返回的源码中是全部数据 driver.execute_script(window.scrollTo(0,document.body.scrollHeight);) time.sleep(2) #得到代码 source driver.page_source#返回source源码以供解析 return source
def writeExcel(title): wb openpyxl.load_workbook(records.xlsx) ws wb.active path get_image_path() path os.path.abspath(.) path.strip(.) ws.append([title,path]) wb.save(records.xlsx)def get_page_title(html): db_title html.xpath(//*[classitemInfo-wrap]/div[classsku-name]/text()) if(len(db_title) 1): return db_title[0].replace(\n,).replace(\,).replace( ,) return db_title[1].replace(\n,).replace(\,).replace( ,)def get_page_logos(html): db_logo_items html.xpath(//*[idspec-list]/ul[classlh]/li) bigs mids [] for db_logo_item in db_logo_items: db_logo_small db_logo_item.xpath(img/src) db_logo_big db_logo_item.xpath(img/data-url) bigs.append(exchange_url(db_logo_small,db_logo_big)) mids.append(exchange_url(db_logo_small,db_logo_big,1)) return [mids,bigs]
def get_page_content(html): images html.xpath(//div[idJ-detail-content]/p/img/href) #pattern re.compile(rbackground-image:url\(([^)]*),re.S) return imagesdef process(url): try: driver webdriver.Chrome() driver.implicitly_wait(10) content get_source(driver,url) html etree.HTML(content) title get_page_title(html) logos get_page_logos(html) images get_page_content(html) print(title,logos,images)#记录标题和图片地址 writeExcel(title) print(write title done!)#下载中图 for mid_url in logos[0]: img_url http:// mid_url.replace(http,).replace(:,).replace(//,) download_img(title,img_url,headers,modelmid) print(download mid logos done!)#下载大图 for big_url in logos[1]: img_url http:// big_url.replace(http,).replace(:,).replace(//,) download_img(title,img_url,headers,modelbig)print(download big logos done!)for img_url in images: img_url http:// img_url.replace(http,).replace(:,).replace(//,) download_img(title,img_url,headers,modelimgs)print(download content images done!)finally: driver.close()if __name__ __main__: while(True): url input(京东详情页地址(quit退出):) if(url quit): break; process(url)
上面代码保存.py文件。通过下面命令执行 python scrawler.py
执行如下 下载图片如下