当前位置：首页 > news >正文

aspnet校友录网站开发青岛网站建设开发外包

news 2025/11/14 23:17:01

aspnet校友录网站开发,青岛网站建设开发外包,西宁网站开发多少钱,建站费用明细实现时间#xff1a;2021-05-30实现难度#xff1a;★★★☆☆☆实现目标#xff1a;采集 Facebook 评论插件、留言外挂程序的所有评论。完整代码#xff1a;https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments其他爬虫实战代码…实现时间2021-05-30实现难度★★★☆☆☆实现目标采集 Facebook 评论插件、留言外挂程序的所有评论。完整代码https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments其他爬虫实战代码合集持续更新https://github.com/TRHX/Python3-Spider-Practice爬虫实战专栏持续更新https://itrhx.blog.csdn.net/article/category/9351278 文章目录【1x00】写在前面【2x00】逻辑分析【2x01】第一页【2x02】下一页【2x03】回复别人的评论【3x00】完整代码【4x00】数据截图【1x00】写在前面本文的采集代码适用于 Facebook 评论插件的评论采集。仅用于 Python 编程技术交流 Facebook 评论插件官网https://developers.facebook.com/products/social-plugins/comments 本文以 https://www.chinatimes.com/realtimenews/20210529003827-260407 为例。【2x00】逻辑分析【2x01】第一页在页面的 Facebook 评论插件位置右键查看框架源代码我们就可以看到第一页评论页面的源码直接访问这个 URL 就可以看到评论信息。这个页面的 URL 为https://www.facebook.com/plugins/feedback.php?app_id1379575469016080channelhttps%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df22d8c81d4ce144%26domain%3Dwww.chinatimes.com%26origin%3Dhttps%253A%252F%252Fwww.chinatimes.com%252Ff5f738a4fa595%26relation%3Dparent.parentcontainer_width924height100hrefhttps%3A%2F%2Fwww.chinatimes.com%2Frealtimenews%2F20210529003827-260407localezh_TWnumposts5order_byreverse_timesdkjoeyversionv3.2width 我们将其格式化后得到有以下参数 https://www.facebook.com/plugins/feedback.php? app_id: 1379575469016080 channel: https://staticxx.facebook.com/x/connect/xd_arbiter/?version46#cbf22d8c81d4ce144 domain: www.chinatimes.com origin: https%3A%2F%2Fwww.chinatimes.com%2Ff5f738a4fa595 relation: parent.parent container_width: 924 height: 100 href: https://www.chinatimes.com/realtimenews/20210529003827-260407 locale: zh_TW numposts: 5 order_by: reverse_time sdk: joey version: v3.2 width以上参数中app_id 需要我们去获取domain 为该网站的域名href 为该页面的 URL剩下的其他参数经测试对结果无影响可直接复制过去。直接在原页面搜索 app_id 的值可以发现有个 meta 标签里面有这个值直接使用 Xpath 匹配即可注意经过测试部分使用了这个插件的页面是没有 app_id 的不需要这个值也能获取所以要注意报错处理。 try:app_id content.xpath(//meta[propertyfb:app_id]/content)[0] except IndexError:pass对于第一页的所有评论我们搜索评论文字的 Unicode 编码可以在 response 中找到对应内容直接将包含评论信息的这一段提取出来即可。【2x02】下一页点击载入其他留言可以看到新的请求类似于https://www.facebook.com/plugins/comments/async/4045370158886862/pager/reverse_time/请求方式为 post。URL 中 async 后面的一串数字为 targetID可以在请求返回的数据中获取。 Form data 如下 app_id: 1379575469016080 after_cursor: AQHReYdcksX9wFZEKA3MgNmN8PCRr7N3tFfZZuIKpCKnIuv-SxCycw4uZ1LqhtMr7RVkGyqACNdpkd9uJJ1jk6ne9g limit: 10 __user: 0 __a: 1 __dyn: 7xe6EgU4e1QyUbFp62-m1FwAxu13wKxK7Emy8W3q322aewTwl8eU4K3a3W1DwUx60Vo1upE4W0LEK1pwo8swaq1xwEwhU1382gKi8wnU1e42C0BE1co3rw9O0RE5a1qw8W0b1w __csr: __req: 1 __hs: 18777.PHASED:plugin_feedback_pkg.2.0.0.0 dpr: 1 __ccg: EXCELLENT __rev: 1003879025 __s: ::lw3b8e __hsi: 6968076253228168178 __comet_req: 0 locale: zh_TW lsd: AVp5kXcGShk jazoest: 2975 __sp: 1app_id 和前面一样after_cursor 的值通过搜索可以在上一页评论数据里面找到换句话说这一页的数据里面包含一个 after_cursor 的值这个值是下一页请求 Form data 里面的参数。经测试其他参数的值不影响最终结果。【2x03】回复别人的评论回复别人的评论分为两种第一种是直接可以看到的第二种是需要点击“更多回复”才能看到的。第一种可以直接获取第二种需要再次发送新的请求才能获取新的请求的 URL 类似于https://www.facebook.com/plugins/comments/async/comment/4045370158886862_4046939882063223/pager/ 请求方式和下一页的请求方式一样其中 URL comment 后面的一串数字仍然是 targetID Form data 里的 after_cursor 参数可以在楼主的评论数据里面获取。【3x00】完整代码完整代码 Github 地址点亮 star 有 buff 加成 https://github.com/TRHX/Python3-Spider-Practice/tree/master/CommentPlugin/facebook-comments # # --*-- coding: utf-8 --*-- # Time : 2021-05-30 # Author : TRHX • 鲍勃 # Blog : www.itrhx.com # CSDN : itrhx.blog.csdn.net # FileName: facebook.py # Software: PyCharm # import requests import json import time from lxml import etree# 测试链接 # # https://www.chinatimes.com/realtimenews/20210529003827-260407 # https://tw.appledaily.com/life/20210530/IETG7L3VMBA57OD45OC5KFTCPQ/ # https://www.nownews.com/news/5281470 # https://www.thejakartapost.com/life/2019/06/03/how-to-lose-belly-fat-in-seven-days.html # https://mcnews.cc/p/25224 # https://news.ltn.com.tw/news/world/breakingnews/3550262 # https://www.npf.org.tw/1/15857 # https://news.pts.org.tw/article/528425 # https://news.tvbs.com.tw/life/1518745 # 测试链接 #PAGE_URL https://www.chinatimes.com/realtimenews/20210529003827-260407 PROXIES {http: http://127.0.0.1:10809, https: http://127.0.0.1:10809} # PROXIES None # 如果不需要代理则设置为 Noneclass FacebookComment:def __init__(self):self.json_name facebook_comments.jsonself.domain PAGE_URL.split(/)[2]self.iframe_referer https://{}/.format(self.domain)self.user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36self.channel_base_url https%3A%2F%2Fstaticxx.facebook.com%2Fx%2Fconnect%2Fxd_arbiter%2F%3Fversion%3D46%23cb%3Df17861604189654%26domain%3D{}%26origin%3Dhttps%253A%252F%252F{}%252Ff9bd3e89788d7%26relation%3Dparent.parentself.referer_base_url https://www.facebook.com/plugins/feedback.php?app_id{}channel{}container_width924height100href{}localezh_TWnumposts5order_byreverse_timesdkjoeyversionv3.2widthself.comment_base_url https://www.facebook.com/plugins/comments/async/{}/pager/reverse_time/self.reply_base_url https://www.facebook.com/plugins/comments/async/comment/{}/pager/self.target_id self.referer self.app_id staticmethoddef find_value(html: str, key: str, num_chars: int, separator: str) - str:pos_begin html.find(key) len(key) num_charspos_end html.find(separator, pos_begin)return html[pos_begin: pos_end]staticmethoddef save_comment(filename: str, information: json) - None:with open(filename, a, encodingutf-8) as f:f.write(information \n)def get_app_id(self) - None:headers {user-agent: self.user_agent}response requests.get(urlPAGE_URL, headersheaders, proxiesPROXIES)html response.textcontent etree.HTML(html)try:app_id content.xpath(//meta[propertyfb:app_id]/content)[0]self.app_id app_idexcept IndexError:passdef get_first_parameter(self) - str:channel_url self.channel_base_url.format(self.domain, self.domain)referer_url self.referer_base_url.format(self.app_id, channel_url, PAGE_URL)headers {authority: www.facebook.com,upgrade-insecure-requests: 1,user-agent: self.user_agent,accept: text/html,application/xhtmlxml,application/xml;q0.9,image/avif,image/webp,image/apng,*/*;q0.8,application/signed-exchange;vb3;q0.9,sec-fetch-site: cross-site,sec-fetch-mode: navigate,sec-fetch-dest: iframe,referer: self.iframe_referer,accept-language: zh-CN,zh;q0.9}response requests.get(urlreferer_url, headersheaders, proxiesPROXIES)data response.textafter_cursor self.find_value(data, afterCursor, 3, separator)target_id self.find_value(data, targetID, 3, separator)# rev find_value(data, consistency, 9, separator})# 提取并保存最开始的评论tree etree.HTML(data)script tree.xpath(//body/script[last()]/text())[0]html_begin script.find(comments:) len(comments:)html_end script.find(meta)result script[html_begin:html_end].strip()result_dict json.loads(result[:-1])comment_type firstself.processing_comment(result_dict, comment_type)self.target_id target_idself.referer referer_urlreturn after_cursordef get_comment(self, after_cursor: str, comment_url: str) - None::param after_cursor: 字符串下一页的 cursor:param comment_url: 字符串评论页面的 URL:return: Nonenum 1while after_cursor:post_data {app_id: self.app_id,after_cursor: after_cursor,limit: 10,iframe_referer: self.iframe_referer,__user: 0,__a: 1,__dyn: 7xe6EgU4e3W3mbG2KmhwRwqo98nwgUbErxW5EyewSwMwyzEdU5i3K1bwOw-wpUe8hwem0nCq1ewbWbwmo62782CwOwKwEwhU1382gKi8wl8G0jx0Fw9q0B82swdK0D83mwkE5G0zE16o,__csr: ,__req: num,__beoa: 0,__pc: PHASED:plugin_feedback_pkg,dpr: 1,__ccg: GOOD,# __rev: rev,# __s: :mfgzaz:f4if6y,# __hsi: 6899699958141806572,__comet_req: 0,locale: zh_TW,# jazoest: 22012,__sp: 1}headers {user-agent: self.user_agent,content-type: application/x-www-form-urlencoded,accept: */*,origin: https://www.facebook.com,sec-fetch-site: same-origin,sec-fetch-mode: cors,sec-fetch-dest: empty,referer: self.referer,accept-language: zh-CN,zh;q0.9}response requests.post(urlcomment_url, headersheaders, proxiesPROXIES, datapost_data)data response.textif xml version in data:html_data data.split(\n, 1)[1]else:html_data dataif for (;;); in html_data:json_text html_data.split(for (;;);)[1]json_dict json.loads(json_text)# print(json_dict)comment_type secondself.processing_comment(json_dict, comment_type)try:after_cursor json_dict[payload][afterCursor]except KeyError:after_cursor False# try:# rev json_dict[hsrp][hblp][consistency][rev]# except KeyError:# rev else:after_cursor Falsenum 1def processing_comment(self, comment_dict: dict, comment_type: str) - None::param comment_dict: 字典所有评论信息不同页面传来的数据可能结构不一样:param comment_type: 字符串用来标记第一页和非第一页的评论:return: Nonetry:comment_dict comment_dict[payload]except KeyError:comment_dict comment_dict# 如果为 first表示是第一页评论则全部储存否则要去掉重复的第一个if comment_type first:comment_ids comment_dict[commentIDs]else:comment_ids comment_dict[commentIDs][1:]# 第一次储存储存所有一级评论self.extract_comment(comment_dict, comment_ids)def extract_comment(self, comment_dict: dict, comment_ids: list) - None::param comment_dict: 字典所有的评论信息:param comment_ids: 列表所有评论的 ID:return: Nonefor i in range(len(comment_ids)):# info #crawl_timestamp int(time.time())crawl_time time.strftime(%Y-%m-%d %H:%M:%S, time.localtime())# comment #comment comment_dict[idMap][comment_ids[i]]comment_id comment_ids[i]target_id comment[targetID]created_timestamp comment[timestamp][time]created_time_text comment[timestamp][text]created_time time.strftime(%Y-%m-%d %H:%M:%S, time.localtime(float(created_timestamp)))comment_type comment[type]ranges comment[ranges]like_count comment[likeCount]has_liked comment[hasLiked]can_like comment[canLike]can_edit comment[canEdit]hidden comment[hidden]high_lighted_words comment[highlightedWords]spam_count comment[spamCount]can_embed comment[canEmbed]try:reply_count comment[public_replies][totalCount]except KeyError:reply_count 0report_uri https://www.facebook.com comment[reportURI]content comment[body][text]# author #author_id comment[authorID]author comment_dict[idMap][author_id]author_name author[name]thumb_src author[thumbSrc]uri author[uri]is_verified author[isVerified]author_type author[type]comment_result_dict {info: {pageURL: PAGE_URL, # 原始页面链接crawlTimestamp: crawl_timestamp, # 爬取时间戳crawlTime: crawl_time # 爬取时间},comment: {type: comment_type, # 类型commentID: comment_id, # 评论 IDtargetID: target_id, # 目标 ID若为回复 A 的评论则其值为 A 的评论 IDcreatedTimestamp: created_timestamp, # 评论时间戳createdTime: created_time, # 评论时间createdTimeText: created_time_text, # 评论时间年月日likeCount: like_count, # 该条评论获得的点赞数replyCount: reply_count, # 该条评论下的回复数spamCount: spam_count, # 该条评论被标记为垃圾信息的次数hasLiked: has_liked, # 该条评论是否被你点赞过canLike: can_like, # 该条评论是否可以被点赞canEdit: can_edit, # 该条评论是否可以被编辑hidden: hidden, # 该条评论是否被隐藏canEmbed: can_embed, # 该条评论是否可以被嵌入到其他网页ranges: ranges, # 不知道啥含义highLightedWords: high_lighted_words, # 该条评论被高亮的单词reportURI: report_uri, # 举报该条评论的链接content: content, # 该条评论的内容},author: {type: author_type, # 类型authorID: author_id, # 该条评论作者的 IDauthorName: author_name, # 该条评论作者的用户名isVerified: is_verified, # 该条评论作者是否已认证过uri: uri, # 该条评论作者的 facebook 主页thumbSrc: thumb_src # 该条评论作者的头像链接}}print(comment_result_dict)self.save_comment(self.json_name, json.dumps(comment_result_dict, ensure_asciiFalse))# 第二次储存储存所有二级评论(回复别人的评论且不用点击“更多回复”就能看见的评论)# 判断依据是否存在 commentIDstry:reply_ids comment[public_replies][commentIDs]self.extract_comment(comment_dict, reply_ids)except KeyError:pass# 第三次储存储存所有三级评论(回复别人的评论但是需要点击“更多回复”才能看见的评论)# 判断依据是否存在 afterCursortry:reply_after_cursor comment[public_replies][afterCursor]reply_id comment_ids[i]reply_url self.reply_base_url.format(reply_id)self.get_comment(reply_after_cursor, reply_url)except KeyError:passdef run(self) - None:self.get_app_id()after_cursor self.get_first_parameter()if len(after_cursor) 20:print(\n{} 评论采集完毕.format(PAGE_URL))else:comment_url self.comment_base_url.format(self.target_id)self.get_comment(after_cursor, comment_url)print(\n{} 评论采集完毕.format(PAGE_URL))if __name__ __main__:FC FacebookComment()FC.run() 【4x00】数据截图

查看全文

http://www.zqtcl.cn/news/43929/