标签爬虫下的文章 - MomokoDo

登录

标签搜索

Momoko

累计撰写 12 篇文章
累计收到 0 条评论

搜索到 1 篇与爬虫的结果

2021-05-20
NGA论坛IP数据爬虫及分析前言NGA论坛刚刚开放了用户IP显示功能，早就想查查泥潭精英充分的我连夜花费数个小时写了个IP爬虫出来，看看都是哪些人在泥潭大漩涡板块活跃爬虫包与headers首先是配置headers：import requests as req from lxml import etree import numpy as np import time import re headers = { # 在浏览器中，network查看 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62', 'Cookie': '', 'Connection':'close', } # API文档参考 https://github.com/wolfcon/NGA-API-Documents版面页然后是从网事杂谈板块前几页的爬取到各个帖子的链接(API接口参数可查看文档)F12查找到对应元素(不准确，需要自行修改)方便抓取链接。urls = [] # 保存页面uid limit = 5 # 版面页数，请勿设置过多 for i in range(1,limit+1): # 获取近期网事杂谈板块回复前limit页中的帖子地址 time.sleep(1) mainPage = req.get('https://bbs.nga.cn/thread.php?fid=-7&order_by=lastpostdesc&page='+str(i),headers=headers,verify=False) doc = etree.HTML(mainPage.text) pages_url = doc.xpath('//td[@class="c1"]/a') # 查找对应元素 for pg in pages_url: r = re.search(r'[0-9]+',pg.attrib['href']).group() # 帖子uid urls.append(r) print('no.'+str(i)+' : '+str(r))之后对抓取到的主题贴进行去重urls = set(urls) # 帖子去重,注意此处顺序被打乱 urls = list(urls) print(len(urls))主题页之后获取到主题贴第一页(默认)的内容，找到对应结果计算帖子页数，并获取到每页的用户uid, 用户uid可去重可不去重。 uid = [] for item in urls: # 帖子中用户uid获取 time.sleep(1) page_url = 'https://bbs.nga.cn/read.php?tid='+str(item)+'&lite=js' # 获取当前帖子页数 mainPage = req.get(page_url,headers=headers,verify=False) txt = str(mainPage.text).replace('window.script_muti_get_var_store=','') Rows = re.findall(r'"__ROWS"\:[0-9]+',txt) if Rows: # nga一小部分帖子js只传一半 pass else: continue pageNum = int(int(Rows[0].replace('"__ROWS":',''))/20 + 1) # 当前帖子页数 print(str(item)+" pages: "+str(pageNum)) if pageNum>100: #去除超过100页的帖子 continue for i in range(1,pageNum+1): # 用户uid获取 u = page_url+'&page='+str(i) mainPage = req.get(u,headers=headers,verify=False) txt = str(mainPage.text).replace('window.script_muti_get_var_store=','') tmp = re.findall(r'"uid"\:[0-9]+',txt) flag = 0 for t in tmp: if(flag==0): flag=1 continue i = t.replace('"uid":','') uid.append(i)用户IP获取通过uid查到用户信息，并筛出ipLoc数据：url = 'https://bbs.nga.cn/nuke.php?lite=js&__lib=ucp&__act=get&uid=' # 用户IP查询 ips = [] # 保存ip nums = 0 for person in uid: time.sleep(0.1) person_page_url = url + person mainPage = req.get(person_page_url,headers=headers,verify=False) txt = str(mainPage.text).replace('window.script_muti_get_var_store=','') tmp = re.findall(r'"ipLoc"\:"[\u4e00-\u9fa5]+',txt) # 正则查找 if tmp: # nga有概率js只传一半 pass else: continue tmp = tmp[0].replace('"ipLoc":"','') ips.append(tmp) print(nums,tmp) # 输出当前位置，方便网络中断后继续运行 nums = nums+1 with open('.\\area.txt', mode='a',encoding='utf-8') as f: # 写入文件保存 for i in ips: f.write(i+'\n')结果处理对结果进行相应处理，作图。
- 2021年05月20日
- 13 阅读
- 0 评论
- 0 点赞

Momoko

12 文章数

0 评论量

标签云