前言

NGA论坛刚刚开放了用户IP显示功能,早就想查查泥潭精英充分的我连夜花费数个小时写了个IP爬虫出来,看看都是哪些人在泥潭大漩涡板块活跃

爬虫

包与headers

首先是配置headers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests as req
from lxml import etree
import numpy as np
import time
import re


headers = {
# 在浏览器中,network查看
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.62',
'Cookie': '',
'Connection':'close',
}


# API文档参考 https://github.com/wolfcon/NGA-API-Documents

版面页

然后是从网事杂谈板块前几页的爬取到各个帖子的链接(API接口参数可查看文档)

F12查找到对应元素(不准确,需要自行修改)方便抓取链接。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
urls = []  # 保存页面uid
limit = 5 # 版面页数,请勿设置过多

for i in range(1,limit+1): # 获取近期网事杂谈板块回复前limit页中的帖子地址

time.sleep(1)

mainPage = req.get('https://bbs.nga.cn/thread.php?fid=-7&order_by=lastpostdesc&page='+str(i),headers=headers,verify=False)
doc = etree.HTML(mainPage.text)
pages_url = doc.xpath('//td[@class="c1"]/a') # 查找对应元素

for pg in pages_url:
r = re.search(r'[0-9]+',pg.attrib['href']).group() # 帖子uid
urls.append(r)
print('no.'+str(i)+' : '+str(r))

之后对抓取到的主题贴进行去重

1
2
3
urls = set(urls) # 帖子去重,注意此处顺序被打乱
urls = list(urls)
print(len(urls))

主题页

之后获取到主题贴第一页(默认)的内容,找到对应结果计算帖子页数,并获取到每页的用户uid, 用户uid可去重可不去重。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
uid = [] 
for item in urls: # 帖子中用户uid获取

time.sleep(1)

page_url = 'https://bbs.nga.cn/read.php?tid='+str(item)+'&lite=js' # 获取当前帖子页数
mainPage = req.get(page_url,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
Rows = re.findall(r'"__ROWS"\:[0-9]+',txt)

if Rows: # nga一小部分帖子js只传一半
pass
else:
continue

pageNum = int(int(Rows[0].replace('"__ROWS":',''))/20 + 1) # 当前帖子页数
print(str(item)+" pages: "+str(pageNum))

if pageNum>100: #去除超过100页的帖子
continue

for i in range(1,pageNum+1): # 用户uid获取
u = page_url+'&page='+str(i)
mainPage = req.get(u,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
tmp = re.findall(r'"uid"\:[0-9]+',txt)
flag = 0
for t in tmp:
if(flag==0):
flag=1
continue
i = t.replace('"uid":','')
uid.append(i)

用户IP获取

通过uid查到用户信息,并筛出ipLoc数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
url = 'https://bbs.nga.cn/nuke.php?lite=js&__lib=ucp&__act=get&uid='  # 用户IP查询
ips = [] # 保存ip
nums = 0
for person in uid:
time.sleep(0.1)
person_page_url = url + person
mainPage = req.get(person_page_url,headers=headers,verify=False)
txt = str(mainPage.text).replace('window.script_muti_get_var_store=','')
tmp = re.findall(r'"ipLoc"\:"[\u4e00-\u9fa5]+',txt) # 正则查找
if tmp: # nga有概率js只传一半
pass
else:
continue
tmp = tmp[0].replace('"ipLoc":"','')
ips.append(tmp)
print(nums,tmp) # 输出当前位置,方便网络中断后继续运行
nums = nums+1

with open('.\\area.txt', mode='a',encoding='utf-8') as f: # 写入文件保存
for i in ips:
f.write(i+'\n')

结果处理

对结果进行相应处理,作图。