Python3爬虫知识梳理

python

Word count: 531Reading time: 2 min

 2019/12/15 

X01 常用的 python 库

0X01 常用的 python 库

1.urllib

1
2
3

import urllib
import urllib.request
urllib.request.urlopen("http://www.baidu.com")

2.re

3.requests

4.selenimu

这个库是配合一些驱动去爬取动态渲染网页的库

(1)chromedriver

我们使用的时候需要先下载一个chromedriver.exe ，下载好了以后放在 chrome.exe 的相同目录下（默认安装路径），然后将这个目录放作为 PATH

import selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
driver.page_source

这种方式的唯一的缺点是会出现浏览器界面，这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument("--headless")

base_url = "http://www.baidu.com/"
#对应的chromedriver的放置目录
driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options)

driver.get(base_url + "/")

start_time=time.time()
print('this is start_time ',start_time)

driver.find_element_by_id("kw").send_keys("selenium webdriver")
driver.find_element_by_id("su").click()
driver.save_screenshot('screen.png')

driver.close()

end_time=time.time()
print('this is end_time ',end_time)

(2)phantomJS

这是另一种无界面的实现方法，虽然说不维护了，并且在使用的过程中会出现各种玄学，但是还是要介绍一下和 Chromedriver 一样，我们首先要去下载phantomJS,然后将其放在 PATH 中方便我们后面的调用

import selenium
from selenium import webdriver

driver = webdriver.phantomJS()
driver.get("http://www.baidu.com")
driver.page_source

5.lxml

这个是为 XPATH 的使用准备的库

6.beautifulsoup

pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本，并且这个库是依赖于 lxml 的，所以安装之前请先安装 lxml

1 2	from bs4 import BeautifulSoup soup = BeautifulSoup('`<html></html>','lxml')

7.pyquery

和 BeautifulSoup 一样也是一个网页解析库，但是相对来讲语法简单一些（语法是模仿 jQuery 的）

from pyquery import PyQuery as pq

page = pq('`<html>hello world</html>`')
result = page('html').text()
result

8.pymysql

9.pymango

10.redis

11.flask

12.django

13.jupyter

Author：V1ZkRA

Link：https://yinwc.github.io/2019/12/15/Python3%E7%88%AC%E8%99%AB%E7%9F%A5%E8%AF%86%E6%A2%B3%E7%90%86/

Publish date：December 15th 2019, 4:35:56 pm

Update date：January 28th 2023, 4:17:06 pm

License：本文采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可

Next Post

WEB应用漏洞及修复
Previous Post

提权笔记

CATALOG

1. 0X01 常用的 python 库