目的:利用google搜尋引擎搜尋字串,並抓取title
最近在做字串比對,除了用些演算法之外(效果不好),想到用google搜尋後再比對,結果效果意外的好,以下分享實作。
作法1
先安裝google search package
*pip install google_search
from googlesearch.googlesearch import GoogleSearch
response = GoogleSearch().search("LOL LMS")
for result in response.results:
print("Title: " + result.title)
print("Content: " + result.getText())
output:頁面tiltle與內文(這邊內文太多就不列了)
優點:簡單方便
缺點:無法克制化,搜尋有限制
作法2
# -*- coding: utf-8 -*-
import requests
import time
import random
from bs4 import BeautifulSoup
def google_scrape(Search_list):
title_list=[]
#url='http://www.baidu.com/s?rsv_idx=1&wd='LPL&usm=2&ie=utf-8&sl_lang=en&rsv_srlang=en&rsv_rq=en&rqlang=cn
url='https://www.google.com.tw/search?q='
user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \
(KHTML, like Gecko) Element Browser 5.0', \
'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \
Version/6.0 Mobile/10A5355d Safari/8536.25', \
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/28.0.1468.0 Safari/537.36', \
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']
proxies = {
"http": "http://113.200.214.164:9999"}
index = random.randint(0, 9)
user_agent = user_agents[index]
headers = {'User-Agent': user_agent}
for x in Search_list:
time.sleep(1)
res=requests.get(url = url+x,headers = headers,proxies = proxies)
soup=BeautifulSoup(res.text, "html.parser")
search_text=soup.find_all("div", class_="g")
title_list=[result.find("a").text for result in search_text]
print title_list
google_scrape(['LOL LMS'])
output:頁面tiltle
程式碼很明顯比作法1長許多,但是卻相對安全不會被google鎖IP
1.user_agents偽裝瀏覽器
2.proxies更換
3.time.sleep(1)間隔時間
以上3點都可以減少被鎖的危險性
優點:較不易被鎖、可抓取特定範圍內容
缺點:速度較慢
以上給大家參考,建議採取作法2,如果只是少量資料或許就可採用作法1。