Python 爬蟲將圖片及文字一次存成Excel File

2018-12-30

5901
0
PyETL

利用BeautifulSoup、Pandas、StyleFrame、Openpyxl 將檔案存成 .xlsx格式

終於要寫人生中的第一篇IT筆記了，寫故事是很難的，真的很佩服網路上那些前輩的技術文章，寫得清晰又明瞭。另外希望透過這樣紀錄的方式，能讓自己更了解學到了甚麼。
那麼話不多說，直接進入正題吧。

這篇要來分享使用Python做爬蟲之外，如何同一時間將爬到的圖片及文字一起以.xlsx格式存進Excel file裡面。這裡面用了一些文字爬蟲及繪圖的套件，如Beautifulsoup、Pandas、StyleFrame 及 Openpyxl 。當然，對於Python新手的我，今天要分享的都是自己摸索出來的方式，如果還有更快更方便的解決辦法，希望各位前輩們可以不吝指教。另外這篇有提到一些基礎HTML及CSS語法，這裡就不詳細介紹。

今天用來練習的是日本美食推薦網站 "食べログ"，相信很多人並不陌生。那麼進入首頁之後，我們直接選擇大阪(好想再去一次阿)區域然後點擊"大阪的餐廳排行榜"。接著就會看到餐廳排名的清單，在頁面空白處點擊右鍵然後選擇"檢查(N)"叫出開發人員工具(或者按F12就可以直接叫出(Chrome瀏覽器))。點擊一下開發人員工具介面左上角的游標(可以試著點擊餐廳名稱及分數)，然後點擊左側任何一個推薦的餐廳物件。

我們直接進入程式碼。
Part1:
- 使用urllib這個套件來送出url request，接著使用BeautifulSoup來解析從tablelog網站回傳回來的html response .
- 取得所有需要的資料及印出

# 使用內建的 urllib.request 裡的 urlopen 這個功能來送出網址
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
from bs4 import BeautifulSoup
#如果是 MAC 電腦, 請務必加入下面兩行, 因為 MAC 有視 https 的 ssl 證書無效的 bug
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

page = 59
while True:
    url = "https://tabelog.com/tw/osaka/rstLst/" + str(page) + "/?SrtT=rt"
    print("處理頁面:", url)
    #如果頁面超過(找不到)，直接印出completed然後break 跳出迴圈
    try:
        response = urlopen(url)
    except HTTPError:
        print("Completed")
        break
    #使用BeautufulSoup解析tablelog 網站回傳的html response
    html = BeautifulSoup(response)
    for list in html.find_all("li", class_="list-rst"):  # => find_all 必轉出 list
        jap = list.find("small", class_="list-rst__name-ja") #取得日文名
        eng = list.find("a", class_="list-rst__name-main") #取得英文名
        scores = list.find_all("b", class_="c-rating__val") #取得評分，注意這邊有三個評分，所以使用find_all
        img = list.find("img", class_="c-img") #要取得圖片網址 => img["src"]
        fname = "tablelog/" + img["src"].split("/")[-1] #設定圖片儲存路徑及檔案名稱
        urlretrieve(img["src"], fname) #使用urlretrieve下載圖片以及存入

        #試著把所有取得的資料印出來看看吧
        print(list), print("======\n"), print(jap), print("======\n"),
        print(eng), print("======\n"), print(scores), print("======\n"), 
        print(img), print("======\n"), print(img["src"])

    page = page + 1

Part 2:
- import pandas 套件，把所有取得的資料轉成DataFrame (一行一行的資料append到 DataFrame)
- 這時候已經可以利用pandas 將資料存成.xlsx file。
- 但因為待會我們也要把圖片存入，需調整欄位size，所以會把DataFrame再轉成StyleFrame方便做設定。(可以試試如果不做欄位設定就把圖存入會發生甚麼事情)

import pandas as pd
df = pd.DataFrame(columns=["餐廳美圖", "綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名", "介紹網址"])
import warnings
#忽略掉對ignore的warning
warnings.filterwarnings('ignore')
          
         ...
         ...

    for list in html.find_all("li", class_="list-rst"):  

         ...
         ...
      

        # 準備Series 以及 append進DataFrame。值會放到相對應的column
        s = pd.Series([scores[0].text, scores[1].text, scores[2].text, jap.text, eng.text, eng["href"]],
                      index=["綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名", "介紹網址"])
        # 因為 Series 沒有橫列的標籤, 所以加進去的時候一定要 ignore_index=True
        df = df.append(s, ignore_index=True)

    page = page + 1

# 儲存成 .xlsx, 不過列編號的數字不用存, 所以index=False
df.to_excel("tablelog.xlsx", encoding="utf-8", index=False)

Part 3:
- 轉成StyleFrame格式，做style設定

from StyleFrame import StyleFrame

    ...
    ...

sf = StyleFrame(df) #轉成StyleFrame

#設定欄寬
sf.set_column_width_dict(col_width_dict={
    ("餐廳美圖"): 25.5,
    ("綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名") : 20,
    ("介紹網址"): 65.5
 })

#設定列高
all_rows = sf.row_indexes
sf.set_row_height_dict(row_height_dict={
    all_rows[1:]: 120
})

#存成excel檔
sf.to_excel('tablelog.xlsx',
            sheet_name='Sheet1', #Create sheet
            right_to_left=False, #False 所以sheet放置是從左到右 left-to-right
            columns_and_rows_to_freeze='A1', #資料從A1整個貼上
            row_to_add_filters=0).save() #不要忘記要save

Part 4:
- 使用openpyxl寫入圖案至excel file

from StyleFrame import StyleFrame
import glob
import openpyxl
from openpyxl import load_workbook
from openpyxl.drawing import image
import os

    ...
    ...

col = 0
wb = load_workbook('tablelog.xlsx') #把檔案先讀出來
ws = wb.worksheets[0] #要把圖檔加進第一個sheet

#使用glob套件做讀檔得動作，從一個資料夾裡把每一個檔案讀出來。
#使用os套件，在讀檔的時候，從時間最早的檔案先讀，避免順序不對
searchedfiles = sorted(glob.glob("tablelog/*.jpg"), key=os.path.getmtime)
for fn in searchedfiles :
    img = openpyxl.drawing.image.Image(fn) # create image instances
    c = str(col + 2) #記得轉成String
    ws.add_image(img, 'A' + c) #從A2開始寫入
    col = col + 1
wb.save('tablelog.xlsx') #不要忘記save

那我們來看看結果吧!

最後附上完整source code:

# 使用內建的 urllib.request 裡的 urlopen 這個功能來送出網址
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
from bs4 import BeautifulSoup
#如果是 MAC 電腦, 請務必加入下面兩行, 因為 MAC 有視 https 的 ssl 證書無效的 bug
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import pandas as pd
df = pd.DataFrame(columns=["餐廳美圖", "綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名", "介紹網址"])
import warnings
#忽略掉對ignore的warning
warnings.filterwarnings('ignore')

from StyleFrame import StyleFrame
import glob
import openpyxl
from openpyxl import load_workbook
from openpyxl.drawing import image
import os

page = 1
while True:
    url = "https://tabelog.com/tw/osaka/rstLst/" + str(page) + "/?SrtT=rt"
    print("處理頁面:", url)
  
    try:
        response = urlopen(url)
    except HTTPError:
        print("Completed")
        break

    html = BeautifulSoup(response)
    for list in html.find_all("li", class_="list-rst"):  # => find_all 必轉出 list
        jap = list.find("small", class_="list-rst__name-ja") 
        eng = list.find("a", class_="list-rst__name-main")
        scores = list.find_all("b", class_="c-rating__val") 
        img = list.find("img", class_="c-img") 
        fname = "tablelog/" + img["src"].split("/")[-1] 
        urlretrieve(img["src"], fname) 

        # 準備Series 以及 append進DataFrame。值會放到相對印的column
        s = pd.Series([scores[0].text, scores[1].text, scores[2].text, jap.text, eng.text, eng["href"]],
                      index=["綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名", "介紹網址"])
        # 因為 Series 沒有橫列的標籤, 所以加進去的時候一定要 ignore_index=True
        df = df.append(s, ignore_index=True)


    page = page + 1

sf = StyleFrame(df) 

sf.set_column_width_dict(col_width_dict={
    ("餐廳美圖"): 25.5,
    ("綜合評分", "晚間評分", "年間評分", "日文店名", "英文店名") : 20,
    ("介紹網址"): 65.5
 })

all_rows = sf.row_indexes
sf.set_row_height_dict(row_height_dict={
    all_rows[1:]: 120
})

sf.to_excel('tablelog.xlsx',
            sheet_name='Sheet1', #Create sheet
            right_to_left=False, 
            columns_and_rows_to_freeze='A1', 
            row_to_add_filters=0).save() 


col = 0
wb = load_workbook('tablelog.xlsx')
ws = wb.worksheets[0] 

searchedfiles = sorted(glob.glob("tablelog/*.jpg"), key=os.path.getmtime)
for fn in searchedfiles :
    img = openpyxl.drawing.image.Image(fn) # create image instances
    c = str(col + 2) 
    ws.add_image(img, 'A' + c) 
    col = col + 1
wb.save('tablelog.xlsx')

Python

回首頁

我的學習筆記

Though life is hard, I want it to be boiling

Python 爬蟲將圖片及文字一次存成Excel File

系列文章