2023-08-09開發日誌

# 2023-08-09 ## LINEbot: * 歌:夜に駆ける ### 經過 1. 抓圖這真的搞超級久的有夠的辛苦改了超多的版本終於在第四個安定下來大致上前面的版本都是長這樣: ```python= from requests_html import HTMLSession import webbrowser from bs4 import BeautifulSoup # 创建HTML会话对象 session = HTMLSession() # 定义URL url = 'https://www.google.com/search?q=yee%20%E6%A2%97%E5%9C%96&tbm=isch' # 将URL替换为你要获取的网页地址 # 发送GET请求 response = session.get(url) # 渲染JavaScript，timeout参数用于等待渲染完成的时间 response.html.render(timeout=1000) # 获取渲染后的HTML内容 html_content = response.html.html #print(html_content) # 将HTML内容写入临时文件 with open('temp.html', 'w', encoding='utf-8') as f: f.write(response.text) # 用默认的浏览器打开临时文件 webbrowser.open('temp.html') #发送GET请求 # 使用Beautiful Soup解析HTML soup = BeautifulSoup(open('temp.html',encoding='utf8'), 'lxml') # 查找含有data-id属性的div标签 div_tags_with_data_id = soup.find_all('div', attrs={'data-id': True}) # 打印所有含有data-id属性的div标签 for div in div_tags_with_data_id: print(div) ``` 內容都大同小異主要就: * 是自己用request去爬蟲成功request=>但是都是縮圖 * 開始拆結構=>發現不知道為啥都爬不到可能是因為JS渲染問題=>用requests_html * 為了偵錯所以把抓下來的html開起來=>意外獲得HTML檔案確定裡面有data-id * request到的變數用BS4搜不到data-id=>走迂迴戰法=>失敗=>乾脆爬temp.html=>成功 * 雖然成功但也爬不到圖片宣告2次爬無解 * 所以找到了icrawler 並在 [https://github.com/hellock/icrawler/issues/73](https://) 中成功改出只拿URL不下載的方法 P.S. 想讓chatgpt中可以寫出他不熟的LIBRARY的其中一個方法就是把教學文/document貼上來但是要考慮文章大小喔 (不然字太多準跳ERROR) 在成功的執行過一次之後我就發現執行到第二次就算關鍵字不同還是跳同一張圖 ![](https://hackmd.io/_uploads/rJF6AQZhh.png) 所以我開始懷疑是暫存的問題反正很蠢搞了很久最後發現問題是在於少加一句: ```python CustomLinkPrinter.file_urls = [] # 清空圖片鏈接列表 ``` 因為如果我是在電腦端執行的話每次這個程式都會重開但如果在PaaS上file_urls就會一直處在滿的狀態就不會放心的了而至於暫存問題我想以後可能也會遇到還是有找到解決辦法: ```python= def download(self, task, default_ext, timeout=5, max_retry=3, overwrite=False, **kwargs): self.session.headers['Cache-Control'] = 'no-store' ``` 不要有暫存的部分 ```python= #app.py from flask import Flask, request, abort from linebot import LineBotApi, WebhookHandler from linebot.exceptions import InvalidSignatureError from linebot.models import * import os import request_4 app = Flask(__name__) line_bot_api = LineBotApi(os.environ['CHANNEL_ACCESS_TOKEN']) handler = WebhookHandler(os.environ['CHANNEL_SECRET']) @app.route("/callback", methods=['POST']) def callback(): signature = request.headers['X-Line-Signature']#是用來驗證請求的有效性 body = request.get_data(as_text=True)#獲取 HTTP 請求的資料主體（body）以文字格式解析後續處理方便 app.logger.info("Request body: " + body)#請求的資料主體寫入 Flask 應用程式的日誌，方便後續查看 try: handler.handle(body, signature)#資料主體和簽名傳handler處理 handler是WebhookHandler 物件處理 LINE Bot 收到的事件。 except InvalidSignatureError:#簽名驗證異常 abort(400) return 'OK' @handler.add(MessageEvent, message=TextMessage) def handle_message(event): try: if event.message.text.startswith("梗圖支援 "): reply = request_4.get_img_url(event.message.text[4:]) app.logger.info("reply is :"+str(reply)) message = ImageSendMessage(original_content_url=reply, preview_image_url=reply) line_bot_api.reply_message(event.reply_token, message) except Exception as e: app.logger.error("An error occurred: " + str(e)) reply = "出了一些問題，請稍後再試" message = TextSendMessage(text=reply) line_bot_api.reply_message(event.reply_token, message) import os if __name__ == "__main__": port = int(os.environ.get('PORT', 5000)) app.run(host='0.0.0.0', port=port) ``` ```python= #request_4.py from icrawler.downloader import ImageDownloader from icrawler.builtin import GoogleImageCrawler from icrawler.utils import Session class CustomLinkPrinter(ImageDownloader): file_urls = [] def get_filename(self, task, default_ext): file_idx = self.fetched_num + self.file_idx_offset return '{:04d}.{}'.format(file_idx, default_ext) def download(self, task, default_ext, timeout=5, max_retry=3, overwrite=False, **kwargs): self.session.headers['Cache-Control'] = 'no-store' file_url = task['file_url'] filename = self.get_filename(task, default_ext) task['success'] = True task['filename'] = filename if not self.signal.get('reach_max_num'): self.file_urls.append(file_url) self.fetched_num += 1 if self.reach_max_num(): self.signal.set(reach_max_num=True) return init_params = { 'downloader_cls': CustomLinkPrinter, # 添加其他参数 } def get_img_url(keyword="yee"): CustomLinkPrinter.file_urls = [] # 清空圖片鏈接列表 google_crawler = GoogleImageCrawler( **init_params) google_crawler.crawl(keyword=keyword+" 梗圖", max_num=1) # 根据需要调整参数 file_urls = google_crawler.downloader.file_urls rtn="" if file_urls: if(file_urls[0].count('http')>1): print("原本:",file_urls[0]) file_urls[0] = "http"+file_urls[0].rsplit('http', 1)[-1] rtn =file_urls[0] else: rtn = "https://memeprod.ap-south-1.linodeobjects.com/user-template/0f3ce1930440d817e8a477a175f871ed.png" print("result",rtn) return rtn #get_img_url("YEE")測試用 ``` ### 小節還是存在著許多問題: * 特殊網址連結(有跳轉的等等)無法使用 * 雖然已經有在最後有做處理但是還是沒辦法100%成功 * ~~多幾張圖片給選~~ * 發現只有一張的效果也不錯所以可能不做 * 找不到圖沒提示 * 會莫名出現圖片無法顯示點進去說已過期 * 這好像我請別人測試不會我也不確定 * 不能標註與已讀 * 我希望可以以標註的方式讓它去提示使用方式而且不能已讀讓它顯得很詭異 * 新功能 * 歌曲支援 * 語錄支援 * 固定一個月會~~吵~~跟你講有什麼新功能提醒你不要把它忘了 * social credit * 英文麻將