爬蟲-文本解析test

## 文本解析test ``` title = "E34xxxx88_xxx__化工115_專題提案發想" # 將不同的分隔符替換為單一的半形底線 normalized_title = title.replace('＿', '_').replace('__', '_') #使用spilt將用半形底線隔開之元素轉換成list parts = normalized_title.split("_") print(parts) if len(parts) >= 3: student_id = parts[0] name = parts[1] department =parts[2] print("學號:", student_id) print("姓名:", name) print("系級:", department) else: print("Title 格式不正確，無法提取訊息") ``` 執行結果 ``` ['E34xxxx88', 'xxx', '化工115', '專題提案發想'] 學號: E34xxxx88 姓名: xxx 系級: 化工115 ``` ### 假設是一整串文字，我需要提取其中特定片段做整理: text = '''主題 Python網路爬蟲應用摘要希望能用網路爬蟲做出一些應用，這只是一個大方向，目前一些，想法如下: 1. 課表選課應用:能跳出介面通知提醒選課時間，還有可以選定一些課程，跳出即時的名額動態，一有缺額就提醒我，最好的話能第一時間幫我自動選課更好(雖然不知道成大選課有沒有擋爬蟲就是了…)。 2. 自動收發訊息:設定一個時間，幫我自動用社群媒體發出貼文或訊息。 3. Discord bot:可以統計出每位使用者的歷史資料(暫定)''' topic_start = text.find('主題') + len('主題') topic_end = text.find('摘要') topic_content = text[topic_start:topic_end].strip() #text[範圍]，strip()提取 find()用來找該元素的位置 #設定一個範圍，從起點(主題二字起始位置+"主題"字元長="題"之後之字串)，到終點("摘要")之前 # 提取摘要 abstract_start = text.find('摘要') + len('摘要') abstract_content = text[abstract_start:].strip() print("主題:", topic_content) print("摘要:", abstract_content) ### 接下來要模擬從網頁上把含有文字的元素爬取下來先從從網頁上複製下來的html片段來試看看: from bs4 import BeautifulSoup html_content = '''<div class="no-overflow w-100 content-alignment-container" id="yui_3_17_2_1_1699203908396_25"> <div id="post-content-390615" class="post-content-container"> 主題 　　Python網路爬蟲應用摘要 　　希望能用網路爬蟲做出一些應用，這只是一個大方向，目前一些，想法如下: 1.    課表選課應用:能跳出介面通知提醒選課時間，還有可以選定一些課程，跳出即時的名額動態，一有缺額就提醒我，最好的話能第一時間幫我自動選課更好(雖然不知道成大選課有沒有擋爬蟲就是了…)。 2.    自動收發訊息:設定一個時間，幫我自動用社群媒體發出貼文或訊息。 3.    Discord bot:可以統計出每位使用者的歷史資料(暫定) </div> <div> <a href="https://moodle.ncku.edu.tw/pluginfile.php/1398924/mod_forum/attachment/390615/%E4%B8%BB%E9%A1%8C.docx?forcedownload=1" aria-label="Attachment 主題.docx" title="Attachment 主題.docx"> <img class="icon " alt="" aria-hidden="true" src="https://moodle.ncku.edu.tw/theme/image.php/fordson/core/1692324800/f/document"> 主題.docx </a> <a href="https://moodle.ncku.edu.tw/portfolio/add.php?ca_postid=390615&ca_attachment=15360719&sesskey=Y4c7w0lWJc&callbackcomponent=mod_forum&callbackclass=forum_portfolio_caller&course=33560&callerformats=document%2Crichhtml%2Cplainhtml%2Cleap2a&instance=1" title="Export attachment 主題.docx to portfolio"> </a> </div> <div class="d-flex flex-wrap"> <div class="mt-2">評比平均分數：6.3 (11) </div> <div class="post-actions d-flex align-self-end justify-content-end flex-wrap ml-auto" data-region="post-actions-container" role="menubar" aria-label="E34111188_巫宗翰_化工115_專題發想提案是由巫宗翰張貼" aria-controls="p390615"> <a data-region="post-action" href="https://moodle.ncku.edu.tw/mod/forum/discuss.php?d=293062#p390615" class="btn btn-link" title="Permanent link to this post" aria-label="Permanent link to this post" tabindex="0"> 永久鏈接 </a> <a data-region="post-action" href="https://moodle.ncku.edu.tw/portfolio/add.php?ca_postid=390615&sesskey=Y4c7w0lWJc&callbackcomponent=mod_forum&callbackclass=forum_portfolio_caller&course=33560&callerformats=richhtml%2Cleap2a&instance=1" class="btn btn-link" tabindex="0"> 匯出到學習歷程檔案 </a> </div> </div> </div>''' # 插入您提供的 HTML 片段 soup = BeautifulSoup(html_content, 'html.parser') # 定位到包含 標籤的 div 容器 post_content_div = soup.find('div', {'id': 'post-content-390615'}) # 提取 div 容器內的所有 標籤的文本 paragraphs_text = [p.get_text(strip=True) for p in post_content_div.find_all('p')] # 將提取的文本組合成一個單一的字符串 full_text = '\n'.join(paragraphs_text) print(full_text) 執行結果: ``` 主題 Python網路爬蟲應用摘要希望能用網路爬蟲做出一些應用，這只是一個大方向，目前一些，想法如下: 1.課表選課應用:能跳出介面通知提醒選課時間，還有可以選定一些課程，跳出即時的名額動態，一有缺額就提醒我，最好的話能第一時間幫我自動選課更好(雖然不知道成大選課有沒有擋爬蟲就是了…)。 2.自動收發訊息:設定一個時間，幫我自動用社群媒體發出貼文或訊息。 3.Discord bot:可以統計出每位使用者的歷史資料(暫定) ```