# 第六週 大數據分析實務&商業智慧與巨量資料分析
> ==**直接使用先前爬出來的多篇類別新聞(news_preprocessed.csv)、進階查詢-自訂關鍵詞熱門度分析、昨日(或今日)誰最大**==[color=#EA0000]
###### tags: `大數據分析實務` `商業智慧與巨量資料分析` `碩士` `複習用` `高科大`
>:::spoiler 文章目錄
>[TOC]
>:::
:::success
:::spoiler 資料來源(點擊展開) :page_facing_up:
{%hackmd @chiaoshin369/bigdata_url_temple %}
:::
## 第六週內容
> **時間:2025/3/25(二)、2025/3/27(四)** [color=#ffe260]
### 說明
1. 將製作 **「進階查詢-自訂關鍵詞熱門度分析」** 與 **「昨日(或今日)誰最大」** 頁面
2. 回家作業使用word檔,並截圖說明,最後匯出pdf檔
3. 直接建置 `自訂關鍵詞熱門度分析` 頁面,並將 `news_preprocessed` 匯入 `dataset`
4. 研究 `昨日(或今日)誰最大`,根據 `熱門人物關鍵字分析` 加上 ==時間過濾== 的功能,過濾 **只保留最後一天** 的新聞
5. 統計各個新聞類別的 **前3~5個人名**,存檔為 `"hot persons of yesterday.csv"`
6. 在Django網頁中使用 ==Ajax自後端讀入==,並呈現在前端網頁,修該一下網頁,讓呈現的方式更好看一些。
## 下載本週課程檔案
>[!Note]
>選取 w06 所有檔案 download。


如果 **無法使用** 或 **檔案連結已被老師變更**,請至 `Github倉庫` 下載
>[!Important]
>[bigdata/2025/class6/drive-download-20250325T013211Z-001.zip at main · chiaoshin/bigdata](https://github.com/chiaoshin/bigdata/blob/main/2025/class6/drive-download-20250325T013211Z-001.zip)
>
## 直接建置 ==進階查詢-自訂關鍵詞熱門度分析== 頁面
> keyword query using your dataset
#### (HW)
將 `seetings.py` 中的 `'app_user_keyword',` 導入。
> **==settings.py==**[color=#42cbed]
```python=1
INSTALLED_APPS = [
'django.contrib.admin',
'django.contrib.auth',
'django.contrib.contenttypes',
'django.contrib.sessions',
'django.contrib.messages',
'django.contrib.staticfiles',
# 'corsheaders', #跨站請求
'app_top_keyword',
'app_top_person',
'app_user_keyword',
]
```
再將 `path('userkeyword/', include('app_user_keyword.urls')),` 路徑包含進去。
> **==`website_configs` urls.py==**[color=#42cbed]
```python=1
from xml.etree.ElementInclude import include
from django.contrib import admin
from django.urls import path,include
urlpatterns = [
# path('admin/', admin.site.urls),
# top keywords
path('topword/', include('app_top_keyword.urls')),
# top persons
path('topperson/', include('app_top_person.urls')),
# user keyword analysis
path('userkeyword/', include('app_user_keyword.urls')),
]
```
前端模板引入此 `{% url 'app_user_keyword:home' %}` 前端頁面。
> **==`templates` base.html==**[color=#42cbed]
```python=1
<!-- 進階自訂分析 -->
<div class="btn-group">
<button type="button" class="btn dropdown-toggle" data-bs-toggle="dropdown" aria-expanded="false">進階查詢</button>
<div class="dropdown-menu">
<a class="dropdown-item" href="{% url 'app_user_keyword:home' %}">自訂關鍵詞熱門度分析</a>
<a class="dropdown-item" href="#">自訂全文檢索與關聯分析</a>
<a class="dropdown-item" href="#">自訂關鍵詞之情緒分析</a>
</div>
</div>
```
再將 **指定頁面(home.html)** 及 **dataset API** 路徑導入。
> **==`app_user_keyword` urls.py==**[color=#42cbed]
```python=1
from django.urls import path
from app_user_keyword import views
app_name="app_user_keyword"
urlpatterns = [
# the first way:
path('', views.home, name='home'),
path('api_get_top_userkey/', views.api_get_top_userkey),
# the second way:
#path('top_userkey/', views.home, name='home'),
#path('top_userkey/api_get_top_userkey/', views.api_get_top_userkey),
]
'''
# the first way:
The url path on the browser will be
http://localhost:8000/userkeyword/
# the second way:
The url path on the browser will be
http://localhost:8000/userkeyword/top_userkey/
The ajax url is as the following:
$.ajax({
type: "POST",
url: "api_get_top_userkey/",
'''
```
彙整整個 `塞選關鍵字` 的函數。
詳細函數說明,可以研究老師的這三個檔案:
1. 10-Filtering news with time period.ipynb
2. 15-User keywords query and frequency calculation.ipynb
3. 30-Time-based keyword frequency for line trend chart in Django.ipynb
> **==views.py==**[color=#42cbed]
```python=1
# Create your views here.
from django.shortcuts import render
import pandas as pd
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
from datetime import datetime, timedelta
# (1) we can load data using read_csv()
# global variable
# df = pd.read_csv('dataset/news_dataset_preprocessed_for_django.csv', sep='|')
# (2) we can load data using reload_df_data() function
def load_df_data():
# df is a global variable
global df
df = pd.read_csv('app_user_keyword/dataset/blow_news_preprocessed.csv', sep=',')
# We should reload df when necessary
load_df_data()
# hoem page
def home(request):
return render(request, 'app_user_keyword/home.html')
# When POST is used, make this function be exempted from the csrf
@csrf_exempt
def api_get_top_userkey(request):
# (1) get keywords, category, condition, and weeks passed from frontend
userkey = request.POST.get('userkey')
cate = request.POST.get('cate')
cond = request.POST.get('cond')
weeks = int(request.POST.get('weeks'))
key = userkey.split()
# (2) make df_query global, so it can be used by other functions
global df_query
# (3) filter dataframe
df_query = filter_dataFrame(key, cond, cate,weeks)
#print(len(df_query))
# (4) get frequency data
key_freq_cat, key_occurrence_cat = count_keyword(df_query, key)
print(key_occurrence_cat)
# (5) get line chart data
# key_time_freq = [
# '{"x": "2019-03-07", "y": 2}',
# '{"x": "2019-03-08", "y": 2}',
# '{"x": "2019-03-09", "y": 13}']
key_time_freq = get_keyword_time_based_freq(df_query)
# (6) response all data to frontend home page
response = {
'key_occurrence_cat': key_occurrence_cat,
'key_freq_cat': key_freq_cat,
'key_time_freq': key_time_freq, }
return JsonResponse(response)
def filter_dataFrame(user_keywords, cond, cate, weeks):
# end date: the date of the latest record of news
end_date = df.date.max()
# start date
start_date = (datetime.strptime(end_date, '%Y-%m-%d').date() - timedelta(weeks=weeks)).strftime('%Y-%m-%d')
# proceed filtering
if (cate == "全部") & (cond == 'and'):
df_query = df[(df.date >= start_date) & (df.date <= end_date)
& df.content.apply(lambda text: all((qk in text) for qk in user_keywords))]
elif (cate == "全部") & (cond == 'or'):
df_query = df[(df['date'] >= start_date) & (df['date'] <= end_date)
& df.content.apply(lambda text: any((qk in text) for qk in user_keywords))]
elif (cond == 'and'):
df_query = df[(df.category == cate)
& (df.date >= start_date) & (df.date <= end_date)
& df.content.apply(lambda text: all((qk in text) for qk in user_keywords))]
elif (cond == 'or'):
df_query = df[(df.category == cate)
& (df['date'] >= start_date) & (df['date'] <= end_date)
& df.content.apply(lambda text: any((qk in text) for qk in user_keywords))]
return df_query
# ** How many pieces of news were the keyword(s) mentioned in?
# ** How many times were the keyword(s) mentioned?
# For the df_query, count the occurence and frequency for every category:
# (1) cate_occurence={} number of pieces containing the keywords
# (2) cate_freq={} number of times the keywords were mentioned
news_categories = ['全部','人物', '議題', '新聞', '雜吹']
def count_keyword(query_df, user_keywords):
cate_occurence={}
cate_freq={}
for cate in news_categories:
cate_occurence[cate]=0
cate_freq[cate]=0
for idx, row in query_df.iterrows():
# count number of news
cate_occurence[row.category] += 1
cate_occurence['全部'] += 1
# count user keyword frequency by checking every word in tokens_v2
tokens = eval(row.tokens_v2)
freq = len([word for word in tokens if (word in user_keywords)])
cate_freq[row.category] += freq
cate_freq['全部'] += freq
return cate_freq, cate_occurence
def get_keyword_time_based_freq(df_query):
date_samples = df_query.date
query_freq = pd.DataFrame({'date_index': pd.to_datetime(date_samples), 'freq': [1 for _ in range(len(df_query))]})
data = query_freq.groupby(pd.Grouper(key='date_index', freq='D')).sum()
time_data = []
for i, idx in enumerate(data.index):
row = {'x': idx.strftime('%Y-%m-%d'), 'y': int(data.iloc[i].freq)}
time_data.append(row)
return time_data
print("app_user_keyword was loaded!")
```
前端畫面插入,並且修正對應資料集選項(==新聞類別==)、預設塞選的關鍵字 `value="臺灣 台東"`。
> **==home.html==**[color=#42cbed]
```python=1
{% extends 'base.html' %} {% block title %}
使用者關鍵詞查詢
{% endblock %} {% block content %}
<div class="col-lg-12">
<h1>分析你關心的關鍵詞</h1>
<p>可以針對你輸入的個別關鍵詞進行熱門程度分析</p>
</div>
<div class="col-lg-6 mb-2">
<!-- 輸入條件區塊開始 -->
<div class="card">
<div class="card-header">
<h3 class="h6 text-uppercase mb-0">輸入條件</h3>
</div>
<div class="card-body">
<div class="mb-3 row">
<label class="col-md-3 col-form-label">關心哪個關鍵詞?</label>
<div class="col-md-9">
<input id="input_keyword" name="userkey" value="臺灣 台東" class="form-control" />
<div class="form-text text-muted">查找關鍵字,可輸入多個,空白隔開。主要以人名,產品,地理區域為主(搜尋斷詞後的詞語,並非全文搜尋)。</div>
</div>
</div>
<div class="mb-3 row">
<label class="col-sm-3 col-form-label">條件</label>
<div class="col-md-9 mb-3">
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cond_and" value="and" name="condradio" />
<label class="form-check-label" for="cond_and">and</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cond_or" value="or" name="condradio" checked />
<label class="form-check-label" for="cond_or">or</label>
</div>
</div>
</div>
<div class="mb-3 row">
<label class="col-sm-3 col-form-label">新聞類別</label>
<div class="col-md-9">
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cate_all" value="全部" name="cateradio" checked />
<label class="form-check-label" for="cate_all">全部</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cate_politics" value="人物" name="cateradio" />
<label class="form-check-label" for="cate_politics">人物</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cate_tech" value="議題" name="cateradio" />
<label class="form-check-label" for="cate_tech">議題</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cate_sports" value="新聞" name="cateradio" />
<label class="form-check-label" for="cate_sports">新聞</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="cate_stocks" value="雜吹" name="cateradio" />
<label class="form-check-label" for="cate_stocks">雜吹</label>
</div>
</div>
</div>
<div class="mb-3 row">
<label class="col-md-3 col-form-label">最近多少周?</label>
<div class="col-md-9">
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk1" value="1" name="wkradio" />
<label class="form-check-label" for="wk1">1</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk2" value="2" name="wkradio" checked />
<label class="form-check-label" for="wk2">2</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk3" value="3" name="wkradio" />
<label class="form-check-label" for="wk3">3</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk4" value="4" name="wkradio" />
<label class="form-check-label" for="wk4">4</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk6" value="6" name="wkradio" />
<label class="form-check-label" for="wk6">6</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk8" value="8" name="wkradio" />
<label class="form-check-label" for="wk8">8</label>
</div>
<div class="form-check form-check-inline">
<input class="form-check-input" type="radio" id="wk12" value="12" name="wkradio" />
<label class="form-check-label" for="wk12">12</label>
</div>
<div class="form-text text-muted">以最新資料時間為準,往前推多少周?</div>
</div>
</div>
<div class="mb-3 row">
<div class="col-md-9 ms-auto">
<button type="button" id="btn_ok" class="btn btn-primary">查詢</button>
</div>
</div>
</div>
</div>
</div>
<!-- 輸入區塊結束 -->
<!-- 顯示區塊 -->
<div class="col-lg-6 mb-2">
<div class="card">
<div class="card-header">
<h3 class="h6 text-uppercase mb-0">出現頻率以時間呈現</h3>
</div>
<div class="card-body">
<small>觀察每個時間點的有多少篇報導(聲量大小)</small>
<div class="row">
<canvas id="keyword_time_line_chart"></canvas>
</div>
</div>
</div>
</div>
<!-- 區塊結束 -->
<!-- 同時出現的關鍵字區塊 -->
<div class="col-lg-6 mb-2">
<div class="card">
<div class="card-header">
<h3 class="h6 text-uppercase mb-0">熱門程度:有幾篇新聞報導提到它?</h3>
</div>
<div class="card-body">
<ul id="keyword_article_count"></ul>
</div>
</div>
</div>
<!-- 區塊結束 -->
<!-- 熱門程度區塊 -->
<div class="col-lg-6 mb-2">
<div class="card">
<div class="card-header">
<h3 class="h6 text-uppercase mb-0">熱門程度:提到它的次數?</h3>
</div>
<div class="card-body">
<ul id="keyword_frequency"></ul>
</div>
</div>
</div>
<!-- 區塊結束 -->
{% endblock %} {% block extra_js %}
<!-- 這裡的java scrip等頁面初始化之後才載入與執行 -->
<!-- chartjs圖js -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.13.0/moment.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.7.3/Chart.min.js"></script>
<!-- 程式碼區 -->
<script>
call_ajax()
//**按鈕事件
$('#btn_ok').on('click', function () {
call_ajax()
}) //event function
$("input[name='cateradio']").on('change', function () {
call_ajax()
}) //event function
$("input[name='wkradio']").on('change', function () {
call_ajax()
}) //event function
$("input[name='condradio']").on('change', function () {
call_ajax()
}) //event function
function call_ajax() {
const userkey = $('#input_keyword').val()
const weeks = $("input[name='wkradio']:checked").val()
const cate = $("input[name='cateradio']:checked").val()
const cond = $("input[name='condradio']:checked").val()
if (userkey.length < 2) {
alert('輸入關鍵字不可空白或小於兩個中文字!')
return 0
}
$.ajax({
type: 'POST',
url: 'api_get_top_userkey/',
//url: 'http://163.18.23.20:8000/userkeyword/api_get_top_userkey/',
data: {
userkey: userkey,
cate: cate,
weeks: weeks,
cond: cond
}, // pass to server
success: function (received) {
const article_count = received['key_occurrence_cat']
console.log(article_count)
$('#keyword_article_count').empty()
//將內容加上li標籤附加起來顯示
for (let key in article_count) {
let paste = '<li>' + key + ':' + article_count[key] + '</li>'
$('#keyword_article_count').append(paste)
}
const kwfreq = received['key_freq_cat']
console.log(kwfreq)
$('#keyword_frequency').empty()
for (let key in kwfreq) {
let paste = '<li>' + key + ':' + kwfreq[key] + '</li>'
$('#keyword_frequency').append(paste)
}
const data_key_time_freq = received['key_time_freq']
console.log(data_key_time_freq)
showtimechart(data_key_time_freq)
} //function
}) //ajax
} //call_ajax()
// 宣告全域變數用於存放圖表實例
let line_chart = null
function showtimechart(data_key_time_freq) {
//取得繪圖元件
const ctx_key_time = document.getElementById('keyword_time_line_chart').getContext('2d')
const myoptions = {
type: 'line',
data: {
datasets: [
{
label: 's2',
borderColor: 'red',
data: data_key_time_freq // your data here!
}
]
},
options: {
legend: {
display: false
},
scales: {
xAxes: [
{
type: 'time',
time: {
unit: 'day',
displayFormats: {
//day: 'DD-MM-YYYY'
day: 'MM/DD'
}
}
}
],
yAxes: [
{
ticks: {
beginAtZero: true
},
display: true,
scaleLabel: {
display: true,
labelString: '出現次數'
}
}
]
}
}
}
// 檢查並清除舊圖
if (line_chart) {
line_chart.destroy()
}
// 畫新圖
line_chart = new Chart(ctx_key_time, myoptions)
}
</script>
{% endblock %}
```
## 研究 ==昨日(或今日)誰最大== 頁面
1. 根據 `熱門人物關鍵字分析` 加上 ==時間過濾== 的功能,過濾 **只保留最後一天** 的新聞
2. 統計各個新聞類別的 **前3~5個人名**,存檔為 `"hot persons of yesterday.csv"`
3. 在Django網頁中使用 ==Ajax自後端讀入==,並呈現在前端網頁,修該一下網頁,讓呈現的方式更好看一些。
---
:::spoiler 最後更新日期
>==第一版==[time=2025 3 27 , 11:45 PM][color=#786ff7]
<!-- >第二版[time=2025 3 24 , 3:20 PM][color=#ce770c] -->
<!-- >第三版[time=2025 3 24 , 3:20 PM][color=#ce770c] -->
>**最後版[time=2025 3 27 , 11:45 PM]**[color=#EA0000]
:::