Try   HackMD

第六週 大數據分析實務&商業智慧與巨量資料分析

直接使用先前爬出來的多篇類別新聞(news_preprocessed.csv)、進階查詢-自訂關鍵詞熱門度分析、昨日(或今日)誰最大

tags: 大數據分析實務 商業智慧與巨量資料分析 碩士 複習用 高科大
文章目錄

第六週內容

時間:2025/3/25(二)、2025/3/27(四)

說明

  1. 將製作 「進階查詢-自訂關鍵詞熱門度分析」「昨日(或今日)誰最大」 頁面
  2. 回家作業使用word檔,並截圖說明,最後匯出pdf檔
  3. 直接建置 自訂關鍵詞熱門度分析 頁面,並將 news_preprocessed 匯入 dataset
  4. 研究 昨日(或今日)誰最大,根據 熱門人物關鍵字分析 加上 時間過濾 的功能,過濾 只保留最後一天 的新聞
  5. 統計各個新聞類別的 前3~5個人名,存檔為 "hot persons of yesterday.csv"
  6. 在Django網頁中使用 Ajax自後端讀入,並呈現在前端網頁,修該一下網頁,讓呈現的方式更好看一些。

下載本週課程檔案

Note

選取 w06 所有檔案 download。

image
image

如果 無法使用檔案連結已被老師變更,請至 Github倉庫 下載

直接建置 進階查詢-自訂關鍵詞熱門度分析 頁面

keyword query using your dataset

(HW)

seetings.py 中的 'app_user_keyword', 導入。

settings.py

INSTALLED_APPS = [ 'django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.messages', 'django.contrib.staticfiles', # 'corsheaders', #跨站請求 'app_top_keyword', 'app_top_person', 'app_user_keyword', ]

再將 path('userkeyword/', include('app_user_keyword.urls')), 路徑包含進去。

website_configs urls.py

from xml.etree.ElementInclude import include from django.contrib import admin from django.urls import path,include urlpatterns = [ # path('admin/', admin.site.urls), # top keywords path('topword/', include('app_top_keyword.urls')), # top persons path('topperson/', include('app_top_person.urls')), # user keyword analysis path('userkeyword/', include('app_user_keyword.urls')), ]

前端模板引入此 {% url 'app_user_keyword:home' %} 前端頁面。

templates base.html

<!-- 進階自訂分析 --> <div class="btn-group"> <button type="button" class="btn dropdown-toggle" data-bs-toggle="dropdown" aria-expanded="false">進階查詢</button> <div class="dropdown-menu"> <a class="dropdown-item" href="{% url 'app_user_keyword:home' %}">自訂關鍵詞熱門度分析</a> <a class="dropdown-item" href="#">自訂全文檢索與關聯分析</a> <a class="dropdown-item" href="#">自訂關鍵詞之情緒分析</a> </div> </div>

再將 指定頁面(home.html)dataset API 路徑導入。

app_user_keyword urls.py

from django.urls import path from app_user_keyword import views app_name="app_user_keyword" urlpatterns = [ # the first way: path('', views.home, name='home'), path('api_get_top_userkey/', views.api_get_top_userkey), # the second way: #path('top_userkey/', views.home, name='home'), #path('top_userkey/api_get_top_userkey/', views.api_get_top_userkey), ] ''' # the first way: The url path on the browser will be http://localhost:8000/userkeyword/ # the second way: The url path on the browser will be http://localhost:8000/userkeyword/top_userkey/ The ajax url is as the following: $.ajax({ type: "POST", url: "api_get_top_userkey/", '''

彙整整個 塞選關鍵字 的函數。

詳細函數說明,可以研究老師的這三個檔案:

  1. 10-Filtering news with time period.ipynb
  2. 15-User keywords query and frequency calculation.ipynb
  3. 30-Time-based keyword frequency for line trend chart in Django.ipynb

views.py

# Create your views here. from django.shortcuts import render import pandas as pd from django.http import JsonResponse from django.views.decorators.csrf import csrf_exempt from datetime import datetime, timedelta # (1) we can load data using read_csv() # global variable # df = pd.read_csv('dataset/news_dataset_preprocessed_for_django.csv', sep='|') # (2) we can load data using reload_df_data() function def load_df_data(): # df is a global variable global df df = pd.read_csv('app_user_keyword/dataset/blow_news_preprocessed.csv', sep=',') # We should reload df when necessary load_df_data() # hoem page def home(request): return render(request, 'app_user_keyword/home.html') # When POST is used, make this function be exempted from the csrf @csrf_exempt def api_get_top_userkey(request): # (1) get keywords, category, condition, and weeks passed from frontend userkey = request.POST.get('userkey') cate = request.POST.get('cate') cond = request.POST.get('cond') weeks = int(request.POST.get('weeks')) key = userkey.split() # (2) make df_query global, so it can be used by other functions global df_query # (3) filter dataframe df_query = filter_dataFrame(key, cond, cate,weeks) #print(len(df_query)) # (4) get frequency data key_freq_cat, key_occurrence_cat = count_keyword(df_query, key) print(key_occurrence_cat) # (5) get line chart data # key_time_freq = [ # '{"x": "2019-03-07", "y": 2}', # '{"x": "2019-03-08", "y": 2}', # '{"x": "2019-03-09", "y": 13}'] key_time_freq = get_keyword_time_based_freq(df_query) # (6) response all data to frontend home page response = { 'key_occurrence_cat': key_occurrence_cat, 'key_freq_cat': key_freq_cat, 'key_time_freq': key_time_freq, } return JsonResponse(response) def filter_dataFrame(user_keywords, cond, cate, weeks): # end date: the date of the latest record of news end_date = df.date.max() # start date start_date = (datetime.strptime(end_date, '%Y-%m-%d').date() - timedelta(weeks=weeks)).strftime('%Y-%m-%d') # proceed filtering if (cate == "全部") & (cond == 'and'): df_query = df[(df.date >= start_date) & (df.date <= end_date) & df.content.apply(lambda text: all((qk in text) for qk in user_keywords))] elif (cate == "全部") & (cond == 'or'): df_query = df[(df['date'] >= start_date) & (df['date'] <= end_date) & df.content.apply(lambda text: any((qk in text) for qk in user_keywords))] elif (cond == 'and'): df_query = df[(df.category == cate) & (df.date >= start_date) & (df.date <= end_date) & df.content.apply(lambda text: all((qk in text) for qk in user_keywords))] elif (cond == 'or'): df_query = df[(df.category == cate) & (df['date'] >= start_date) & (df['date'] <= end_date) & df.content.apply(lambda text: any((qk in text) for qk in user_keywords))] return df_query # ** How many pieces of news were the keyword(s) mentioned in? # ** How many times were the keyword(s) mentioned? # For the df_query, count the occurence and frequency for every category: # (1) cate_occurence={} number of pieces containing the keywords # (2) cate_freq={} number of times the keywords were mentioned news_categories = ['全部','人物', '議題', '新聞', '雜吹'] def count_keyword(query_df, user_keywords): cate_occurence={} cate_freq={} for cate in news_categories: cate_occurence[cate]=0 cate_freq[cate]=0 for idx, row in query_df.iterrows(): # count number of news cate_occurence[row.category] += 1 cate_occurence['全部'] += 1 # count user keyword frequency by checking every word in tokens_v2 tokens = eval(row.tokens_v2) freq = len([word for word in tokens if (word in user_keywords)]) cate_freq[row.category] += freq cate_freq['全部'] += freq return cate_freq, cate_occurence def get_keyword_time_based_freq(df_query): date_samples = df_query.date query_freq = pd.DataFrame({'date_index': pd.to_datetime(date_samples), 'freq': [1 for _ in range(len(df_query))]}) data = query_freq.groupby(pd.Grouper(key='date_index', freq='D')).sum() time_data = [] for i, idx in enumerate(data.index): row = {'x': idx.strftime('%Y-%m-%d'), 'y': int(data.iloc[i].freq)} time_data.append(row) return time_data print("app_user_keyword was loaded!")

前端畫面插入,並且修正對應資料集選項(新聞類別)、預設塞選的關鍵字 value="臺灣 台東"

home.html

{% extends 'base.html' %} {% block title %} 使用者關鍵詞查詢 {% endblock %} {% block content %} <div class="col-lg-12"> <h1>分析你關心的關鍵詞</h1> <p>可以針對你輸入的個別關鍵詞進行熱門程度分析</p> </div> <div class="col-lg-6 mb-2"> <!-- 輸入條件區塊開始 --> <div class="card"> <div class="card-header"> <h3 class="h6 text-uppercase mb-0">輸入條件</h3> </div> <div class="card-body"> <div class="mb-3 row"> <label class="col-md-3 col-form-label">關心哪個關鍵詞?</label> <div class="col-md-9"> <input id="input_keyword" name="userkey" value="臺灣 台東" class="form-control" /> <div class="form-text text-muted">查找關鍵字,可輸入多個,空白隔開。主要以人名,產品,地理區域為主(搜尋斷詞後的詞語,並非全文搜尋)。</div> </div> </div> <div class="mb-3 row"> <label class="col-sm-3 col-form-label">條件</label> <div class="col-md-9 mb-3"> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cond_and" value="and" name="condradio" /> <label class="form-check-label" for="cond_and">and</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cond_or" value="or" name="condradio" checked /> <label class="form-check-label" for="cond_or">or</label> </div> </div> </div> <div class="mb-3 row"> <label class="col-sm-3 col-form-label">新聞類別</label> <div class="col-md-9"> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cate_all" value="全部" name="cateradio" checked /> <label class="form-check-label" for="cate_all">全部</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cate_politics" value="人物" name="cateradio" /> <label class="form-check-label" for="cate_politics">人物</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cate_tech" value="議題" name="cateradio" /> <label class="form-check-label" for="cate_tech">議題</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cate_sports" value="新聞" name="cateradio" /> <label class="form-check-label" for="cate_sports">新聞</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="cate_stocks" value="雜吹" name="cateradio" /> <label class="form-check-label" for="cate_stocks">雜吹</label> </div> </div> </div> <div class="mb-3 row"> <label class="col-md-3 col-form-label">最近多少周?</label> <div class="col-md-9"> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk1" value="1" name="wkradio" /> <label class="form-check-label" for="wk1">1</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk2" value="2" name="wkradio" checked /> <label class="form-check-label" for="wk2">2</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk3" value="3" name="wkradio" /> <label class="form-check-label" for="wk3">3</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk4" value="4" name="wkradio" /> <label class="form-check-label" for="wk4">4</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk6" value="6" name="wkradio" /> <label class="form-check-label" for="wk6">6</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk8" value="8" name="wkradio" /> <label class="form-check-label" for="wk8">8</label> </div> <div class="form-check form-check-inline"> <input class="form-check-input" type="radio" id="wk12" value="12" name="wkradio" /> <label class="form-check-label" for="wk12">12</label> </div> <div class="form-text text-muted">以最新資料時間為準,往前推多少周?</div> </div> </div> <div class="mb-3 row"> <div class="col-md-9 ms-auto"> <button type="button" id="btn_ok" class="btn btn-primary">查詢</button> </div> </div> </div> </div> </div> <!-- 輸入區塊結束 --> <!-- 顯示區塊 --> <div class="col-lg-6 mb-2"> <div class="card"> <div class="card-header"> <h3 class="h6 text-uppercase mb-0">出現頻率以時間呈現</h3> </div> <div class="card-body"> <small>觀察每個時間點的有多少篇報導(聲量大小)</small> <div class="row"> <canvas id="keyword_time_line_chart"></canvas> </div> </div> </div> </div> <!-- 區塊結束 --> <!-- 同時出現的關鍵字區塊 --> <div class="col-lg-6 mb-2"> <div class="card"> <div class="card-header"> <h3 class="h6 text-uppercase mb-0">熱門程度:有幾篇新聞報導提到它?</h3> </div> <div class="card-body"> <ul id="keyword_article_count"></ul> </div> </div> </div> <!-- 區塊結束 --> <!-- 熱門程度區塊 --> <div class="col-lg-6 mb-2"> <div class="card"> <div class="card-header"> <h3 class="h6 text-uppercase mb-0">熱門程度:提到它的次數?</h3> </div> <div class="card-body"> <ul id="keyword_frequency"></ul> </div> </div> </div> <!-- 區塊結束 --> {% endblock %} {% block extra_js %} <!-- 這裡的java scrip等頁面初始化之後才載入與執行 --> <!-- chartjs圖js --> <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.13.0/moment.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.7.3/Chart.min.js"></script> <!-- 程式碼區 --> <script> call_ajax() //**按鈕事件 $('#btn_ok').on('click', function () { call_ajax() }) //event function $("input[name='cateradio']").on('change', function () { call_ajax() }) //event function $("input[name='wkradio']").on('change', function () { call_ajax() }) //event function $("input[name='condradio']").on('change', function () { call_ajax() }) //event function function call_ajax() { const userkey = $('#input_keyword').val() const weeks = $("input[name='wkradio']:checked").val() const cate = $("input[name='cateradio']:checked").val() const cond = $("input[name='condradio']:checked").val() if (userkey.length < 2) { alert('輸入關鍵字不可空白或小於兩個中文字!') return 0 } $.ajax({ type: 'POST', url: 'api_get_top_userkey/', //url: 'http://163.18.23.20:8000/userkeyword/api_get_top_userkey/', data: { userkey: userkey, cate: cate, weeks: weeks, cond: cond }, // pass to server success: function (received) { const article_count = received['key_occurrence_cat'] console.log(article_count) $('#keyword_article_count').empty() //將內容加上li標籤附加起來顯示 for (let key in article_count) { let paste = '<li>' + key + ':' + article_count[key] + '</li>' $('#keyword_article_count').append(paste) } const kwfreq = received['key_freq_cat'] console.log(kwfreq) $('#keyword_frequency').empty() for (let key in kwfreq) { let paste = '<li>' + key + ':' + kwfreq[key] + '</li>' $('#keyword_frequency').append(paste) } const data_key_time_freq = received['key_time_freq'] console.log(data_key_time_freq) showtimechart(data_key_time_freq) } //function }) //ajax } //call_ajax() // 宣告全域變數用於存放圖表實例 let line_chart = null function showtimechart(data_key_time_freq) { //取得繪圖元件 const ctx_key_time = document.getElementById('keyword_time_line_chart').getContext('2d') const myoptions = { type: 'line', data: { datasets: [ { label: 's2', borderColor: 'red', data: data_key_time_freq // your data here! } ] }, options: { legend: { display: false }, scales: { xAxes: [ { type: 'time', time: { unit: 'day', displayFormats: { //day: 'DD-MM-YYYY' day: 'MM/DD' } } } ], yAxes: [ { ticks: { beginAtZero: true }, display: true, scaleLabel: { display: true, labelString: '出現次數' } } ] } } } // 檢查並清除舊圖 if (line_chart) { line_chart.destroy() } // 畫新圖 line_chart = new Chart(ctx_key_time, myoptions) } </script> {% endblock %}

研究 昨日(或今日)誰最大 頁面

  1. 根據 熱門人物關鍵字分析 加上 時間過濾 的功能,過濾 只保留最後一天 的新聞
  2. 統計各個新聞類別的 前3~5個人名,存檔為 "hot persons of yesterday.csv"
  3. 在Django網頁中使用 Ajax自後端讀入,並呈現在前端網頁,修該一下網頁,讓呈現的方式更好看一些。

最後更新日期

第一版2025 3 27 , 11:45 PM

最後版2025 3 27 , 11:45 PM