Data Processing Workflow_share

# Data Processing Workflow_share This is a workflow to process raw serise data and detact seismic/micro-seismic events in parallel on cluster. :tada: ### Table of Content [toc] ## **Example for One Thread Data Processing** ### **1. module load** We employ **[obspy](https://docs.obspy.org/)** in python to help us processing data. ``` python=1 import os import obspy from obspy.clients.fdsn import Client from obspy import read, read_inventory from obspy import UTCDateTime from datetime import datetime from obspy.signal.trigger import classic_sta_lta from obspy.signal.trigger import plot_trigger from obspy.signal.trigger import trigger_onset ``` ### **2. Data Import** The serise data stored in NAS need to be pre-transfered on cluster, as our **sugon** cluster calculation node cannot access to Internet. ``` python=1 st = read('G610-2018-01-08.mseed') ``` ![](https://i.imgur.com/F0DjfDg.png =400x400) ### **3. Remove instrument response** Raw data need to remove instrument response. If you download seismic data via `obspy.clients.fdsn`, the station information can be directly attached on mseed file. Our raw data in NAS were downloaded by `w-get`, We utilize stationXML file to remove response. and we should get stationXML file: ``` python=1 client = Client("KNMI") tstart = UTCDateTime(2014,1,5) tend = UTCDateTime(2015,3,18) minlon = 6.3 maxlon = 7.1 minlat = 53.0 maxlat = 53.5 inventory = client.get_stations(network="NL",station="G*4",minlatitude=minlat, maxlatitude=maxlat,minlongitude=minlon, maxlongitude=maxlon,level="response") inventory.write("Groningen_station_resp.xml",format="STATIONXML") ``` Then, remove response by stationXML. The routine automatically picks the correct response for each trace. :warning: Need to define a filter band `pre_filt` to prevent amplifying noise during the deconvolution. ``` python=1 inv = read_inventory('Groningen_station_resp.xml') pre_filt = (0.005, 0.006, 30.0, 35.0) st.remove_response(inventory=inv, output='DISP', pre_filt=pre_filt) ``` ![](https://i.imgur.com/twlUErr.png =400x400) ### **4. Low-pass Filter** ```python=1 st_LF=st.copy() st_LF.filter("lowpass", freq=5, corners=10) st_LF.plot(); st_LF[0].plot(type='dayplot') ``` ### **5. Events Detaction** This workflow employs classical **sta/lta** method for dectation process as a example. However, we strongly recommend you select suitable dectation method for your research. ```python st_D = st_LF.select(component="Z") tr = st_D[0] tr.plot(type="relative") df = tr.stats.sampling_rate cft = classic_sta_lta(tr.data, int(10 * df), int(30 * df)) plot_trigger(tr, cft, 2.6, 2) ``` ![](https://i.imgur.com/MCaFHuU.png =400x150) ![](https://i.imgur.com/Cq3fF83.png) ### **6. Events Extraction** Save the events data shot into folder separately for your further analysis. The time seirse data is trimed with the extension of 30s. We can assemable those data for ```python=1 path = '~/EventDetaction/' folder = os.path.exists(path) if not folder: os.makedirs(path) else: pass events_time = trigger_onset(cft, 2.6, 2, max_len=9e99, max_len_delete=False) t = tr.stats.starttime for i in range(0,len(events_time)): t1 = t+events_time[i,0]/df-15 #start time of single event t2 = t+events_time[i,1]/df+15 #end time of single event st_tr = st.copy() st_events = st_tr.trim(t1, t2) event_path = path+str(t)[0:10]+'-Event'+str(i+1)+'.mseed' st_events.write(event_path, format='MSEED') st_events.plot() ``` ![](https://i.imgur.com/sEKzSfs.png =300x300) ![](https://i.imgur.com/ubasafR.png =300x300) ![](https://i.imgur.com/1KjjSBI.png =300x300) ![](https://i.imgur.com/pbSRlaW.png =300x300) ![](https://i.imgur.com/g4lfALB.png =300x300) ## **Process Data in parallel on cluster** Based on above Single-Threaded data processing workflow, we can process data on cluster in parallel. In this workflow, we deployed serise data of different **stations** to hundreds of **cores** on cluster, and process it in parallel. ### **1. Create folder structure** We designed this process need a folder with three sub-folders under `~/` on the cluster. Please use `mkdir` commond to create this folder structure. The folder structure is shown as follows: ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Red,fontname=Courier,shape=box] //All nodes will this shape and colour edge [color=Blue, style=dashed] //All the lines look like this ParaProcess->{Dataset Program EventDetaction} Dataset->{SeriseData_mseed stationXML} Program->{Para_sh process_py} EventDetaction {rank=ParaProcess;} // Put them on the same level } ``` ### **2. Upload Data** :::success Before processing data, we should upload **serise dataset** and **stationXML** file to cluster with the path of `/public/home/user/ParaProcess/Dataset`. Of courese, you can change the path as you wish through editing the follow code. :warning:If you want change the path, you must remember that the path of file is absolute path. ::: ### **3. Create shell in Program folder** `cd` the program folder you have created. Then, tap `touch Para.sh` and `vim Para.sh` for editing this shell file with the following code: ```shell= #!/bin/bash cd /public/home/aczpj7g1ph/ParaProcess/Dataset # This is the datasets dir. file=`ls *.mseed`; # We process series data saved as .mseed files. name=${file%.*} # Make the dir of .py files. We process those data via python code. if [ -d /public/home/aczpj7g1ph/ParaProcess/Program ];then cd /public/home/aczpj7g1ph/ParaProcess/Program else mkdir /public/home/aczpj7g1ph/ParaProcess/Program cd /public/home/aczpj7g1ph/ParaProcess/Program fi # Generate python and shell code for each data file. for i in $name do file_name=${i%.*} # We named the python file following data file name. touch ${file_name}_workflow.py # Import modules echo "#!/usr/bin/env python " >> ${file_name}_workflow.py echo "import sys" >> ${file_name}_workflow.py echo "import os" >> ${file_name}_workflow.py echo "import obspy" >> ${file_name}_workflow.py echo "from obspy import read, read_inventory, UTCDateTime" >> ${file_name}_workflow.py echo "from datetime import datetime" >> ${file_name}_workflow.py echo "from obspy.signal.trigger import classic_sta_lta" >> ${file_name}_workflow.py echo "from obspy.signal.trigger import trigger_onset" >> ${file_name}_workflow.py # Import Data file into Data_processing program. echo "filename = '${file_name}.mseed'" >> ${file_name}_workflow.py echo "filepath = '/public/home/aczpj7g1ph/ParaProcess/Dataset/${file_name}.mseed'" >> ${file_name}_workflow.py echo "st = read(filepath)" >> ${file_name}_workflow.py # The stationXML file need to be prepared, and we can get it form obspy.clients.fdsn. echo "stationxml = 'Groningen_station_resp.xml'" >> ${file_name}_workflow.py # Remove instrument response by stationXML. echo "inv = read_inventory(stationxml)" >> ${file_name}_workflow.py echo "pre_filt = (0.005, 0.006, 30.0, 35.0)" >> ${file_name}_workflow.py echo "st.remove_response(inventory=inv, output='DISP', pre_filt=pre_filt)" >> ${file_name}_workflow.py # Low-pass Filter. echo "st_LF=st.copy()" >> ${file_name}_workflow.py echo "st_LF.filter('lowpass', freq=5, corners=10)" >> ${file_name}_workflow.py # Events detaction. # This workflow employs classical sta/lta method for dectation process as a example. # However, we strongly recommend you select suitable dectation method for your research. echo "st = st.select(component='Z')" >> ${file_name}_workflow.py echo "tr = st[0]" >> ${file_name}_workflow.py echo "df = tr.stats.sampling_rate" >> ${file_name}_workflow.py echo "cft = classic_sta_lta(tr.data, int(10 * df), int(30 * df))" >> ${file_name}_workflow.py # Save the events data shot into folder separately for your further analysis. echo "events_time = trigger_onset(cft, 2.6, 2, max_len=9e99, max_len_delete=False)" >> ${file_name}_workflow.py echo "t = tr.stats.starttime" >> ${file_name}_workflow.py echo "for i in range(0,len(events_time)):" >> ${file_name}_workflow.py # The time seirse data is trimed with the extension of 30s. echo " path = '/public/home/aczpj7g1ph/ParaProcess/EventDetaction/'+str(t)[0:10]+'/Event'+str(i+1)" >> ${file_name}_workflow.py echo " folder = os.path.exists(path)" >> ${file_name}_workflow.py echo " if not folder:" >> ${file_name}_workflow.py echo " os.makedirs(path)" >> ${file_name}_workflow.py echo " else:" >> ${file_name}_workflow.py echo " pass" >> ${file_name}_workflow.py echo " t1 = t+events_time[i,0]/df-30 #start time of single event" >> ${file_name}_workflow.py echo " t2 = t+events_time[i,1]/df+30 #end time of single event" >> ${file_name}_workflow.py echo " st_tr = st.copy()" >> ${file_name}_workflow.py echo " st_events = st_tr.trim(t1, t2)" >> ${file_name}_workflow.py echo " event_path = path+'/'+filename[0:4]+'.mseed'" >> ${file_name}_workflow.py echo " st_events.write(event_path, format='MSEED')" >> ${file_name}_workflow.py # generate shell script touch ${file_name}_process.sh echo "#!/bin/bash" >> ${file_name}_process.sh echo "#SBATCH --job-name=${file_name}_p" >> ${file_name}_process.sh echo "#SBATCH --mem-per-cpu=2gb" >> ${file_name}_process.sh echo "#SBATCH --ntasks=1" >> ${file_name}_process.sh echo "#SBATCH --nodes=1" >> ${file_name}_process.sh echo "#SBATCH --ntasks-per-node=1" >> ${file_name}_process.sh echo "#SBATCH --cpus-per-task=1" >> ${file_name}_process.sh echo "#SBATCH --output=${file_name}.log" >> ${file_name}_process.sh echo "#SBATCH --partition=kshcnormal" >> ${file_name}_process.sh echo "# load the environment" >> ${file_name}_process.sh echo "module purge" >> ${file_name}_process.sh echo "source activate obspy" >> ${file_name}_process.sh echo "python ${file_name}_workflow.py" >> ${file_name}_process.sh sbatch ${file_name}_process.sh done ``` :::info This main shell script will generate the data processing code and corresponding scripts for submitting job to cluster, and submit job automatically. ::: ### **4. Submit Job** submit job to cluster by slurm system: ```Shell=1 bash Para.sh ``` And this job will automaticlly read serise data in dataset folder, and generate relative `.py` file for processing each `.mseed` file under the folder of `Program`. Those `.py` files will be automatically operated for data processing. ### **5. Result** Through this data processing workflow, we can detact several seismic events, and each events was trimed and saved in the EventDetaction folder separately with the following folder structure: ```mermaid graph LR EventDetaction --> Date --> Event_number --> StationNumber_mseed ``` Some example of test workflow: ![](https://i.imgur.com/BqUZH5d.png =400x150) ![](https://i.imgur.com/VtEOxtu.png =500x120) ![](https://i.imgur.com/DDbjd2f.png =500x230) ![](https://i.imgur.com/LPAlCdd.png =500x230) ## **Data Download** ### Station informtion http://rdsa.knmi.nl/fdsnws/station/1/#urls ### DataSelect http://rdsa.knmi.nl/fdsnws/dataselect/1/ http://rdsa.knmi.nl/fdsnws/dataselect/1/query?starttime=2017-01-01T00%3A00%3A00&endtime=2017-01-01T00%3A05%3A00&network=NL&station=VKB&channel=BHZ&nodata=404 # Design the Template employ ar picker to picker p wave and s wave arrive time. reference： http://www.iitk.ac.in/nicee/wcee/article/13_786.pdf

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.