# Data Processing Workflow_share
This is a workflow to process raw serise data and detact seismic/micro-seismic events in parallel on cluster. :tada:
### Table of Content
[toc]
## **Example for One Thread Data Processing**
### **1. module load**
We employ **[obspy](https://docs.obspy.org/)** in python to help us processing data.
``` python=1
import os
import obspy
from obspy.clients.fdsn import Client
from obspy import read, read_inventory
from obspy import UTCDateTime
from datetime import datetime
from obspy.signal.trigger import classic_sta_lta
from obspy.signal.trigger import plot_trigger
from obspy.signal.trigger import trigger_onset
```
### **2. Data Import**
The serise data stored in NAS need to be pre-transfered on cluster, as our **sugon** cluster calculation node cannot access to Internet.
``` python=1
st = read('G610-2018-01-08.mseed')
```

### **3. Remove instrument response**
Raw data need to remove instrument response. If you download seismic data via `obspy.clients.fdsn`, the station information can be directly attached on mseed file.
Our raw data in NAS were downloaded by `w-get`, We utilize stationXML file to remove response. and we should get stationXML file:
``` python=1
client = Client("KNMI")
tstart = UTCDateTime(2014,1,5)
tend = UTCDateTime(2015,3,18)
minlon = 6.3
maxlon = 7.1
minlat = 53.0
maxlat = 53.5
inventory = client.get_stations(network="NL",station="G*4",minlatitude=minlat,
maxlatitude=maxlat,minlongitude=minlon,
maxlongitude=maxlon,level="response")
inventory.write("Groningen_station_resp.xml",format="STATIONXML")
```
Then, remove response by stationXML.
The routine automatically picks the correct response for each trace.
:warning: Need to define a filter band `pre_filt` to prevent amplifying noise during the deconvolution.
``` python=1
inv = read_inventory('Groningen_station_resp.xml')
pre_filt = (0.005, 0.006, 30.0, 35.0)
st.remove_response(inventory=inv, output='DISP', pre_filt=pre_filt)
```

### **4. Low-pass Filter**
```python=1
st_LF=st.copy()
st_LF.filter("lowpass", freq=5, corners=10)
st_LF.plot();
st_LF[0].plot(type='dayplot')
```
### **5. Events Detaction**
This workflow employs classical **sta/lta** method for dectation process as a example.
However, we strongly recommend you select suitable dectation method for your research.
```python
st_D = st_LF.select(component="Z")
tr = st_D[0]
tr.plot(type="relative")
df = tr.stats.sampling_rate
cft = classic_sta_lta(tr.data, int(10 * df), int(30 * df))
plot_trigger(tr, cft, 2.6, 2)
```


### **6. Events Extraction**
Save the events data shot into folder separately for your further analysis.
The time seirse data is trimed with the extension of 30s. We can assemable those data for
```python=1
path = '~/EventDetaction/'
folder = os.path.exists(path)
if not folder:
os.makedirs(path)
else:
pass
events_time = trigger_onset(cft, 2.6, 2, max_len=9e99, max_len_delete=False)
t = tr.stats.starttime
for i in range(0,len(events_time)):
t1 = t+events_time[i,0]/df-15 #start time of single event
t2 = t+events_time[i,1]/df+15 #end time of single event
st_tr = st.copy()
st_events = st_tr.trim(t1, t2)
event_path = path+str(t)[0:10]+'-Event'+str(i+1)+'.mseed'
st_events.write(event_path, format='MSEED')
st_events.plot()
```
 
 

## **Process Data in parallel on cluster**
Based on above Single-Threaded data processing workflow, we can process data on cluster in parallel. In this workflow, we deployed serise data of different **stations** to hundreds of **cores** on cluster, and process it in parallel.
### **1. Create folder structure**
We designed this process need a folder with three sub-folders under `~/` on the cluster. Please use `mkdir` commond to create this folder structure. The folder structure is shown as follows:
```graphviz
digraph hierarchy {
nodesep=1.0 // increases the separation between nodes
node [color=Red,fontname=Courier,shape=box] //All nodes will this shape and colour
edge [color=Blue, style=dashed] //All the lines look like this
ParaProcess->{Dataset Program EventDetaction}
Dataset->{SeriseData_mseed stationXML}
Program->{Para_sh process_py}
EventDetaction
{rank=ParaProcess;} // Put them on the same level
}
```
### **2. Upload Data**
:::success
Before processing data, we should upload **serise dataset** and **stationXML** file to cluster with the path of `/public/home/user/ParaProcess/Dataset`. Of courese, you can change the path as you wish through editing the follow code.
:warning:If you want change the path, you must remember that the path of file is absolute path.
:::
### **3. Create shell in Program folder**
`cd` the program folder you have created. Then, tap `touch Para.sh` and `vim Para.sh` for editing this shell file with the following code:
```shell=
#!/bin/bash
cd /public/home/aczpj7g1ph/ParaProcess/Dataset # This is the datasets dir.
file=`ls *.mseed`; # We process series data saved as .mseed files.
name=${file%.*}
# Make the dir of .py files. We process those data via python code.
if [ -d /public/home/aczpj7g1ph/ParaProcess/Program ];then
cd /public/home/aczpj7g1ph/ParaProcess/Program
else
mkdir /public/home/aczpj7g1ph/ParaProcess/Program
cd /public/home/aczpj7g1ph/ParaProcess/Program
fi
# Generate python and shell code for each data file.
for i in $name
do
file_name=${i%.*} # We named the python file following data file name.
touch ${file_name}_workflow.py
# Import modules
echo "#!/usr/bin/env python " >> ${file_name}_workflow.py
echo "import sys" >> ${file_name}_workflow.py
echo "import os" >> ${file_name}_workflow.py
echo "import obspy" >> ${file_name}_workflow.py
echo "from obspy import read, read_inventory, UTCDateTime" >> ${file_name}_workflow.py
echo "from datetime import datetime" >> ${file_name}_workflow.py
echo "from obspy.signal.trigger import classic_sta_lta" >> ${file_name}_workflow.py
echo "from obspy.signal.trigger import trigger_onset" >> ${file_name}_workflow.py
# Import Data file into Data_processing program.
echo "filename = '${file_name}.mseed'" >> ${file_name}_workflow.py
echo "filepath = '/public/home/aczpj7g1ph/ParaProcess/Dataset/${file_name}.mseed'" >> ${file_name}_workflow.py
echo "st = read(filepath)" >> ${file_name}_workflow.py
# The stationXML file need to be prepared, and we can get it form obspy.clients.fdsn.
echo "stationxml = 'Groningen_station_resp.xml'" >> ${file_name}_workflow.py
# Remove instrument response by stationXML.
echo "inv = read_inventory(stationxml)" >> ${file_name}_workflow.py
echo "pre_filt = (0.005, 0.006, 30.0, 35.0)" >> ${file_name}_workflow.py
echo "st.remove_response(inventory=inv, output='DISP', pre_filt=pre_filt)" >> ${file_name}_workflow.py
# Low-pass Filter.
echo "st_LF=st.copy()" >> ${file_name}_workflow.py
echo "st_LF.filter('lowpass', freq=5, corners=10)" >> ${file_name}_workflow.py
# Events detaction.
# This workflow employs classical sta/lta method for dectation process as a example.
# However, we strongly recommend you select suitable dectation method for your research.
echo "st = st.select(component='Z')" >> ${file_name}_workflow.py
echo "tr = st[0]" >> ${file_name}_workflow.py
echo "df = tr.stats.sampling_rate" >> ${file_name}_workflow.py
echo "cft = classic_sta_lta(tr.data, int(10 * df), int(30 * df))" >> ${file_name}_workflow.py
# Save the events data shot into folder separately for your further analysis.
echo "events_time = trigger_onset(cft, 2.6, 2, max_len=9e99, max_len_delete=False)" >> ${file_name}_workflow.py
echo "t = tr.stats.starttime" >> ${file_name}_workflow.py
echo "for i in range(0,len(events_time)):" >> ${file_name}_workflow.py # The time seirse data is trimed with the extension of 30s.
echo " path = '/public/home/aczpj7g1ph/ParaProcess/EventDetaction/'+str(t)[0:10]+'/Event'+str(i+1)" >> ${file_name}_workflow.py
echo " folder = os.path.exists(path)" >> ${file_name}_workflow.py
echo " if not folder:" >> ${file_name}_workflow.py
echo " os.makedirs(path)" >> ${file_name}_workflow.py
echo " else:" >> ${file_name}_workflow.py
echo " pass" >> ${file_name}_workflow.py
echo " t1 = t+events_time[i,0]/df-30 #start time of single event" >> ${file_name}_workflow.py
echo " t2 = t+events_time[i,1]/df+30 #end time of single event" >> ${file_name}_workflow.py
echo " st_tr = st.copy()" >> ${file_name}_workflow.py
echo " st_events = st_tr.trim(t1, t2)" >> ${file_name}_workflow.py
echo " event_path = path+'/'+filename[0:4]+'.mseed'" >> ${file_name}_workflow.py
echo " st_events.write(event_path, format='MSEED')" >> ${file_name}_workflow.py
# generate shell script
touch ${file_name}_process.sh
echo "#!/bin/bash" >> ${file_name}_process.sh
echo "#SBATCH --job-name=${file_name}_p" >> ${file_name}_process.sh
echo "#SBATCH --mem-per-cpu=2gb" >> ${file_name}_process.sh
echo "#SBATCH --ntasks=1" >> ${file_name}_process.sh
echo "#SBATCH --nodes=1" >> ${file_name}_process.sh
echo "#SBATCH --ntasks-per-node=1" >> ${file_name}_process.sh
echo "#SBATCH --cpus-per-task=1" >> ${file_name}_process.sh
echo "#SBATCH --output=${file_name}.log" >> ${file_name}_process.sh
echo "#SBATCH --partition=kshcnormal" >> ${file_name}_process.sh
echo "# load the environment" >> ${file_name}_process.sh
echo "module purge" >> ${file_name}_process.sh
echo "source activate obspy" >> ${file_name}_process.sh
echo "python ${file_name}_workflow.py" >> ${file_name}_process.sh
sbatch ${file_name}_process.sh
done
```
:::info
This main shell script will generate the data processing code and corresponding scripts for submitting job to cluster, and submit job automatically.
:::
### **4. Submit Job**
submit job to cluster by slurm system:
```Shell=1
bash Para.sh
```
And this job will automaticlly read serise data in dataset folder, and generate relative `.py` file for processing each `.mseed` file under the folder of `Program`. Those `.py` files will be automatically operated for data processing.
### **5. Result**
Through this data processing workflow, we can detact several seismic events, and each events was trimed and saved in the EventDetaction folder separately with the following folder structure:
```mermaid
graph LR
EventDetaction --> Date --> Event_number --> StationNumber_mseed
```
Some example of test workflow:




## **Data Download**
### Station informtion
http://rdsa.knmi.nl/fdsnws/station/1/#urls
### DataSelect
http://rdsa.knmi.nl/fdsnws/dataselect/1/
http://rdsa.knmi.nl/fdsnws/dataselect/1/query?starttime=2017-01-01T00%3A00%3A00&endtime=2017-01-01T00%3A05%3A00&network=NL&station=VKB&channel=BHZ&nodata=404
# Design the Template
employ ar picker to picker p wave and s wave arrive time.
reference:
http://www.iitk.ac.in/nicee/wcee/article/13_786.pdf