# UROP pt. 2
## Try todo:
- [ ] install ncview
- [ ] clean up backtracking code...
- [ ] fix the interp function to not use the whole entire dataset
I started a new markdown file just because i felt like it.
I'm still downloading data. I use
dh -u 2020*.nc
or whatever to see the sizes of the files, and redownload files that are too small (these files didn't finish downloading before the slurm script ended.)

I put the variables/etc. as arguments for the python script so it's easier. (Aka in the slurm script I can just say python3 api_request.py 2020 1 12 v to download all of 2020's worth of v, by month.)
7.2G 2020_06_u.nc
## Meeting November 4
### Using xarray to rewrite backtracking
I am following this guide: https://towardsdatascience.com/handling-netcdf-files-using-xarray-for-absolute-beginners-111a8ab4463f
$ ncdump -h ../allpressurelevels.nc
This gives me a quick glance at what the file contains.
I installed cartopy from https://scitools.org.uk/cartopy/docs/latest/installing.html.
It turns out that xarray has already done for me lots of things that I already did...
- finding the nearest neighbor and its values: https://docs.xarray.dev/en/stable/user-guide/indexing.html
- interpolation: https://docs.xarray.dev/en/stable/user-guide/interpolation.html
Then I rewrote the backtracking algorithm with xarray, using initial_lon = 140, initial_lat = -5, initial_pressure = 900, time_increment = 0.3.
Times are stored as datetime64, instead of hours from 1900-01-01, or whatever. At iteration index 0, u = 1.2855311470822812. This is also what I got!
In order to subtract times, I read through the documentation for datetime64 (https://numpy.org/doc/stable/reference/arrays.datetime.html). I converted to milliseconds since it requires integers:
time_next = time_current - np.timedelta64(math.floor(time_increment * MS_PER_HOUR), 'ms')
Here is what I got for u:
Iteration index 0
1.2855300903320312
Iteration index -1
1.4171670736750355
Iteration index -2
0.7613251298932129
Iteration index -3
1.2057111865143195
Iteration index -4
1.6176948182586335
Iteration index -5
2.0667538935387775
Iteration index -6
2.1826777907319044
Iteration index -7
1.8332458705146666
Iteration index -8
1.296544412548214
Iteration index -9
0.9030607094291474
Here's the plot (again) (it matches):

The code is really messy and I very much would like to clean it up and add more documentation; I don't feel like it right now but it's on my todo list.
### Get properties while backtracking
The variables I have (by pressure levels) are:
- geopotential (z)
- specific humidity (q)
- u
- v
- w
- temperature (t)
I just add them into the array as I'm backtracking. The variables I have (single pressure level) are:
- surface pressure (sp)
- total precipitation (tp)
I download this data for the India box for June 1-3, 2020 onto my computer.

I run the backtracking algorithm for lon_0 = 76.8E, lat_0 = 17.2N. I set the initial level to be surface pressure - 25 hPa. I subtracted 2500 since apparently it converts to pascals.
- Change the time interval to 0.25.
### Running on Cluster
Then I add the code to Svante. It takes too long (I have the fix the interp function.)
## notes
I can look in the region of interest; for each year, month, look at precipitation -- zero out everything that's not in the region of interest
zero out everything that is NOT a local maximum
find the local maximum and check that it's local maximum with respect to latitude/longitude/time
find a maximum, set to zero, repeat.
- just divide the year into segments and look at the maxima of each of those
- can do "keep track of the top ten" for january, februrary, etc. (probably might not let me take the max of the whole array
just try not to do it on the login node -- use a slurm script
- on the tracking map, add the plots for temp, humidity, etc.
- I can find a precipitation maximum, draw a box, and do every parcel within that
first:
- do 15 minutes, make the figure (can send PJ a .csv file) and make it look slightly better
- and then do a bunch of parcels from the same event
the way I store the data should be
- each parcel (lots of starting points) are in the same event/.csv file
- time should be innermost dimension
- can try to use a DataArray with another dimension (parcel # within box)
Tracking for two days:
- 




## Meeting November 9
### Download India June 2020 Data
I decided to download all the data I needed for India, June 2020, onto my computer. This will make it easier for me.
For pressure levels:
- geopotential
- specific humidity
- temperature
- u
- v
- w
For single levels:
- surface pressure
- total precipitation
### Find places to backtrack from
For this meeting, I will just try to produce several backtracking on high precipitation vs low precipitation days for the relevant locations.
Here is the plot of the average (by location) total hourly precipitation in India for June 2020. (The plot I had before was for June 2000.)
> This parameter is the accumulated liquid and frozen water, comprising rain and snow, that falls to the Earth's surface. It is the sum of large-scale precipitation and convective precipitation. Large-scale precipitation is generated by the cloud scheme in the ECMWF Integrated Forecasting System (IFS). The cloud scheme represents the formation and dissipation of clouds and large-scale precipitation due to changes in atmospheric quantities (such as pressure, temperature and moisture) predicted directly by the IFS at spatial scales of the grid box or larger. Convective precipitation is generated by the convection scheme in the IFS, which represents convection at spatial scales smaller than the grid box. This parameter does not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth. This parameter is accumulated over a particular time period which depends on the data extracted. For the reanalysis, the accumulation period is over the 1 hour ending at the validity date and time. For the ensemble members, ensemble mean and ensemble spread, the accumulation period is over the 3 hours ending at the validity date and time. The units of this parameter are depth in metres of water equivalent. It is the depth the water would have if it were spread evenly over the grid box. Care should be taken when comparing model parameters with observations, because observations are often local to a particular point in space and time, rather than representing averages over a model grid box.

Here is the plot of the variability (standard deviation with respect to time) of total hourly precipitation in India for June 2020.

The 2000 vs 2020 plots look pretty different:


I selected three locations that looked like they had high standard deviations (later I will automate this process to find local maxima.)

a) (78.5E, 21.3N)
b) (84.8E, 22.3N)
c) (82.3E, 27.8N)
plot_precip_variability(ds, "in India, June 2020", zoom = 20, lon_0 = 82.3, lat_0 = 27.8)

I plotted the precipitation vs time for the month for each of these points (I chose the closest point in the dataframe.) Then I selected two low precipitation times and two high precipitation times.
a) I took
ds.tp.sel(longitude = 78.5, latitude = 21.3, method = "nearest")
This is the DataArray I plotted:
<xarray.DataArray 'tp' (time: 720)>
array([0. , 0. , 0. , ..., 0.000509, 0.000511, 0.000317],
dtype=float32)
Coordinates:
longitude float32 78.5
latitude float32 21.25
* time (time) datetime64[ns] 2020-06-01 ... 2020-06-30T23:00:00
Attributes:
units: m
long_name: Total precipitation
Here is the plot:

Here are the top 10 hours for precipitation:
[numpy.datetime64('2020-06-12T22:00:00.000000000'), numpy.datetime64('2020-06-12T23:00:00.000000000'), numpy.datetime64('2020-06-13T00:00:00.000000000'), numpy.datetime64('2020-06-12T21:00:00.000000000'), numpy.datetime64('2020-06-13T01:00:00.000000000'), numpy.datetime64('2020-06-12T20:00:00.000000000'), numpy.datetime64('2020-06-13T02:00:00.000000000'), numpy.datetime64('2020-06-15T10:00:00.000000000'), numpy.datetime64('2020-06-12T19:00:00.000000000'), numpy.datetime64('2020-06-13T03:00:00.000000000')]
I picked two times that are at different times.
- High precipitation:
- 2020-06-12T22:00:00.000000000 (time index is 286, 1055830)
- 2020-06-15T10:00:00.000000000 (1055890)
I was going to do the same for low precipitation, but most times are low precipitation so I just picked them by eye, verifying that they were indeed low precipitation.
- Low precipitation:
- 2020-06-24T00:00:00.000000000 (1056096)
- 2020-06-20T00:00:00.000000000 (1056000)
>>> print(precipitation_at_location.sel(time = np.datetime64('2020-06-20T00:00:00.000000000')).values)
3.413154e-05
b) (longitude, latitude) = (84.8E, 22.3N)

Maximum precipitation times:
[numpy.datetime64('2020-06-17T03:00:00.000000000'), numpy.datetime64('2020-06-17T04:00:00.000000000'), numpy.datetime64('2020-06-17T02:00:00.000000000'), numpy.datetime64('2020-06-17T01:00:00.000000000'), numpy.datetime64('2020-06-17T00:00:00.000000000'), numpy.datetime64('2020-06-17T05:00:00.000000000'), numpy.datetime64('2020-06-16T23:00:00.000000000'), numpy.datetime64('2020-06-17T06:00:00.000000000'), numpy.datetime64('2020-06-16T22:00:00.000000000'), numpy.datetime64('2020-06-16T08:00:00.000000000'), numpy.datetime64('2020-06-04T12:00:00.000000000'), numpy.datetime64('2020-06-19T16:00:00.000000000'), numpy.datetime64('2020-06-04T15:00:00.000000000'), numpy.datetime64('2020-06-27T14:00:00.000000000'), numpy.datetime64('2020-06-19T17:00:00.000000000'), numpy.datetime64('2020-06-19T19:00:00.000000000'), numpy.datetime64('2020-06-16T06:00:00.000000000'), numpy.datetime64('2020-06-29T13:00:00.000000000'), numpy.datetime64('2020-06-20T18:00:00.000000000'), numpy.datetime64('2020-06-14T20:00:00.000000000')]
- High precipitation:
- 2020-06-17T03:00:00.000000000 (1055931)
- 2020-06-04T12:00:00.000000000 (1055628)
- Low precipitation:
- 2020-06-04T00:00:00.000000000 (1055616)
- 2020-06-25T00:00:00.000000000 (1056120)
c) (longitude, latitude) = (82.3E, 27.8N)

[numpy.datetime64('2020-06-20T01:00:00.000000000'), numpy.datetime64('2020-06-20T02:00:00.000000000'), numpy.datetime64('2020-06-20T03:00:00.000000000'), numpy.datetime64('2020-06-20T04:00:00.000000000'), numpy.datetime64('2020-06-20T00:00:00.000000000'), numpy.datetime64('2020-06-19T23:00:00.000000000'), numpy.datetime64('2020-06-19T22:00:00.000000000'), numpy.datetime64('2020-06-20T05:00:00.000000000'), numpy.datetime64('2020-06-19T21:00:00.000000000'), numpy.datetime64('2020-06-19T20:00:00.000000000'), numpy.datetime64('2020-06-17T00:00:00.000000000'), numpy.datetime64('2020-06-20T06:00:00.000000000'), numpy.datetime64('2020-06-16T23:00:00.000000000'), numpy.datetime64('2020-06-17T01:00:00.000000000'), numpy.datetime64('2020-06-19T19:00:00.000000000'), numpy.datetime64('2020-06-16T22:00:00.000000000'), numpy.datetime64('2020-06-17T02:00:00.000000000'), numpy.datetime64('2020-06-05T04:00:00.000000000'), numpy.datetime64('2020-06-16T21:00:00.000000000'), numpy.datetime64('2020-06-05T03:00:00.000000000')]
- High precipitation:
- 2020-06-20T01:00:00.000000000 (1056001)
- 2020-06-17T00:00:00.000000000 (1055928)
- Low precipitation:
- 2020-06-03T00:00:00.000000000 (1055592)
- 2020-06-13T00:00:00.000000000 (1055832)
Now, I run the backtracking algorithm from the places I have selected as well as the high/low precipitation times.
I had issues opening the dataset:
ds = xr.open_mfdataset(['../india_june_2020/india_june_2020_u_component_of_wind.nc',
'../india_june_2020/india_june_2020_v_component_of_wind.nc',
'../india_june_2020/india_june_2020_vertical_velocity.nc'])
air_locations = iterate_backtracking(ds = ds, initial_lon = 78.5, initial_lat = 21.3, initial_time = np.datetime64('2020-06-12T22:00:00.000000000'), time_increment = 0.25)
air_locations.to_csv('2020_06_12_22_78-5_21-3')
It turns out that I downloaded the data wrong... I downloaded the entire dataset I wanted instead of by variable. It's not a big deal, though, so I just work with it (noting that I will do it differently later.)
I also messed up my git history; hopefully I didn't do anything too bad.
I backtracked for (up to) 2 days each.
I was having issues with time but ended up using decode_times = 'False' in initally opening the array. I wrote a very stupid script to convert the original datetime64s to integer times.
ds_integer_times = xr.open_mfdataset(["../india_june_2020/india_june_2020_geopotential.nc", "../india_june_2020/india_june_2020_surface_pressure.nc", "../india_june_2020/india_june_2020_total_precipitation.nc"], decode_times = False)
ds_normal_times = xr.open_mfdataset(["../india_june_2020/india_june_2020_geopotential.nc", "../india_june_2020/india_june_2020_surface_pressure.nc", "../india_june_2020/india_june_2020_total_precipitation.nc"])
for i in range(720):
if ds_normal_times.time.isel(time = i).values == np.datetime64('2020-06-24T00:00:00.000000000'):
print(ds_normal_times.time.isel(time = i))
print(ds_integer_times.time[i])
Editing the backtracking file was annoying; I basically just did the exact same thing as my previous backtracking, converting the xarray DataArrays into numpy arrays and then using scipy interpolate. I had to be careful of precipitation since that did not take pressure as a coordinate.
I had a bunch of annoying issues but fixed them (due to the time being stored differently).
I ran PJ's MATLAB code to plot (had to install mapping toolbox in matlab):
clear
close all;
A=readmatrix('2020_06_12_22_78-5_21-3.csv');
Time=A(2:end,1);
Longitude=A(2:end,3);
Latitude=A(2:end,4);
Pressure=A(2:end,5);
Temperature=A(2:end,6);
% Geopotential=A(2:end,7);
Precipitation=A(2:end,7);
SpecificHumidity=A(2:end,8);
load coastlines.mat;
coastlon=[coastlon; coastlon+360];
coastlat=[coastlat; coastlat];
figure(1)
clf
set(gcf,'Position',[172 76 1079 902]);
axMap=axes('Position',[0.13 0.4 0.775 0.55]);
plot(coastlon,coastlat,'k','LineWidth',2); hold on;
for nPoint=1:length(Time)
scatter(Longitude(nPoint),Latitude(nPoint),12,[(Pressure(nPoint)-900)/150 1-(Pressure(nPoint)-900)/150 0.5],'filled');
if mod(nPoint,40)==1
if nPoint==1
text(Longitude(nPoint)+1.5,Latitude(nPoint)+1,[num2str(Time(nPoint)) ' hr'],'FontSize',18,'HorizontalAlignment','right');
text(Longitude(nPoint)+1.5,Latitude(nPoint)-1.5,[num2str(round(Pressure(nPoint),0)) ' hPa'],'FontSize',18,'HorizontalAlignment','center','VerticalAlignment','middle');
else
text(Longitude(nPoint),Latitude(nPoint)+1,num2str(Time(nPoint)),'FontSize',18,'HorizontalAlignment','right');
text(Longitude(nPoint),Latitude(nPoint)-1.5,num2str(round(Pressure(nPoint),0)),'FontSize',18,'HorizontalAlignment','center','VerticalAlignment','middle');
end
scatter(Longitude(nPoint),Latitude(nPoint),36,[(Pressure(nPoint)-900)/150 1-(Pressure(nPoint)-900)/150 0.5],'filled');
end
end
xlim([56 85]);
xlabel('Longitude');
ylabel('Latitude');
ylim([5 25]);
set(gca,'FontSize',24);
title('June 1, 2020');
axQuantities=axes('Position',[0.13 0.125 0.775 0.15]);
yyaxis left
plot(Time,Temperature,'LineWidth',2); hold on;
xlabel('Time (hours)');
ylabel('T (K)');
ylim([292 303]);
yticks([295 300]);
yyaxis right
plot(Time,1000*SpecificHumidity,'LineWidth',2);
ylabel('q (g/kg)');
ylim([13 22]);
xlim([-47 0]);
set(gca,'FontSize',24);
saveas(gcf,'../Results/Example1.png');
(why does MATLAB look like that....lol....)
This is what I asked the code to do:
air_locations = iterate_backtracking(ds = ds_integer_times, initial_lon = 78.5, initial_lat = 21.3, initial_time = 1055830, time_increment = 0.25, tracking_hours = 48)
air_locations.to_csv('../2020_06_12_22_78-5_21-3.csv')
plot_backtracking(air_locations, pressure_units = "millibars")
print(air_locations)
# for thing in ["temperature", "geopotential", "precipitation", "specific_humidity"]:
for thing in ["precipitation"]:
air_locations[thing] = air_locations[thing].astype(float)
# plt.figure()
air_locations.plot.line(y = thing)
plt.show()
print(ds)
a) (78.5E, 21.3N)
- High precipitation
- 2020-06-12T22:00:00.000000000 (1055830)


- 2020-06-15T10:00:00.000000000 (1055890)


- Low precipitation
- 2020-06-24T00:00:00.000000000 (1056096)



- 2020-06-20T00:00:00.000000000 (1056000)



b) (84.8E, 22.3N)
- High precipitation:
- 2020-06-17T03:00:00.000000000 (1055931)




<!--  -->
- 2020-06-04T12:00:00.000000000 (1055628)




<!--


 -->
- Low precipitation:
- 2020-06-04T00:00:00.000000000 (1055616)


- 2020-06-25T00:00:00.000000000 (1056120)



c) (82.3E, 27.8N)
- High precipitation:
- 2020-06-20T01:00:00.000000000 (1056001)
<!-- 

 -->




- 2020-06-17T00:00:00.000000000 (1055928)




<!-- 

 -->
- Low precipitation:
- 2020-06-03T00:00:00.000000000 (1055592)



- 2020-06-13T00:00:00.000000000 (1055832)


### Notes
- Put precipitation on the same map as T and q and label it high/low precip
The lowest is 800 and the highest is 980.
### November 17
I went to svante and moved all the files into one folder (for now): ../raw_data/2020_nc_files/2020_velocities (I recognize that this is not a very good naming convention). The script I am using is in scripts.
The initial pressure is 949.756171875 for
initial_lon = 82.3, initial_lat = 27.8, initial_time = 1055592, time_increment = 0.25, tracking_hours = 48
I try
time = 1051896
longitude = 0.0
latitude = 90.0
level = 100
## Notes
-- see how paths diverge within a few hours (2 hours either side) and lat/lon either side (27 points)
-- track for longer -- look at the somali jet
- do error bars excluding outliers on each side
-- look where we can sneak particles across the equator (the somali jet)
## December 1
I am still having a lot of trouble with xarray, since the dataset is very large.
I'm not sure if this is right but I added these two lines:
import dask
dask.config.set(**{'array.slicing.split_large_chunks': True})
I tested chunking, loading to a numpy array, and a lotm ore options. In the end, I decided that it would be easiest to load each parameter (temperature, u, v, w, etc.) as a separate dataset, since it seemed to be able to handle that well.
ds_integer_times = xr.open_mfdataset("../raw_data/2020_nc_files/*_u.nc", decode_times = False)
print(ds_integer_times)
temperature = ds_integer_times.u.sel(time = 1051896, longitude = 0.0, latitude = 90.0, level = 100)
print(ds_integer_times.u)
# temperature = ds_integer_times.t.isel(time = slice(3, 10), longitude = slice(4, 10), latitude = slice(5, 10), level = slice(6, 10))
print(temperature)
temperature = temperature.load()
print(temperature)
These are the filenames:
- on my computer:
- india_june_2020_by_pressure_levels.nc
india_june_2020_geopotential.nc
india_june_2020_specific_humidity.nc
india_june_2020_temperature.nc india_june_2020_u_component_of_wind.nc
india_june_2020_v_component_of_wind.nc
india_june_2020_vertical_velocity.nc
india_june_2020_surface_pressure.nc
india_june_2020_total_precipitation.nc
- on svante:
- 2020_01_geopotential.nc
2020_01_specific_humidity.nc
2020_01_temperature.nc
2020_01_u.nc
2020_01_v.nc
2020_01_w.nc
surfacepressure2020.nc
totalprecipitation2020.nc
### On my own computer
I realize that in the india_june_2020 folder, all the data were in all of the files, so I rewrote a new script to download data, in by_pressure_level_whatever.py.
### On Svante
I made a dictionary containing each of the xarray datasets.
ds = {var_name: xr.open_mfdataset("../raw_data/2020_nc_files/*" + var_name_dict[var_name] + "*.nc") for var_name in var_name_dict.keys()}
Then I just modified the backtracking code very jankily. I realized that I didn't have a main Dataset to get the time/level/lat/lon from, so I just used specific humidity, which was a bad choice since it doesn't have level, so I just put 'q' instead lol.
"""
Iterate backtracking
Parameters:
* ds : the xarray Dataset
* initial_lon :
"""
def iterate_backtracking(ds, initial_lon, initial_lat, initial_level = None, initial_time = None, earliest_time = None, tracking_hours = None, time_increment = 1, interp_method = 'linear'):
# If the initial time is not specified, set the initial time to be the last time in the dataset
if (initial_time is None):
initial_time = ds['sp'].time.values[-1]
if (initial_level is None):
surface_pressure = ds['sp'].sp.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, method = "nearest").values / PA_PER_HPA
initial_level = surface_pressure - HPA_ABOVE_SURFACE
print("Initial pressure level:", initial_level)
print("temperature", ds['t'].t.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, level = initial_level, method = "nearest"))
initial_location_data = {
"time": [initial_time],
"longitude": [initial_lon],
"latitude": [initial_lat],
"level": [initial_level],
"temperature": [ds['t'].t.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, level = initial_level, method = "nearest").values.astype(np.float)],
"geopotential": [ds['z'].z.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, level = initial_level, method = "nearest").values.astype(np.float)],
"precipitation": [ds['tp'].tp.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, method = "nearest").values.astype(np.float)],
"specific_humidity": [ds['q'].q.sel(time = initial_time, latitude = initial_lat, longitude = initial_lon, level = initial_level, method = "nearest").values.astype(np.float)],
}
print("Initial Location Data:", initial_location_data)
air_locations = pd.DataFrame(initial_location_data, index = [0.0])
tracking_point_in_range = True
n = 0
while tracking_point_in_range:
# Iterate the calculate of wind once
current_index = -1 * n * time_increment
next_index = -1 * (n + 1) * time_increment
print("Iteration index", current_index)
lon_current = air_locations.loc[current_index, "longitude"]
lat_current = air_locations.loc[current_index, "latitude"]
level_current = air_locations.loc[current_index, "level"]
time_current = air_locations.loc[current_index, "time"]
# If the earliest time is not provided, set it to be tracking_hours hours before the initial time, or to the earliest time
if earliest_time is None:
if tracking_hours is None:
earliest_time = ds['sp'].time.min().values
else:
# earliest_time = initial_time - np.timedelta64(math.floor((tracking_hours - 1) * MS_PER_HOUR), 'ms')
earliest_time = initial_time - (tracking_hours)
# Check if the time is out of range
if (time_current < earliest_time):
print("The time is out of range.")
tracking_point_in_range = False
break
# Check if the longitude/latitude/pressure levels are within the range
if (lon_current > ds['sp'].longitude.max().values or lon_current < ds['sp'].longitude.min().values or
lat_current > ds['sp'].latitude.max().values or lat_current < ds['sp'].latitude.min().values or
level_current > ds['q'].level.max().values or level_current < ds['q'].level.min().values):
print("The longitude, latitude, or level is out of range.")
tracking_point_in_range = False
break
time_length = ds['sp'].time.size
lat_length = ds['sp'].latitude.size
lon_length = ds['sp'].longitude.size
level_length = ds['q'].level.size
time_index = get_left_index(ds['sp'], 'time', time_current)
lat_index = get_left_index(ds['sp'], 'latitude', lat_current)
lon_index = get_left_index(ds['sp'], 'longitude', lon_current)
level_index = get_left_index(ds['q'], 'level', level_current)
# get the
time_points = ds['sp'].time.values[max(0, time_index - 1) : min(time_length, time_index + 1)]
level_points = ds['q'].level.values[max(0, level_index - 1) : min(level_length, level_index + 1)]
latitude_points = ds['sp'].latitude.values[max(0, lat_index - 1) : min(lat_length, lat_index + 1)][::-1]
longitude_points = ds['sp'].longitude.values[max(0, lon_index - 1) : min(lon_length, lon_index + 1)]
current_point = np.array([time_current, level_current, lat_current, lon_current])
current_point_3d = np.array([time_current, lat_current, lon_current])
wind_points = (time_points, level_points, latitude_points, longitude_points)
wind_points_3d = (time_points, latitude_points, longitude_points)
# print("Time index:", time_index, "Lat index", lat_index, "Lon index", lon_index, "level index", level_index)
# Gets a 2x2x2x2 array of the u values around the point that I am backtracking from
u = ds['u'].u.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lat_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lon_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(level_length, level_index + 1))).values
# print("u")
# u_t = u.interp(time = time_current, longitude = lon_current, latitude = lat_current, level = level_current).values
u_t = interpn(wind_points, u, current_point, method = interp_method)[0]
print("u", u_t)
# find the northward wind at this location
v = ds['v'].v.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lat_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lon_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(level_length, level_index + 1))).values
# print("v")
# v_t = v.interp(time = time_current, longitude = lon_current, latitude = lat_current, level = level_current).values
v_t = interpn(wind_points, v, current_point, method = interp_method)[0]
print("v", v_t)
# find the upward wind at this location
w = ds['w'].w.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lat_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lon_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(level_length, level_index + 1))).values
# print("w")
# w_t = w.interp(time = time_current, longitude = lon_current, latitude = lat_current, level = level_current).values
w_t = interpn(wind_points, w, current_point, method = interp_method)[0]
print("w", w_t)
lon_next = lon_current - lon_distance_travelled(time_increment, u_t, lat_current)
lat_next = lat_current - lat_distance_travelled(time_increment, v_t)
level_next = level_current - level_distance_travelled(time_increment, w_t)
time_next = time_current - time_increment
# time_next = time_current - np.timedelta64(math.floor(time_increment * MS_PER_HOUR), 'ms')
t = ds['t'].t.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lon_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lat_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(level_length, level_index + 1))).values
# temperature = t.interp(time = time_current, longitude = lon_current, latitude = lat_current, level = level_current).values
print("temperature:", t)
temperature = interpn(wind_points, t, current_point, method = interp_method)[0]
print("interpolated temperature:", temperature)
z = ds['z'].z.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lon_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lat_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(level_length, level_index + 1)))
geopotential = interpn(wind_points, z, current_point, method = interp_method)[0]
tp = ds['tp'].tp.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(time_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lat_length, lat_index + 1))).values
# precipitation = tp.interp(time = time_current, longitude = lon_current, latitude = lat_current).values
precipitation = interpn(wind_points_3d, tp, current_point_3d, method = interp_method)[0]
q = ds['q'].q.isel(time = slice(max(0, time_index - 1), min(time_length, time_index + 1)),
longitude = slice(max(0, lon_index - 1), min(lon_length, lon_index + 1)),
latitude = slice(max(0, lat_index - 1), min(lat_length, lat_index + 1)),
level = slice(max(0, level_index - 1), min(ds.level.size, level_index + 1))).values
# specific_humidity = q.interp(time = time_current, longitude = lon_current, latitude = lat_current, level = level_current).values
specific_humidity = interpn(wind_points, q, current_point, method = interp_method)[0]
# Update the air_locations DataFrame with the next location
air_locations.loc[next_index] = [time_next, lon_next, lat_next, level_next, temperature, geopotential, precipitation, specific_humidity]
# air_locations.loc[next_index] = [time_next, lon_next, lat_next, level_next, temperature, precipitation, specific_humidity]
n += 1
return air_locations
After basically just changing all the references, it works! I saved the data to a .csv file.
Then I used scp to send it to my local computer.
scp jakinng@svante-login.mit.edu:/net/fs07/d1/jakinng/1055592_82-3_27-8.csv Downloads
Then I read in the csv file as a pandas dataframe and plotted!


(As compared to the same function ran on my local computer:)


I realized that I have to read it with
air_locations = pd.read_csv('../1055592_82-3_27-8.csv', index_col = 0)
and that fixed the issue. yay
## Box
I ran it for within a box. I'm not really sure why the box doesn't look boxier.

### TODO:
- [ ] rename the .nc files to be more consistent in the dataset
## Next
Smooth precipitation with respect to a small area, and maybe a day
- 3 degrees lon/lat diameter box
- moving average (at each point, take the average precipitation of the 3 deg. box around that point instead of the actual precipitation at the box)
- look at the precipitation variability within the box (strong precipitation days is fine, doesn't need to be hours)
- backtrack from lots of points within the entire box (3x3 box, maybe 25 points)
- (look at some time series)
- then, for each point, take the moving average where each time point is the average within the 24 hours around it
- then look at the 10 highest peaks in india total (10 local maxima)
- don't look at the coasts or the mountains
- stick to central india (only over land??)
- make plots (group by property)
## Friday, December 10
### Modules
First, I am going to take the moving average of the precipitation.
I did this using this function: https://docs.xarray.dev/en/stable/generated/xarray.DataArray.rolling.html
I (again) (as usual) had issues with scipy and the installed packages. I installed conda: https://conda.io/projects/conda/en/latest/user-guide/install/macos.html
https://www.sianbrooke.co.uk/dr-brookes-blog/coding-in-python-managing-packages-with-conda-and-pip
Now, every single one of my imports is messed up.
I gave up and decided to make a virtual environment, following this: https://www.geeksforgeeks.org/python-virtual-environment/#:~:text=A%20virtual%20environment%20is%20a,that%20most%20Python%20developers%20use. This is a python3.9 venv.
I activated the virtual environment:
source fall22urop/bin/activate
I changed my mind since it wasn't working and followed this instead: https://uoa-eresearch.github.io/eresearch-cookbook/recipe/2014/11/20/conda/
THen I uninstalled conda and reinstalled it. I also ran this: https://stackoverflow.com/questions/70320522/unable-to-create-environments-file-path-not-writable-mac-m1-chip-python-module
After doing this for around 3 hours, I ended up uninstalling everything, creating a virtual environment, installing everything, and then running the file on terminal instead of on vscode.
I went to VSCode preferences, workplace settings, files, and added the key-value pair "python.pythonPath": "/Users/jakin/miniconda3/envs/uropvenv/bin/python". Then I deleted it again and ran
source activate uropvenv
in the integrated terminal.
So that I could actually read what was going on, I followed this: https://apple.stackexchange.com/questions/282185/how-do-i-get-different-colors-for-directories-etc-in-iterm2.
I started a virtual environment using
source activate uropvenv
### Precipitation
I smoothed the precipitation using .rolling().
display_map(np.mean(ds.tp.values, axis = 0), ds.variables.values, ds.latitude.values, "precipitation units", "Total Precipitation")
# print(ds.tp.values)
display_map(np.mean(ds.tp.rolling(time = 3, center = True).mean().values, axis = 0), ds.variables.values, ds.latitude.values, "precipitation units", "Total Precipitation (Smoothed)")
Then I tried to drop some values of NAN using https://docs.xarray.dev/en/stable/generated/xarray.DataArray.dropna.html, but I realized I probably don't want to do that because then it has to remove the entire slice if there is an nan in it.
This is what I want:
ds.tp.rolling(time = 1, longitude = 3, latitude = 3, center = True).mean().dropna("longitude", how = "all").dropna("latitude", how = "all")
These are the results from before and after dropping the nans:
<xarray.DataArray 'tp' (time: 720, latitude: 161, longitude: 161)>
dask.array<open_dataset-91478705923de2470f45f1bf30b142aatp, shape=(720, 161, 161), dtype=float32, chunksize=(720, 161, 161), chunktype=numpy.ndarray>
Coordinates:
* longitude (longitude) float32 60.0 60.25 60.5 60.75 ... 99.5 99.75 100.0
* latitude (latitude) float32 40.0 39.75 39.5 39.25 ... 0.75 0.5 0.25 0.0
* time (time) datetime64[ns] 2020-06-01 ... 2020-06-30T23:00:00
Attributes:
units: m
long_name: Total precipitation
ROLLING MEAN <xarray.DataArray 'tp' (time: 720, latitude: 159, longitude: 159)>
dask.array<getitem, shape=(720, 159, 159), dtype=float32, chunksize=(720, 159, 159), chunktype=numpy.ndarray>
Coordinates:
* longitude (longitude) float32 60.25 60.5 60.75 61.0 ... 99.25 99.5 99.75
* latitude (latitude) float32 39.75 39.5 39.25 39.0 ... 1.0 0.75 0.5 0.25
* time (time) datetime64[ns] 2020-06-01 ... 2020-06-30T23:00:00
<!-- Before smoothing:

After smoothing:
 -->
After setting the vertical limits to be between 0 and 0.003:


The latitude and longitude both increment by 0.25, so the 3deg by 3deg box has lat_n = 13 (-1.5, -1.25, -1, -0.75, -.5, -.25, 0, 0.25, 0.5, 0.75, 1, 1.25, 1.5), lon_n = 13.


I want to make a movie of precipitation:
https://pypi.org/project/ffmpeg-python/, https://stackoverflow.com/questions/33458488/matplotlib-subplot-animation-with-basemap
I also tried taking the average of precipitation for every day (24 hrs) and making a movie. The results are in the UROP folder on my computer.
The descriptions of the movies are in mp4descriptions.txt.
### Plotting the Box and its Properties
Next, I want to find the top 10 local maxima with respect to space and time, within India.
https://stackoverflow.com/questions/72855208/how-to-find-peaks-in-python-for-a-3d-plot
It seems kind of harder/nontrivial. I will start with the points that I identified earlier:
a) (78.5E, 21.3N)
b) (84.8E, 22.3N)
c) (82.3E, 27.8N)

I decide to look at point A and point B since point C is in the mountains (which probably has weird effects).
For point A, here is the time series of precipitation (smoothed with respect to space):
ds = xr.open_mfdataset("../india_june_2020/india_june_2020_total_precipitation.nc")
ds_rolling_mean = precip_rolling_mean(ds, lon_n = 13, lat_n = 13).sel(longitude = 78.5, latitude = 21.3, method = "nearest")
print(ds_rolling_mean)
print(ds_rolling_mean.values)
ds_rolling_mean.plot()
This goes to 78.5 and 21.25.

and here the time series smoothed with respect to time (every 24 hours):
ds_rolling_mean.rolling(time = 24, center = True).mean().dropna("time", how = "all").plot.line()

I found all of the local maxima using scipy.signal.find_peaks:
local_maxima = find_peaks(ds_rolling_mean_time.values)[0]
peaks = ds_rolling_mean_time.isel(time = np.array(local_maxima))
peaks.plot.line()

Then, I found the top 3 valued local maxima.
top_peaks = []
for i in range(10):
argmax_peak = peaks.argmax(dim = "time").values
print("TOP PEAK", argmax_peak, peaks.time[argmax_peak].values)
top_peaks.append(peaks.time[argmax_peak].values)
# print(peaks.)
# print(peaks[argmax_peak])
peaks[argmax_peak] = 0
The top local maxima are at 2020-06-12T19:00:00.000000000, 2020-06-15T18:00:00.000000000, and 2020-06-13T07:00:00.000000000 (this is a repeat of the first one), and 2020-06-03T19:00:00.000000000, which correspond to integer times:
- 1055827
- 1055898
- 1055839 (same as #1)
- 1055611
where the code is:
for top_time in top_peaks:
find_integer_time(ds_normal_times, ds_integer_times, datetimestr=top_time)
For (a_x, a_y), I considered the 3x3 box consisting of 1 degree less/more in either direction and tracked it using the top 3 peaks. (I can also do the 5x5 box.)
Here is the backtracking plot:

I combined all the data into a DataArray:
datasets = [air_locations.to_xarray() for air_locations in air_locations_array]
dataset = xr.concat(datasets, pd.Index(np.arange(len(air_locations_array)), name = "path_num"))
print(dataset)
The result is
<xarray.Dataset>
Dimensions: (index: 194, path_num: 9)
Coordinates:
* index (index) float64 0.0 -0.25 -0.5 ... -47.75 -48.0 -48.25
* path_num (path_num) int64 0 1 2 3 4 5 6 7 8
Data variables:
time (path_num, index) float64 1.056e+06 ... 1.056e+06
longitude (path_num, index) float64 77.5 77.43 77.36 ... 75.5 75.46
latitude (path_num, index) float64 20.25 20.26 ... 25.25 25.23
level (path_num, index) float64 926.8 926.6 ... 957.3 956.3
temperature (path_num, index) object 297.4 ... 299.93845049977824
precipitation (path_num, index) object 0.0005539 ... 0.0
specific_humidity (path_num, index) object 0.01845 ... 0.01996067240336437
Then I averaged for the paths:






Here is the same plots but arranged in a grid:

<!-- 




 -->
Then I converted it to subplots:
avg_dataset = dataset.mean(dim = "path_num")
fig = plt.figure()
ax1 = fig.add_subplot(321)
ax3 = fig.add_subplot(323)
ax5 = fig.add_subplot(325)
avg_dataset.temperature.plot(ax = ax1)
avg_dataset.precipitation.plot(ax = ax3)
avg_dataset.specific_humidity.plot(ax = ax5)
# plt.show()
# fig = plt.figure()
ax2 = fig.add_subplot(322)
ax4 = fig.add_subplot(324)
ax6 = fig.add_subplot(326)
dataset.temperature.plot.line(x = "index", ax = ax2)
dataset.precipitation.plot.line(x = "index", ax = ax4)
dataset.specific_humidity.plot.line(x = "index", ax = ax6)
plt.show()

TODO: figure out how to make path num not overlap. lol.
Here is the plots for


### Slurm script on Svante
I had to do pip3 install --user "dask[complete]" and delete all the plotting stuff, and then it worked.
#!/bin/bash
#
#SBATCH -J data_request
#SBATCH -p edr,fdr
#SBATCH -t 23:59:59
#SBATCH -e output_error_files/%Jrequest_error_file.err
#SBATCH -o output_error_files/%Jrequest_output_file.out
source /etc/profile.d/modules.sh
module load python/3.9.1
pip3 install --user cdsapi
pip3 install --user netCDF4
pip3 install --user numpy
# import matplotlib.pyplot as plt
# import matplotlib as mpl
# from mpl_toolkits.basemap import Basemap
pip3 install --user pandas
pip3 install --user scipy
pip3 install --user xarray
pip3 install --user cftime
pip3 install --user "dask[complete]"
python3 backtrackingxarray2.py
exit
Use this to save a list of dataframes to .csv files: https://stackoverflow.com/questions/61209641/saving-to-csv-files-multiple-dataframes-using-a-list-of-names-of-dataframes.
## Notes
- For the movie, try a log scale (since we care about the stuff happening in the yellow) or just expand the scale
- make sure the scale matches the map
- make the purple white on the colorscale
- average in time only a couple hours not the whole day
- check if it's local max with respect to time, then lat, then lon, then multiply the truth "matrices" together
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.argrelmax.html
- pressure colorbar should be flipped (upside down); also, switch to distance from surface rather than absolute pressure
- I can add a diff color/symbol inside the large circles for the paths as well as numbers
- plot specific humidity an hour before vs precipitation
- display local time under the index 5:30 UTC
- https://en.wikipedia.org/wiki/Solar_time
- (T + 24/360 * longitude) mod 24
-
0.25
- 
- 
- legend.loc = left or something
- try plot = step
- if i can (at some point), draw a vertical line indicating it is over land and not the ocean anymore
TODO
- find wehre to plot from (say it has to be a local maximum for some width, say 3 to the left & to the right)