
# Crash Course Q&A
Suggestion box:

Please help us prepare our next Crash Course on Data Assimilation by writing suggestions below.
What topics were missing according to you?
 A little bit more of a general approach for beginners before going on into the deeper levels of DA.
 A bit more emphasis on variational approaches would be beneficial, especially given their relation to machine learning.
 Perhaps an introduction to particle filtering will be useful too.

What topics were repeated too much in your opinion?
 Too many of ...
Did you try the tutorials? Were they useful?
 Only the ones from Patrick so far, but they are fantastic to get started.
 Yes: both the tutorials from Patrick and Julien. They were useful to understand how to implement the different notions with the code. Personnally I understood more by practising the tutorial from Patrick than by listening the lectures, which were a bit too quick for me.
 Yes, they were nice to assimilate what we learned in class.

Any other suggestions for improvements?
 The lectures went too quickly. No time to assimilate (haha).
 I agree with this, if it would be only the live lecture. But as you make the slides and the video available, I think it is fine to go at a decent speed, and we can catch up later on our own pace.
 I also agree with the speed of the lecture and the usefulness of the tutorials. Maybe instedad of the lecture on the tutorials, a kind of dedicated time to the tutorial in the schedule would have been nice with Patrick being available for questions live.
 I also agree, especially the first lectures that presented Data Assimilation theory. Maybe they should be more time spent on how DA works and fewer on applications, as I guess some participants don't have any background knowledge in DA.
 The tutorials were very useful though some of them may have been a bit advanced, specially for people with no previous background on data assimilation. The time constraint probably needed that the applications go on fast but the fact that one can later catch up with those at his/her own pace is nice. 
 An introduction of DA and more time to do the exercises. I have zero background in DA and I felt very lost in the first class. The course started with many equations that I didn't understand where they were used. I began to understand with some links on the jupyter notebooks (e.g., the introduction by pictures). Then, I could reread the slides and understand the equations.
Day 1:

**Geir Evensen: basic principles of inference and DA theory**
Link to lecture slides: https://github.com/geirev/Presentations/
 In the example, if there were more conditions on x, like `x(2)=3;x(4)=5`, would you have to add other variables (`a,b,c,d`)? And extend the costfunction, or they would be somehow "lumped up" in just `b`? (MD)
 You will add terms in the cost function for the additional conditions (transcribed by YY)
 Why did we consider only the first two moments while forming least squares cost function?
 For now we consider Gaussian distributions, which has only first two moments. The higher moments can be added for nonGaussian distributions (transcribed by YY)
 **HD** In the case that we use a surrogate model or a measurement operator that has some errors or is approximate, should we increase the observations error? or how should we modify our d.a. problem?
 My opinion is that you should include the error in the right place: if the error is caused by surrogate model then it's model error, if error is in observation operator it's obs error. However in practice it is often more convenient to lump them in obs error variances. (YY)
 I will mention a bit representation errors in my section on Wednesday (YY)
 What's the fundamental difference between filter and smoother updates?
 In filtering, you use obs to update the state at the current time step, while in smoother you also update the states in a time window. (transcribed by YY)
 How does nonstationarity of model processes affect the error propagation?
 In a nonstationary model the error covariance would change with time (flow dependent). This flow dependent error covariance needs to be provided at each time step to correctly propagate information. (YY)
 What will be the interpretation or the implications if we after 60 updates did not see such an improvement like in slide 61 but continue to be the same error as in slide 60(30 updates)? Something will be wrong with the filter?
 Several things can cause this failure. Something wrong with filter, or the model, or even the obs itself. A typical situation is that you specified the wrong error variances when assimilating the obs, causing its impact to be zero. (YY)
 How does the EnKF deal with outliers or extreme members at a given assimilation step? Are there some techniques to diminish the impact of these extreme members in the covariances and the mean?
 Typically not an issue in ensemble dist. But if observation and model state don't agree (very large innovation), this would be problematic and you need to either fix model state or obs (transcribed by YY)
 To add to this, when you truly have issue with outlier members (maybe due to nonlinear dynamics that grows error exponentially), the EnKF update will surely drag the outlier back too slowly. You can use nonlinear filters (e.g. Rankhistogram filter) to better deal with outliers. (YY)
 Does the EnKF converge when the background is sparse for example if the background is simply an observation ?
 (LB) The model cannot be initialised with sparse data, so you have to give your best shot at initialising the model (say, by interpolating the sparse observations). The assimilation is intended to correct unobserved variables as observations are assimilated.
 Which metric do you use to calculate the errors in slide 83? L1norms or L2norms? Can either of them be used to evaluate the performance of the EnKF?
 I believe it's L2norms (YY, correct me if wrong). I think you can use any metric to measure performance, if it reflects the error magnitude.
 Is there any restriction on the minimum number of ensembles and frequency of observations for a good prediction of EnKF regarding positive Lyapunov exponents and error doubling time of chaotic systems?
 (LB) Yes, indeed you will see with Alberto Carrassi's presentations that you need a number of members larger than the number of positive (and neutral?) Lyapunov exponents, although getting one magical number is not easy. It also depends on which EnKF scheme you use (square root or perturbed obs), and how many observations. You should get good elements of answer throughout the week.
 How "few" is few for number of realizations in EnKF ensemble?
 In a statistical sense, it takes ~1000 realizations to obtain good correlations, but in ensemble forecasts we can often only afford ~10 or ~100. Sometimes people only use ~10, and that's "few". Localization deos improve the problem, so you can still use small ensemble size and the filter works.
 how do you estimate parameters with the ENKS, do you take the last estimate at the end of the temporal window ?
 (LB) The parameters in the example are not changing in time. So the estimation looks for the best value in the whole assimilation window.
 I mean where does the parameter go. Kind of in the same way you have $x^a$ and $x^f$ on slide 66.
 Maybe the question is how to define the spatial and temporal 'location' of these parameters? (YY)
 Yes. But here we have linear parameter, so we just extend the state vector (MD)
 Why does the EnKS result in different analyses at analysis times (slide 118) compared with the EnKF?
 Because the EnKS takes updates from more observations (in all time steps) than the EnKF (only the current time step) (YY)
 Could someone please explain "localization"? What does it do and why is it needed?
 Sampling noises due to small ensemble size is typically large when estimating the correlation between variables at large distances. Localization restricts the update from each observation to only the state variables in its vicinity (local analysis), therefore the sampling noises are suppressed. (YY)
 I will have a demo on localization on Wednesday (YY)
 (LB) I will also come to localisation tomorrow.
 How do you keep the variables in the state vector physically consistent after the analysis step? Or make them agree the dynamics and not create nonphysical solutions?
 (LB) I will talk about that tomorrow. In short the good practice is to take all the necessary variables in the state vector. That will conserve the linear balance/relationships. Then you can use transformations to try to respect nonlinear relationships, and in the last resort you have to do some postprocessing (removing unphysical values if you really cannot avoid them otherwise).
 Can including some uncertainty in the parameters (parameter estimation) be an alternative to increasing the number of observations per cycle to address the bimodal distributions that occur in the Lorenz problem?
 Are you saying if you can increase the observation information content at each cycle so that we can assimilate them?
 I am wondering if for example the bimodality occurs because there are only two possible attractors, and giving the uncertainty to the model give it more freedom to fit either one of them
 Hmm. I think when you have bimodality, it is important to have both modes represented in your ensemble when the observation is uncertain, so that when more accurate observation comes, the ensemble dist can then be refined to the correct mode. Think about the situation when the ensemble dist is centered on the wrong mode, it becomes difficult to correct it any more. Yes, frequent observation can ensure the dist to be correct.
 so, if I understood well, it is not a real problem to have two attractors, and that the ensemble splits in two semiensembles sometimes ? (it can happen if we have 2 very different observations in a same cell of the model for example) => we are very far from gaussiannity then, aren't we ?
 Yes, the ensemble can represent any distribution, the Gaussian approximation is only made during the filter update. If this bimodal distribution deviates severely from Gaussianity, the EnKF will make a suboptimal update (because it assumes Gaussianity). An important point in the Lorenz example is that, if you have a lot of observations, the distribution will eventually become normal after many updates (the ensemble will be focusing on a small area in probability space). (YY)
 Can the EnKF be used in case where you have multiplciative noise instead of additive noise?
 Yes, the EnKF formulation allows you to formulate the model noise in that way.
 What would be the advice in models where the predictibility time or "stability" does not occur in the first period of time but after a long spinup period?
 (LB) What we practice is to first spinup the model until is stabilises. But some others prefer to assimilate before the model has spunup to avoid the model drift.
 (YY) predictability time scale is related to how fast errors grow, which is determined by model dynamics. EnKF on each cycle reduces error and counter this error growth. If the rate of reduction is not much larger than error growth rate, it would take a long spinup cycles before you reach stability. I would suggest starting the DA cycles earlier than the period of interest in this case.
 Enkf dev in light blue is not symetrical...compared to the light red one...is it meaning something in particular? Would it be helpful at all? (answered live)
 Very interesting how these methods relate to some used in the geophysics inversion community. Which of the smoothers relates to the Hamiltonian Monte Carlo approach?
 With the EnKF, you mentioned that when the data are not synchronous (i.e. measurements are made at different times within the DA time window) you have to take into account temporal covariances. How is this done? Do all EnKF implementations do this?
 (LB) It is quite easy to do once you have an EnKF implemented. The explanation is in this paper https://doi.org/10.1111/j.16000870.2009.00417.x But you have to make sure that the model can print out the HX^f^ at the time and location of observations.
 My understanding of that paper is that it works perfectly for a linear model but is "good enough" for nonlinear models because the ensemble spread is greater for observations further from the last analysis time... and as long as DA windows are not too large. But are temporal covariances at all calculated or even implicit in this approach?
 Yes like the EnKF itself, the further you are from Gaussianlinear conditions, the more trouble you should expect. In particular if your assimilation window is too long. In practice you never calculate the temporal covariances explicitly. You are multiplying ensemble matrices HA'.A'^T^H^T^ like in Geir's slides, except that one term of the multiplication is at observation time and the other term at analysis time.

**COVID model study**
 After I downloaded EnKF_seir, I tried to run 'covid.ipynb'; However, there is no sample data such as hosp_0.dat and dead_0.data. Could you share sample data to implement the code?
 Are you not sharing with WHO your results?
 Are there any plans to further advance this project for studying the effects of the new variants? What would change in the estimation problem?
 Once countries start opening, can you start accounting for interactions between countries?
 Can you say something on the number of forward model runs for IES and ESMDA? not for the covidmodel, but the prior portion of the lecture. (MD)
 Depends on the nonlinearity (as far as I uderstood, MD)
 What do you do if the forward runs are quite expensive (f.e. 1 h per run)?
 The interaction transfer matrix between age groups was constant you said? Would this be fitted to specific countries to account for cultural differences?
 Can the model predict if herd immunity has been achieved?

**Patrick Raanes: DA Tutorial**
https://github.com/nansencenter/DAtutorials
 I create a new environment with the requirements.txt, but the first notebook crashed. Although it worked with another environment
 use python V
 It may be useful if someone has the same problem when creating a new local environment. Uninstall the updated matplotlib which does not work and install an older version, 3.1.3 is working.
 A glossary of data assimilation?
 There was a paper from [Kayo Ide about unified notations(Ide et al. 1997)](https://www.jstage.jst.go.jp/article/jmsj1965/75/1B/75_1B_181/article). It is not followed by everybody, but at least it has made a mark on the community. (LB)
 There seems to be more fixed glossary in differnt communities, which is not unified yet. In machine learning/weather forecasting/engineering the names for model output (feature/state/signal) can be very different. So it's a good idea to make a cheatsheet before you talk to domain experts. (YY)
 Hi Patrick, in the diagnostics there is the product of the H and K operators (HK  black line), can you explain how to intrerprete it? Also what is the blue (10(infl1) line?
 (LB) HK relates to the reduction of the estimated errors. When you apply the Kalman update to the covariance matrix, the analyzed covariance matrix becomes C^a^=(IKH)C^f^.
 The blue line is probably related to inflation (I guess he will get to this later)
 All the tutorials have vocation to do a survey of possibilities with DA..Can we ask questions leater when we will dig in the tutorials?
 (LB) Yes we want to keep this page open for a while and collect all the answers there even after the debrief session. Although we may not check the page regularly after Friday, so please send an email when you have further questions.
Day 2

**Laurent Bertino: background statistics, spatial covariances, anamorphosis for nonGaussian variables**
 What are example of variables that are transitionnaly invariant in space? For example the average temerature at Bergen is not the same as in Dubai. (answered live)
 It is really hard to find a good example. If you take a subset of the data in a small area and during a short time you may assume that the assumption is fulfilled.
 Isn't spherical model almost the same as inverse distance weighting?(MD)
 No, because there is no screening effect. In addition, IDW overvalues clustered data. (transcribed by MD). Inverse distance weighting is actually not a Kriging method. It is a very cheap interpolation method but it will only work well if the data are regularly spaced.
 (YY) The kriging method uses isotropic covariances, so is it more comparable to optimal interpolation which also uses isotropic covariances. Since the ensemble methods will provide flow dependent covariances.
 That is right. In data assimilation the only isotropic covariances are the observation error covariances and maybe the model errors.
 What is the difference with conditionnal gaussian processes ?
 Maybe the kriging assume the random variables similar to gaussian random fields (gaussian processes with spatial covariances) (I don't know much about that, correct me if wrong, YY)
 in GP, each simulation is considered to be the realization of a gaussian process : Z(x) = m(x)+W(x) with m the mean function,, W(x) the gaussian (sigma2, R), we conditionn the GP to respect the observed data (that are, in fact, the outputs of a model), and this conditionned GP is still a GP. so I think this is equivalent to the gaussian conditonnal simulations you showed. Maybe the term is different since we interpolate the model outputs instead of spatial data (metamodeling), but the equations are the same. (CL)
 (LB) I also have the impression that it is the same methods with a different name.
 ok thank you, you deserve your coffee :)
 Can you use conditional simulation to create ensamble members (initial conditions) from data to then use for EnKF latter, for example?(MD)
 The way a conditional simulation is created is very similar to the way an initial ensemble is created (drawing from a random distribution), so yes! (YY)
 This is what is practiced with reservoir simulations for example.
 Thanks!
 Please explain again for "kriged unconditional simulation" in page 91. what is the different between unconditional simulation and kriged unconditional simulation?
 kriged unconditional simulation is the result of kriging from the unconditional simulation pretending that we only know the observed sample points, therefore the unobserved area are very 'smooth', the unconditional simulation itself is noisy and has the correct variances. (transcribed by YY)
 How does the change of support in geostatistics connect to the concept of representation errors in data assimilation?
 It is exactly the same concept. In data assimilation there is an extra complication to know what exactly is the support of a model grid cell (with most numerical schemes the effective support is 4 to 8 times the size of an ocean model grid cell) (LB)
 Could you reexplain very briefly (with the terms of the equations) the similitude with the Kalman gain please ?
 You can identify the denominator

**Some practical DA examples**
 Hi Laurent, the gaussian anamorphosis guarantees that once you transform back to the original variables that your result will not be biased or this can be an issue?
 Yes, the whole philosophy holds in this inequality: $E(\phi({\bf Y})) \neq \phi(E({\bf Y}))$. If you transform the ensemble mean you are computing $\phi(E({\bf Y}))$, which is biased, but if you transform each ensemble member you can compute $E(\phi({\bf Y}))$ without bias.
 Could you please advise research papers about comparison of assimilation SSH vs U,V by EnKF, EnOI or etc.?
 Well since the two are in principle equivalent, there has not been any attempt to compare them to my knowledge. For a comparison of different assimilation methods, I can recommend this paper: https://doi.org/10.1016/j.ocemod.2011.01.006
 How do you do when the anamorphosis is still not able to give a gaussian variable ? (when there is a Dirac at zero for example)
 For discontinuous variables you can look at truncated Gaussian methods:
 A truncated Gaussian filter for data assimilation with inequality constraints: Application to the hydrostatic stability condition in ocean models,
Claire Lauvernet, JeanMichel Brankart, Frédéric Castruccio, Grégoire Broquet, Pierre Brasseur, Jacques Verron,
Ocean Modelling, Volume 27, Issues 1–2, 2009, 117, ISSN 14635003,
https://doi.org/10.1016/j.ocemod.2008.10.007.
 For categorical data mixed with quantitative data:
Metamodeling methods that incorporate qualitative variables for improved design of vegetative filter strips
Claire Lauvernet and Céline Helbert . Reliability Engineering & System Safety, 2020. Volume 204, 107083, ISSN 09518320,
https://doi.org/10.1016/j.ress.2020.107083.
 The cluster EnKF for multimodal distributions: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.16000870.2007.00246.x
 If the observations are truncated but not the variable itself, you can look at the semiqualitative EnKF by Shah et al. 2018: https://doi.org/10.1002/qj.3381
 To impose linear inequality constraints in the state vector, this method has been proposed by Janjic et al. 2014: https://doi.org/10.1175/MWRD1300056.1
 How to choose the ensemble size?
 This is a difficult question. There are aspects of dynamical systems theory (see Alberto Carrassi's presentaion on Friday, pay attention to the Lyapunov exponents part), as well as matrix algebra considerations: you are dealing with matrices of rank less than the state dimension and number of observations: The number of members should be a number in the same ballpark. Finally, it is recommended to plot examples of ensemble covariances between different variables and try to make sense of them. If these plots do not show the expected spatial features, you have probably not used enough members.
 Is bias linked to no respect of ergodicity?
 (LB) I don't know. Bias can stem from any wrong feature in the assimilation system: an error covariance with the wrong value, ... I recommend Fitzgerald (1971) in the context of the KF: https://ieeexplore.ieee.org/document/1099836
 Could you please explain the differences between the conditional simulation approach using geostatistics and the EnKF used for data assimilation? It seems to me that the conditional simulations rely more in the data because they produce the random realizations using the variogram information whilst the EnKF rely more in the dynamical models and just correct using some observations. That would make the simulations more datadriven and the EnKF more physicsdriven?
 Yes. This is correct to say, you can call Geostatistical conditional simulations "an EnKF without a numerical model". Reversely, the EnKF can be interpreted as a conditional simulation algorithm (using a numerical model).
Day 3:

**Yue (Michael) Ying: Data assimilation with multiscale alignment**
Link to slides and notebook: https://github.com/nansencenter/msda_crashcourse
 Non physicist question: can anyone clarify to me what is meant by "I use a Gaussian state for the 1d variable"?
 The values along the xdimension are simply the values of a Gaussian function. (LB)
 Hi Michael, does modes that you or scales you are trying to best estimate with the EnKF are for example te one the terms of a fourier series/transformation?
 Yes.
 Yes, you can get scale component by running spectral bandpass filters (forward fourier transform, select wavenumbers, then backward), it is also possible to run spatial smoothing operator to get the large scale (like Lanzcos filtering)
 are you using/asimilating several times the same data with this multiscale, and if yes, isn't it an issue ? do you loose the independance? do you have to describe it in the obs error ?
 In terms of using the observation several times, there is no problem because our scale components being updated are not the same (the small scale component is the fullres minus the largescale component, so it is independent). In a linear error regime, the displacement vectors will reduce ensemble spread for small scale, and EnKF will reduce it again, this is a problem of using the obs information twice (that's why we shall use the EnKF directly in linear regime!)
 Lowpass filtering, or spatial/running averaging are linear transformations. How to reconcile the fact that the KF (or EnKF) already works optimally with linearity with the observed improved performance of the MSA scheme? (PNR)
 The multiscale decomposition is summing the impact for each scale component, which is still linear updates. But the alignment step is giving a nonlinear solution (displacement adjustment to the prior ensemble), which is outside of linear regime.
 can we apply the MSA on other filters, like particle filter for example ?
 (LB transcripts) Yes since the MS applies to the state vector only. That should be interesting with the PF.
 Can MSDA be used for parameterestimation? How can you represent model parameters in the spectral space?
 The challenge is to define the location of parameters so that they can be related to the observed variables in a multiscale sense. The correct characterization of the relation between the parameter and observation at a given scale is something we need to explore here.
 I can image how it can be done for global model parameters but if they are distriubuted in the actual space?
 Yes, if the parameter is closely related to some variables that can be decomposed into scale components, maybe the parameters can take a similar transform as the variable. For example, the land surface drag coef is closely related to the surface wind, you can define the drag coef on each model grid on surface and update them together with the surface wind.
 Check out Marc Bocquet's recent work on learning global parameters.
 Could you give some references of these methods ?
 The multiscale alignment method is introduced in Ying 2019: https://doi.org/10.1175/MWRD190170.1, Feature alignment techniques also in Ravela et al. 2007, Nehrkorn et al. 2013, Stratmann et al. 2018, reference can be found in my paper.
 The scale dependent observation error inflation is presented in Ying 2020: https://doi.org/10.1175/MWRD190387.1
 I also have an earlier paper on multiscale localization: Ying et al. 2018: https://doi.org/10.1175/MWRD170336.1. Also see references within, e.g. Li et al. 2009, Buehner and Shlyaeva 2015.
 How do you interpret the notion of spread in spectral space? Do you filter every member then calculate the spread for a given frequency band?
 You can interpret the height of the spectrum line at a given wavenumber as the ensemble variance as if you filter out that wavenumber and compute the domain averaged variance. The actual way to compute the spread spectrum is to compute the power spectrum of individual member's deviation from mean, summing them and divide by (Nens1). See function sprd_spec in the notebook.

**Julien Brajard: Machine Learning**
Link to slides and notebook: https://github.com/nansencenter/mlcrashcourse
 Can we talk about supervised/unsupervised learning in regression task problems where we do not have categories but numerical outputs? Or supervised/unsupervised is only for classification problems?
 Yes, you can consider timeseries forecasting (https://www.tensorflow.org/tutorials/structured_data/time_series) and autoencoder as unsupervised regression (https://www.jeremyjordan.me/autoencoders/).
 When you explained the linear model it looks like a Kalman Filter. What is the difference between the data assimilation problem and the machine learning problem there?
 There are no fundamental difference between machine learning and data assimilation (DA can be seen as a type of machine learning algorithm). If you are interested in the subject, you can have a look at the references given in slide 14 of the pdf document https://github.com/nansencenter/mlcrashcourse/blob/main/crash_course.pdf
 how many data points required to develop the model? any minimum limit?
 There are no general answer to that. For simple models it is generally accepceted that you need more data samples than parameters to estimate, but it depends on the complexity of the underlying model and also on the noise on the data. Deep learning model have very often more parameters than data sample.
 In slide 23 the plot looks very similar to the one shown by Michael of the observations space which was nonlinear. Does this indicate then that these ML model explore better the observation space than the EnKF that does it linearly?
 ML model can be non linear so it is a application of ML to express non linear observation operators in a DA process,
 Is it common to account for correlated measurement errors in the learning process?
 Correlated measurements, and more general bias in dataset are a important problem for machine learning. But if the correlation is known it can be introduced in the loss function to be minimized as it is done in 4DVar for example.
 can I fight overfitting by training 2 models in independent data, instad of training 1 time and using the second dataset to check skill?
 Training an ensemble of model is a good idea if you have a big enough dataset. It is for example done in a very popular machine learning model: the random forest (https://jakevdp.github.io/PythonDataScienceHandbook/05.08randomforests.html) but I think that it does not prevent to use also a validation set, because you still need to have an assessment of the quality of the model.
 This topic is very interesting. can you refer some basic papers/tutorial that give me insight of the machine learning algorithms
 You can have a look at the two references given in slide 2 of the pdf https://github.com/nansencenter/mlcrashcourse/blob/main/crash_course.pdf
 How do you gauge training set representativity for your dataset. Are there any measurable metrics used commonly? I assume that this would strongly decrease subjectivity related to the process, right?
 This is a important problem of machine learning which is linked to the problem of biases in machine learning training. Mostly the answer is that this is problem dependent but here are two very nice papers that address the problem. There very specific and do not propose any common metric though:
 https://aip.scitation.org/doi/abs/10.1063/5.0042598
 https://arxiv.org/abs/2004.07780
 very good presentation! degree3 is better than degree1and 30, but I want to know whether degree3 is the best?how to know the best parameter?
 How do you choose the number of layers/neurons/etc.? Can I use GridSearch or there is more about it?
 One way to do, if you have few hyperparameter (the degree of the polynomial is not a parameter optimized during the training process, so it is called "hyperparameter") is to explore all the possibilities (like it is done in slide 24, lower figure). This exhaustive approach is called "Gridsearch" (https://jakevdp.github.io/PythonDataScienceHandbook/05.03hyperparametersandmodelvalidation.html#ValidationinPractice:GridSearch). If there are two many hyperparameters, you can do either "Random search" (https://machinelearningmastery.com/hyperparameteroptimizationwithrandomsearchandgridsearch/) over randomly picked hyper parameter, or even better: bayesian optimisation (http://hyperopt.github.io/hyperopt/).
 Is there any connection between the perceptron formulation and the bootstrapping behind the particle filter? The idea of having weights multiplying different realziations of state vectors.
 importance sampling? I guess the key difference is the network architecture used in ML. (YY)
 Is there an analogue concept of the activation function for the data assimilation methods?
 the EnKF update equation is a linear regression, which is one nueron with linear activation func? (YY)
 linear activation function for multiple hidden layers is exactly same with the simple neural network without hidden layer.(KL)
 Hi Michael (YY), when you use your multiscale data assimilation, that introduce nonlinearities for the filter, that would be similar to what an activation function does?
 (YY) I guess the underlying idea is similar, but if you look at the math formulation there is still big differences in architecture. I'd say the nonlinearity those activation functions handle is more general, while "alignment" only handles positionerrorrelated nonlinearities.
 What is the motivation behind the activation function? Map the weighted linear combination of the inputs to a more convenient set?
 to introduce nonlinearity in the "model", the relation between input and output (transcribed by YY)
 What is the purpose of the learning rate parameter?
 the learning rate is the speed at which the weights are updated according to the error gradients, it is also a hyper parameter of the ML that one can optimize (transcribed by YY)
 Can the minibatches approach be seen as something similar to the multiple data assimilation approaches? And the epochs from ML like iterative filters?
 (LB) These are different things: the minibatches use different observations with the same characteristics. the MDA is reassimilating the same observations with higher observation errors.
 maybe epochs is similar to the outerloops in 4DVar?
 Yes that is similar indeed. But in case of 4DVar, outerloops have fundamental difference with innerloop. In case of epoch, all the iterations are the same, the epoch is a convenient way to gather several iterations to give a measure of the progress of the training.
 Hi Julien, when you were explaining the loss functions for the regression task and for the classification task, you mentioned that a loss function like least square is good for regression and crossentropy is good for classification. Is there a reason behind this? Could crossentropy be also used for regression?
 There are several reason that are linked to theoritical reasons (maximum likelihood estimator, convexity of the loss function, ...) but mostly the reason is that the crossentropy deals with proabibilies which is convenient for classification. The output of the model in case of classification (which is called a 'classifier') is the probabtiliy for a sample to belong to each class. A good measurement of misfit regarding probability is the crossentropy. For regression, you cannot easily interpret the output of the model as a probability, so it is not straitforward to define a crossentropy on quantitative output (except in the particular case where your output is a probability)
 When you have the same data and the same architecture (network) you can end up with different trained networks due to the random initialization, right? How to assess better the performance of the networks in those cases? Specially when methods like backpropagation (gradient descent) can have different solutions depending on the starting point?
 Yes this is an important point indeed. First, let say we don't want the training to be too sensitive to initialization of the weights, but sometimes, this is unavoidable. There are some research done on the way to initialize the weights (http://proceedings.mlr.press/v9/glorot10a.html) and the initializing procedure can even be one hyperparameter to tune. Some technics can make the training less sensitive to the weights initialization (normalization of the input/output, batchnormalization, add small noise on the data, L1/L2 regularization of the weights). If you can afford it, it is also a good practice to run several training to keep only the best one (measured using a metric on the validation dataset). Note also that weight initialization is not the only randomness in the training: the way the minibatch are drawn for each iteration can also gives different optimizations. Finally, you can also have an ensemble of model using various initializations, it gives an idea of the uncertainty of your model.
 is Keras a python package to install in anaconda ?
 keras is now a submodule of tenssorflow. Even if a standalone version of keras still exists, I would advice to use tensorflow>=2.0 that you can install using anaconda or pip.
 linear activation function for multiple hidden layers is exactly same with the simple neural network without hidden layer.
 True. This is why it is not advised to use linear activation function in hidden layers. Linear activation is generally used only on the last layer (for regression.)
 What is the definition of an epoch?
 I found this blog that gives a complete and nice explaination of an epoch: https://machinelearningmastery.com/differencebetweenabatchandanepoch/
**Julien Brajard: Machine Learning Notebook**
You can play with all the parameters of the notebook but I would advice to focus on the following:
 Normalization : on/off
 Number of neurons on each layer
 Noise on the training set: try 0 (default), 1e4, 1e3, 1e2
 Size of the training set
If you have feedbacks/questions regarding the notebook, please ask here.
Day 4:

**Pico Session: questions to Kyungbook Lee**

**Francois Counillon: DA for climate prediction**
 Does the choice of error metric change the forecast skill? I'd imagine some error measure has more smallscale influence than others.
 Historically correlation has been used, but now RMSE is also often presented together with a reliability budget analysis.
 What is the definition of "persistence" forecast? Why is this chosen as a benchmark?
 Persistence is a way to quantify the skill related to simple autocorrelation of the quantity you are trying to predict. If persistence is working better than your prediction you dont need to run a expensive system and can just use persistence
 Will different ensemble members have different level configurations? i.e. the same isopycnal level will be at different physical height among the members?
 Yes, but all member will have the same density.
 What are the requirements to define a coupled data assimilation problem? How to differentiate it from a problem where you have an area with extreme change in dynamics and another one which is more constant for example? The difference are physical laws that applies to each medium involve?
 In the climate community it relates to whether different components of the Earth system are updated jointly or separately, but you are right that it very relate to whether two processes are weakly or strongly coupled (physical sense). For the ocean atmosphere it relates the problem of multi scale data assimilation. You can have a look at this good summary paper https://www.jstor.org/stable/26243775?seq=1

**Annette Samuelsen: DA for marine ecosystem**
Link to presentation, part 1: https://www.dropbox.com/s/bn6j1s5ntxaxll7/DAMarine_ecosystems_part1.pdf?dl=0
Part 2:https://www.dropbox.com/s/koi3twm4r2y1q76/DAMarine_ecosystems_part2.pdf?dl=0
The list of references is on the last page of part 2.
 how do you account for uncertainties of model formulation in ensemble DA? I know if you know the parameters are inaccurate you can perturb them in ensemble, but what if the process itself is not formulated correctly?
 (LB) We always use perturbations of physical conditions (winds, heat fluxes) and ecosystem model parameters. You need to do some sentivity tests to find out which parameters are actually important.
 AS  in addition to Laurents answer: If you are uncertain about the model formulation, say for example the functional form of grazing by zooplankton on phytoplankton, it would be possible to make with a study with a similar setup to that of Friedrichs et al (2007), but instead comparing the performance between different models, on could compare the performance between models using different grazing formulations to help you select which formulation works best model.
 How do you assimilate RS chla? (daily or 8day)
 (LB) 8days because of the heavy cloud cover in our regions.
 are the delay in blooming in the model caused by some biased parameters?
 (LB) Yes, partially because of delays in the physics, but also the biological parameters.
 How nonlinear are the bio models you use? Assuming the ocean and atmosphere forcings are stationary.
 (LB) Very nonlinear. There has been simulations of predatorprey interactions in completely static environment and the results are very nonlinear. This is partly because of all the exponentials in the MichaelisMenten equations.
 (AS) we see limit cycles in very simple systems, for example models that conatin 2 or 3 components, but these are not observed in nature.
 could you also allow parameters to vary in space?
 (LB) Yes, this was practiced by Simon et al. (2015) and a few others (Doron et al.?).
 You can make the parameters as complex as you want, but you have to be prepared to receiving a very complex answer.
 Are the slides available somewhere?
 (LB) We will put the link on the events.nersc.no page ...
 The links are are added on the top now.
 Did you use Gaussian Anamorphosis also for the nutrient variables? And if yes, did you use the same transformation function?
 Yes, it was applied also to nutrients and the same function is used.
 Do you also use the logarithmic transformation of CHL for the Argo data?
 I assume you refer to the paper by Teruzzi et al. (2021), I consulted this paper, but I could not find this information. From experience I think that would we advantagous, as subsurface chlorophyll has a similar probability distribution to surface chlorophyll.
 How many ensemble members do you use for BGC variables?
 We use 80 in our current setup which had the additional smooting step and therefore is more computational demanding. In the previous pilot reanalysis (Simon et al. 2015), we used 100.

**Moha El Gharamti: DA in hydrology and streamflow forecast**
 In Streamflow, is the observed variable the same as the modeled?
 Yes, one of WRFHydro's state variables is streamflow. This makes things easier when comparing model estimates to the observations. H is essentially an identity here (Moha).
 Hi Moha, can you elaborate in what you mentioned that localization breaks Bayes? How does it break it?
 $p(xy) \propto p(x) \cdot p(yx)$: Bayes rule indicates that when an observation becomes available, we find the likelihood of this obs given the state. After multiplying the likelihood with the prior, one can obtain the updated value of the state after using the obs. Localization distorts this a little because it limits the number of obs used at the analysis time to only the closed ones (Moha).
 The water is sensitive to the location of precipitation, can the atmospheric model predict the squall lines accurately and therefore the hydro component has prediction skill?
 Having a good precipitation estimate from the Weather Forcing engine is crucial. If the precipitation is wrong, WRF_Hydro's prediction will be poor. We've seen that at a number of gauges in the study. The forcing needs to be right in order to have a skilled streamflow prediction (Moha).
 How do the channel and bucket models differ in their functionality?
 The bucket is a very simple conceptual way to describe the groundwater processes. The channel follows a more sophisticated physical law described by the MC formulation (Moha).
 If you only limit observation impact to individual catchments, they don't overlap, are you sacrificing the observation information content?
 Perhaps but I'd also be happy I'm not correlating 2 potentially unrelated quantities. The issue of ungauged is something that is always dicussed in the hydrology community. I personally think that leaving ungauged basins alone is more sound than updating them with gauges that follow a different hydrological regime (Moha).
 For the adaptive covariance inflation, what was the motivation of choosing an inverse Gamma prior for lambda? Is the posterior also an inverse Gamma under these assumptions?
 An inversegamma function is just a convinient choice for the inflation factor. It has a zero probability near zero and it grows steeply and further decays in a steep fashion for large inflation values. Previously, the adaptive inflation algorithm used a Gaussian pdf for $\lambda$ and that was not a very smart choice. Posterior inflation is also assumed to follow an inversegamma distribution but the likelihood function is slighty different. More on this can be found here: "https://journals.ametsoc.org/view/journals/mwre/147/7/mwrd180389.1.xml" (Moha).
Day 5:

**Rosella Arcucci: Data Assimilation with Machine Learning**
 Do you have any idea what types of physcial processes are better handled by machine learning than data assimilation? Like based on their degree of nonlinearity?
 A theorem proves that any nonlinear system can be approximated by ML. So ML can always bring some speedup (Transcr. LB)
 (LB) It is difficult to make multivariate transformations that preserve all relationships, linear and nonlinear. Does the encoder respect all nonlinearities or only a selecrted subset of the nonlinearities? What happens to the linear relationships?
 There are many hyperparameters in an autoencoder, so depending on our prior knowledge of linear/nonlinear relationships it is possible to linearize more or less (transcr. LB)
 In your case, does PCA achieve some kind of hidden localization ? Did you compare with usual localization techniques ?
 (LB transcripts) Yes localisation is applied with domain decomposition to reduce the dimensionality. There is an overlap between the subdomains and information passed between them.
 (LB) How far can we extrapolate Neural Nets: Would you trust a future climate simulation from a NN?
 Definitely yes if it comes from qualified people.
 (Julien B.) It is related to the discussion about the representativeness of training set in machine learning. You can have a look at this recent paper:  https://aip.scitation.org/doi/abs/10.1063/5.0042598
 How do the combination of data assimilation with neural networks impact hyperparameters like the number of layers in for example deep learning applications?
 (LB transcripts) It makes the backpropagation step much faster.
 How can we account for the errors made by the neural networks in the data assimilation, is it a type of model error?
 (LB transcripts) Model errors are accounted just as in standard DA, by a Gaussian noise.
 (YY) Say we have a really erroneous model, can machine learning help detecting what component in the model is wrong given the data?
 The parameter can be estimated by the algorithm if there is error in it. (transr. YY)
 Is the code available somewhere?
 https://github.com/DLWG is the project page
 I don't understand the difference between surrogate of a physical model and the NN learning from the CFDs? Both learn from the equations, don't they ?
 (LB transc.) Yes, they are two sides of the same coin.

**Alberto Carrassi: DA in chaotic systems**
 In a large dimensional system, could some place in the state be stable while other place be unstable? If so, how do we evaluate the overall stability of the system?
 This is possible of course. One way is to look at the amplitude of the unstable modes along the state vector. In fact this is also a way to identify region (or model subcompartment) whre instabilites originate or dissipate (see, e.g., Fig.5 in this work https://link.springer.com/content/pdf/10.1007/s1095502002525z.pdf)
 Would the manifolds simplify to linear subspaces if we assume linear autonomous dynamics?
 Yes, absolutely. The linear subspaces are much simpler though because they cannot be unstable. (YY transc.) OPS: let me rectify this. It seems that in the transaction the idea has been wrongly reported. A linear system can of course be unstable and in fact a lot of the analytic results about the performance of Kalman filter in relation to model dynamics are obtained for unstbale linear systems (see e.g. https://epubs.siam.org/doi/10.1137/16M1068712). In relation to the specific question: yes, for linear system the unstable manifold will ``just'' be an unstable subspace (whereby is easier to construct a basis).
 (YY) Can we tell if a system is behaving like a exponential growth or periodic change, if the period is much longer than the simulation time scale we have? Say we cannot afford to run simulation long enough, or we don't know the truth yet (climate prediction).
 I think the problem mentioned here is the general problem of time sampling. We do only know the system via observations so, yes, we can very well (in theory) be uncapable to detect the existence of a underlying long period if we have not yet collected observations over a time lapse longer than the period.
 How to relate ergodicity and gaussianity in relation with DA?
 This is difficult question and, as far as I know, one of those research question that has been in the air for some time already. Let me just say that, for ergodic system we know the existence of an invariant distribution describying the points (system realizations) on the attractor. Thus, when doing DA we should hope and aim at producing analysis states that belong to the invariant attractor distribution.

**Final Debrief, suggestion box**, any questions that you didn't have a chance to ask? Or did you try out the tutorial notebooks and have encountered issues? Put them here.
Also put general suggestions about the course and what we should change for the next edition (a priori in two years from now):
 version of python libraries in notebook? so it will always run correctly
 Is this sufficient? https://github.com/nansencenter/mlcrashcourse/blob/main/requirements.txt
 (MG) If it helps I've been using python 3.8.5, the default on Ubuntu 20.04, and whatever libraries are installed by default with pip. So far everything seems to have worked.
 (LB) Thanks, Malek
 matplotlib version (newest version has problem?)
 Hey @Patrick, works fine with Python 3.8... No idea what would cause the failure with Python 3.9 though, might also be a different version number of a dependency or so...
 Yes, you are right, probably an issue in a dependency. Thanks for pinning this down. I will restrict the python version in the readme.
 ISDA courses, what are the differences between this course and the other available ones
 ISDA online https://isdaonline.univie.ac.at/ some good talks. Next event will be in Septemper after summer break.
 (LB) That is right, I have received the announcement as well. A very good lineup!
 (LB) To answer your questions, there are probably several other courses in DA, we have a special emphasis on ensemble techniques because of the history of our work