# Challenge 2
## Understanding the fundamental limits of information fusion
#### names here
Richard dakin DSTL
Rich Green DSTL
Stuart Scourfield DSTL
Paul Thomas DSTL
Olly Johnson O.Johnson@bristol.ac.uk
Carola Schoenlib cbs31@cam.ac.uk
Alan Champneys a.r.champneys@bristol.ac.uk
Martha Lewis martha.lewis@bristol.ac.uk
Zahraa Abdallah zahraa.abdallah@bristol.ac.uk
Ioannis Kontoyiannis yiannis@maths.cam.ac.uk
Ramji Venkataramanan rv285@cam.ac.uk
Wenwu Wang w.wang@surrey.ac.uk
Cathryn Mitchell c.n.mitchell@bath.ac.uk
## Location of various bits and pieces
+ Overleaf
https://www.overleaf.com/9557732713qqwtgwvgfrjb
+ Whiteboardfox
https://r3.whiteboardfox.com/3711884-9561-1120
+ Mural
https://app.mural.co/t/newtoninstitute6161/m/newtoninstitute6161/1599474290519/cc211c06eba958e131258baba43700a1ed5f0d96
## Initial Ideas
What is a normal pattern? What is a change
Simplify idea, a static problem, a Bayesian problem where we are trying to update the landscape
Whole area of anomaly detection in time series. When has the statistics of the underlying data changes. Or change point detection.
Fusion might have lots of applications e.g. to anomaly detection.
Edge analytics :- is it on the edge or in a centralised place, or in the cloud, or the fog
Can we access the conditional distribution of all the kinds of observations we might observe. Then we can use mutual information.
Finding a child 2D. Sensor that tells you orthoganal information or correlated information? What information is the second sensor bringing you that the first sensor doesn't using inforamtion theory.
How should we represent the data? Is it probabilistic, or information theoretic.
Is it crisp data, or is it uncertain.
Streams of numbers? Higher level?
Classifying data types. Information? numbers?
Think about different sources and their reliability.
How to model the data sources?
Its easier to do *something* about the problem rather than describe reality. E.g. image compression. But there is no statistical model for an image. Yet we can do a lot.
So we don't need a perfect description of reality as we observe it.
Non-numerical data, is important. Catagorical variables in statistics once you know what you are interested in. Rigorous tools exist. Think more broadly that linear Gaussian models.
Information theory problems. Fundamental limits can give precise answers in the case of simple sources of uncertainty (e.g. the chanel) the complexity comes from the way we do the protocol. But in data fusion, we don't have such a perfect description.
Recent work on using boelean data, categorical etc to Euclidean space with a loss function.
Problem 1. Find the child - this is like a navigation problems, inconsistent satellite measurements
Problem 2. Predict the child's properties of behaviour. How do categorise?
Ramji example: survey with missing data. There is a loss function. E.g. PCA using a matrix with real numbers. What is a sensible mapping.
Loss function is measure of difference from what we expect. Use algorithm to predict and then find error.
https://people.orie.cornell.edu/mru8/doc/udell15_pca_dataframe.pdf
In some narrow settings there are some optimality consitions. Can we identify a class of problems we would like to solve.
A word of caution. Care in image compression. We know fundamental limits since 1950s, but algorithms have zero connections with that theory. Practical algorithms are nowhere near this bound.
What do we want of our fused data? What would close to reality mean. What is our measure of "close"?
In the past few years neural networks have given us good priors for images.
Use a trained neural network as a prior then use rigorous statistics on top of it.
Notion in language of putting things next to each other. Process then decide how to combine.
First step, what is putting next to each other? Very basic level treat them as independent. Can't do this if type of information is different. E.g. image and position?
This goes back to image process/navigation/tracking problem.
Suppose we have identified that there is a red object. So that gives a categorial (statistics sense) infering over colour, rather than mapping into 3D space.
2 different meanings of "categorial". Wiki page for categorical data analysis.
https://en.wikipedia.org/wiki/List_of_analyses_of_categorical_data
Can everything be turned into probability distributions if you have a sense of independence.
The kind of inputs we have is something we don't know for sure. That is when we think about feature extraction being a function of the input. Can we learn which features that we want to extract, in an adaptable way. Can this be thought of as a sort of reinforcement learning?
How can we assume that these features are adaptable?
From machine learning there are some common features but the model might come up with something that we didn't expect as the kind of features.
Explainable AI, what made the algorithm make these decisions, we might find other features that are more important.
Ioannis, there are situations where two pieces of information come from the same source rather than "orhtoganl" sources.
E.g. a very rare disease. More effective to test one person twie with a noisy test.
We have few examples, and nothing like a general theory. Can we make a general theory of this?
## Sub Group 1 - information theoretic ideas
### Toy Bayesian probability model
Toy model: consider $X$ the position of the child we want to find, and suppose we have a prior density $f(x)$.
Suppose we have two sensors that produce outputs $Z_1$ and $Z_2$. Imagine that, when the child is at position $x$, the conditional distribution of $(z_1,z_2)$ is $h( z_1, z_2 | x)$.
We can express the density of the child's position given the output of the
sensors as
$$ g(x | z_1, z_2) = \frac{ f(x) h(z_1, z_2 | x)}
{\int f(u) h( z_1, z_2 |u) du}.$$
One fundamental limit is the Shannon capacity (mutual information) of the channel from $X$ to $(Z_1,Z_2)$. What can we say about how the capacity of this channel depends on the form of $h$? e.g. is it the case that if the sensors are conditionally independent - i.e.
$$ h(z_1, z_2 | x) = h_1(z_1 | x) h_2( z_2 | x)$$
then that's the most informative case?
Very toy example: consider $X$ standard Gaussian on the real line, and for $X=x$ suppose
$(Z_1, Z_2) \sim N( (x,x), \Sigma)$ [so $Z_1 = X + N_1$ and $Z_2 = X + N_2$ where $N_1$ and $N_2$ are bivariate
normal with mean 0 and covariance $\Sigma$.
*Estimation* point of view: The minimum mean-squared error (MMSE) estimator of $X$ given $Z= [Z_1, Z_2]^{T}$ is
$$
\hat{X}^{\rm MMSE}(Z_1, Z_2) = \mathbf{1} (\mathbf{1} + \Sigma)^{-1}Z,
$$
where $\mathbf{1}$ is the $2 \times 2$ all-ones matrix.
Special case: Suppose $N_2 = c N_1$. So long as $c \neq 1$, we can find $X$ directly, but in the case
$c=1$ we can't. Trick:
$$ c Z_1 - Z_2 = c(X+N_1) - (X+c N_1) = (c-1)X,$$
so if $c \neq 1$ we can perfectly estimate $X$ using a linear combination.
This is exactly what you get using the MMSE formula above, since we know
$$ \Sigma = \left( \begin{array}{cc} 1 & c \\ c & c^2 \end{array} \right) $$
Therefore, independent looks at the data are not necessarily optimal.
This is also consistent with the information-theoretic point of view.
*Information-theoretic* point of view: mutual information is $H(Z) - H(Z |X)$, and we know that the entropy of a bivariate normal with covariance matrix $C$ is $\log(2 \pi e) + \log \det C$. Here, $Z|X = x$ has covariance matrix $\Sigma$, $Z$ has covariance matrix $\mathbf{1} + \Sigma$.
So, the mutual information in general is
$$ I(X; Z) = \log \left( \det( \mathbf{1} + \Sigma)/\det \Sigma \right) =
\log\left( 1 - \frac{\Sigma_{11} - 2 \Sigma_{12} + \Sigma_{22}}{\Sigma_{12}^2 - \Sigma_{11} \Sigma_{22}} \right).$$
We can see that the `shape' of $\Sigma$ is telling us how easy it is to recover $X$. In particular, in the toy example above where $N_2 = c N_1$, we can see that
$$ I(X;Z) = \log \left( 1- \frac{(1-c)^2}{(c)^2 - 1 \times c^2} \right).$$
So unless $c=1$, then the mutual information is *infinite*, meaning that we can recover $X$ perfectly.
But in general, this mutual information is quantifying how much information we've gained about $X$ by learning the outputs of the sensors, and hence gives fundamental bounds on our ability to recover $X$.
Information-theory can be also used to answer the following question:
-- If the true distribution $P(X, Z_1, Z_2)$ of the data is unknown, but is known to lie in some set of distributions $\mathcal{P}$, then we can design an estimator based on a *postulated* distribution. What is the postulated distribution that leads to the best estimation performance within $\mathcal{P}$?
-----------
### Yiannis's problem (see figure below):
There are situations where two pieces of information come from the same source rather than "orhtoganl" sources.
E.g. a very rare disease. Under certain circumstances, may be more effective to test one person twice with a noisy test.
We have few examples, and nothing like a general theory. Can we make a general theory of this?

Consider $X \in \{ 1, -1\}$ with $P(X=1)=p$, where the incidence probability $p$ is a small value.
In the first model, we have:
$$
Z_1 = X + N_1 + \tilde{N}_1, \qquad Z_2 = X + N_2 + \tilde{N_2}.
$$
In the second model, we have:
$$
Z_1 = X + N_1 + \tilde{N}_1, \qquad Z_2 = X + N_1 + \tilde{N_2}.
$$
We can compute the mutual information $I(X; Z_1 Z_2)$ to understand which model allows us to estimate $X$ more accurately -- this will depend on the incidence probability $p$.
----------------------
Toy problem V2:
Here there are two different kinds observations to be fused. Let $X$ be the variable, say location. Observation $Z_1$ is Boolean, and and $Z_2$ is real-valued. The model is defined in terms of a hidden Boolean random variable $U$ which is +1 or -1 with equal probability. Then the location is
$$X =U + W, \qquad W \sim N(0,1). $$
For example, $U$ models whether the child is hungry or not, which may influence the location they are likely to be around.
(Thus the location distribution is a mixture model, with $U$ being unknown.)
The two observations $Z_1, Z_2$ are generated as:
$$
Z_1 = \begin{cases}U \quad \text{ with probability} \ (1-p), \\
-U \quad \text{ with probability} \ p.
\end{cases}
$$
$$
Z_2 = X + V, \qquad V \sim N(0, \tau^2).
$$
A key question is how much better it is to estimate $X$ using *both* $Z_1$ and $Z_2$ compared to just $Z_2$?
We can analyze this through both estimation theory and information theory.
*Estimation Theory*:
The minimum mean-square error (MMSE) estimate of $X$ based on both observations is $E\{ X | Z_1, Z_2 \}$ and based on just $Z_2$ is $E\{ X|Z_2 \}$. The corresponding errors are:
$$
MMSE_{Z_1, Z_2} = 1 - E\{ (E\{ X | Z_1, Z_2 \})^2\}, \qquad
MMSE_{Z_2} = 1 - E\{(E\{ X | Z_2 \})^2\}.
$$
The improvement in estimation performance by using both $(Z_1, Z_2)$ compared to just $Z_2$ is:
$$
MMSE_{Z_2} - MMSE_{Z_1, Z_2} =
E\{ (E\{ X | Z_1, Z_2 \})^2\} - E\{(E\{ X | Z_2 \})^2\}.
$$
For this toy model we can write down the MMSE estimators and the performance gain. We have
$$
E\{ X | Z_1=z_1, Z_2=z_2 \} = \frac{\int_\mathbb{R} x \, p(x, z_1, z_2) dx}{p(z_1, z_2)}, \quad \text{ and } \quad
E\{ X | Z_2=z_2 \} = \frac{\int_\mathbb{R} x \, p(x, z_2) dx}{p( z_2)}.
$$
The joint probability distributions $p(x, z_1, z_2)$ can be computed as follows.
$$
p(x, z_1, z_2) = \frac{1}{2} p(x, z_1, z_2| U=1) + \frac{1}{2} p(x, z_1, z_2| U=-1).
$$
Writing $\phi(\cdot)$ for the standard normal density and $\delta(\cdot)$ for the Kronecker delta function, we have for $u \in \{ 1, -1\}$:
$$
p(x, z_1, z_2 | U=u) = \phi(x-u)\phi((z_2 -x)/\tau)\cdot [\delta(z_1=u) (1-p) + \delta(z_1=-u)p ]
$$
An information-theory approach would calculate the mutual information $I(X; Z_1, Z_2)$ and the improvement by using both observations compared to just $Z_2$ is quantified by
$I(X; Z_1, Z_2) - I(X; Z_2)$. This can be computed using the probability distribution derived above.
-----------------------
------------------
## Sub Group 2 - Heterogeneous data fusion
-can we use categrory theory to describe non-numerical data types (categroical, boelean, etc)?
-what are the advantages and disadvantages of mapping everything to a Euclidean space?
-can we use reinforcement learning to learn categrories?
-Should we use machine learning to learn a Bayesian prior (like in modern ideas of image analysis)?
-At what stage in a multi-step data fusion process is it best to use AI/Machine learning? At bottom level or at a higher system level?
-Can we learn from the social sciences, especially in real time dynamic modelling.
-It looks like physics based modelling.
-How do the met office do it? Physics and observations
Data assimilation allows a framework to combine the different information and to predict (dynamic) movement of for example a person
Still a big challenge to create equations for 'people' moving around and how to represent concepts in a computer model
Inspriation from neuroscience humans do this all the time
Wenwu experience in audio and video data fusion
using features, there are many techniques available:
Can fuse features by doing concatination
build a model using these hybrid features.
This is lower-level fusion. There are also mid-level and higher level features using different transforms, e.g deep learning.
There is also an output layer. There is hierachical information. There are issues in audio representing the voice, is this the best way to store the information?
The dimension of the information and the sampling rates can be very different. How to represent?
Deep learning is running as a black box system with no understanding/analysability.
What does a perturbation of the input layer lead to in the output. Interpretability is a hot topic.
Baysian feature model (in tracking model) collects information from sensors and estimates the 3D position of the speakers using Baysian approach using random finding set theory, where the number of targets and the state of those targets are unknown.
The difficulty is the get to a posterior distribution. Using particle filtering, for example. Using information from previous time frame to create multiple particles around the state, which is a noisy version of the state.
From this, we can get measurement information in the images. In audio this is direction of speaker from microphone array, in video its location. Then use this to generate a posterior step by step.
For multi-target tracking, popular frameworks include the random finite set, and its first order approximation such as probability hypothesis density (PHD) filter.
The issue is the particle distribution, so that some particles may have very small weight. We need to use resampling techniques.
Should we describe some exteme cases? Fully determined? E.g. GPS position - conflicting information? e.g. camera and GPS disagree at the same time. Overdetermined and conflicting or in agreement. Then sparse sporadic sightings, time dependent. Then also how to identify strange behaviours within the data.
Another extreme, ill defined task. e.g. Is this person a threat?
Categorical data problems
Grant by Kristian M. Imperial college on identify and track a target. using news etc.
https://gow.epsrc.ukri.org/NGBOViewGrant.aspx?GrantRef=EP/S032398/1
Carola: In healthcare there is also data that is categrorical.
Previously suggested approach for representing different data types for a simple PCA
https://people.orie.cornell.edu/mru8/doc/udell15_pca_dataframe.pdf
Different lenses, generality we are asking for.
Fusion at which level. Bottom, mid or top level?
How can we formulate the environment? This changes over time, what are the limitations of the fusion, to understand the environment in a better way. Then we can use reinforcement learning. Which features at which time, are more important/reliable?
E.g. camera is poor in the evening, or there are lots of other GPS signals around.
Seems we have a number of aspects:
-what mathematical framework? time dependent or static.
-What problems can be posed?
-What are the data types? E.g. expectations of behaviour.
-Changes in environment?
Spaces to our discussions are (i) set of problems (ii) set of solutions and (iii) set of data input types
Need to consider error space too - data assimilation does this already so that is a plus - confidence in the data is critical to getting the right solution
Anomolous behaviour can be an interesting paradigm. When does measurement differ from expected? Current state of the art is not quite right. Josef Kittler at Surrey.
Snap to grid GPS problems. Can be at the fundamental model level in the data source.
Problem A
Lost child with different sources of information that may be incomplete at times and may be conflicting - can we find the full path of the child
Problem B
Data sources are different e.g. visual, audio and conceptual and we need to combine them - behaviour, catagorical and audio/visual but the question is still where is the child.
Problem C.
Question is itself ill defined e.g.is this child safe? Are they lost? Are they a threat? Model needed to obtain the input AND to interpret the output too.
A big challenge is how to make a decision about how to fuse the data to get the very best from it all.
##### Problem A
Kalman filter or another time dependent adaptive filters. This will come up with a solution. But we don't know what is the optimal solution. There can be some non-numerical data that comes for example from expert judgement.
This is tackled a lot in for example autonomous vehicle algorithms that combine visual and GPS
When is the data so bad that the solution is no better than the null model.
Can we look at the sensitivity of the output to different sources. Can we learn about the trustworthiness of the source. Perhaps ensemble learning.
The ultimate limit is a factor of
(i) estimate of the error being wrong
(ii) but avoid them being so large
Dog and husband model - may be together and may not be - therefore need the right error model
Also remember what was said yesterday about better to have two different data or one data twice ?
Every sensor is really not raw data it is the result of a model with some implicit assumptions in what the sensor is telling us.
What if one of the sensors has non-gaussian systematic errors, or one of the sensors has "gone rogue"? e.g. the dog has rolled in "it" and the sensor is detached or the husband is chasing sheep not the dog
High importance on anomalous behaviour.
Novelty detection is different from anomly behaviour. What is the difference?
What is normal behaviour?
E.g. for smart home environments for social care. What is the different between a fall and cooking?
Ensemble approach - helpful to use all algorithms and models to increase the chance of detection.Makes a lot of sense if you are interested in the detection of any possible outlier that might be missed by some approaches but you don't know which approach is better for a given situation.
This is a paper for ensemble classification
https://dspace.cvut.cz/bitstream/handle/10467/9443/1998-On-combining-classifiers.pdf?sequence=1
Here is a paper about if different experts disagree
http://personal.ee.surrey.ac.uk/Personal/W.Wang/papers/KittlerZKHW_PatternRecognition_2018.pdf
Active learning and human in the loop.
Limits: sensors can go wrong. Ideally need independent confirmation of an observation. Even GPS is supported with a 'model' within an algorithm to find the position from the raw observations.
Limits: assumptions can be wrong.
Limits: weightings can be wrong within algorithms
Adverseral sensing. Adversarial examples that can fail/break down a system/model.
Backwards error analysis. 'My solution is the precise answer to a different problem'
'Where is the problem I am solving easiest to crack?' - this seems to be problem specific.
GPS spoofing can fool algorithms even with an ensemble approach.
-we might be collecting the "wrong" data from the begining
-what are we fusing? What is not worth fusing?
-we might be collecting the right data but not solving the right problem. The target may not be connected to the data that we are collecting.
- what goes wrong if the error is too large?
You can produce artefacts.
e.g x-ray A bit of the image with only one measurement.
Undetermined problems (sparce data problems)
e.g. conflicting information that we don't realise is conflicting.
e.g. discrete-time binning can lead to false answers. When in fact there is an error.
Then you revert to a model.
Double peak of the solution could be interpreted dynamically or as a probability measure.
Problem formulation and confidence in data is important.
Can we predict which model is more reliable based on the data we have.
Can we learn from video/sound tracking? Video tracking is usually more precise but can suffer from catestrophic failure when oclusion happens.
There a cost in representation and analysis of high-dimensional data. Is it worth it?
Can we think of decentralised algorithm at each device that makes the dimension less.
Audio-visual tracking (a PhD thesis):
https://core.ac.uk/download/pdf/30342844.pdf
A survey chapter on audio-visual speaker tracking is here:
https://www.intechopen.com/books/motion-tracking-and-gesture-recognition/audio-visual-speaker-tracking
Time dependent probability distribution of the solution
Data assimilation for the atmosphere assume current model is best guess so far.
4D Kalman filter - model
- data point updates
- adaptive model based on the measurements
- what if your model is dynamic, means model is influence by data
- Guass Markov approach, every time you restart you restart everything including the model.
Learning from data streams with adaptive sliding window https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.2279&rep=rep1&type=pdf
Where can this go wrong? With misplaced confidence in data and in model.
4D Var is the state of the art in numerical weather forecasting. Solve for a window of time.
Kalman example: https://www.annalsofgeophysics.eu/index.php/annals/article/view/3074/3117
When does the algorithm give me a worse answer than not using at all?
A. When my assumption of the model or my model of error was wrong.
BUT
we might have multiple states of possible location at the same time and we therefore need to allow for probability of position to jump across different posisble 'tracks' - sometimes multiple solutions are acceptable
Italian flag method of communicating uncertainty
Can also direct which information you might want to try to get next - what is the value of the information you need and is it worth collecting e.g. I need an image from this region - also a 'null' result can tell you as much as a sighting in some types of scenario. Ruling out or ruling in false assumptions is sometimes all that is needed to collapse a solution (i.e. to solve it).
-------------------------------
### day 2 afternoon
Feature extraction from information: how can we move from concepts to mathematical models. At what level do we 'fuse'? Raw data? ML approaches. Features are regions of space. Or movements in space.
Lucky case - features are convex. May be more tricky to deal with. Some cases might be relatively simple e.g. the right colour can be given a 'pixel' of probability for our space.
Toy problem can be extended of course: may know the set of probability distributions to minimise the 'worst case'.
Approach to data merging:
Can parallel the data streams and fuse at the end keeping it separate so that it is clear what that data mean and where is comes from
If merge too soon that is sometimes a pitfall because you might lose the important point in the overall noise of the mass of data coming in
Concept of random finite sets from
https://www.intechopen.com/books/motion-tracking-and-gesture-recognition/audio-visual-speaker-tracking

Application to nonlinear nongaussian model and do not need number of tracking targets to be known as a prior
Particle resampling.
Also look at RAIM algorithms in GPS - simple algorithms to predict when there is a failure of a satellite because you have more satellites than you need to give a solution. These could be directly adapted to the current problem.
Limits: if we have copious measurements we might need to reduce the observations to cut down model complexity to get information in real time. So need to define the acceptable error in space and the 'time' that the solution is given now.
Limits: multiple target tracking might run up against computational limits
Can we define performance metrics that are then used to practically evaluate the limits? And can we analyse this mathematically? Think of 'dilution of precision' in GPS - one of the fundamental limits is the sensor geometry and sampling. Another is the model accuracy and applicability.
See also
http://personal.ee.surrey.ac.uk/Personal/W.Wang/papers/KilicBWK_TMM_2015.pdf
on page 10 where a large number of performance metrics are defined for multi-obect particle tracking.
Emphasis steer is to focus most on problem B.
Social - prediction of what happens next given context
Might have a prior idea from experience
or maybe its completely unexpected?
More general problem is that we do have a prior (compare to problem with network on nodes)
Need to be rigorous on tracking and what is it about the information that tells us that this is the child and not someone else? e.g. adults don't usually skip down the road therefore the motion tracking indicates that this is likely to be a child. 'unique behavour identification - how can we be sure of the classification of the characteristics?' Catagorical variables.
Question: presumably its helpful to identify anomalous behaviour even if its not understood in one of the possible models? And also useful if it is understood.
One of the fundamental limits is around making sure you gathered all of the possible information and realising how to use it.
Is it better to use the calculate and predict approach ? When to merge the data.
How to 'reject' the irrelevant datasets.
There is a detection challenge in visial object tracking
https://www.votchallenge.net/
two communities - image processing does the detection and prediction of multiple targets
- in computer vision there is a single target.
Direction to go:
"if you can flag up any and all the fundamental limits you could investigate about fusion in the scenario, e.g. fusing different sources of info, compressing information, detection, tracking, recogntioon, identification, etc. if you can do anything on predcition and behaviour this would be amazing but for me it is all about understanding the limits on fusing 'simialr data and orthogonal data using features or information content....
fusing differnt sources - will they make the final decision worse?
fundamental limits of compression
fundamental limits of detection
fundamental limits of tracking, recognition, identification etc.
RAIM algorithm in GPS receivers to determine which data source is the outlier
When and how should you fuse the data?
One of the fundamental limits is the inabilty to determine the reliability of our data.
Basic example: noisy data from two sources versus categorise the noisy data into features first.
Say we have two photos of a person in a street. We don't know if its red or green.
Second one we don't know if its red or yellow.
its the same object so it must be red.
How to make this mathematically demonstrated to show that the limit is 'better' with a catagorisation first. Length of vector vs. knowledge?
Could get a worse result if the granulatity of each image is too poor, in that case fusing might make things determined better.
How do you know when you should fuse? Need to set a metric. Confidence in feature shape match? Does it look like a child? Bayesian prior of previous images?
If you are 'computationally rich' you can do this trial and error approach.
Two orthogonal sources of data example. e.g. photo and audio sample (in frequency space) ? Then conceptual red coat. Process in parallel or in sequence? If so which sequence?
Data are vectors of different length - one long vector or translate into a common space? Note that is VERY dangerous to translate into a common space if this makes us forget the bias/problem with an individual dataset e.g. broken sensor.
Need to set a toy problem with multiple vectors with different uncertainties in the data. And the overall classification error depends differently on the different kinds of data (lengths of vector)?
How do we parameterise individual metrics?
Under what circumstances is it best to classify first? What are the rules?
Question: Are we better to translate into one long common vector or to keep the vectors separate and process then combine?
Additional approaches:
1. Graph-based learning for clustering and semi-supervised classification. These methods work for all kinds of data (also non-Euclidean); they exploit the geometry of the data through the Graph-Laplacian; they are usually appealing to mathematicians as they involve some pretty cool analysis and Computational analysis; they can work with heterogeneous data that can be noisy and incomplete.
Coincidentally, there is a workshop on this topic this week https://www.ima.umn.edu/2020-2021.1/W9.14-18.20#schedule
:)
2. Optimal transport: an interesting approach when looking at feature maps of the data; best resource is Gabriel Peyre‘s blog http://www.gpeyre.com/
## Thursday (am) update:
Paul Thomas from dstl joined us for a discussion and we had a far reaching and rapid gallop through the ideas to bring him up to date. A couple of relevant outcomes:
1. Another problem to consider relevent is the MH370 airline search and all of the aspects associated with that, the complexity of the data.
2. Regarding data fusion, there is a general feeling that later is better, so that you don't drop information that might be come relevant at a later date. But it is not always practical to retain all data to the end: so there is a need to decide what can be saved without a signicant loss. For example GPS position is ok, no need to save raw data from each satellite. But a conficting bit of information might be really useful if it is substantiated at a later date when more data comes in.
Steer: Assume is data sparse (not fully determined)
A new phrasing of the QUESTION:
Given mathematical approach X and dataset Y: what are the limits to the output/result?
'e.g. The best you can do with one data point on a 2-D map and an x and y error bar on the point is an ellipse of probabilty of location.'
Further thoughts with an example Suppose $R$ is a vector of the set of all locations of a road in a 2D space and s is arclength along that road. What is the best you can do with $\int$ $R (ds)$ = $p$ ? does this now depend on the model?
Assume that R is a vector of locations and that p is an observation e.g. we saw (observation p) the child on the road somewhere along road, R. With no further information we cannot narrow down the location of the child other than to say they may be on that road. But if we know the topology and the weather we might put in an expection that the child is not on the hilltop because it is raining and they might choose to be in a sheltered place in a valley. So the best we can do is dependent in this case on our ability interpret all the information and to put errors on it. We would need a prior model of the valley/hill/weather influence on the location of the child on that road.
This is a prior which gives us a probability density of positions on the path s. This density function is bi-modal becuase the hill is in the middle.
Now, assume that the mathematical model is forcing us to one location. If there are two valleys with a mountain the middle the mathematical solution could be that the child is on the mountain, since the valleys are equally likely but are separated by the mountain. This is not the logical solution we wanted! But if we allow two solutions: they answer is that the child is most likely in one of the two valleys. So the choice of the maths matters and the interpretation and use of the behavioural data matters too.
This is like the tomography inversion problem.
------
### Thoughts on integrating sources of data
Consider the problem of integrating data at one geographical location from two different sensors at one timepoint
Visual data represented by a vector ${v} \in \mathbb{R}^n$
Audio data represented by a vector ${w} \in \mathbb{R}^m$
Suppose we are looking for a child in a black coat. We use data from the sensors as input to classifiers to detect the presence of the child. Classifier
$$f: \mathbb{R}^n \to [0,1]::{v} \mapsto P(b|{v})$$ maps vectors ${v}$ from the visual sensor to the probability $P(b|{v})$ of a black coat $b$ given $v$.
Classifier
$$g: \mathbb{R}^m \to [0,1]::{w} \mapsto P(c|{w})$$ maps vectors ${w}$ from the audio sensor to the probability $P(c|{w})$ of a child's voice $c$ given $w$
Classifier
$$h: \mathbb{R}^{n+m} \to [0,1]^2::[v, w] \mapsto [P(b|v, w), P(c|v, w)]$$ maps the concatenation of vectors from $\mathbb{R}^{n+m}$ to two probabilities: the probability $P(b|v, w)$ of detecting a black coat given the vectors $v$ and $w$ and the probability $P(c|v, w)$ of detecting a child's voice given $v$ and $w$.
Suppose that $f$ is trained with labelled data $\{v_i\}_{i = 1}^k$, $g$ is trained with labelled data $\{w_i\}_{i = 1}^k$, and $h$ is trained with paired data $\{[v_i, w_i]\}_{i = 1}^k$.
$h$ could simply act separately on $\mathbb{R}^n$ and $\mathbb{R}^m$, i.e.
$$h = [f, g]: \mathbb{R}^{n+m} \to [0,1]^2::[v, w] \mapsto [f(v), g(w)]$$
But more generally, $P(b|v, w)$ will indeed depend on $w$ and similarly $P(c|v, w)$ will depend on $v$
The question is then what are the circumstances in which it is better to use the classifier $h$ that takes both $v$ and $w$ into account, and when is it better to use separate classifiers $f$ and $g$, to find out whether there is a black coat and a childs voice present.
If children do not generally wear black coats, then there will be correlations in the data.
Is it better to use $P(b|v, w)$ or just $P(b|v)$. When does the vector $w$ act as a distractor?
(Sorry, thought I had an answer but realised I had made a mistake!)
suppose we had a truth table as follows for black coat $b$ and child's voice being high
| | b | not b
| - | --- | ----
| c | 0.01 | 0.49
| not c | 0.09 | 0.41
$P(b|c) = P(b \cap c)/P(c) =0.01/0.5 = 0.02$
$P(c|b) = P(c\cap b)/P(b) = 0.01/0.1 = 0.1$
$v$ is visial
Suppose $f$ gives $P(b|v)$
Then $g$ gives $P(c|w)$
Then $h$ gives probability of $P(b|v,w)$ and $P(c|v,w)$ $(h_1,h2)$
Suppose decision is based on $\omega h_1 + (1-\omega ) h_2$ e.g. $\omega=1/2$
Whereas using $f$ and $g$ we might use the same.
We want to find something about about
$P(b|v,w) = P(b\cap v \cap w) /P(v \cap w )$
- In the case of independence this is
$P(b\cap v \cap w)/P(v)P(w)$
- In the case that $v$ and $w$ are mutually exclusive then
$P(b|v,w)$ is not defined.
Can we express $P(b|v)$ and $P(b|w)$
in terms of $P(v|b)$, $P(v|~b)$, $P(w|b)$, and $P(w|~b)$
The good Dr Bayes tells us that
$$P(b|v) = [P(b)/P(v)] P(v|b)$$
$$P(b|w) = [P(b)/P(w)] P(w|b)$$
Lets start with the case of independence.
Now suppose we have truth table for $P(v|b)$ and $P(w|b)$ as follows
| |V | Not V |
|-| - | ----|
| W |0.1 | 0.1|
| not W |0.4 | 0.4 |
This says that the probability of observation V detecting a black coat ($P(v|b)$)is 0.5 and probability of W detecting a black coat ($P(w|b)$) is 0.2
Let us suppose we have some prior information that b is a rare event.
ML comment: we can make some inference about this from the labelle training data (or at least, we have information about how rare the classifer "thinks" b to be)
Say $P(b)=0.001$
and that $P(v)=0.01$, $P(w)=0.01$,
so that $P(v\cap w )=0.0001$
Now we can start to calculate.
$P(b|v) =0.05$
$P(b|w) =0.02$
Now, our assumption on independence means
$P(v,w|b) = P(v|b) \times P(w|b)$
Hence Dr Bayes tells us that
$$P(b|v,w) = \frac{P(b)P(v|b)P(w|b)}{P(v\cap w) }$$
So $P(b|v,w)= [0.0004/P(v\cap w)]$
This number can be estimated if we know something about
$P(v\cap w) = P(v\cap w\cap b) + P(v\cap w \cap \tilde{ } b)$ .... (1)
ML comment: we have assumed a value for $P(v \cap w)$ above
ARC: Good point. My logic was all twisted. I guess what I was thinking of was perhaps we don't know that v and w are indepentent, just that $P(v|B)$ and $P(w|B)$ then we know a value for for
$P(v|B \cap w|B)$ but not $P(B\cap w)$
[despite what I wrote above].
We know something about the first term. But the second term we don't know. It seems to be something about how the false detection of black coats is more likely if we have both measurements v and w being true.
The question is, is this interesting or trivial? It seems to say something about the imporant thing being how the errors combine.
In otherwords if I detect a high pitched black object, is there a high chance that I have found a cat(!). Whereas detecting a black object is not very likely to meet this error, nor is the detection of a high pitched object on its own.
But finding both gives me much bigger error.
Does this mean that we need a good Baysian prior for the joint distribution of the error, so that I can correct for this?
ML notes
$$P(b|v,w) = \frac{P(b)P(v|b)P(w|b)}{P(v\cap w) }\\
= \frac{P(b)P(v|b)P(w|b)}{P(v) P(w) } \text{ assuming independence of $v$, $w$}\\
= \frac{P(b)P(b|v)P(v)P(w|b)}{P(b)P(v) P(w) } \text{expanding $P(b|v)$}\\
= \frac{P(b|v)P(w|b)}{P(w) } \text{ cancelling} $$