---
title: Allergens
tags:
description:
---
# Allergens
## Aim and Outline of the Project
Based on an analysis of all allergenic proteins currently available from major databases (WHO/IUIS and FARRP), in this work a comprehensive study of allergens is presented along with the introduction of two new dimensions for the analysis of allergens with which enable the large scale investigation of the relationship between molecular properties of allergens and their symptomatology.
The computational methods presented here are in itself not novel, but build on existing analytical approaches at a larger scope. This provides a holistic overview over all known allergens and allows us to expand on previous findings. Besides the expansion of scope, the novelty of this work is derived from the two new dimensions.
The first new dimension is an overview over all sequences associated with a given allergen in existing literature. While the tightly curated WHO/IUIS database has strict requirements for standardised experimental proof of allergenicity in potential allergens, other databases such as FARRP only require a publication to undergo the review process of a scientific journal before integrating the findings into the database. Without standardised methods to ensure high quality of data, it is possible that some of the sequences linked in these databases differ significatley, complicating further research into allergenic proteins. An example for this is the case of Ves v 1, where there are annotations pointing to both C0HLL3 from vespa velutina (asian hornet) and P49369 from vespula vulgaris (common wasp).
The second dimension is made possible by the results produced with CoMent and contextualises allergens by providing a complete list of significant co-mentions between allergens and symptoms in scientific literature. This permits the comparative study of both clinical and molecular properties of allergens.
## Materials and Method
### Allergen Sequences
[Results produced by CoMent](https://csbg.cnb.csic.es/ahirt/allergen_data/all_allergens_combined_w_org_pubmedhits) were parsed to retrieve the protein IDs associated with allergens. There are 97 allergens for which no protein ID is available in the data. [List available here](https://csbg.cnb.csic.es/ahirt/allergen_data/allergens_without_protein_id.tsv). Thus, the data used for analysis includes the 2084 Allergens for which a sequence was found.
This is shy of the 2233 allergens listed in the 2021 version of the FARRP database, but exceeds the scope of the WHO/IUIS (1089) by almost a thousand alllergens.
#### FASTA
The fasta files used for all analyses are available here:
- [individual files](https://csbg.cnb.csic.es/ahirt/allergen_data/fasta_files/allergens_fasta/)
- [unique file combining all sequences](https://csbg.cnb.csic.es/ahirt/allergen_data/fasta_files/allergens.fasta)
#### JSON Dictionary
[This JSON file](https://csbg.cnb.csic.es/ahirt/allergen_data/allergens.json) contains all protein IDs from the input file.
The input file with the combined pubmed hits includes only the WHO annotations. In some cases, these did not contain valid protein IDs. For all combined allergens, I therefore parsed the noncombined file, in order to retrieve the AO annotations as well, and stored them in a JSON file.
Hence, each combined allergen (all WHO allergens) has two dictionaries associated with it, one called "IDs", which contains the primary, WHO annotated proteins, and a second, called AO_IDs. Both dictionaries contain two keys, 'NCBI' and 'UniProt' which each have a list of all respective protein IDs associated as their value.
When available, the first NCBI ID was used to retrieve the sequence for BLAST analysis, otherwise the first UniProt ID was used. For all combined allergens the 'IDs' dictionary was used.
This file provides an easy-to-parse, structured and complete database of all proteins associated with a given allergen in literature. This is the foundation for the first new dimension mentioned in the introduction.
### Non-Allergen Data: Krutz et al. (2019)
The dataset available from Krutz identifies 178 "no evidence" proteins from plants. The supplementary data provided by Krutz contains many thousand proteins, most of which were identified as not relevant due to insignificant prevalence or evidence for allergenicity. To retrieve only the definite non-allergenic proteins, it was necessary to filter for multiple variables. The [fasta file of non-allergens used is available here](https://csbg.cnb.csic.es/ahirt/allergen_data/fasta_files/krutz_non_allergens.fasta).
The no allergen evidence of the proteins in Krutz's dataset comes from prediction of no allergenicity with AllerCatPro in combination with evidence for prevalence in plant foods that have are commonly consumed around the globe.
### HPO Data
In the HPO data 828 unique allergens are represented in 3920 entries of 349 unique annotations. This is equal to an average of 4.73 annotations per allergen.
From this, jaccard scores could be calculated for 326028 allergen pairs. Jaccard score describes the similarity between to sets. In addition, the information content (IC) of each HPO term in the data was was calculated. The IC represents the specificity of the respective term within an ontology.
## Results
### Similarity to the Human Proteome
In the following plots, the frequency of percent identity of BLAST hits for allergen sequences against their closest human counterpart is displayed. The data for each subsection is presented in two versions, first a plot including all of the BLAST hits that made it into the data after filtering for length > 30 and evalue < 0.001, and then a second plot, in which only the BLAST hit with the highest pident is shown per allergen. Simply put, the first plot in each subsection shows the density of all BLAST hits against the human proteome for a given category of allergens, while the second focuses on the most similar BLAST result for each allergen.
Against the human proteome allergens produced a total of 50237 hits.
[Data available here](https://github.com/admarhi/TFM_Allergens/blob/main/blast_output/allergens_human_proteome_best_blast_scores)
Against the human proteome, the non-allergens identified by Krutz et al. (2019) produced a total of 2325 hits.
[Data available here](https://github.com/admarhi/TFM_Allergens/blob/main/blast_output/krutz_human_proteome_best_blast_scores)
#### Allergens and Plant Non-Allergens

Both allergens and non-allergens have a right-skewed distribution (mean is greater than median), with very similar means (35.4 and 34.8 respectively). The allergens median is closer to the mean than the non-allergens (33.6 and 29.7 respectively) giving the non-allergens a much more skewed shape. The non-allergens standard deviation is higher than the allergens.
It is only logical that the allergenic proteins are more similar than the proteins in the reference data, as the reference data only includes proteins from plant foods, while the allergens also include a large proportion of mammalian proteins.
Of all the allergens which were included in the data of this plot, 12% (117/962) had at least one hit above 62% pident.
Summary Statistics
| result | mean_pident | sd_pident | median_pident | mean_length | sd_length | median_length | max_length |
| --- | --- | --- | --- | --- | --- | --- | --- |
| allergen | 35.4 | 10.1 | 33.6 | 245 | 188 | 217 | 1535 | 34 |
| no_evidence | 34.8 | 14.0 | 29.7 |239 | 121 | 205 | 946 | 37 |

As outlined above, the second plot only shows the highest pident per allergen. This shows some interesting characterisitcs of the data. The density line indicates that non-allergens are slightly more similar to the human proteome than the allergens. While there is a significant drop of allergens around the 62% mark, based on these results it cannot be concluded that allergenic proteins are rarely allergenic as previously found by Jenkins et. al. (2007). 12% (117/962) of allergenic proteins has its highest pident hit above the threshold of a percent of identity of 62%.
If at all a threshold for maximum sequence similarity to the human proteome can be defined it would have to be above a sequence similarity of ca. 95.9% percent. The BLAST hits above this threshold are produced by auto immune allergens (Hom s 2,3,4). However, it is very unlikely that a protein from a different species produced these BLAST hits, which makes the definition of such a threshold virtually useless.
#### Food Allergens and Plant Non-Allergens
Interestingly, the differences between food allergens and plant non-allergens is greater than in the previous plot with all allergens. The following plots visualise the reason behind this.


While the non-allergens still have a right-skewed distribution the allergen BLAST hits now more closely resemble a normal distribution. Both median and mean of the the non-allergens (29.7 and 34.8 respectively) are lower than the allergens (37.4 and 37.2 respectively).
This may be interpreted to indicate that non-allergens are generally less closely related to the nearest human counterpart than food allergens in general. Considering that food allergens also include animal allergens, an expected result.
Of all the allergens which were included in the data of this plot, 16% (56/347) had at least one hit above 62% pident. This is a considerably higher percentage than in the previous set plot and likely due to the a greater fraction of animal proteins within the food allergens than the overall allergens.
Summary Statistics
| result | mean_pident | sd_pident | median_pident |mean_length | sd_length | median_length | max_length | min_length |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| allergen | 37.2 | 10.3 | 37.4 |278 | 239 | 204 | 1535 | 34 |
| no_evidence | 34.8 | 14.0 | 29.7 | 239 | 121 | 205 | 946 | 37 |

For allergens, the highest similarity BLAST hits accumulate to the left of the 62% pident threshold. Allergens feature a slightly more pronounced bimodal distribution than non-allergens.
#### Plant Food Allergens and Plant Non-Allergens

The shape of the plant food allergens more closely resembles the shape of the non-allergens than any of the other plots. This was expected as both non-allergens and allergens in this plot are proteins from food plants.
While the number of allergens in this group is less than in previous groups, 1% of allergens (1/99) of allergens feature BLAST hits over 62% pident.
|result|mean_pident|sd_pident|median_pident|mean_length|sd_length|median_length|max_length|min_length|
|---|---|---|---|---|---|---|---|---|
|allergen|33.2|8.74|31.8|183|111|164|592|41|
|no_evidence|34.8|14.0|29.7|239|121|205|946|37|

Here, the difference between allergens and non-allergens from the same category (plant food proteins) becomes apparent. The plot shows that while non-allergens are distributed relatively evenly, allergens peak Furthermore, it indicates that for plant food allergens, it might actually be feasible to define a no-allergenicity threshold at an even lower level than 62% sequence similarity, as it seems that for significant BLAST results (length > 30 and evalue < 0.001), there is only one (1/99) allergen which produce such a hit, (AEY79726.1-NA_0360_AO against sp|P62937|PPIA_HUMAN).
#### Animal Food Allergens

The slight left skew of the distribution explains why the distribution including all allergens resembled a normal distribution, as the allergens and non-allergens cancelled each other out.
Interestingly, there are a considerable amount of hits above 62% pident. This is in conflict with existing literature, which states that animal food proteins are rarely allergenic if they are above 62% sequence similarity to the human proteome. Of 22% (45/205) animal food allergens produce at least one BLAST hit above 62%. The below plot visualises this more appropriatley.
|result|mean_pident|sd_pident|median_pident|mean_length|sd_length|median_length|max_length|min_length|
|---|---|---|---|---|---|---|---|---|
|allergen|37.7|9.97|38.3|294|249|228|1535|34|

While the majority of allergens produces their highest similarity BLAST hit against the human proteome up until 62% as outlined by Jenkins et al., a significant amount of allergens in our data produces hits greater than 62% sequence similarity to the human proteome. This does impose some interesting consequences as the cut off value for allergens as proposed by Jenkins in 2007 can not be accepted.
#### Injection Allergens

Only one injection allergens produce hits over 62%, Aed a 8.

The plot doesn't change much when only displaying the highest hit for each allergen. Only hit above 62% (produced by Aed_a_8 ABF18258.1 against sp|P11021|BIP_HUMAN)
#### Dermal-skin Allergens

There are only four dermal-skin allergens producing hits above 62% (Hev b 9, Mala s 6, NA 0289, NA 0591)

#### Airway Allergens


### Allergen Hits over 62% pident
The following plots are more informative in their interactive versions, linked below each plot. By deselecting groups in the legend it is possible to filter for individual groups. Hovering on any of the datapoints will produce a context menu with the query protein ID as well as the human protein ID against which the alignment was produced.
#### Entry Way

[Interactive version](https://csbg.cnb.csic.es/ahirt/plotly/high_similarity_allergen_bbs.html)
The proportion of entryways of allergens represented in the plot are listed in the table below.
| Entry | n (%) |
| ----- | -----: |
| Airway | 207 (49.9) |
| Food | 152 (36.6) |
| NA | 27 (6.51) |
| Dermal-skin | 21 (5.06) |
| Injection | 8 (1.93) |
[The same interactive plot for highest hits](https://csbg.cnb.csic.es/ahirt/allergen_data/plots/high_similarity_allergen_bbs_highest_hits.html)
#### Source

[Interactive version](https://csbg.cnb.csic.es/ahirt/plotly/animal_plant_high_similarity_allergen_bbs.html)
[The same interactive plot for highest hits](https://csbg.cnb.csic.es/ahirt/allergen_data/plots/animal_plant_high_similarity_allergen_bbs_highest_hits.html)
#### Entry way of allergen hits over 62% pident

#### Non-allergen hits over 62% pident
There are 33 non-allergens which produce a total of 80 BLAST hits of more than 62% pident to the human proteome.
P59259-Spinach and A0A096USG9-Wheat are the only two non-allergens that produce close to identical hits (98.1%), both with sp|P62805|H4_HUMAN.
#### Protein Families
Existing literature repeatedly points towards the fact that only few protein families make up the entirety of allergen protein families.
To introduce the protein families into the data, I ran InterPro against both the fasta file containing all allergens and the non-allergens from the krutz dataset. The analysis showed that all allergens belonged to 192 different protein families.
The 10 most common protein families include 38% (824/2181) of all allergens in our dataset.
| Family | n |
| --- | --: |
| Protease inhibitor/seed storage/LTP family | 144 |
| Profilin | 116 |
| Pathogenesis-related protein Bet v 1 family | 103 |
| EF-hand domain pair | 96 |
| Cupin | 94 |
| Tropomyosin | 81 |
| ML domain | 52 |
| Cysteine-rich secretory protein family | 49 |
| Papain family cysteine protease | 47 |
| Ribonuclease (pollen allergen) | 42 |

Inspecting the pident plot grouped for protein families, it seems that the distributions are roughly similar, with only the ribonucleases (pollen allergens) showing more right skew than other families.
The ten most common protein families listed above produce only few blast hits above 62%, as shown below:

It is not surprising to see tropomyosin as one of the allergens with the highest similarities, as tropomyosins are found in all mammals, including humans.
The table and graph below show the ten most common protein families in blast hits above 62%.
Interestingly, the fourth biggest group in hits above 62% percent identity are NA, i.e., unavailable.
| family | n |
| --- | --: |
| Cyclophilin type peptidyl-prolyl cis-trans isomerase/CLD | 119 |
| Enolase, C-terminal TIM barrel domain | 52 |
| Hsp70 protein | 52 |
| NA | 49 |
| Tubulin/FtsZ family, GTPase domain | 30 |
| C-terminus of histone H2A | 29 |
| Fibrillar collagen C-terminal domain | 21 |
| Fructose-bisphosphate aldolase class-I | 20 |
| EF-hand domain pair | 17 |
| Tropomyosin | 17 |

Animal Food Allergens can be assorted into 21 protein families, the top ten of which make up 93% (191/206). These show significant differences compared to the results presented by Jenkins.
| family | n |
| --- | --: |
| NA | 90 |
| EF-hand domain pair | 37 |
| Tropomyosin | 35 |
| ATP:guanido phosphotransferase, C-terminal catalytic domain | 8 |
| Casein | 6 |
| Fibrillar collagen C-terminal domain | 4 |
| Enolase, C-terminal TIM barrel domain | 3 |
| Fructose-bisphosphate aldolase class-I | 3 |
| Serpin (serine protease inhibitor) | 3 |
| C-type lysozyme/alpha-lactalbumin family | 2 |
> ToDo:
> - Which allergens could not be introduced to a protein family?
---
### HPO and Sequence Similarity
This section evaluates the pairwise sequence similarity between all allergens in the context of their symptomatology. The results presented here indicate that there is no clear relationship between allergen sequence similarity and symptomatology.
#### Information Content
For the purposes in this study, the lowest (most specific) IC was selected from the set of overlapping HPO terms for each allergen pair. This provided another variable to filter by.
The plot below shows the information content of the hpo terms which have been assorted to the allergen pairs.

The plot below shows the percent of identity between all allergen pairs which share a given HPO term as their most specific HPO annotation.

Combining the two previous plots by scaling IC by 32, a relationship between the two variables is revealed. Although not statistically significant, it can be observed that with lower information content (more specfic HPO term) the sequence similarity is greater in a given pair of allergens. This indicates that the symptoms described by the most specfic HPO terms must be to some extend be caused by unique regions in a proteins sequence.

> ToDo:
> - Investigate the outliers at the lower end of the information content, could they simply stem from small sample size or
#### Scatter Plot
The horizontal lines can be explained by small sets of hpo annotations and corresponding discretisation of the jaccard score.
I believe the density plot below still visualises the data in a more interpretable manner.

##### Density Plot
The density plot below shows a more interpretable version of the previous scatter plot. Red indicates a high density of allergen pairs in the region, blue indicates that no pairs are present.
Dividing the plot into four distinct quadrants highlights some interesting characteristics.

The cluster in the left hand bottom quadrant represents those allergen pairs that have both a low sequence similarity and low symptomatic similarity. The faint cluster in the left hand top quadrant is more interesting, it represents allergen pairs that share low sequence similarity, but high similarity in symptomatology.
Filtering for IC < 2.2 yields the following plot.

Interestingly, when filtered for only allergen pairs that share a certain degree of specficity, the bulk of allergen pairs previously observed in the bottom left quadrant, as well as the small cluster in the top left quadrant disappears. Both of these clusters were only caused by low shared sequence similarity, and as becomes apparat from the filtering here, the sequence similiarity was caused by HPO annotations with low information content.
Additionally, the cluster in the right half of the plot shifts upwards, towards the now more pronouced (i.e., representing a larger percentage of the overall datapoints) cluster in the top right hand corner. This indicates that with increasing specificty of the annotated HPO terms, the correlation of jaccard score and sequence identity becomes stronger.
Although statistically not significant, the observeable trend in this plot shows a slight correlation between sequence similarity and symptomatic similarity of two allergens. However, based on these results it does not seem realistic to be able to make accurate predictions about the symptomatology of an allergen based on its sequence without further information.