# BioHackathon 2024 $\text{p}K_\text{a}$ benchmarking > https://biohackathon-europe.org/ > https://biohackathon-europe.org/projects/ > https://elixir-europe.org/about-us/what-we-do/elixir-programme # Proposal ## Title Benchmarking $\text{p}K_\text{a}$ prediction for molecules of biological relevance ## Abstract (max 250 words) **NOTE**: currently 244 words Most molecules contain some specific functional groups likely to gain or lose protons under specific circumstances (protonation/deprotonation). In aqueous environments, protonations/deprotonations occur at a rate much higher than other biochemical processes, and are therefore considered to be at equilibrium. Each ionization equilibrium between the protonated and deprotonated forms of the molecule can be described with a constant value called the acid dissociation constant: pKa. The pKa value is a physical constant, and of major importance in biological thermodynamics. These calculations find further application in many different areas of chemistry, biology, medicine, and geology. pKa values can be determined experimentally, but this is often time-consuming, expensive and complicated, especially for intricate, biologically relevant molecules. Therefore, prediction of values by computational tools is of very high interest. They were engineered, however, mostly with a database focused on the needs of (medical) chemists and do not include many molecules relevant for biochemistry. Very few of them are open-source and freely accessible. Another open problem is designing a high-quality benchmark for the results of these tools against experimental data, or against each other. Besides the need to add many biologically-relevant molecules, each molecule can have multiple possible (de-)protonation sites. Since often databases only list a subset of the pKa values, without indicating their positions in the molecule, it can be difficult to establish a ground truth. This project is about agreeing on the evaluation criterion for existing methods to predict pKa values on bio-molecules, and implementing the evaluation. ## Scope and vision (max 300 words) > What is/are the project's goal(s)? > Project usefulness. > What need does the project fulfil? > Why is the project important? > Project novelty > To your knowledge, has something similar been done before? Is the project based on another successful paradigm? Does the project have novel elements? > > *Novelty could be both scientific or technical. E.g., it could be new implementations, a new analysis, add-ons to existing services, new quality of life improvements. Any novel elements that you identify in your project. It could be the case that your project does not focus on novelty as much but rather on optimisation, or improving sustainability and accessibility. All these elements are relevant here. **NOTE**: currently 290 words pKa finds application in many different areas: e.g. many compounds used for medication are weak acids or bases, and a knowledge of the pKa values can be used for estimating the extent to which the compound enters the blood stream. Acid dissociation constants are also essential in aquatic chemistry and chemical oceanography, where the acidity of water plays a fundamental role. In living organisms, acid–base homeostasis and enzyme kinetics are dependent on the pKa values of the many acids and bases present in the cell and in the body. In chemistry, a knowledge of pKa values is necessary for the preparation of buffer solutions and is also a prerequisite for a quantitative understanding of the interaction between acids or bases and metal ions to form complexes. pKa can be determined experimentally in chemistry, but this determination is more complicated in biochemistry. The different organ and cell environments are indeed as many parameters to control to determine pKa. Some tools exist to predict values of pKa, but most of them are tailored for chemistry, not for biochemistry. One of the most commonly used predictors for pKa is a propriatary tool called Marvin, offered by ChemAxon. However, their validation process is opaque and it is unclear if their benchmarking included a large set of biochemically-relevant values. Moreover, ChemAxon has recently changed its free academic license making it much more restrictive. Overall, few alternatives exist, and there is a dire need for open-source tools. It means we don't have tools that fulfill both biochemistry and FAIR needs. This project wants to evaluate existing tools and databases to predict values of pKa on bio-molecules, and propose a standard accepted by the community. To our knowledge, some tools are emerging, such as `Dimorphite-DL` or `pKasolver`. They will be integrated in our evaluation with biochemistry cases and larger molecules, such as lipids, that can be found in cells. ## Alignment with ELIXIR 2024-26 Programme and beyond (max 150 words) > Please indicate how your project aligns to the scientific, technical and people focussed themes of the new ELIXIR 2024-26 Programme (e.g., Technical Platforms, Scientific Themes including Communities, People and Nodes theme, Focus Groups and Services). We are also interested to hear how your project aligns with other Institutional, European or International efforts (max 150 words). **NOTE** currently 104 words The benchmarking computation effort will rely on the ELIXIR Compute and Interoperability Platforms, and will follow open standards for a full FAIRness. The BioHackathon may also help to improve interoperable computations in the cheminformatics field by networking with the Galaxy and bio.tools communities, and with the EOSC Focus Group. The prediction of the pKa values is of major concern for several scientific fields, i.e., biology, biochemistry and chemistry. Those fields are linked to several ELIXIR communities, we are part of: the metabolomics, systems biology, toxicology and microbial biotechnology communities. It may also have some links in the ELIXIR microbiome, and food and nutrition communities. ## Feasibility (max 150 words) > What is the timeline for your project regarding both long-term goals and short-term goals? > What would be your focus and project plan in BioHackathon 2024 in case your project is selected? > What is the minimum number of people required for the project to succeed? > What is the required level of expertise for people to participate? > What is the current methodology to be used? **NOTE** currently 130 words In the short term, we will prospect to find tools and databases related to pKa values in biochemistry. These tools and databases will be the basis of our benchmark. In the long term, we will provide pKa values for most biomolecules, or at least a simple way to compute them. If possible, we will create a simple-to-install open-source tool for predicting pKa values (similar to Marvin). The focus during BH 2024 will be to reference tools and databases related to pKa values, and benchmark them in a FAIR way. The minimum number of people for the project to succeed is 2-3. A minimal level in chemistry is required. Tools and databases curation, as well as benchmark knowledge, are nice to have. Participants are, however, likely to have significant overlap, creating opportunities for teaming and ensuring that the project can scale to a larger number of participants. ## On-site aspects - Engagement Strategy (max 100 words) > During the BioHackathon everyone should be treated equally and be given the same opportunities to contribute. > > What is your plan to engage with your BioHackathon group? How would you deal with new project joiners? Have you considered participants that will be working with you remotely? **NOTE** currently 56 words The tools and databases prospection, as well as the benchmarking part, are meant to be "parallelized" between project joiners, on-site or remotely. We will have both on-site and remote instructors/assistants during BH. Contributions will be done via Pull Requests to GitHub repositories, and no distinction will be drawn between on-site and remote participants in this process. ## Past Participation > For any of the project leads: > > If you participated in the BioHackathon in the past, were you able to publish the project results? Please provide the DOI(s) of your publication(s) **TODO for each of us (leaders only), if relevant** Sébastien Moretti: - doi:10.37044/osf.io/yxunp - doi:10.37044/osf.io/7f94w - doi:10.37044/osf.io/vn4dx - doi:10.37044/osf.io/y6gbq ## Collaborations and New Project Leads (max 50 words) > Is the project collaborative (with evidence)? > > E.g., Multi-Institutional, Multi-disciplinary, Multi-Node, Cross-ELIXIR entities. > > Are any of the 2 Project Leads new to BioHackathon Europe? Are they planning to attend in person? **TODO** to complete AND shorten, currently 49 words We are from different institutions, different ELIXIR nodes, with different, complementary, backgrounds: - Sébastien Moretti, SIB Swiss Institute of Bioinformatics, ELIXIR-CH, bioinformatician and software developer; plan to attend in person. - ~~Elad Noor, Weizmann Institute of Science, Rehovot, Israel, theoretical biologist; plan to attend in person.~~ - Robert Giessmann, IGDORE & TU Berlin, Germany, cheminformatics and experimentalist; new to BH; plan to attend in person. ## Engagement with Industry (max 100 words) > Please give examples here of engagement with industry **NOTE** curently 50 words This project is very interesting for commercial providers of prediction tools, to compare their tool with others and show the superiority of their predictions. We would like to invite providers of commercial and academic / open-source software to contribute to the definition of a jointly-agreed "gold standard" for the field. ## Permission to publish (on electronic media) > yes/no Yes, of course ## Author approval (this submission has been approved by all authors) > yes/no Yes ## Project Lead attendance > I confirm that at least one Project Lead will register for face to face (F2F) participation in BioHackathon Europe 2024 We confirm...