User Guide#

Introduction#

The ALPINAC algorithm is a method for identifying chemical formulae from a mass spectrum. It involves generating fragment formulae for each detected mass by using a variant of the knapsack algorithm, which is used to enumerate all possible combinations of atoms that can fit into the measured fragment mass within the uncertainty margin. The algorithm produces correct chemical formulae, usually along with many other incorrect formulae, which are eliminated in subsequent steps using additional constraints. The chemical formulae are then organized according to a specific order using a pseudo-fragmentation graph, which helps identify and delete unlikely chemical formulae. A set of all possible isotopologues, including their relative intensities based on their natural isotopic abundances, is generated for each candidate chemical formula, and their intensities are computed. The optimum contribution for each isotopologue set is computed, and a likelihood estimator is used to rank candidate formulas based on their ability to explain the measured signal. The process is iterated over several steps to optimize the contributions of multiple sets of candidate fragments to match the measured signal profile of a mass spectrum.

Algorithm#

The ALPINAC algorithm consist of 10 main steps (see picture) which are described in more detail below:

Alpinac Algorithm

Step 1#

Step 1 of the algorithm is the generation of fragment formulae by using combinatorics to generate chemical formulae of candidate fragments for each detected mass in a spectrum. This is done using a variant of the knapsack algorithm, which is a well-known problem in combinatorics. The aim of the knapsack step is to recover all chemical formulae that could correspond to each detected fragment, given its mass and uncertainty, and excluding all other chemical formulae that would not fulfil the criteria of measured mass and uncertainty. The inputs of the dedicated knapsack algorithm are the measured masses of the fragments with uncertainties and a list of masses of elements that are expected to form the fragments. The knapsack algorithm is used to enumerate all possible combinations of atoms so that the sum of their masses fits the measured fragment mass within the uncertainty margin. An unbounded number of each atom is available, i.e. each atom type can be used multiple times, this is also known as the unbounded knapsack problem. For CI data step 1 is modified as explained in modifications.

Step 2#

Step 2 involves organizing all chemical formulae generated in Step 1 according to a specific order, to help identify and delete unlikely chemical formulae. This is done by creating a pseudo-fragmentation graph using the class DiGraph provided in the Python package Networkx. Each node on the graph represents a candidate fragment, and an edge is set from one node to another if the chemical formula of the first fragment admits the chemical formula of the second one as subformula. The maximal fragments have no ancestor but may have children, and they are the maximal elements of the ensemble of fragments. The aim is to eliminate isolated nodes, or nodes not being connected to any other node (singletons), which may correspond to impurities produced by GC bleed or unlikely solutions from the knapsack. However, very small molecules that produce a very limited number of different fragments, without the molecular ion, are not eliminated.

Step 3#

Step 3 involves generating a set of all possible isotopologues, including their relative intensities based on their natural isotopic abundances, for each candidate chemical formula generated in the previous step. This is done using combinatorics of the involved isotopic atoms, and for each minor-isotope formula, its exact mass and expected intensity relative to the knapsack formula are computed. Isotopologues with an intensity below the instrumental limit of detection are not enumerated. Knapsack-generated chemical formulae containing only the most abundant isotopologues are organized as nodes in a pseudo-fragmentation graph, each complemented by their minor-isotope formula(e) as a node property if above the instrumental limit of detection.

Step 4#

Step 4: In the context of isotope pattern analysis, this step computes the optimum contribution to the measured data for each isotopologue pattern individually and involves generating theoretical continuous mass spectra for each potential chemical formula, and then optimizes a certain contribution for each set of isotopologues to match the measured mass profile. This optimization problem involves finding a non-negative scaling factor k_i for every possible isotopic pattern such that the measured signal best fits the theoretical profile of the set. The optimization equation is minimized to find the value of k_i, which represents the maximum contribution of a given isotopologue set i, assuming no other sets are present to contribute to the signal. If the entire profile falls below the limit of detection (LOD), the candidate node is removed from the graph of solutions. The computed maximum contribution k_max_i is then used as the starting value for the first computation of likelihood in Step 5.

Step 5-8#

In Step 5 of the algorithm, a likelihood estimator is used to rank candidate formulas based on their ability to explain the measured signal. The estimator considers the knapsack fragment and its set of minor-isotope fragments, as well as sub-fragments and their isotopologues. The first estimator formula is modified to correct for the effect of more complex formulas being favored (which have potentially more sub-formulae), by multiplying it by the number of found sub-formulae divided by the expected maximum number of sub-formulae. The likelihood estimator has no chemical significance and is only a practical tool to speed up the identification process. All knapsack fragments are ranked in order by decreasing likelihood value.

Iteratively, in the following optimization including steps 5-8, a process for optimizing the contributions of multiple sets of candidate fragments to match the measured signal profile of a mass spectrum is performed. The goal is to identify the correct combination of candidate fragments that best explains the observed peaks in the spectrum. The steps involve iterating over the following process:

Select the top-ranked knapsack formulae and associated sub-fragments and isotopologues ranked based on the likelihood estimator described in step 1. (A) Use a machine learning technique (Levenberg-Marquardt algorithm implemented in the Python package lmfit) to optimize the linear factors k_i that correspond to the contribution of each selected candidate fragment to the measured peak signal. (B) Eliminate any nodes below the limit of detection (LOD) and any nodes that become singletons (no longer part of a larger set of connected candidate fragments). If less than 95% of the signal is reconstructed, add the next ranked most likely node and repeat steps (A) and (B) until the threshold is met. Delete any nodes of the ranked list that have not been selected to be optimized and any new singletons. Compute the final contribution of each remaining node to each measured mass peak using its optimized factor k_i. To reduce computation time, the mass domain is split into smaller, non-overlapping domains for optimization, and the optimization is stopped once 95% of the signal is reconstructed. Additionally, a greedy-like strategy is used to prioritize the most likely candidate fragments for optimization, further reducing the number of fitted isotopic profiles.

Step 9#

After the last iteration sets not used during the iteration are deleted, they would belong to substances contributing less than 5% to the measured signal.

Step 10#

In Step 10, a simple algorithm is used to identify or reconstruct likely molecular ions based on the chemical formulae of the detected fragments. The algorithm follows the three conditions of the SENIOR theorem (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-105: The SENIOR rule is a set of three conditions that must be met for a molecular graph to exist. These conditions include the sum of valences being even, the sum of valences being greater than or equal to twice the maximum valence, and the sum of valences being greater than or equal to twice the number of atoms minus 1.) to ensure the molecular ion is valid. Maximal fragments with even sum valence are tested for the first two conditions, and if fulfilled, are added to the list of potential molecular ions. Maximal fragments with odd sum valence are tested for the second condition by adding one monovalent atom to each of them and checking if the resulting fragment fulfills the condition. All potential molecular ions are added to the graph with the other fragments, and their likelihood values are computed. It is assumed that all multivalent atoms present in the true molecule have been detected and correctly identified in at least one fragment.

Additional Information on processing#

We describe where each box of Figure 1 of the paper is coded. First the input data and chemical data.

  • Input: experimental data. A file with uncertainties is required either as command-line argument when running python -m alpinac.mode_identification <FILE.txt>, or with a wrapper script (see .sh files, or [mode_identification_script.py](./scripts/mode_identification_script.py).

  • Exact mass and valence of atoms: this is provided in the file [periodic_table.py](./alpinac/periodic_table.py).

  • Environmental abundance of isotopes: this is also provided in the file [periodic_table.py](./alpinac/periodic_table.py).

  • Output: a graph of mass spectrum with labeled peaks (for the identified ones).

Measured data are read and saved in the class structure Compound defined in the file compound.py.

How to interpret the result#

Some measured masses have no solutions. What can I do?#

  1. Make sure the mass in question is a real signal, not e.g. electronic noise. If the peak is ionised twice, it will not be identified. Doulbe ionisation is not supported by ALPINAC.

  2. Using default constant values in const_identification.py, run ALPINAC once. Based on the results, make your own best estimation of which chemical elements are present.

  3. Run ALPINAC a second time using, as additional constrain, your chosen list of present atoms. You can also either increase the value of max_opt_signal to e.g. 99%, or decrease the value of max_ratio_not_opt_masses to e.g. 0.01.

  4. If this does not work, do as in 3. but at the same time, increase the mass uncertainty m_u_k to e.g. 3.0.

How can I be sure that the results are correct?#

  • We have observed that in case only the chemical elements C, H, N, and O are present, with a mass resolution of approx. 3500, results are sometimes inaccurate. We strongly advice to run ALPINAC several times while giving different possible combinations of chemical elements, e.g., “CH”, “CHO”, “CHN”, “CHNO”.

  • If ALPINAC suggests one chemical element is present in some formulae but you think this is wrong: this can happen, especially if no rare-isotope fragment was detected. Run ALPINAC one more time with a given list of chemical elements that do not contain the one you think is wrong, and check if ALPINAC can still reconstruct the same percentage of measured signal.

  • With a mass resolution of approx. 3500, it can also be difficult to correctly reconstruct the number of hydrogen atoms. This number can be overestimated. We advice to run ALPINAC with a given chemical formula as constrain (in mode_identification_script.py or through the command line). Try multiple times with formulae that contain a different number of hydrogen atoms.

ALPINAC is still running after 3 minutes. What can I do?#

This can happen if the mass uncertainty is high, or if there are many measured masses, causing a number of knapsack formulae above 1000.

  1. Use a lower value for max_opt_signal in const_identification.py, e.g. 80%.

  2. Run ALPINAC using only a limited list of present chemical elements.

Handling of CI ionization#

The algorithm for electron-ionization spectra has been adapted to handle combined EI-CI data in order to improve prediction performance. In EI spectra, the molecular ion is often absent, but in CI spectra, it is present, reducing the chance of predicting the largest fragment as the molecular ion. However, the addition of a chemical reagent to the measured sample results in adducts to the fragments, increasing the complexity of the combinatorics in CI spectra. To distinguish between atoms from the measured molecule and atoms from the adduct, virtual pseudo-atoms with the same mass but valence = 2 are introduced and tagged as originating from the adduct. The modified algorithm matches mass peaks using both physical atoms and pseudo-atoms in the first (pseudo-knapsack) step, but removes the pseudo-atoms afterwards to leave only real atoms of the fragment. However, this may result in fragments with negative DBE, and the set of possible formulas is larger compared to the pure EI case. Along with the formulas extracted from the EI data, these formulas are passed to the next steps of ALPINAC. In the following, the algorithm performs as in the pure EI case, but with a larger set of possible formulas which results in a more complex directed graph and higher computational expenses.