Subset substances - EGFR example   In the following examples, we use the following prefixes for name spaces: bp for BioPAX Level 2, sbpax for SBPAX, rdf for the RDF namespace. Objects without namespace are specific to this example. Consider :EGFR, a protein, as it is typically defined in a BioPAX file: :EGFR   rdf:type   bp:protein  It is assumed that :EGFR refers to the unphosphorylated form, since every reference to a phoshorylated form needs to include a reference to a sequence feature, such as :pY1148-feature: :pY1148-feature rdf:type bp:sequenceFeature The complex EGF-EGFR-pY1148, which has the component EGFR-pY1148, which is phosphorylated at site Y1148, reads in BioPAX: :EGF-EGFR-pY1148 rdf:type bp:complex :EGF-EGFR-pY1148 bp:COMPONENTS :component1 :component1 rdf:type bp:sequenceParticipant :component1 bp:SEQUENCE-FEATURE-LIST :pY1148-feature :component1 bp:PHYSICAL-ENTITY EGFR is phosphorylated up to five times, which would require five sequence features. Mentioning them with every reference is impractical. Therefore, SBPAX allows expressing different phosphoforms as substances. For example the singly phosphorylated EGFR-pY1148 would be a protein and have the sequence feature :pY1148-feature: :EGFR-pY1148 rdf:type bp:protein :EGFR-pY1148 sbpax:hasSequenceFeature :pY1148-feature This looks similar to BioPAX Level 3, but there is a difference: In BioPAX Level 2 and 3, if a feature is not mentioned, it is not clear whether it is not present or whether its presence is unspecified. In SBPAX however, features not mentioned are clearly features whose presence is unspecified, which is the only way to avoid wrong assertions on features not discovered yet. To specify that a feature is not present, we would say :EGFR sbpax:lacksSequenceFeature :pY1148-feature Further, we can specify the union of :EGFR and :EGFR-pY1148, where the presence of the feature is unspecified. This is important, since in some interactions, both forms can participate. Calling this substance :EGFR-aY1148, (a for "all") we can assert the other two to be subset substances: :EGFR sbpax:subSetOf :EGFR-aY1148 :EGFR-pY1148 sbpax:subSetOf :EGFR-aY1148 The substance :EGFR-aY1148 looks similar to an entity reference in BioPAX Level 3, but the difference is that it can be used just like other substances. Most importantly, there is a natural way to integrate partially specified phosphoforms, if there are multiple sequence features, of which EGFR can have up to five. Consider two sites of EGFR, Y1148 and Y1173, and the four possible states distinguished by these sites: EGFR (not phosphorylated), EGFR-pY1148 (only at site Y1148), EGFR-pY1173 (only at site Y1173) and EGFR-pY1148-pY1173 (both sites). Each of these four forms is a substance in SBPAX. EGFR and EGFR-pY1148 are subset substances of their union EGFR-aY1148, which is in turn a subset substance of the union of all four, EGFR-aY1148-aY1173. So there is a hierarchy of subsets with three levels. With more than two site, the subset hierarchy will have even more levels. Any of these may be needed as part of a pathway, and SBPAX provides a natural way to organize them all. Not only proteins have phosphoforms, but also complexes: :EGF-EGFR rdf:type bp:complex :EGF-EGFR-pY1148 rdf:type bp:complex :EGF-EGFR-aY1148 rdf:type bp:complex With simple references to component substances, specifying components becomes a lot easier than in BioPAX: :EGF-EGFR sbpax:hasComponent :EGFR :EGF-EGFR-pY1148 sbpax:hasComponent :EGFR-pY1148 :EGF-EGFR-aY1148 sbpax:hasComponent :EGFR-aY1148 Most importanly, subset relationships between components neatly propagate to complexes: :EGF-EGFR sbpax:subSetOf :EGF-EGFR-aY1148 :EGF-EGFR-pY1148 sbpax:subSetOf :EGF-EGFR-aY1148