Representing Discrete Probabilities in RDF
Many RDF datasets are meant to be interpreted as simply "statements in the graph are true". In order to record more nuanced interpretations, we'll make meta-statements: statements about statements and graphs.
We'll use a predicate that assigns a probability to a statement or graph. When its subject is a statement, it means that statement is true with that probability. When its subject is a graph, it means all statements in that graph are true with at least that probability.
Assume we have a dataset with a variety of statements, some of which we think are true, some are false, and some are true only with some probability. We can't just start inferring new statements from this dataset, since we have no idea what they would mean.
First create a graph of "assumed true" statements. Let's call it G1. Add to it a statement "G1 probability 1". We can add to our "true" graph probabilities for other statements not present in that graph itself. Within this graph we can safely let loose a inference process to find more true statements.
Next comes the fun part. We want to allow inferences about statements with a known, non-1 probability. To do that, let's say we have a statement outside G1, Sa, and there's a statement in G1 "Sa probability .9". We'll create a new supergraph of G1 consisting of G1 plus Sa. Let's call that Ga. We can add to G1 a statement "Ga probability .9". Then, we can safely let an inference process run in that new supergraph.
Let's say we have another statement, called S~a, known to be the inverse of Sa, either because a human manually added a statement to G1 creating that relationship between them, or an automated process was able to infer that. We can infer it's probability is .1, and carry out the same supergraph process, this time producing a graph with probability .1.
Let's say we have another statement, Sb, also with probability .9. We could perform the supergraph process with it on G1 to produce yet another graph with probability .9. We could also perform the supergraph process on Ga, producing a graph that represents everything in G1 being true, plus Sa, plus Sb. We'll call that graph Gab. We can then add to G1 a statement "Gab probability .81". This assumes the probability of the two statements is independent.
To represent dependant probabilities, instead of making statements about probability in G1, we can make them in the subgraphs that suppose those statements are true. E.g. if Sa being true implies Sb has a probability of .99, then we add a statement to Ga saying as much. The supergraph construction then proceeds the same, but the resulting graph has a probability of 0.891.
Let's try a more formal approach.
We'll define three predicates:
- n hasProbability p
-
n is a statement or graph, p is a number between 0 and 1 (inclusive).
- n hasProbabilityAtLeast p
-
n is a graph or statement, p is a number between 0 and 1 (inclusive).
- s isOppositeOf t
-
s and t are statements.
You can make several inferences that are obvious. In addition:
if n isOppositeOf m and n hasProbability p then m hasProbability 1 − p
Let's think about a largest possible graph, the graph that contains all possible statements. Let's call it G🌌. Assigning meaning to G🌌 is pretty much impossible; it contains every possible contradiction! You could apply a certain kind of inference engine to it that just constructs subgraphs that contain no contradictions, but it would have a infinite about of work to do. So instead, we'll carve out subgraphs that we can assign some meaning to. We can also nicely represent subjectivity while we're at it.
Let's say I want to mark some statement (Sa) as true. I create a new graph of statements I think are true, and add the statement to it. Let's call it Gmorgan:1. I can also add to Gmorgan:1 the statement "Gmorgan:1 hasProbability 1". I can let an inference engine loose in this graph and have it add everything it can derive from statements already in the graph to it.
Now let's say I want to say another statement (Sb) in G🌌 (but not Gmorgan:1) has probability 0.8. I add a statement to Gmorgan:1: "Sb hasProbability 0.8". This can trigger the automatic creation of a new graph Gmorgan:b, which contains Sb, and is also a supergraph of Gmorgan:1. We can then infer in Gmorgan:1 "Gmorgan:b hasProbabilityAtLeast 0.8".