Representing Discrete Probabilities in RDF

Morgan Wahl

2019-11-13 23:33

Many RDF datasets are meant to be interpreted as simply "statements in the graph are true". In order to record more nuanced interpretations, we'll make meta-statements: statements about statements and graphs.

We'll use a predicate that assigns a probability to a statement or graph. When its subject is a statement, it means that statement is true with that probability. When its subject is a graph, it means all statements in that graph are true with at least that probability.

Assume we have a dataset with a variety of statements, some of which we think are true, some are false, and some are true only with some probability. We can't just start inferring new statements from this dataset, since we have no idea what they would mean.

First create a graph of "assumed true" statements. Let's call it G₁. Add to it a statement "G₁ probability 1". We can add to our "true" graph probabilities for other statements not present in that graph itself. Within this graph we can safely let loose a inference process to find more true statements.

Next comes the fun part. We want to allow inferences about statements with a known, non-1 probability. To do that, let's say we have a statement outside G₁, S_a, and there's a statement in G₁ "S_a probability .9". We'll create a new supergraph of G₁ consisting of G₁ plus S_a. Let's call that G_a. We can add to G₁ a statement "G_a probability .9". Then, we can safely let an inference process run in that new supergraph.

Let's say we have another statement, called S_~a, known to be the inverse of S_a, either because a human manually added a statement to G₁ creating that relationship between them, or an automated process was able to infer that. We can infer it's probability is .1, and carry out the same supergraph process, this time producing a graph with probability .1.

Let's say we have another statement, S_b, also with probability .9. We could perform the supergraph process with it on G₁ to produce yet another graph with probability .9. We could also perform the supergraph process on G_a, producing a graph that represents everything in G₁ being true, plus S_a, plus S_b. We'll call that graph G_ab. We can then add to G₁ a statement "G_ab probability .81". This assumes the probability of the two statements is independent.

To represent dependant probabilities, instead of making statements about probability in G₁, we can make them in the subgraphs that suppose those statements are true. E.g. if S_a being true implies S_b has a probability of .99, then we add a statement to G_a saying as much. The supergraph construction then proceeds the same, but the resulting graph has a probability of 0.891.

Let's try a more formal approach.

We'll define three predicates:

n hasProbability p: n is a statement or graph, p is a number between 0 and 1 (inclusive).
n hasProbabilityAtLeast p: n is a graph or statement, p is a number between 0 and 1 (inclusive).
s isOppositeOf t: s and t are statements.

You can make several inferences that are obvious. In addition:

if n isOppositeOf m and n hasProbability p then m hasProbability 1 − p

Let's think about a largest possible graph, the graph that contains all possible statements. Let's call it G_🌌. Assigning meaning to G_🌌 is pretty much impossible; it contains every possible contradiction! You could apply a certain kind of inference engine to it that just constructs subgraphs that contain no contradictions, but it would have a infinite about of work to do. So instead, we'll carve out subgraphs that we can assign some meaning to. We can also nicely represent subjectivity while we're at it.

Let's say I want to mark some statement (S_a) as true. I create a new graph of statements I think are true, and add the statement to it. Let's call it G_morgan:1. I can also add to G_morgan:1 the statement "G_morgan:1 hasProbability 1". I can let an inference engine loose in this graph and have it add everything it can derive from statements already in the graph to it.

Now let's say I want to say another statement (S_b) in G_🌌 (but not G_morgan:1) has probability 0.8. I add a statement to G_morgan:1: "S_b hasProbability 0.8". This can trigger the automatic creation of a new graph G_morgan:b, which contains S_b, and is also a supergraph of G_morgan:1. We can then infer in G_morgan:1 "G_morgan:b hasProbabilityAtLeast 0.8".