Features

Below are descriptions of the important dedalov2 features. All of them are controlled by passing parameters to dedalov2’s explain method. All of them, except prefixes, change the search behavior of the system.

Changing the Search Heuristic

You can change the search behavior of dedalov2 by choosing a search heuristic. Currently, dedalov2 offers three search heuristics: entropy, longest path first, and shortest path first. Each of these heuristics is explained below. If you haven’t done so already, it is recommended to read about Paths first.

Entropy

In information theory, entropy is defined as

\[S = - \sum_{i}P_{i}\log {P_{i}}\]

Dedalov2 adapts this function to calculate the entropy of a path, \(p\). It does this by considering as its outcomes all explanations that use the path [e.g., \(\{e : e = (p,o_i) \}\)]. For each of these outcomes, it calculates its probability as the the number of roots it has divided by the total number of examples [e.g., \(\frac{|R(p,o)|}{|E|}\)]. Entropy can be calculated using different bases for its logaritm. To stay consistent with previous work, dedalov2 uses the base 10 logarithm. This results in the path evaluation function below.

\[S(p) = - \sum_o \frac{|R(p,o)|}{|E|} \log_{10} \frac{|R(p,o)|}{|E|}\]

In code, the result looks like:

def entropy(p: Path, examples: Examples) -> float:
   res: float = 0
   for obj in p.get_end_points():
      num_roots = len(p.get_starting_points_connected_to_endpoint(obj))
      frac = num_roots/len(examples)
      res -= frac * math.log10(frac)
   return res

Longest Path First

This search heuristic assigns higher scores to longer paths. This causes a depth-first-search-like behavior, extending a single path as long as possible before backtracking to previous ones.

Shortest Path First

This search heuristic assigns higher scores to shorter paths. This causes a breadth-first-search-like behavior, first extending all paths of the shortest length before selecting a longer path. This approach causes the search frontier to quickly increase in size. When using a large HDT file, consider using explicit memory usage limits to prevent MemoryErrors.

Pruning Search Paths

By default, Dedalov2 applies branch pruning to reduce the amount of work it needs to do. You can turn this off, or reduce its effect, by selecting a different branch pruning method. There are currently five such methods available.

Policy Short Name Description
global-less-equal (default) gle Prune path if it can only yield explanations with scores less or equal than the current global maximum.
global-less gl Prune path if it can only yield explanations with scores less than the current global maximum.
path-less-equal ple Prune path if it can only yield explanations with scores less or equal than the current path maximum.
path-less pl Prune path if it can only yield explanations with scores less than the current path maximum.
off off Don’t prune branches. Warning: this can result in large numbers of results and long computation times.

You can change the path pruning policy by passing it as a parameter.

ddl.explain("the-internet.hdt", "abba.txt", prune="ple")

Blacklisting URIs

Some predicates are too common or too unreliable to yield good explanations. In such cases, you can blacklist these predicates by passing to dedalov2 a file with blacklisted URIs.

blacklist.txt:

http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/2004/02/skos/core#altLabel

Pass blacklist via parameter of the same name:

ddl.explain("the-internet.hdt", "abba.txt", blacklist="blacklist.txt")

Using URI Prefixes

The long URIs in the results make them hard to read. Dedalov2 supports URI prefixes. Simply pass your file with prefixes as a parameter.

import dedalov2 as ddl

for explanation in ddl.explain("the-internet.hdt",
                               "abba.txt",
                               minimum_score=1,
                               groupid=1,
                               prefix="prefix.txt"):
    print(explanation)

Results in

dbpedia2:label -| dbpedia:Polar_Music
dbpedia2:associatedActs -| dbpedia:ABBA
dbpedia2:wikilink -| dbpedia:Anni-Frid_Lyngstad
skos:subject -| dbpedia:Category:ABBA_members
dbpedia2:wikilink -| dbpedia:Cher
dbpedia2:wikilink -| dbpedia:Polar_Music
dbpo:associatedBand -| dbpedia:ABBA
dbpo:associatedMusicalArtist -| dbpedia:ABBA
dbpedia2:wikilink -| dbpedia:Svensktoppen
dbpedia2:wikilink -| dbpedia:Eurovision_Song_Contest
dbpedia2:wikilink -| dbpedia:ABBA
dbpedia2:wikiPageUsesTemplate -| dbpedia:Template:infobox_musical_artist
dbpedia2:wikilink -| dbpedia:Rik_Mayall
dbpo:label -| dbpedia:Polar_Music

Where prefix.txt is a tab-separated file that looks like:

madsrdf http://www.loc.gov/mads/rdf/v1#
bflc    http://id.loc.gov/ontologies/bflc/
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
foaf    http://xmlns.com/foaf/0.1/
yago    http://yago-knowledge.org/resource/
rdfs    http://www.w3.org/2000/01/rdf-schema#
dbo http://dbpedia.org/ontology/
dbp http://dbpedia.org/property/
dc  http://purl.org/dc/elements/1.1/
gr  http://purl.org/goodrelations/v1#
...

Don’t write this file yourself. Download one from http://prefix.cc/popular.

Handling Large Input Files

Using a large example files as input has two significant drawbacks. Below are descriptions of these drawbacks, and how you can deal with them.

Firstly, the time required to find good explanations will go up. For every explanation, dedalov2 has to check to which examples it connects. Having more examples means more checks. Dedalov2 allows you to truncate the input file to reduce the number of input examples without having to modify the input file. This makes finding the right number of examples to include by trial-and-error much easier.

ddl.explain("the-internet.hdt", "abba.txt", truncate=10)

Using truncate with a value of 10 means that the algorithm will use 10 positive examples and 10 negative examples. If the number of positive or negative examples in the input file is lower than 10, it will use all of them.

Secondly, the ratio of positive/negative examples may affect the search heuristic. For example, the entropy path heuristic prioritizes paths that are, more or less, equally connected to positive and negative examples. If the number of positive, or negative, examples in the input file is much larger than the others, the entropy search heuristic may select different paths.

To prevent this, dedalov2 by default limits the number of positive and negative examples to the size of the smallest group. You can disable this by using the balance parameter.

ddl.explain("the-internet.hdt", "abba.txt", balance=False)

This allows the number of positive examples to differ from the number of negative examples.