Features¶
Below are descriptions of the important dedalov2 features. All of them are controlled by passing parameters to dedalov2’s explain method. All of them, except prefixes, change the search behavior of the system.
Changing the Search Heuristic¶
You can change the search behavior of dedalov2 by choosing a search heuristic. Currently, dedalov2 offers three search heuristics: entropy, longest path first, and shortest path first. Each of these heuristics is explained below. If you haven’t done so already, it is recommended to read about Paths first.
Entropy¶
In information theory, entropy is defined as
Dedalov2 adapts this function to calculate the entropy of a path, \(p\). It does this by considering as its outcomes all explanations that use the path [e.g., \(\{e : e = (p,o_i) \}\)]. For each of these outcomes, it calculates its probability as the the number of roots it has divided by the total number of examples [e.g., \(\frac{|R(p,o)|}{|E|}\)]. Entropy can be calculated using different bases for its logaritm. To stay consistent with previous work, dedalov2 uses the base 10 logarithm. This results in the path evaluation function below.
In code, the result looks like:
def entropy(p: Path, examples: Examples) -> float:
res: float = 0
for obj in p.get_end_points():
num_roots = len(p.get_starting_points_connected_to_endpoint(obj))
frac = num_roots/len(examples)
res -= frac * math.log10(frac)
return res
Longest Path First¶
This search heuristic assigns higher scores to longer paths. This causes a depth-first-search-like behavior, extending a single path as long as possible before backtracking to previous ones.
Shortest Path First¶
This search heuristic assigns higher scores to shorter paths. This causes a breadth-first-search-like behavior, first extending all paths of the shortest length before selecting a longer path. This approach causes the search frontier to quickly increase in size. When using a large HDT file, consider using explicit memory usage limits to prevent MemoryErrors.
Pruning Search Paths¶
By default, Dedalov2 applies branch pruning to reduce the amount of work it needs to do. You can turn this off, or reduce its effect, by selecting a different branch pruning method. There are currently five such methods available.
Policy | Short Name | Description |
---|---|---|
global-less-equal (default) | gle | Prune path if it can only yield explanations with scores less or equal than the current global maximum. |
global-less | gl | Prune path if it can only yield explanations with scores less than the current global maximum. |
path-less-equal | ple | Prune path if it can only yield explanations with scores less or equal than the current path maximum. |
path-less | pl | Prune path if it can only yield explanations with scores less than the current path maximum. |
off | off | Don’t prune branches. Warning: this can result in large numbers of results and long computation times. |
You can change the path pruning policy by passing it as a parameter.
ddl.explain("the-internet.hdt", "abba.txt", prune="ple")
Blacklisting URIs¶
Some predicates are too common or too unreliable to yield good explanations. In such cases, you can blacklist these predicates by passing to dedalov2 a file with blacklisted URIs.
blacklist.txt:
http://www.w3.org/2000/01/rdf-schema#label
http://www.w3.org/2002/07/owl#sameAs
http://www.w3.org/2004/02/skos/core#altLabel
Pass blacklist via parameter of the same name:
ddl.explain("the-internet.hdt", "abba.txt", blacklist="blacklist.txt")
Using URI Prefixes¶
The long URIs in the results make them hard to read. Dedalov2 supports URI prefixes. Simply pass your file with prefixes as a parameter.
import dedalov2 as ddl
for explanation in ddl.explain("the-internet.hdt",
"abba.txt",
minimum_score=1,
groupid=1,
prefix="prefix.txt"):
print(explanation)
Results in
dbpedia2:label -| dbpedia:Polar_Music
dbpedia2:associatedActs -| dbpedia:ABBA
dbpedia2:wikilink -| dbpedia:Anni-Frid_Lyngstad
skos:subject -| dbpedia:Category:ABBA_members
dbpedia2:wikilink -| dbpedia:Cher
dbpedia2:wikilink -| dbpedia:Polar_Music
dbpo:associatedBand -| dbpedia:ABBA
dbpo:associatedMusicalArtist -| dbpedia:ABBA
dbpedia2:wikilink -| dbpedia:Svensktoppen
dbpedia2:wikilink -| dbpedia:Eurovision_Song_Contest
dbpedia2:wikilink -| dbpedia:ABBA
dbpedia2:wikiPageUsesTemplate -| dbpedia:Template:infobox_musical_artist
dbpedia2:wikilink -| dbpedia:Rik_Mayall
dbpo:label -| dbpedia:Polar_Music
Where prefix.txt
is a tab-separated file that looks like:
madsrdf http://www.loc.gov/mads/rdf/v1#
bflc http://id.loc.gov/ontologies/bflc/
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
foaf http://xmlns.com/foaf/0.1/
yago http://yago-knowledge.org/resource/
rdfs http://www.w3.org/2000/01/rdf-schema#
dbo http://dbpedia.org/ontology/
dbp http://dbpedia.org/property/
dc http://purl.org/dc/elements/1.1/
gr http://purl.org/goodrelations/v1#
...
Don’t write this file yourself. Download one from http://prefix.cc/popular.
Handling Large Input Files¶
Using a large example files as input has two significant drawbacks. Below are descriptions of these drawbacks, and how you can deal with them.
Firstly, the time required to find good explanations will go up. For every explanation, dedalov2 has to check to which examples it connects. Having more examples means more checks. Dedalov2 allows you to truncate the input file to reduce the number of input examples without having to modify the input file. This makes finding the right number of examples to include by trial-and-error much easier.
ddl.explain("the-internet.hdt", "abba.txt", truncate=10)
Using truncate with a value of 10 means that the algorithm will use 10 positive examples and 10 negative examples. If the number of positive or negative examples in the input file is lower than 10, it will use all of them.
Secondly, the ratio of positive/negative examples may affect the search heuristic. For example, the entropy path heuristic prioritizes paths that are, more or less, equally connected to positive and negative examples. If the number of positive, or negative, examples in the input file is much larger than the others, the entropy search heuristic may select different paths.
To prevent this, dedalov2 by default limits the number of positive and negative examples to the size of the smallest group. You can disable this by using the balance parameter.
ddl.explain("the-internet.hdt", "abba.txt", balance=False)
This allows the number of positive examples to differ from the number of negative examples.