.. toctree::
   :maxdepth: 1
   :caption: Contents:

Features
========

Below are descriptions of the important dedalov2 features.
All of them are controlled by passing parameters to dedalov2's *explain* method.
All of them, except prefixes_, change the search behavior of the system.

Changing the Search Heuristic
-----------------------------

You can change the search behavior of dedalov2 by choosing a search heuristic.
Currently, dedalov2 offers three search heuristics: *entropy*, *longest path first*, and *shortest path first*.
Each of these heuristics is explained below.
If you haven't done so already, it is recommended to read about Paths_ first.

.. _Paths: background.rst

Entropy
~~~~~~~

In information theory, entropy is defined as

.. math::

   S = - \sum_{i}P_{i}\log {P_{i}}

Dedalov2 adapts this function to calculate the entropy of a path, :math:`p`.
It does this by considering as its outcomes all explanations that use the path [e.g., :math:`\{e : e = (p,o_i) \}`].
For each of these outcomes, it calculates its probability as the the number of roots it has divided by the total number of examples [e.g., :math:`\frac{|R(p,o)|}{|E|}`].
Entropy can be calculated using different bases for its logaritm.
To stay consistent with previous work, dedalov2 uses the base 10 logarithm.
This results in the path evaluation function below.

.. math::

   S(p) = - \sum_o \frac{|R(p,o)|}{|E|} \log_{10} \frac{|R(p,o)|}{|E|}

In code, the result looks like:

.. code:: python

   def entropy(p: Path, examples: Examples) -> float:
      res: float = 0
      for obj in p.get_end_points():
         num_roots = len(p.get_starting_points_connected_to_endpoint(obj))
         frac = num_roots/len(examples)
         res -= frac * math.log10(frac)
      return res

Longest Path First
~~~~~~~~~~~~~~~~~~

This search heuristic assigns higher scores to longer paths.
This causes a depth-first-search-like behavior,
extending a single path as long as possible before backtracking to previous ones.

Shortest Path First
~~~~~~~~~~~~~~~~~~~

This search heuristic assigns higher scores to shorter paths.
This causes a breadth-first-search-like behavior,
first extending all paths of the shortest length before selecting a longer path.
This approach causes the search frontier to quickly increase in size.
When using a large HDT file, consider using
explicit memory usage limits to prevent MemoryErrors.

Pruning Search Paths
--------------------

By default, Dedalov2 applies branch pruning to reduce the amount of work
it needs to do. You can turn this off, or reduce its effect, by
selecting a different branch pruning method. There are currently five
such methods available.

+-----------------------+-----------------------+-----------------------+
| Policy                | Short Name            | Description           |
+=======================+=======================+=======================+
| global-less-equal     | gle                   | Prune path if it can  |
| **(default)**         |                       | only yield            |
|                       |                       | explanations with     |
|                       |                       | scores less *or       |
|                       |                       | equal* than the       |
|                       |                       | current global        |
|                       |                       | maximum.              |
+-----------------------+-----------------------+-----------------------+
| global-less           | gl                    | Prune path if it can  |
|                       |                       | only yield            |
|                       |                       | explanations with     |
|                       |                       | scores less than the  |
|                       |                       | current global        |
|                       |                       | maximum.              |
+-----------------------+-----------------------+-----------------------+
| path-less-equal       | ple                   | Prune path if it can  |
|                       |                       | only yield            |
|                       |                       | explanations with     |
|                       |                       | scores less or equal  |
|                       |                       | than the current      |
|                       |                       | *path* maximum.       |
+-----------------------+-----------------------+-----------------------+
| path-less             | pl                    | Prune path if it can  |
|                       |                       | only yield            |
|                       |                       | explanations with     |
|                       |                       | scores less than the  |
|                       |                       | current *path*        |
|                       |                       | maximum.              |
+-----------------------+-----------------------+-----------------------+
| off                   | off                   | Don’t prune branches. |
|                       |                       | *Warning:* this can   |
|                       |                       | result in large       |
|                       |                       | numbers of results    |
|                       |                       | and long computation  |
|                       |                       | times.                |
+-----------------------+-----------------------+-----------------------+

You can change the path pruning policy by passing it as a parameter.

.. code:: python

   ddl.explain("the-internet.hdt", "abba.txt", prune="ple")

Blacklisting URIs
-----------------

Some predicates are too common or too unreliable to yield good explanations.
In such cases, you can blacklist these predicates by passing to dedalov2 a file with blacklisted URIs.

blacklist.txt:

::

   http://www.w3.org/2000/01/rdf-schema#label
   http://www.w3.org/2002/07/owl#sameAs
   http://www.w3.org/2004/02/skos/core#altLabel

Pass blacklist via parameter of the same name:

.. code:: python

   ddl.explain("the-internet.hdt", "abba.txt", blacklist="blacklist.txt")

.. _prefixes:

Using URI Prefixes
------------------

The long URIs in the results make them hard to read. Dedalov2 supports
URI prefixes. Simply pass your file with prefixes as a parameter.

.. code:: python

   import dedalov2 as ddl

   for explanation in ddl.explain("the-internet.hdt",
                                  "abba.txt",
                                  minimum_score=1,
                                  groupid=1,
                                  prefix="prefix.txt"):
       print(explanation)

Results in

::

   dbpedia2:label -| dbpedia:Polar_Music
   dbpedia2:associatedActs -| dbpedia:ABBA
   dbpedia2:wikilink -| dbpedia:Anni-Frid_Lyngstad
   skos:subject -| dbpedia:Category:ABBA_members
   dbpedia2:wikilink -| dbpedia:Cher
   dbpedia2:wikilink -| dbpedia:Polar_Music
   dbpo:associatedBand -| dbpedia:ABBA
   dbpo:associatedMusicalArtist -| dbpedia:ABBA
   dbpedia2:wikilink -| dbpedia:Svensktoppen
   dbpedia2:wikilink -| dbpedia:Eurovision_Song_Contest
   dbpedia2:wikilink -| dbpedia:ABBA
   dbpedia2:wikiPageUsesTemplate -| dbpedia:Template:infobox_musical_artist
   dbpedia2:wikilink -| dbpedia:Rik_Mayall
   dbpo:label -| dbpedia:Polar_Music

Where ``prefix.txt`` is a tab-separated file that looks like:

::

   madsrdf http://www.loc.gov/mads/rdf/v1#
   bflc    http://id.loc.gov/ontologies/bflc/
   rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
   foaf    http://xmlns.com/foaf/0.1/
   yago    http://yago-knowledge.org/resource/
   rdfs    http://www.w3.org/2000/01/rdf-schema#
   dbo http://dbpedia.org/ontology/
   dbp http://dbpedia.org/property/
   dc  http://purl.org/dc/elements/1.1/
   gr  http://purl.org/goodrelations/v1#
   ...

Don’t write this file yourself. Download one from
http://prefix.cc/popular.

Handling Large Input Files
----------------------------------

Using a large example files as input has two significant drawbacks.
Below are descriptions of these drawbacks,
and how you can deal with them.

Firstly, the time required to find good explanations will go up.
For every explanation, dedalov2 has to check to which examples it connects.
Having more examples means more checks.
Dedalov2 allows you to truncate the input file to reduce the number of input examples without having to modify the input file.
This makes finding the right number of examples to include by trial-and-error much easier.

.. code:: python

   ddl.explain("the-internet.hdt", "abba.txt", truncate=10)

Using truncate with a value of 10 means that the algorithm will use 10 positive examples and 10 negative examples.
If the number of positive or negative examples in the input file is lower than 10, it will use all of them.

Secondly,
the ratio of positive/negative examples may affect the search heuristic.
For example, the entropy path heuristic prioritizes paths that are, more or less,
equally connected to positive and negative examples.
If the number of positive, or negative, examples in the input file is much larger than the others,
the entropy search heuristic may select different paths.

To prevent this, dedalov2 by default limits the number of positive and negative examples to the size of the smallest group.
You can disable this by using the balance parameter.

.. code:: python

   ddl.explain("the-internet.hdt", "abba.txt", balance=False)

This allows the number of positive examples to differ from the number of negative examples.