TeX Live Database as Graph Database

For a presentation at the Neo4j User Meeting in Tokyo I have converted the TeX Live Database into a Graph Database and represented dependencies between all kind of packages as well as files and their respective packages as nodes and relations.

Update 20181010: I have worked out the first step mentioned in further work and got rid of uuids completely and use package names/revisions and file names as identifier. I also added a new node type for TLPDB and renamed the relation between packages and files from contains to includes. The former now refers to the relation between TLPDB and packages. The text and code has been updated to reflect this.

Before going into the details how I did represent the TeX Live Database tlpdb as graph, let us recall a few concepts of how packages are managed and arranged in TeX Live. Each package in TeX Live has a cateogry. The currently available categories are Package, ConTeXt, Collection, Scheme, TLCore. They can be categorized into four groups:

  • Basic macro packages These are the meat of TeX Live, the actual stuff our upstream authors are writing. Typically LaTeX or font packages, they are either of category Package or ConTeXt.
  • Packages for binaries These are packages that ship “binary” files – which means files that are installed into the bin directory and are executable. Some of these are actually scripts and not binaries, though.
  • Collections A Collection contains basic and binary packages, and might depend on other collections. We guarantee that the set of collections is a partition of the available files, which allows distributors like Debian etc to make sure that no file is included two times in different packages.
  • Schemata These are the top-level groups that are presented to the user during installation. They depend on collections and other packages, and try to provide a meaningful selection.

The TeX Live Database itself is modeled after the Debian package database, and contains stanzas for each package. A typical example for a package would be (slightly abbreviated):

category Package
revision 15878
catalogue one2many
shortdesc Generalising mathematical index sets
longdesc In the discrete branches of mathematics and the computer
...
longdesc one-line change.
docfiles size=98
 texmf-dist/doc/latex/12many/12many.pdf details="Package documentation"
 texmf-dist/doc/latex/12many/README details="Readme"
srcfiles size=6
 texmf-dist/source/latex/12many/12many.dtx
 texmf-dist/source/latex/12many/12many.ins
runfiles size=1
 texmf-dist/tex/latex/12many/12many.sty
catalogue-ctan /macros/latex/contrib/12many
catalogue-date 2016-06-24 19:18:15 +0200
catalogue-license lppl
catalogue-topics maths
catalogue-version 0.3
...

A typical example for a collection would be:

name collection-langjapanese
category Collection
revision 48752
shortdesc Japanese
longdesc Support for Japanese; additional packages in
longdesc collection-langcjk.
depend collection-langcjk
depend ascmac
depend babel-japanese
...

and a typical example of a schema would be:

name scheme-medium
category Scheme
revision 44177
shortdesc medium scheme (small + more packages and languages)
longdesc This is the medium TeX Live collection: it contains plain TeX,
longdesc LaTeX, many recommended packages, and support for most European
longdesc languages.
depend collection-basic
depend collection-binextra
depend collection-context
depend collection-fontsrecommended
...

In total, we are currently at the following values: 9 Schemata, 41 Collections, 6718 Packages (Package, TLCore, ConTeXt), and about 181839 files.

Representation in Neo4j

Representation as graph was relatively straight-forward: We decided for separate nodes for each package and each file, and relations of dependency (depend in the above examples), inclusion (files being included in a package), and containment (a package is contained in a certain tlpdb revision).

We used a simple Perl script tl-dump-neo4j which uses the TeX Live provided Perl modules to read and parse the TeX Live Database to generate CSV files for each node type and each relation type. These CSV files were then imported into a Neo4j database with neo4j-import. For each node type one csv file was generated with three fields, an UUID consisting of the name and the revision separated by a colon, the name of the package and the revision. Example of the file node-Package.csv containing the Packages:

uuid:ID,name,revision
cslatex:47536,cslatex,47536
mcf2graph:48046,mcf2graph,48046
substances:40989,substances,40989
layouts:42428,layouts,42428
guide-to-latex:45712,guide-to-latex,45712

For the files contained in the database I use the file name as identifier, thus the respective csv only contains one field, the file name (enclosed in quotes to make sure that spaces are not mistreated).

There is a node type TLPDB with only identifier revision that carries the current version of the tlpdb used.

The three relations (depends, contains, and includes) then used the assigned UUIDs to define the relation: For packages it is the “name:revision”, for files the filename. The start of edge-depends.csv file is:

:START_ID,:END_ID
cslatex:47536,csplain:48563
cslatex:47536,latex:47860
cslatex:47536,cm:45811
cslatex:47536,tex-ini-files:40533
cslatex:47536,latex-fonts:28888
cslatex:47536,hyphen-base:48303

Only for the includes relation we added an additional tag giving the type of file (run/bin/doc/src according to the group the file is in the tlpdb). The start of edge-includes.csv is given below:

:START_ID,type,:END_ID
libertinus-type1:48618,run,"texmf-dist/fonts/tfm/public/libertinus-type1/LibertinusSans-tosf-sc-ly1.tfm"
sourcesanspro:42852,run,"texmf-dist/fonts/tfm/adobe/sourcesanspro/SourceSansPro-ExtraLight-tlf-sc-ly1.tfm"
cbfonts:31624,run,"texmf-dist/fonts/tfm/public/cbfonts/grxc0700.tfm"
clearsans:34405,doc,"texmf-dist/doc/fonts/clearsans/clear-samples.pdf"
fonts-tlwg:47499,run,"texmf-dist/fonts/tfm/public/fonts-tlwg/rkinnari_o.tfm"

The last relation is contains which sets up connections between tlpdb revisions and the contained packages. The start of edge-contains.csv is given below:

:START_ID,:END_ID
48864,ltxmisc:21927
48864,tex4ebook:47672
48864,tamefloats:27345
48864,matc3mem:35773
48864,todo:17746

With this in place a simple call to neo4j-import produced a ready-to-go Neo4j Database:

$ ls
edge-contains.csv    node-ConTeXt.csv  node-TLCore.csv
edge-depends.csv     node-Files.csv    node-TLPDB.csv
edge-includes.csv    node-Package.csv
node-Collection.csv  node-Scheme.csv
$ neo4j-import --into ../graphdb \
   --nodes:TLPDB node-TLPDB.csv \
   --nodes:Collection node-Collection.csv \
   --nodes:ConTeXt node-ConTeXt.csv \
   --nodes:Files node-Files.csv \
   --nodes:Package node-Package.csv \
   --nodes:Scheme node-Scheme.csv \
   --nodes:TLCore node-TLCore.csv \
   --relationships:contains edge-contains.csv \
   --relationships:includes edge-includes.csv \
   --relationships:depends edge-depends.csv
...
IMPORT DONE in 2s 93ms. 
Imported:
  168129 nodes
  172280 relationships
  175107 properties
Peak memory usage: 1.03 GB

Sample queries

Return all schemata:

match (s:Scheme) return s;

Return all dependencies from a schema to something else then a collection:

match p = (s:Scheme) -[:depends]-> (q)
  where NOT 'Collection' IN LABELS(q)
  return p;

Here we use LABELS to find all the labels of a node.

Check whether the same package is contained in two different collections:

match (c1:Collection) -[:depends]-> (p)
  <-[:depends]- (c2:Collection) return c1, c2, p;

Fortunately, only collections are targets of multiple depends, which is fine 😉

Search for cycles in the dependencies:

match p = (n)-[:depends*]-> (n) return p;

Here we use the * operator to search for arbitrary long paths. Interestingly we got one result, namely that ConTeXt depends on itself, something that is not good anyway.

Search for files that are included in multiple packages:

match (p1) -[:includes]-> (f)
  <- [:includes]- (p2) return p1, p2, f;

Fortunately here we didn't get any result. Anyway, this is checked every day with a simple grep/awk program 😉

Show all the documentation files for one package:

match (p) -[:includes {type:'doc'}]-> (f)
  where p.name = "tlcockpit"
  return p,f;

Graph Algorithm with Neo4j

The Neo4j Team also provides a set of graph alogrithm readily available by installing and activating a plugin. This plugin can be downloaded from this Neo4j Github Page. In my case this resulted in the download of graph-algorithms-algo-3.4.7.0.jar, which I did put into the plugins folder of my Neo4j installation. On Debian this is /var/lib/neo4j/plugins/. To get it to actually run one needs to allow running it by adding the following line to the Neo4j config file (on Debian /etc/neo4j/neo4j.conf):

dbms.security.procedures.unrestricted=algo.*

After a restart of Neo4j one is ready to use all the algorithms provided in this jar.

First let us check the Google Page Rank (whatever it might mean for the current case):

CALL algo.pageRank.stream(null, 'depends', {iterations:20, dampingFactor:0.85})
  YIELD nodeId, score
  MATCH (node) WHERE id(node) = nodeId
  RETURN node.name AS page,score
  ORDER BY score DESC

which gives the following output (in table mode):

╒══════════════════════════════╤═══════════════════╕
│"page"                        │"score"            │
╞══════════════════════════════╪═══════════════════╡
│"context"                     │4.868265000000001  │
├──────────────────────────────┼───────────────────┤
│"hyphen-base"                 │4.667172000000001  │
├──────────────────────────────┼───────────────────┤
│"hyph-utf8"                   │4.0754105          │
├──────────────────────────────┼───────────────────┤
│"kpathsea"                    │1.8529665          │
├──────────────────────────────┼───────────────────┤
│"plain"                       │0.982524           │
├──────────────────────────────┼───────────────────┤

In the similar vein is the Betweenness Centrality:

CALL algo.betweenness.stream(null, 'depends', {direction:'out'})
  YIELD nodeId, centrality
  MATCH (pkg) WHERE id(pkg) = nodeId
  RETURN pkg.name AS pkg,centrality
  ORDER BY centrality DESC;

which gives the following output:

╒══════════════════════════╤═══════════════════╕
│"pkg"                     │"centrality"       │
╞══════════════════════════╪═══════════════════╡
│"collection-basic"        │1675.4717032967033 │
├──────────────────────────┼───────────────────┤
│"collection-latexextra"   │1212.0             │
├──────────────────────────┼───────────────────┤
│"context"                 │947.3333333333334  │
├──────────────────────────┼───────────────────┤
│"collection-latex"        │744.8166666666666  │
├──────────────────────────┼───────────────────┤
│"collection-pictures"     │586.0              │
├──────────────────────────┼───────────────────┤

Finally let us look at the triangle computation:

CALL algo.triangleCount.stream(null, 'depends', {concurrency:4})
  YIELD nodeId, triangles, coefficient
  MATCH (p) WHERE id(p) = nodeId
  RETURN p.name AS name, triangles, coefficient
  ORDER BY triangles DESC

which yields the following output:

╒═════════════════════════╤═══════════╤═══════════════════════╕
│"name"                   │"triangles"│"coefficient"          │
╞═════════════════════════╪═══════════╪═══════════════════════╡
│"collection-basic"       │109        │0.042644757433489826   │
├─────────────────────────┼───────────┼───────────────────────┤
│"scheme-full"            │46         │0.05897435897435897    │
├─────────────────────────┼───────────┼───────────────────────┤
│"collection-latex"       │43         │0.04154589371980676    │
├─────────────────────────┼───────────┼───────────────────────┤
│"scheme-tetex"           │42         │0.022950819672131147   │
├─────────────────────────┼───────────┼───────────────────────┤
│"collection-context"     │39         │0.04318936877076412    │
├─────────────────────────┼───────────┼───────────────────────┤

Future Work

[DONE 20181010 - see above] During the presentation we got the suggestion to use hash values of the node content instead of arbitrarily computed uuids to allow for better upgrades/additions in case the values of nodes did remain the same.

Furthermore, it would be interesting to parse the full information of packages (including revision numbers, catalogue information etc), save them into the nodes, and regularly update the database to see the development of packages. To make this actually work out we need the first step of using hashes, though.

Closing

Considering that all of the above plus the actual presentation slides were written in less than one day, one can see that developing a graph database based on Neo4j and playing around with it is a rather trivial procedure. The difficult part is normally to find the "right" set of node types and relation types, as well as their attributes. In the case of the TeX Live database this was quite trivial, which allowed for an easy and direct representation in Neo4j.

We made the graph (read-only) available for experimentation at http://texlive.info:7474/browser/ (with user/pass neo4j).

We hope that these simple examples of graphs help others to kick-start more interesting and deeper projects using the power of graphs.

7 Responses

  1. karl says:

    gosh, it’s been too long. i don’t remember our simple tlpdb file format being “based” on debian’s. especially since i’ve never seen debian’s. but maybe you proposed it initially? or it might be a case of independent convergence, i mean, how else would one rationally represent this stuff in a line-oriented file …

    • Yes it is, and you should know it as I said it many times 😉 Ok, based is maybe wrong, “inspired” would be better. A simple stanza (=paragraph) format with key/value pairs. Here is one stanza from a Debian Package file:

      Package: 0xffff
      Version: 0.8-1
      Installed-Size: 168
      Maintainer: Sebastian Reichel 
      Architecture: amd64
      Depends: libc6 (>= 2.14), libusb-0.1-4 (>= 2:0.1.12)
      Description: Open Free Fiasco Firmware Flasher
      Multi-Arch: foreign
      Homepage: https://github.com/pali/0xFFFF
      Description-md5: 183c49f6505eb3432d1b069800f1f5b6
      Tag: admin::hardware, hardware::usb, implemented-in::c,
       interface::commandline, role::program, scope::utility, works-with::file
      Section: misc
      Priority: optional
      Filename: pool/main/0/0xffff/0xffff_0.8-1_amd64.deb
      Size: 58908
      MD5sum: e549a30bd8c46fdbb0fe769f537fa535
      SHA256: ce7b8084c77bd56e974477acb83118033b40fa190dbb9a4407d4f4b237ccd60c
      
    • Forgot to say, I designed the tlpdb format with this background, that is why I said “based”, or “inspired”.

  2. wtsnjp says:

    Once I wrote a crazy one-liner script (golfed as much as I could) to get the exact list of “basic and binary packages” contained in a scheme/collection. I found that `tlmgr info` is useful, but the result can contain collections (i.e. not only basic and binary packages), so I have to call `tlmgr info` multiple times to resolve dependency recursively – it actually take a while. The graph database will solve this kind of demands more easily, effectively, and visually 🙂
    cf. https://blog.wtsnjp.com/2018/08/18/tlpkg-dependency/ (in Japanese)

    Are you going to maintain the demo (http://texlive.info:7474/browser/) for long term? I’m asking this because the URL doesn’t look like permanent one…

    • Indeed, getting only the basic/binary packages from a scheme or collection is not trivial and would need either several invocations of tlmgr, or a minimal perl program using the modules. Or you just get the dependencies and make greps etc, collections are always called collection-* and schemata scheme-*.

      Concerning the web demo – I guess it mostly depends on how much it is used and how much it hits on my server’s CPU/mem 😉 The URL is not strange, texlive.info is used also for tlcontrib, and the port is the standard port for the Neo4j browser.

  3. Luigi says:

    “””
    Here we use the * operator to search for arbitrary long paths. Interestingly we got one result, namely that ConTeXt depends on itself, something that is not good anyway.
    “””
    Why is it “not good” ?

    • Hi Luigi,
      it doesn’t hurt, but is strange. In fact it is an artifact because in our tlpsrc there is no such dependency, but it is automatically added in one of our generation scripts.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>