I have been playing around with Neo4j as graph database, and searching for a big dataset I decided to look at Debian packages (source and binary) from stable, testing, sid, and experimental, and represent all of that in a big graph database.
While this is far from ready, the following entities and relations a represented:
- source packages, unversioned and versioned
- binary packages, unversioned and versioned
- all dependencies, including alternatives and versioned dependencies
- relations like maintains, builds, etc
- suites (stable, testing, sid, experimental)
The graph currently has 220618 nodes and 782323 edges, and my first trial to import this into the database was by generating a long cypher statement, and then throwing that at cypher-shell. Well, that was not the best idea. After 24h I stopped the process and rewrote the generation script to generate csv files. Using neo4j-import the same amount of data was imported in 5secs (!!!).
What I would like to get in the future is the whole package history as well, and maybe also include all the bugs into the database … if I only would have easily accessible and parseable information about these items (Debian Q&A maybe?). If you have any suggestions, please let me know.
More to come, stay tuned.