I need to be able to manipulate a large (10^7 nodes) graph in python. The data corresponding to each node/edge is minimal, say, a small number of strings. What is the most efficient, in terms of ** memory and speed**, way of doing this?

A dict of dicts is more flexible and simpler to implement, but I intuitively expect a list of lists to be faster. The list option would also require that I keep the data separate from the structure, while dicts would allow for something of the sort:

graph[I][J]["Property"]="value"

What would you suggest?

Yes, I should have been a bit clearer on what I mean by efficiency. In this particular case I mean it in terms of random access retrieval.

Loading the data in to memory isn’t a huge problem. That’s done once and for all. The time consuming part is visiting the nodes so I can extract the information and measure the metrics I’m interested in.

I hadn’t considered making each node a class (properties are the same for all nodes) but it seems like that would add an extra layer of overhead? I was hoping someone would have some direct experience with a similar case that they could share. After all, graphs are one of the most common abstractions in CS.

## Answer

I would strongly advocate you look at NetworkX. It’s a battle-tested war horse and the first tool most ‘research’ types reach for when they need to do analysis of network based data. I have manipulated graphs with 100s of thousands of edges without problem on a notebook. Its feature rich and very easy to use. You will find yourself focusing more on the problem at hand rather than the details in the underlying implementation.

**Example of Erdős-Rényi random graph generation and analysis**

""" Create an G{n,m} random graph with n nodes and m edges and report some properties. This graph is sometimes called the Erd##[m~Qs-Rényi graph but is different from G{n,p} or binomial_graph which is also sometimes called the Erd##[m~Qs-Rényi graph. """ __author__ = """Aric Hagberg (hagberg@lanl.gov)""" __credits__ = """""" # Copyright (C) 2004-2006 by # Aric Hagberg # Dan Schult # Pieter Swart # Distributed under the terms of the GNU Lesser General Public License # http://www.gnu.org/copyleft/lesser.html from networkx import * import sys n=10 # 10 nodes m=20 # 20 edges G=gnm_random_graph(n,m) # some properties print "node degree clustering" for v in nodes(G): print v,degree(G,v),clustering(G,v) # print the adjacency list to terminal write_adjlist(G,sys.stdout)

Visualizations are also straightforward:

More visualization: http://jonschull.blogspot.com/2008/08/graph-visualization.html