Network Analysis in Python

A comprehensive introduction and hands-on practice of graph theory and network analysis

Guangyuan(Frank) Li
7 min readJun 20, 2023
Photo by Alina Grubnyak on Unsplash

Graph, or network, essentially just consists of a set of points and the lines linking them in between. Have you ever wondered, what makes network analysis become such an indispensable component of data science?

This, in my opinion, might attribute to its powerful representation of the real world. Deriving an abstract model is usually the first step to comprehending a complicated real-life problem, however, not everything in our life can easily fit into a regular grid shape object (i.e. image or matrix), however, the graph structure seems to have the ability to transcend the constraints and can perfectly express the complex relationships across a variety of scenarios (social network, protein-protein interactions).

Since networks and graphs can provide a generally great framework allowing us to model the real world. The next step would naturally be how to derive useful insights from a network, which can further be mapped back to the real situation. That necessitates a deep understanding of graph theory and network analysis.

Mathematically, a graph can be simply denoted as G=(V,E) , where G is the graph, v signifies the set of vertices and E means the collection of edges. I will start from here, and show you how you can obtain insightful conclusions from this simple equation. We are going to use the popular Python package — networkx to demonstrate all the concepts in the real codes.

So brace yourself and let’s get started!

Create a graph object

In the networkx python package, 4 built-in graph objects have been implemented, they are nx.Graph() , nx.DiGraph() , nx.MultiGraph() and nx.MultiDiGraph() . nx.Graph() object denotes a normal undirected network, the edges between the vertices do not have directions. You can use nx.DiGraph() to represent directionality and nx.MultiGraph() to enable multiple edges (undirected, if directed, using nx.MultiDiGraph())between two vertices.

Let’s first learn how to build a graph from scratch, by doing so, I chose the most complicated instance — MultiDiGraph() object:

A MultiDiGraph object (image by author)

To summarize:

A graph is composed of nodes and edges, they are the two pillars of any graph. Each node and edge can have associated attributes. You can either add a node or edge one at a time or use an iterator function add_nodes/edges_from .It is important to get a sense of what data type the G.nodes() or G.edges() are, and when printing out each element, what it will look like (tuple, nested tuple, etc).

It is also common to convert networks from NumPy or pandas, usually, a graph can be represented by an adjacency matrix or edge list.

Edge list or adjacency matrix (image by author)

Networkx also has quite a lot of built-in graph generators, which can become very handy when you just want to have a quick graph instance.

For instance, complete_graph means the vertices in the graph are fully connected, you can refer to the documentation for other types of graphs and we will cover a few others in the following section.

G = nx.complete_graph(n=5) # n means the number of vertices

Network analysis

So once we have a graph or network, the next question is, what types of questions we can ask about this network? In my opinion, these can be categorized into multiple categories as follows:

Intrinsic properties associated with a node

  1. Degree
G = nx.Graph()
G.add_nodes_from([1,2,3],color='blue')
G.add_edges_from([(1,2),(2,3),(1,3)],weight=4.5)
G.degree()
# DegreeView({1: 2, 2: 2, 3: 2})
for i in G.degree():
print(i)
# (1, 2)
# (2, 2)
# (3, 2)

2. Cluster of Coefficient (CC)

The cluster of coefficient (CC) is an important feature for a node, it measures how the neighbors of a node connect with each other.

lcc = nx.clustering(G,1,weight='weight') # local cc
# 1.0

3. Centrality (with different measurements)

Intrinsic properties associated with an edge (or two nodes)

There can exist several paths between two reachable nodes in the graph, the shortest path specifically denotes the path that has the shortest distance, measured by the customized edge attributes.

p = nx.shortest_path(G,source=1,target=3,weight='weight',method='dijkstra')
# [1, 3]

Intrinsic properties associated with a graph

  1. small-world and scale-free

A small-world network is a very vivid way to describe the characteristics of this type of network. Formally, it means a network with both a small average shortest path and a high clustering coefficient. It sounds paradoxical at first glance because a high clustering coefficient means nodes in the graph are somehow disconnected and form several hubs on their own, which ought not to have a small average shortest path because a node in one hub need to walk a long way to reach another distant hub. The reason why both conditions can be met is that in our social network, although we all have our own closest friends and contacts, there are several people in the network, who are extremely social and they serve as the connection between different hubs, the existence of these social people explains the small average shortest path. In other words, Both you and another stranger very likely know certain “social” people.

To measure the extent to which a network is a small world, we have two metrics, sigma and omega to contrast the shortest path and clustering coefficient of the query network against the random network. sigma>1 can be considered as a small world graph, which corresponds toomega=0 . omega=1 means it is a random network and omega=-1 means it is a random lattice shape network.

G = nx.complete_graph(5)
nx.sigma(G). # 1
nx.omega(G). # 0

Another commonly encountered network type is a scale-free network, informally scale-free network is the opposite of a random network, which contains hubs. Formally, scale-free networks mean the degree distribution follows a power law where a large fraction of nodes only has a low value of a degree. A scale-free network doesn’t necessarily mean it is a small-world network, you probably need to rewire a few random links between the hubs to meet the small average shortest path requirement.

Intrinsic properties associated with each community (cluster)

Community is a relatively conceptual term. It denotes a set of nodes that are densely connected with each other and loosely connected to nodes in other communities.

There are several different algorithms that can help you to find out the communities in the graph. However, to measure how well a community detection algorithm work, we define three metrics:

G = nx.complete_graph(10)
partition = [(1,2,3),(4,5,6,7),(8,9,0)]
from networkx.algorithms.community.quality import modularity,coverage,performance
modularity(G,partition) # -0.073
coverage(G,partition) # 0.26, intra-edges/total-edges
performance(G,partition) # 0.26, (intra-edges + inter-non-edges)/total-edges
  1. assortativity

Imagine each node has an attribute indicating its identity, assortativity refers to the extent to which one type of node mixes with each other type of node (including itself), this can be done using the nx.attribute_mixing_matrix function.

2. Number of the connected components

3. Connectivity measures how many edges/nodes need to be removed to disconnect the community

nx.node_connectivity(G)  # 2
nx.edge_connectivity(G) # 2
from networkx.algorithms.connectivity import local_node_connectivity
local_node_connectivity(G,1,2) # 2

4. Volume is the sum of out-degrees

nx.volume(G,[1,2],weight='weight')  # 18

5. Node boundary refers to a set of nodes in the out-neighbors of the subgraph

nx.node_boundary(G,[1,2]) # {3}

6. degree distribution

7. global cluster of coefficient

gcc = nx.average_clustering(G,weight='weight') # global cc is an average of local cc
# 1.0

Properties of two subgraphs

  1. Cut size means how many edges you need to cut to disconnect two sets.
nx.cut_size(G,[1,2],[3],weight='weight')  # 9

2. Conductance can be interpreted as a normalized version of the cut size

nx.conductance(G,[1,2],[3],weight='weight') # 1

3. edge boundary is the edges between two sets (S and the node boundary of S).

4. isomorphism and similarity

Two graphs are isomorphic if and only if they contain the same nodes and the nodes are connected in the same way. More formally, if node i and node j are adjacent in the graph G , then they have to be adjacent in the graph H , if this condition is met, then G and H are isomorphic. Based on that, we can define the similarity of two graphs by saying, how many nodes/edges need to be changed to make two graphs isomorphic. This metric is called graph editing distance.

G1 = nx.complete_graph(10)
G2 = nx.cycle_graph(10)
nx.graph_edit_distance(G1,G2) # 35
nx.is_isomorphic(G1,G2) # False

Other definitions

  1. random network

There are two types of widely-used random networks, one is called gnp , another is called gnm . gnp means a network with n nodes, however, the edges between nodes will only be created with the probability p . p will follow a uniform distribution. gnm , on the other hand, denotes randomly choosing a graph with n nodes and m edges amongst all its possible collections.

G = nx.gnp_random_graph(5,p=0.2)
G = nx.gnm_random_graph(5,3)

2. clique

clique means a subgraph of G, such that every two vertices in the subgraph are connected.

Graph Neural Network (GNN)

In the context of GNN, each node and edge can be embedded into different features, and once pooled, each graph can also be represented by different features. Now we can conduct tasks that include (1) node classification, (2) link classification, (3) graph embedding/representations, and (4) graph-level classification. GNN is particularly flexible to model objects like molecular structures, single-cell genomics data, and many others. Once we transform the domain-specific questions into these four types of fundamental tasks, multiple off-the-shelf algorithms can be applied.

Network visualization

Asides from Networkx’s built-in visualization function, we can also export it to formats compatible with Cytospace for an optimal layout, including the flexibility to change the layout, and appearance of nodes/labels.

Reference

  1. Mathematics | Walks, Trails, Paths, Cycles, and Circuits in Graph (https://www.geeksforgeeks.org/mathematics-walks-trails-paths-cycles-and-circuits-in-graph/)
  2. Connected Components in a Graph (https://www.baeldung.com/cs/graph-connected-components)
  3. Getting Started with Community Detection in Graphs and Networks (https://www.analyticsvidhya.com/blog/2020/04/community-detection-graphs-networks/)

--

--