Exploring GPlus Dataset using Neo4j

What is Neo4j

3 min readApr 28, 2023

As we have discussed so far, Graphs are really handy to solve complex problems and understand systems of great complexity, it only makes sense to have a database, in contrast to traditional relational databases like SQL or document databases like MongoDB, that is efficient in handling graph data.

This is where Neo4j comes into play. Neo4j is a very efficient and sophisticated graph database management system created exclusively for dealing with graph data. Unlike traditional databases, Neo4j retains information as nodes and relationships, allowing for considerably more flexible and efficient modelling of complex systems and interactions.
One of Neo4j’s primary advantages is its query language, Cypher.

Cypher is a querying tool for graph data that allows users to easily explore and change graph data using a simple and clear syntax. This makes it simple to use even for non-experts and can assist organisations in extracting important insights from their data more quickly and easily.
Neo4j is utilized in a wide range of industries, including finance, healthcare, and social media. Because of its ability to handle massive datasets and complicated queries, it’s also a popular choice for organisations that need to work with graph data at scale.

Running Neo4j

version: '3'

services:
  neo4j:
    image: neo4j:4.4
    environment:
      - NEO4JLABS_PLUGINS=["graph-data-science"]
      - NEO4J_dbms_security_procedures_unrestricted=gds.*
      - NEO4J_dbms_security_procedures_whitelist=gds.*
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - myneo4jdata:/data

volumes:
  myneo4jdata:

We will run Neo4j using Docker which is basically a tool that runs containerized apps.

This docker-compose file basically will run an instance of Neo4j with the plugin of Graph data Science.

The “ports” section maps the container’s ports 7474 and 7687 to the same ports on the host machine, allowing you to access the Neo4j browser and the Bolt protocol.
The “volumes” section creates a named volume called “myneo4jdata” that maps to the “/data” directory in the container. This allows you to persist the data between container restarts.

We will run the following command to start the docker file

docker compose up

Now we use a Python script to load our dataset into the Neo4j running instance:

from neo4j import GraphDatabase
from faker import Faker
import random

edgeDens = 0.4

faker = Faker()

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "1234567890"))
session = driver.session()

data = {}
gender = ["Male", "Female", "Other"]

with open('gplus_combined.txt') as f:
    lines = f.readlines()
    for line in lines:
        if(random.random() < edgeDens):
            continue
        us1, us2 = line.strip().split()
        if(data.get(us1) is None):
            data[us1] = (faker.name(),random.choice(gender))
        if(data.get(us2) is None):
            data[us2] = (faker.name(),random.choice(gender))
        query = f"MERGE (a:{data[us1][1]} " + "{name:$us1})" + f"MERGE (b:{data[us2][1]} " +"{name:$us2})" + f"MERGE (a)-[:FRIEND " + "{strength:"+ f"{random.random()}"+"}]-(b)" 
        session.run(query,us1=data[us1][0],us2=data[us2][0])

This code generates fake data for a social network and inserts it into a Neo4j database. It uses the faker library to generate fake names and genders, and generates fake relationships between users. The data is read from a gplus_combined.txt file and is formatted in the query statement.

Cypher Queries

The first Cypher query we run is

Match(n) return n

which gives all the nodes as a return statement. However, Neo4j is limited to 300 nodes to display at once to not crash the laptop.

Exploring GPlus Dataset using Neo4j

What is Neo4j

Running Neo4j

Cypher Queries

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Alan Marc Louis

No responses yet