Skip to content

Explore the twitter network with neo4j

⚠️ Before running these cypher queries make sure to have noe4j running with the gds plugin installed ⚠️

First let's create cypher query that defines two unique node constraints

CREATE CONSTRAINT IF NOT EXISTS FOR (u:User) REQUIRE u.id IS UNIQUE; 
CREATE CONSTRAINT IF NOT EXISTS FOR (t:Tweet) REQUIRE t.id IS UNIQUE;

Why this matters - Constraints enforce data quality by preventing duplicate User and Tweet nodes for the same id. - Neo4j also creates indexes behind the scenes, speeding up MATCH/MERGE lookups on u.id and t.id.

and load our csv file with :

What this does: - LOAD CSV WITH HEADERS reads the remote CSV and exposes each row as a map via row. - LIMIT 5 previews the data so you can validate header names and formats before writing MERGE logic. - Use the preview to confirm fields like id, name, username, createdAt.

LOAD CSV WITH HEADERS FROM "https://bit.ly/39JYakC" AS row
WITH row
LIMIT 5
RETURN row

import user information

Explanation: - MERGE (u:User {id:row.id}) guarantees one node per unique user id (idempotent). - ON CREATE SET runs only when a new node is created; add ON MATCH SET if you need to update existing nodes. - datetime(row.createdAt) converts strings into Neo4j temporal types for time-based queries later. - With the uniqueness constraint, rerunning this step will not produce duplicates.

LOAD CSV WITH HEADERS FROM "https://bit.ly/39JYakC" AS row
MERGE (u:User {id:row.id})
ON CREATE SET u.name = row.name,
              u.username = row.username,
              u.registeredAt = datetime(row.createdAt)

load follower network

Notes : - :auto in Neo4j Browser helps with batching; CALL { ... } IN TRANSACTIONS explicitly chunks work. - We MATCH users (they must exist from the previous step) and MERGE the FOLLOWS relationship to avoid duplicates. - Direction: (source)-[:FOLLOWS]->(target) means "source follows target".

:auto LOAD CSV WITH HEADERS FROM "https://bit.ly/3n08lEL" AS row CALL {
WITH row
      MATCH (s:User {id:row.source})
        MATCH (t:User {id:row.target})
      MERGE (s)-[:FOLLOWS]->(t)
    } IN TRANSACTIONS

Imports tweets in batch with the IN TRANSACTIONS statement

What’s happening here : - Each Tweet is linked to its author via (:User)-[:PUBLISH]->(:Tweet). - MERGE (p:Tweet {id:row.id}) respects the uniqueness constraint and prevents duplicates. - We store text and createdAt to enable content and temporal analysis. - Batching via CALL { ... } IN TRANSACTIONS keeps transactions manageable for large CSVs.

:auto LOAD CSV WITH HEADERS FROM "https://bit.ly/3y3ODyc" AS row
CALL {
  WITH row
  MATCH (a:User{id:row.author})
  MERGE (p:Tweet{id:row.id})
  ON CREATE SET p.text = row.text, p.createdAt = datetime(row.createdAt)
  MERGE (a)-[:PUBLISH]->(p)
} IN TRANSACTIONS

imports MENTIONS relationships

  • Mentions connect a Tweet to the User account referenced in the tweet.
  • Direction: (t:Tweet)-[:MENTIONS]->(u:User).
  • Enables "most mentioned users" and "tweets mentioning X" queries.
LOAD CSV WITH HEADERS FROM "https://bit.ly/3tINZ6D" AS row
MATCH (t:Tweet {id:row.post})
MATCH (u:User {id:row.user})
MERGE (t)-[:MENTIONS]->(u);

Imports retweets relationships

  • Retweet edges connect a retweet to the original tweet.
  • Direction: (retweet)-[:RETWEETS]->(original).
  • Useful for measuring virality and cascade patterns.
LOAD CSV WITH HEADERS FROM "https://bit.ly/3QyDrRl" AS row
MATCH (source:Tweet {id:row.source})
MATCH (target:Tweet {id:row.target})
MERGE (source)-[:RETWEETS]->(target);

imports IN_REPLY_TO relationships

  • Reply edges connect a reply tweet to the tweet it replies to.
  • Direction: (reply)-[:IN_REPLY_TO]->(original).
  • Enables conversation threading and depth analysis.
LOAD CSV WITH HEADERS FROM "https://bit.ly/3b9Wgdx" AS row
MATCH (source:Tweet {id:row.source})
MATCH (target:Tweet {id:row.target})
MERGE (source)-[:IN_REPLY_TO]->(target);

Now inspect your graph model with :

CALL db.schema.visualization()

This procedure renders a live schema view based on your actual data: - Node labels and relationship types present - Property keys and constraints/indexes (depending on your environment)

You should see something like this in your neo4j editor

Explore the GDS plugin

First in order to load our graph inside the GDS pluging we must create something called graph projection like this :

// Drop existing projection if it exists
CALL gds.graph.drop('twitter', false) YIELD graphName;

// Create graph projection for GDS algorithms
CALL gds.graph.project(
    'twitter',              // Graph name
    'User',                 // Node label
    'FOLLOWS',              // Relationship type
    {
        relationshipProperties: {}
    }
)
YIELD graphName, nodeCount, relationshipCount, projectMillis
RETURN graphName, nodeCount, relationshipCount, projectMillis

Now verify the projection with :

// Check projection info
CALL gds.graph.list('twitter')
YIELD graphName, nodeCount, relationshipCount, memoryUsage
RETURN graphName, nodeCount, relationshipCount, memoryUsage

Page rank algorithm 🏄‍♂️

We've seen in the previous course the pagerank algo is widely used to measures the importance of users based on their followers AND the importance of those followers.

// Estimate memory required
CALL gds.pageRank.write.estimate('twitter', {
    writeProperty: 'pagerank'
})
YIELD nodeCount, relationshipCount, requiredMemory

// Run PageRank
CALL gds.pageRank.write('twitter', {
    writeProperty: 'pagerank',
    maxIterations: 20,
    dampingFactor: 0.85
})
YIELD nodePropertiesWritten, ranIterations

// Find top 20 influential users by PageRank
MATCH (u:User)
RETURN u.username, u.name, u.pagerank
ORDER BY u.pagerank DESC
LIMIT 20

Betweenness centrality : find the bidge users

Identifies users who connect different parts of the network.

// Run Betweenness Centrality (can be slow on large networks)
CALL gds.betweenness.stream('twitter')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS user, score
ORDER BY score DESC
LIMIT 20
RETURN user.username, user.name, score as betweenness

High betweenness = Bridge between communities. Information flows through these users

Now let's write a comparison of these centralities measures

// Write all centrality metrics
CALL gds.degree.write('twitter', {
    writeProperty: 'inDegree',
    orientation: 'REVERSE'
});

CALL gds.degree.write('twitter', {
    writeProperty: 'outDegree',
    orientation: 'NATURAL'
});

// Compare top users across metrics
MATCH (u:User)
WHERE u.pagerank IS NOT NULL
RETURN 
    u.username,
    u.name,
    u.inDegree as followers,
    u.outDegree as following,
    round(u.pagerank * 1000) / 1000 as pagerank
ORDER BY u.pagerank DESC
LIMIT 20

Communities detection

Like we've seen in course the community detection is about finding groups of users who follow each other.

// Run Louvain
CALL gds.louvain.write('twitter', {
    writeProperty: 'community',
    includeIntermediateCommunities: false
})
YIELD communityCount, modularity, ranLevels
RETURN communityCount, modularity, ranLevels

// See community sizes
MATCH (u:User)
WHERE u.community IS NOT NULL
RETURN u.community as communityId, 
       count(u) as size,
       collect(u.username)[0..5] as sampleUsers
ORDER BY size DESC
LIMIT 10

We use the louvain algorithm for this, for more information see the wiki page about it

Path finding

So yeah, we are talking about graph so it's like an obligation to play with paths 😅

// Find ALL shortest paths between two specific users
MATCH (start:User {username: 'GoogleAI'}), (end:User {username: 'elonmusk'})
MATCH paths = allShortestPaths((start)-[:FOLLOWS*]-(end))
RETURN [node IN nodes(paths) | node.username] AS path
LIMIT 5

allShortestPaths works directly on the transactional store and respects the direction of FOLLOWS. It is perfect when you want the literal follower → followee routes (e.g., "can information flow from GoogleAI to Elon via follower chains?"). For very large graphs, though, the GDS library gives you more control, performance, and additional metrics like path cost.

We can also use the GDS plugin but we must create an other projection, undirected this time with :

// Create an undirected projection (if you haven't already)
CALL gds.graph.drop('twitter_undirected', false);

CALL gds.graph.project(
    'twitter_undirected',
    'User',
    {
        FOLLOWS: {
            orientation: 'UNDIRECTED'
        }
    }
);

We build this undirected projection because gds.shortestPath.* treats each relationship as bidirectional unless we say otherwise. In a social graph, that lets us answer "who is connected to whom?" regardless of who followed first. If you leave the projection directed, Dijkstra would only explore in the NATURAL orientation, so one-way follows could block otherwise valid paths... which is not what we want 😅

And then you can run :

// Now GDS will find the same paths (: 
MATCH (start:User {username: 'GoogleAI'}), (end:User {username: 'elonmusk'})
CALL gds.shortestPath.dijkstra.stream('twitter_undirected', {
    sourceNode: start,
    targetNode: end
})
YIELD nodeIds, totalCost
RETURN [nodeId IN nodeIds | gds.util.asNode(nodeId).username] AS path,
       totalCost,
       size(nodeIds) - 1 as pathLength

In Twitter terms wer can say that :

  • Directed: A follows B, B follows C, C follows A (rare!)
  • Undirected: A and B are connected, B and C are connected, A and C are connected (common!)

you can list all your gds projections with the command : CALL gds.graph.list()

Find triangles

Triangles are one of the most important patterns in social network analysis. Like we've seen in course triangles often mean real friendships (not random follows); shared interests (they follow the same topics) and actual communities (not just loose connections) in the network.

Users with few triangles are more isolated or peripheral in the network - essentially, triangles measure how "clustered" and cohesive your social connections are.

Let's count the number of triangle by users :

// Use the undirected projection
CALL gds.triangleCount.stream('twitter_undirected')
YIELD nodeId, triangleCount
WITH gds.util.asNode(nodeId) as user, triangleCount
ORDER BY triangleCount DESC
LIMIT 20
RETURN user.username, user.name, triangleCount

Find hops

A "hop" is one step along a relationship in a graph - 1 hop means directly connected, 2 hops means connected through one intermediary node (friend-of-friend), 3 hops means two intermediaries, and so on.

Hops measure the distance between nodes in a network, helping you understand how closely connected things are and enabling queries like "find everyone within X degrees of separation" for recommendations like in the linkedin app of fb, influence analysis, or discovering communities 🧙

// Find users within 2 hops of a specific user
MATCH path = (start:User {username: 'GoogleAI'})-[:FOLLOWS*1..2]->(end:User)
WHERE start <> end
WITH end, length(path) as distance
ORDER BY distance, end.pagerank DESC
RETURN DISTINCT end.username, end.name, distance, end.pagerank
LIMIT 20

That wraps up this article. As you might guess we clearly didn’t explore every plugin method, but you now have a good understanding of the essentials 🤓

Now let’s dive into the code and answer the questions below 🤗


Some questions to practice your ninjutsu 🥷

  1. Find five random user nodes
  2. Find five random FOLLOWS relationships
  3. Find the text property of three random Tweet nodes
  4. Generate a Cypher statement to visualize sample RETWEETS relationships
  5. Why using merge and not create ?
  6. Calculate the ratio of missing values for the createdAt node property of the Tweet nodes
  7. Count the number of relationships by their type. To count the number of relationships grouped by their type, you can start by describing a relationship pattern
  8. Compare the text of an original tweet and its retweet
  9. Calculate the distribution of tweets grouped by year created
  10. Use the MATCH clause in combination with the WHERE clause to select all the tweets that were created in 2021
  11. Return the top four days with the highest count of created tweets
  12. Count the number of users who were mentioned but haven’t published a single tweet
  13. Find the top five users with the most distinct tweets retweeted
  14. Find the top five most mentioned users
  15. Find the 10 most followed Users
  16. Find the top 10 users who follow the most people