Back to work
Dec 15, 2022
3 min read

Parallel Dijkstra Algorithm - MPI & CUDA Implementation

Comparative study of serial vs parallel implementations of Dijkstra's shortest path algorithm using MPI and CUDA in Python.

Timeline and Details

Start dateEnd dateAssociated withResources
October 2022December 2022Faculty of Informatics and Digital Technologies, University of RijekaMPI Implementation · CUDA Implementation

Overview

Comparative performance analysis of Dijkstra’s shortest path algorithm implemented in serial, parallel (MPI), and GPU-accelerated (CUDA) versions using Python. Demonstrates significant speedup through parallelization on large graphs with 2000 nodes and 8000 edges.

Note: Developed in collaboration with a fellow student. Collaborator details available on the GitHub repository.

Key Results

Performance comparison on graph with 2000 nodes and 8000 edges:

  • Serial execution: 0.284 seconds
  • Parallel MPI execution: 2.30e-05 seconds
  • Speedup: ~12,350x faster with parallel implementation

Implementation Details

Serial Implementation:

  • Classic Dijkstra algorithm with adjacency list representation
  • Iterative approach visiting nodes sequentially
  • Random graph generation for testing

MPI Parallel Implementation:

  • Distributed memory parallelization using MPI4py
  • Weight matrix partitioned across multiple processors
  • Processor 0 generates graph and distributes to workers
  • Each processor computes local shortest distances
  • Results gathered and combined at root processor

CUDA GPU Implementation:

  • Custom CUDA kernel for parallel distance computation
  • Thread-based parallelization (1024 threads per block)
  • GPU memory allocation for distance arrays and adjacency matrix
  • Iterative kernel execution for graph relaxation

Technical Stack

Language: Python 3
Parallelization: MPI4py (Message Passing Interface), PyCUDA
Libraries: NumPy, timeit, random
Execution: Multi-processor systems, NVIDIA GPU

Key Findings

Advantages of Parallelization:

  • Significant speed improvement for large datasets
  • Better resource utilization across multiple processors
  • Scalability for complex graph problems

Limitations:

  • Overhead from thread creation and synchronization
  • Not optimal for small graphs (serial faster due to overhead)
  • Increased complexity and debugging difficulty

Conclusion: Parallelization is most beneficial for processing large amounts of data where computational gains outweigh communication overhead.

Skills Demonstrated

Python programming, parallel algorithm design, MPI message passing, CUDA GPU programming, performance benchmarking, distributed systems, algorithm optimization, heterogeneous computing