Seamless Cloud Messaging: Integrating Apache Pulsar with Google Cloud Platform

Apache Pulsar is an all-in-one messaging and streaming platform. Messages can be consumed and acknowledged individually or consumed as streams. Its layered architecture allows rapid scaling across hundreds of nodes, without data reshuffling.

Its features include multi-tenancy with resource separation and access control, geo-replication across regions, tiered storage and support for six official client languages. Apache Pulsar supports up to one million unique topics and is designed to simplify your application architecture.

Pulsar is a Top 10 Apache Software Foundation project and has a vibrant and passionate community and user base spanning small companies and large enterprises.

Features:

  • Rapid Horizontal Scalability
  • Low-latency messaging and streaming
  • Automatic Load Balancing
  • Multi-tenancy as a first-class citizen
  • Serverless Functions
  • Official 3rd party integrations
  • Supports up to 1M topics
  • Seamless Geo-Replication

How does Apache Pulsar Works


Producer & Consumer

A Pulsar client contains a consumer and a producer. A producer writes messages on a topic. A consumer reads messages from a topic and acknowledges specific messages or all up to a specific message.

Apache ZooKeeper

Apache ZooKeeper is used by Pulsar and BookKeeper to store important information that helps coordinate between different servers. This includes details like the list of ledgers for each topic, segments for each ledger, and which broker is responsible for different topic bundles. Essentially, ZooKeeper is a group of servers (usually three) that work together to ensure high availability and reliability.

Pulsar Brokers

In Pulsar, topics (or partitions) are distributed among different brokers. A broker gets messages for a topic and adds them to a virtual file called a ledger, which is stored in the BookKeeper cluster. Brokers mainly read messages from a cache or directly from BookKeeper and send them to consumers. They also handle message acknowledgments and save them to the BookKeeper cluster. Importantly, brokers are stateless, meaning they don’t store any data on a disk.

Objectives of the Integration:

Real-time Data Flow:
Objective: Demonstrate a seamless and efficient real-time data flow between Apache Pulsar and Google Pub/Sub.

Cross-Platform Messaging:
Objective: Enable robust cross-platform messaging for hybrid cloud use cases, facilitating smooth interaction B/W multiple environments.

Seamless Communication:
Objective: Showcase seamless communication capabilities between on-premise systems and GCP.

Architecture Description:

  • This Proof of Concept (POC) demonstrates a hybrid cloud architecture that utilizes Apache Pulsar running on a Google Cloud Platform (GCP) Virtual Machine (VM), alongside GCP Pub/Sub for efficient messaging.
  • The architecture is designed to facilitate seamless communication between on-premises systems and cloud services, ensuring high availability and scalability.

Components:

  • Pulsar Client
  • Pub/Sub Topics
  • Python Integration

1. Create a VM and Install Apache Pulsar on GCP

Create a VM Instance:

  • Go to the GCP Console.
  • Navigate to Compute Engine > VM instances.
  • Click on Create Instance.
  • Choose the machine type, region, and operating system (Ubuntu is recommended).
  • Set up firewall rules to allow TCP traffic on the default Pulsar port (6650).
  • Click Create.

Install Apache Pulsar:

  • SSH into your VM instance.
  • Update the package list and install required dependencies:
sudo apt-get update
sudo apt-get install -y openjdk-11-jdk wget
  • Download and extract Apache Pulsar:
    wget https://archive.apache.org/dist/pulsar/pulsar-2.10.2/apache-pulsar
    -2.10.2-bin.tar.gz
    tar -xzf apache-pulsar-2.10.2-bin.tar.gz
    cd apache-pulsar-2.10.2
  • Start the Pulsar standalone server:
Bin/pulsar standalone

 2. Create Pub/Sub Topics in GCP

Create Pub/Sub Topics

  • In the Google Cloud Console, navigate to Pub/Sub.
  • Click on Create Topic.
  • Name your topic (e.g., pulsar-test-topic).
  • Click Create.

Create a Service Account:

  • Navigate to IAM & Admin > Service Accounts.
  • Click Create Service Account.
  • Give the account a name and grant it the Pub/Sub Publisher role.
  • Download the JSON key file for the service account and store it locally (e.g., /path/to/gcp-key.json).

Install Dependencies

Install Python and Required Libraries on GCE

  • Install Python: Most Ubuntu instances come with Python pre-installed. Verify it:
python3 –version
  • Install Python Dependencies: Install the necessary Python libraries for Pulsar and Pub/Sub:
sudo apt install python3-pip -y
pip3 install pulsar-client google-cloud-pubsub 
  • Set Up Google Cloud Pub/Sub Credentials: Set the environment variable to use the service account key:
    export GOOGLE_APPLICATION_CREDENTIALS="/home/username/gcp-key.json"

Create the Python Script for Pulsar to Pub/Sub

  • Create the Python Script: On the VM, create a Python script (e.g., pulsar_to_pubsub.py) that will read messages from Pulsar and publish them to Google Cloud Pub/Sub.
nano pulsar_to_pubsub.py
Python pub-sub.py 

Test the Integration

  • Run the Python Script: Start the Python script to begin consuming messages from Pulsar and publishing them to Google Cloud Pub/Sub:
python3 pulsar_to_pubsub.py
  • Publish a Message to Pulsar: Open another terminal window or tab and SSH into the VM again. Now publish a test message to the Pulsar topic:
cd apache-pulsar-3.0.0
bin/pulsar-client produce persistent://public/default/my-pulsar-topic 
--messages "Hello everyone"

  • Pulsar Topic (Producer-Consumer Validation): Since you’ve produced a message to the Pulsar topic, you can verify if the message has been successfully delivered to the topic by using a Pulsar consumer to read the message.
  1. Open another terminal or SSH session into your GCE instance where Pulsar is running.
  2. Run the Pulsar consumer to listen to the topic:
cd apache-pulsar-3.0.0 
bin/pulsar-client consume "persistent://public/default/my-pulsar-
topic" --subscription-name my-sub --num-messages 1

Pulsar Logs: If you want to dive deeper into the Pulsar logs for more details, you can check the logs located in your Pulsar installation directory under logs/

cd apache-pulsar-3.0.0/logs 
ls

Conclusion:

This integration demonstrates how Pulsar can enhance distributed messaging within cloud environments like GCP, ensuring a robust solution for real-time communication.