Kafka

How to integrate with Spark?

How to integrate with Spark?

 

In the Kafka Spark Streaming Integration, Kafka acts as a central hub to input data streams which are processed, and then Spark streaming publishes the results using Spark engine into another Kafka topic or store in the form of dashboards, HDFS, and databases.

 

IMG_256

 

Two approaches involved in this process are -

 

  • Receiver-based approach 

 

Using the Kafka consumer API, the Receiver is implemented to receive the data which is then stored in Spark executors. Spark stream then processes the data synchronously to ensure zero data loss. 



 

  • Direct approach  

The direct approach is receiver-less and works by querying Kafka for offsets in each topic within its designation partition rather than using receivers. This approach makes it easier to read data in parallel and offers zero data loss by eliminating the need for write-ahead logs. 

It was introduced in Spark 1.4 for the Python API and Spark 1.3 for the Java and Scala API. 

 

Here are Kafka-Spark APIs -

 

  • SparkConf API

It is used to set configuration for key-value pairs using the SparkConf class. 

 

  • StreamingContext API

It denotes the connection to a Spark cluster and can be used to create broadcast variables, accumulators, and RDDs on it. Its signature is -

 

public StreamingContext(String master, String appName, Duration batchDuration, 

   String sparkHome, scala.collection.Seq<String> jars, 

   scala.collection.Map<String,String> environment)

 

where app name indicates the name of your job, the master is the cluster URL to connect to and batchDuration is the time interval required to divide streams of data into batches. 

 

  • KafkaUtils API

This API connects Kafka cluster to Spark streaming using createStream signature mentioned below -

 

public static ReceiverInputDStream<scala.Tuple2<String,String>> createStream(

   StreamingContext ssc, String zkQuorum, String groupId,

   scala.collection.immutable.Map<String,Object> topics, StorageLevel storageLevel)

 

Where, ssc is a StreamingContext object, zkQuorum is Zookeeper quorum, groupId is the group id for the consumer, topics means a map of topics to consume and storageLevel indicates the level used for storing received objects.

 

Top course recommendations for you

    Python Fundamentals for Beginners
    4 hrs
    Beginner
    711.2K+ Learners
    4.55  (39811)
    Front End Development - HTML
    2 hrs
    Beginner
    497.5K+ Learners
    4.51  (37972)
    Front End Development - CSS
    2 hrs
    Beginner
    182.5K+ Learners
    4.51  (13448)
    Blockchain Basics
    3 hrs
    Beginner
    82.8K+ Learners
    4.55  (4380)
    Data Structures in C
    2 hrs
    Beginner
    177.2K+ Learners
    4.39  (11454)
    Excel for Beginners
    5 hrs
    Beginner
    1.2M+ Learners
    4.49  (62680)
    My SQL Basics
    5 hrs
    Beginner
    261.4K+ Learners
    4.46  (13362)
    Android Application Development
    2 hrs
    Beginner
    158.7K+ Learners
    4.42  (6461)
    OOPs in Java
    2 hrs
    Beginner
    113.5K+ Learners
    4.44  (6273)
    Building Games using JavaScript
    2 hrs
    Beginner
    33K+ Learners
    4.49  (554)