By Anshul Ghogre
Apache NiFiis designed to automate the flow of data between software systems. It is based on the “NiagaraFiles” software previously developed by the NSA, it supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.
Apache Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Apache NiFi can work as a Producer and a Consumer for Kafka. Both ways are suitable and depends upon requirements and scenarios.
Why NiFi and Kafka together?
Integration of Kafka and NiFi helps us to avoid writing lines of code to make it work. Easy to handle and understand the complete pipelines in one screen. Easy to scale.
For Kafka, Apache NiFihas the capabilities to work both as a Producer andConsumer as well. Both ways are suitable and depends upon requirements and scenarios.
NiFias a Producer
The most efficient way to use NiFi is to act as a Kafka producer, which will generate data from any source as an input and forward it to the Kafka Broker. Here, NiFi replaces the producer which can then deliver the data to the appropriate Kafka topic. The major perk here is being able to bring data to Kafka without writing any producer code, by simply dragging and dropping a series of processors in NiFi (PublishKafka), and being able to visually monitor and control this pipeline.
NiFi as Consumer
A few projects have already developed a pipeline to channel data to Kafka and with time they introduce NiFi to their process. In this case, NiFi can replace Kafka consumer and handle all of the logic. For instance, it can take the data from Kafka to move it forward. Here we avoid the Consumer code by just dragging and dropping the NiFi’sConsumeKafka processor. For example, you could deliver data from Kafka to HDFS without writing any code by using ConsumeKafka processor.
*Note- Bi-Directional Flow is also possible for more complex scenarios.
There are a few factors that can impact the performance of publishing and consuming in NiFi.
A couple of them are explained below. PublishKafka&ConsumeKafka both have a property called “Message Demarcator/Separator/Delimiter”.
On the publishing end, the demarcator indicates that the flow of files that are incoming will have multiple messages in the content, with the given demarcator between them. In this case, PublishKafka will stream the content of the flow file, separating it into messages based on the demarcator, and publish each message individually. When the property is left blank, PublishKafka will send the content of the flow file as s single message.
On the consuming end, the demarcator indicates that ConsumeKafka should produce a single flow file with the content containing all of the messages received from Kafka in a single poll, using the demarcator to separate them. When this property is left blank, ConsumeKafka will produce a flow file per message received.
Given that Kafka is finetuned for smaller messages, and NiFi is tuned for larger messages, these batching capabilities allow for the best of both worlds. Where Kafka can take advantage of smaller messages, and NiFi can take advantage of larger streams, this results in significantly improved performance. Publishing a single flow file with 1 million messages and streaming that to Kafka will be significantly faster than sending 1 million flow files to PublishKafka. The same can be said on the consuming end, where writing a thousand consumed messages to a single flow file will produce higher throughput than writing a thousand flow files with one message each.