BigData Journey

Saturday, 7 August 2021

Why Avro is the best choice for Kafka (Upto 91% compression - more SAVING!!)

I will explain this with an example :

Consider you have a data of 30 TB in JSON/ or any other format -

If you directly store this data into Kafka Topic, you will need to pay according to the space you are consuming (if using the cloud)

Advice to use AVRO format in Kafka topic to compress the data

Research shows that we can save up to 91% of space using Avro

Example :

Data : 30 TB

After Avro conversion : 2.7 TB (up to 91% )

For this, you might need to convert JSON data to Avro data, which is complex but doable.

Tuesday, 27 July 2021

Avro Schema Evolution

When you work with data, schema evolution plays a very crucial role.

For instance, while working on Kafka with Avro data you might have observed many exceptions that the producer/consumer clients face wrt schema compatibility. It's important to understand the concept of schema evolution.

What is Avro?
~ In order to transfer data over the network you need data to be serialized (data conversion to binary format) and Avro is one such system or tool that helps you achieve it.
~ Avro depends on schema and you can also think of it as a JSON format with the schema attached to it. And that's the main reason why Avro is preferred over JSON because you can't enforce schema with JSON.
~ Avro is fast & compact.
~ Data is fully typed and named.

A simple Avro schema can look like this -
{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"},
{"name":"department", "type":"string"}
}

Schema Evolution -
In order to evolve an Avro schema, you need to keep some important things in mind in order to make the changes compatible.

Backward Compatibility -
Your producer application is producing messages/data using an old schema and your consumer application can read the data using a newly evolved schema.

Let's use the above schema to understand it.
Suppose the Producer app created a record using the following schema -

{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"},
{"name":"department", "type":"string"}
}

(so the record has emp_name, emp_id, and department)

Now, the Consumer app on the other side reads this record using a newly evolved schema that doesn't contain the field department.

{
"type": "record",
"name": "Employee",
"fields" : [
{"name":"emp_name", "type":"string"},
{"name":"emp_id", "type":"long"}
}

But the consumer is still able to read the record and the data would just have emp_name and emp_id (the department is silently ignored).

Forward Compatibility -
Your producer app uses a new schema to write messages and your consumer app can read the messages using an old schema.

~ There could exist a combination of both backward and forward compatible schema as well, a fully compatible schema.

Some of the rules that I personally found useful for creating compatible schemas -
- You can easily add a field with a default value in the new schema.
Now suppose the producer is writing using an old schema and the consumer uses this new schema, as we have a default value associated with the newly added field we don't need to worry about this field missing in the producer schema and the field would get default value on the consumer side.

- You can easily remove a field having a default value in the new schema.
- You can't rename a field but you can use aliases.
- You can't change the data type.

Credit goes to Mayank Ahuja

Friday, 16 April 2021

Spark issue : Spark Out of Memory Issue

Out Of Memory is the most common type of error you might face in Spark

1. You can get this error at Driver

2. or You can get this error at the Executor level.

1. Driver

Collect Operation

If we have big files and we try to collect those in Driver. You will see Driver OOM(i.e Out Of Memory Error)

Broadcast Join

If we try to broadcast large files. Then again you will see OOM.

2. Executor

High Concurrency

If we assign more cores to a single executor then it might possible for each partition we have fewer resources(memory) but need to process large files then we get OOM.

As a benchmark - use 3 or 5 cores to a single executor, not more than this

Large Partitions causes OOM

If you have a larger partition then if the executor tries to process that data then chances are it fails with OOM

As a benchmark - Use appropriate partitions

Yarn Memory Overhead OOM

Executor container consists of Yarn Memory Overhead + Executor memory

Normally Yarn Memory Overhead(off-heap memory) should be 10% of Executor memory.

This Yarn Memory Overhead memory is used to store spark internal objects. Some time those need more space and throw OOM

If this error you will get, try to increase the Yarn Memory Overhead parameter.

I hope you like this blog.

Please subscribe and follow this blog for more information on Big Data Journey

Monday, 8 March 2021

Conversion of Kafka topics from JSON to Avro with KSQL

Here’s a dummy topic, in JSON:

$ kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic mysql_users
{"uid":1,"name":"Cliff","locale":"en_US","address_city":"St Louis","elite":"P"}
{"uid":2,"name":"Nick","locale":"en_US","address_city":"Palo Alto","elite":"G"}

In KSQL declare the source stream, specifying the schema (here a subset of the full schema, just for brevity):

ksql> CREATE STREAM source (uid INT, name VARCHAR) WITH (KAFKA_TOPIC='mysql_users', VALUE_FORMAT='JSON');

 Message
----------------
 Stream created
----------------

Now create a derived stream, specifying the target serialization (Avro) and the target topic (this is optional; without it will just take the name of the stream):

ksql> SET 'auto.offset.reset' = 'earliest';
Successfully changed local property 'auto.offset.reset' from 'null' to 'earliest'
ksql> CREATE STREAM target_avro WITH (VALUE_FORMAT='AVRO', KAFKA_TOPIC='mysql_users_avro') AS SELECT * FROM source;

 Message
----------------------------
 Stream created and running
----------------------------
ksql>

Check out the resulting Avro topic:

$ kafka-avro-console-consumer \
                   --bootstrap-server localhost:9092 \
                   --property schema.registry.url=http://localhost:8081 \
                   --topic mysql_users_avro --from-beginning
{"UID":{"int":1},"NAME":{"string":"Cliff"}}
{"UID":{"int":2},"NAME":{"string":"Nick"}}

Because KSQL is a continuous query, any new records arriving on the source JSON topic will be automatically converted to Avro on the derived topic.

Friday, 4 December 2020

Azure updates #6 : Microsoft Azure is Multiple Cloud

Microsoft Azure is Multiple Cloud
how?

It consists of four different Microsoft Azure Cloud- Public, US Gov, China and Germany

To get a listing of all the different clouds under the Microsoft Azure umbrella, you can run the following Azure CLI command:

az cloud list --output table

The command will output the names Microsoft gives to all of the Microsoft Azure clouds.

IsActive Name Profile
---------- ----------------- ---------
True AzureCloud latest
False AzureChinaCloud latest
False AzureUSGovernment latest
False AzureGermanCloud latest

Public Azure Cloud - We all aware about this - 60+ Regions as of today.
Azure US government Cloud - Consists of 8 Azure Regions
Azure German Cloud - The Azure German Cloud is a sovereign cloud that consists of 2 Azure Regions that are located in German
Azure China Cloud - The China Azure Cloud is a sovereign cloud that consists of 4 Azure Regions that are located in China

Wednesday, 21 October 2020

Azure updates #5 : The number of Azure regions (announced) to 65.

Continuing Azure’s big expansion into Europe, they just announced a data center being built in Austria!

https://techcrunch.com/2020/10/20/microsoft-azure-announces-its-first-region-in-austria/

This brings the number of Azure regions (announced) to 65. Good to see Europe getting extensive local cloud coverage.

Wednesday, 14 October 2020

Azure updates #4 - Azure Data Explorer now support Compute Isolated SKUs - published on 13th October 2020

What that mean is?

Azure Data Explorer provides support for isolated compute using SKU Standard_E64i_v3.

Isolated compute virtual machines (VMs) enable customers to run their workload in a hardware isolated environment dedicated to single customer.

Who can use this?

Clusters deployed with isolated compute VMs are best suited for workloads that require a high degree of isolation for compliance and regulatory requirements.

Isolated compute support is available in the following regions:

West US 2

East US

South Central US

Note - compute SKU(s) offer isolation to secure data without sacrificing the flexibility in configuration

Tuesday, 13 October 2020

Azure updates #3 - Row Level Security now generally available for Azure Data Explorer -Published on 12th October 2020

What that mean is?

You can now prevent specific users from viewing certain rows in a table, and you can also mask the data they see.

Where this can be used?

Projects or companies use this to implement GDPR - Right to object article and Anonymization

What is the right to object?

You have the right to object to an organization processing (using) your personal data at any time.

What is Anonymization?

Anonymization is the process of masking data.

In that case, we can restrict the user or group to access the data by allowing Row-level Security. This works the similar way as we use Apache Ranger to restrict the user or group to access the data by applying Row-level filtration and Masking Policy.

Saturday, 10 October 2020

Azure updates #2 - Azure Files premium tier is now available in more regions with LRS, ZRS, and NFS support -Published on 9th October 2020

With 60+ announced regions, more than any other cloud provider, Azure makes it easy to choose the datacenter and regions that are right for our customers.

The premium tier is now available in 32 Azure regions

This is a useful link if anyone would like to explore Products available by region

https://azure.microsoft.com/en-us/global-infrastructure/services/?products=storage&regions=all

Friday, 9 October 2020

Azure updates #1 - Azure Blob Storage -Soft Delete for Containers preview region expansion -Published on 8th October 2020

For quick reference, when we create a Azure Storage account, we can see following options - Containers(for blob), Files, Tables and Queues.

When container soft delete is enabled for a storage account, any deleted container and their contents are retained in Azure Storage for the period that you specify. During the retention period, you can restore previously deleted containers and any blobs within them.

Follow my blog for more updates on the Cloud.