Call us on: +91 860-099-8107, Email: contactnull@technogeekscsnull.co.in, Demo Sessions :- Coming Saturday : We are Conducting Demo Sessions on : | Hadoop, BigData, Spark And Scala, Data Science And Machine Learning, : Cloud Computing AWS (Amazon Web Services), Data warehouse, ETL Testing and Informatica | , Please Call Us To Confirm Timings And Book Your Demo Seat On: : 8600998107 |
Apache Pig Tutorial

Apache Pig Tutorial

APACHE PIG

Start pig shell on your terminal 

For Mapreduce 

$pig –x  mapreduce

For Local Mode

$pig –x local

grunt>

These are the operator we will use in pig:

Load Operator:

This operator used to load data from local/hdfs to pig using load keyword.

Syntax:

Relation_name = LOAD'Input file path' USING function as schema;

 

Explain:

 

Example:

Suppose you have a file Student.txt and you want to load it into pig so command is:

grunt> stu_load = load '/user/cloudera/Student.txt’ using PigStorage(',') as (sid:int, sname:chararray, address:chararray);

 

Dump operator:

Using dump operator you can see your relation_name data in which you stored previous.

Syntax:

grunt> Dump Relation_Name;

Example:

grunt> dump stu_load

 

 

Limit

Used to limit the number of outputs to the desired number.

Syntax:

Alias = LIMIT alias n;

Where;

alias : name of the relation.

n : number of tuples to be displayed.

Example:

stu_limit = limit stu_load 2;

 

 

Store operator:

This is use for store data into HDFS from pig which is processed in pig.

Syntax:

STORE Relation_name INTO ' required_directory_path ' [USING function];

Explain:

 

 

Example:

Suppose we processed employee data into pig now we want to store this into another file.

grunt> store stu_load into '/user/cloudera/output';   

 

 

Describe operator:

This operator basically use to see schema of a relation.

Syntax:

grunt> Describe Relation_name

Example:

grunt>  describe stu_load;

 

 

 

 

Order operator

Sorts a relation based on single or multiple fields.

Syntax:

alias = ORDER alias BY {field_name [ASC | DESC]

Where;

alias : is the relation

ORDER : is the keyword.

BY : is the keyword.

field_name : column on which you want to sort the relation.

ASC : sort in ascending order

DESC : sort in descending order.

Example:

stu_order =  ORDER stu_load by sid  DESC;

 

 

 

 

 

Explain operator:

It is used to display MapReduce execution plans of a relation.

Syntax:

grunt> explain Relation_name;

Example:

grunt> explain stu_load;

 

Illustrate operator:

It gives step-by-step execution of a sequence of statements.

Syntax:

grunt> illustrate Relation_name;

Example:

grunt> illustrate stu_load;

 

 

 

 

 

 

 

 

Group operator:

It is use to group data in one or more relation.

Syntax:

grouped_data = group Relation_name by field;

Explain:

 

 

Example:

stu_grouped = group stu_load by stuid;

Group by multiple column:

stu_group_multi = group stu_load by (name, address);

  

 

stu_grouped_all = group stu_load all;

 

 

 

 

Join operator:

It is use to combine two or more relation

  1. Self-join0
  2. Inner join
  3. Outer joins.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00

2,Khilan,25,Delhi,1500.00

3,kaushik,23,Kota,2000.00

4,Chaitali,25,Mumbai,6500.00 

5,Hardik,27,Bhopal,8500.00

6,Komal,22,MP,4500.00

7,Muffy,24,Indore,10000.00

Order.txt

102,2009-10-08 00:00:00,3,3000

100,2009-10-08 00:00:00,3,1500

101,2009-11-20 00:00:00,2,1560

103,2008-05-20 00:00:00,4,2060

 

 

NOTE: We are join two different  tables /files  here.

 Please note both the files are comma separated and so while loading the data, you will have to use the command PigStorage(',')

First load the file in HDFS

grunt>customers = load '/user/cloudera/customers.txt' using PigStorage(',')as (id:int, name:chararray, age:int, address:chararray, salary:int);

 

grunt>orders = load 'user/cloudera/orders.txt' using PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int);

 

 

 

JOIN OPERATIONS

Self - join

 

Self-join is used to join a table with itself as if the table were two relations.

 

in self-join, we have to rename the same relation and then apply the join condition. 

Syntax:

Relation3_name = join Relation1_name BY key, Relation2_name BY key

Here we are join single table but we change relation name.

grunt> cust_realation1  = load '/user/cloudera/customers.txt' using PigStorage(',')as (id:int, name:chararray, age:int, address:chararray, salary:int);

grunt> cust_realation2  = load '/user/cloudera/customers.txt' using PigStorage(',')as (id:int, name:chararray, age:int, address:chararray, salary:

 

 

grunt> customers3 = JOIN cust_relation1 BY id, cust_relation2 BY id;

 

 

Verification

Verify the relation customers3 using the DUMP operator as shown below.

dump customers3

 

OUTPUT

 

 

Inner Join

Inner join returns rows when there is a match in both tables.

 

Syntax

grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;

 

Example

customer_orders = JOIN customers BY id, orders BY customer_id;

 

 

 

Verification

Verify the relation customer_orders using the DUMP operator as shown below.

dump customer_orders

 

 

 

 

Left Outer Join

 

The left outer Join operation returns all rows from the left table, even if there are no matches in the right relation

 

Syntax

 

grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

 

Example

 

left_outer= JOIN customers BY id LEFT OUTER, orders BY customer_id;

 

Verification

Verify the relation left_outer using the DUMP operator as shown below.

dump left_outer

 

 

 

Right Outer Join

 

The right outer join operation returns all rows from the right table, even if there are no matches in the left table.

Syntax

Given below is the syntax of performing right outer join operation using the JOIN operator.

 

right_outer = JOIN customers BY id RIGHT, orders BY customer_id;

 

Example

 

outer_right= JOIN customers BY id LEFT RIGHT, orders BY customer_id;

 

Verification

Verify the relation right_outer using the DUMP operator as shown below.

dump right_outer

OUTPUT

 

 

Full Outer Join

 

The full outer join operation returns rows when there is a match in one of the relations.

 

Syntax

full_outer = JOIN customers BY id FULL OUTER, orders BY customer_id;

 

Example

             full_order = JOIN customers BY id FULL OUTER, orders BY customer_id;

 

 

 

 

 

 

Verification

Verify the relation full_order using the DUMP operator as shown below.

dump full_order

 

OUTPUT

 

 

 

Using Multiple Keys Join

 

We can perform JOIN operation using multiple keys.

 

Syntax:

 

employee.txt

001,Rajiv,Reddy,21,programmer,003

002,siddarth,Battacharya,22,programmer,003

003,Rajesh,Khanna,22,programmer,003

004,Preethi,Agarwal,21,programmer,003

005,Trupthi,Mohanthy,23,programmer,003

006,Archana,Mishra,23,programmer,003

007,Komal,Nayak,24,teamlead,002

008,Bharathi,Nambiayar,24,manager,001

 

 

employee_contact.txt

001,9848022337,Rajiv@gmail.com,Hyderabad,003

002,9848022338,siddarth@gmail.com,Kolkata,003

003,9848022339,Rajesh@gmail.com,Delhi,003

004,9848022330,Preethi@gmail.com,Pune,003

005,9848022336,Trupthi@gmail.com,Bhuwaneshwar,003

006,9848022335,Archana@gmail.com,Chennai,003

007,9848022334,Komal@gmail.com,trivendram,002

008,9848022333,Bharathi@gmail.com,Chennai,001

 

 

LOAD THE FILES IN HDFS

employee = LOAD '/user/cloudera/employee.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);

 

 

employee_contact = LOAD '/user/cloudera/employee_contact.txt' USING PigStorage(',') as (id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

 

 Key_join = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);

output

Verification

Verify the relation Key_join using the DUMP operator as shown below.

dump Key_join

 

 

Enquiry Now !