Pandas for SQL Users

Pandas is a great python library for data handling. It declares 2 main classes:

  1. Series – for one dimensional array
  2. DataFrame – for 2 dimensional tables (and with multi-indexing can display more dimensions)

Typical flow of using Pandas will be – load the data, manipulate and store again. This is very similar to SQL use with Select, Insert, Update and Delete statement

In this post i will cover the basic operations in pandas compared to SQL statements

I will use a simple CSV file, load it to a  dataframe and run all the commands on it:

Select

Select * from df where index between 2 and 6

Select * from df where (index between 3 and 9) and (index % 2 = 1)

Select Name from df

Select Name, Age from df

Select Name where index between 2 and 6

Select Name,Age where index between 2 and 6


Note for reference – Same operations with loc:

Same operations with iloc:


Select * from df where Age > 12

Select Name, Country  from df where Age > 12

Select * from df where Age>12 or Height > 130

Select * from df where Age>12 and Height > 130

 

Insert

Insert into df values (‘eli’,4,’DF’,100,20,20,2000,4,4)

Insert into df select * from df2

Insert new column with values (alter table df add Inc int; update df set Inc=income*2)

 

Update

Update df set Age=30

Update df set Age=20 where income>20000

Update df set income=income*2

Complex Update using iterator:

The above is a simple example but you can add as much logic as you need to the loop

 

Delete

Delete df where index=6

Delete df where Age=6

Delete column (alter table drop column Age)

 

Aggregates and Group By

Select count(*) from df

Select count(*),max(*) from df

Select count(Name), sum(income) , max(Age) from df

select Country,count(*) from df group by Country

select Country,[family persons],count(*) from df group by Country,[family persons]

 

 

Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *