Exploring Data With Pandas¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline
Series¶
There are three ways to create a series.
In [3]:
# first method: Involves passing a list without index to pd.
a = pd.Series(['food', 8, 'july', 9.0, [2,3], 'data'])
print a
In [5]:
# second method: Involves passing a list of data and a list on index
data = pd.Series(['food', 8, 'july', 9.0, [2,3], 'data'],
index=['a','b','c', 'd', 'e', 'f'])
print data
In [8]:
# third method: Involves passing a dictionary
data = {'Lagos': 1000, 'Kaduna': 1300, 'Abuja': 900, 'Anambra': 1100,
'Plateau': 450, 'Ebonyi': None}
states = pd.Series(data)
print states
One can basically perform any operations on the series like a numpy array.¶
In [11]:
print states['Lagos'] # index with dictionary keys
In [12]:
print states[states > 1000] # boolean indexing
In [13]:
print states > 1000
In [14]:
print 'Abuja' in states # check for membership
In [15]:
print states.notnull() # check for availablility of values
In [18]:
print states[states.isnull()] # expose null states
In [25]:
# we can create a new dataframe
data = {'year': [1986, 1990, 1994, 1998, 2002, 2006, 2011, 2014],
'team': ['Argentina', 'Germany', 'Brazil', 'France', 'Brazil', 'Italy', 'Spain', 'Germany'],
'goal_for': [3, 1, 3, 3, 2, 5, 1, 1],
'goal_against': [2, 0, 2, 0, 0, 3, 0, 0]}
soccer = pd.DataFrame(data, columns=['year', 'team', 'goal_for', 'goal_against'])
soccer
Out[25]:
In [26]:
from_csv = pd.read_csv('Secondary.csv')
from_csv.head()
Out[26]:
So let's write to excel and also read from excel. This might require installing xlrd, so let's do pip install xlrd¶
In [27]:
soccer.head()
Out[27]:
In [28]:
soccer.to_excel('soccer.xlsx')
In [29]:
del soccer
In [30]:
soccer = pd.read_excel('soccer.xlsx')
In [31]:
soccer.head()
Out[31]:
Reading From SQL database¶
In [43]:
import os
basedir = os.path.realpath('.')
from pandas.io import sql
import sqlite3
#conn = sqlite3.connect('sqlite:////' + os.path.join(basedir, 'data.sqlite'))
conn = sqlite3.connect('data.sqlite')
In [45]:
soccer.to_sql(name='test',con=conn)
In [49]:
query = "SELECT * FROM test"
result = pd.read_sql(query, con=conn)
result.head()
Out[49]:
Read data from url¶
In [50]:
url = 'https://raw.githubusercontent.com/kantologist/abuja_schools/master/data/international-schools.csv'
from_url = pd.read_csv(url)
from_url.head()
Out[50]:
So let's explore a sample data. Lagos Schools¶
In [52]:
schools = from_csv
schools.head()
Out[52]:
In [53]:
schools.info() # data inspection
In [54]:
schools.dtypes
Out[54]:
In [56]:
schools.describe()
Out[56]:
In [57]:
schools.head()
Out[57]:
In [58]:
schools.tail(3)
Out[58]:
In [59]:
schools[200:205]
Out[59]:
oops! LGA and Ownership are mixed up. Let's fix that.¶
In [75]:
schools.columns
Out[75]:
In [83]:
schools.rename(columns= {'LGA':'Ownership', 'Ownership':'LGA'}, inplace=True )
schools.columns
Out[83]:
In [85]:
schools['Ownership'].head()
Out[85]:
In [86]:
schools[['LGA', 'Ownership']].head()
Out[86]:
In [88]:
schools[(schools.Ownership == 'State owned')].head(3)
Out[88]:
In [90]:
schools[(schools.Ownership == 'State owned') & (schools.LGA == 'Oshodi-Isolo')].head(3)
Out[90]:
In [91]:
schools.iloc[[0, 78, 300]]
Out[91]:
In [93]:
schools.groupby('LGA').count().head()
Out[93]:
In [94]:
schools.groupby('LGA').size().tail()
Out[94]: