The first step in any data project is to get the data ready for analysis. In a traditional curriculum, the dataset was usually taken for granted and ready for analysis from the start. However, this no longer reflects the reality in industry: the trend towards big data brought with it an increased demand for professionals that have the computational skills to work with possibly large, complex and unstructured data. It is commonly stated that up to 80% of a data scientist's job can be data preparation, exploratory data analysis (EDA), and visualisation.
In data science terminology, the process of processing data is called data wrangling or munging. Common tasks include loading the data, merging datasets, identifying and handling errors (data cleaning), dealing with missing values, etc.
In this lesson we explore the basics of how to use the pandas package to work efficiently with data in Python. As a complement to this guide, the 10 minutes to pandas tutorial in the official pandas documentation is also a useful starting reference.
We will use the
Credit dataset from the Introduction to Statistical Learning texbook by James, Witten, Hastie and Tibshirani. It records the average credit card balace at end of the month for customers of a financial services company, as well as other individual characteristics such age, education, gender, marital status, number of cards, and credit rating.
We start by loading the data. First, we need to load the pandas package. Since the dataset is in the
csv format, we use the read_csv function to import the file. The package will automatically read the column names and infer the variable types. We assign the data variable to a variable called
data, which will store the dataset as an object called a
import pandas as pd
# We will always assume that the data file is in a subdirectory called "Data" data=pd.read_csv('Data/Credit.csv')
Note: To get help on any function or object, append a question mark to it and run the cell.
#pd.read_csv? # data?, data.head?, etc (the hash starts a comment, everything after it is ignored when running the cell)
DataFrame has some similarities to a spreadsheet. However, unlike a spreadsheet, you are not able to click through it or manually make modifications in the current setup. We do everything through coding. This may initially feel restrictive, but it is ultimately more efficient and scalable to work in this way. Some Python environments such as Spyder have GUIs (graphical user interfaces) for viewing and modifying data frames.
Two basic methods to have a first view of the data are the
tail methods. The head method displays the first rows of the data (by default five), while the tail displays the last rows. You can specify the number of rows within the parentheses.
Running a cell with only the name of the
DataFrame will provide a full view, but pandas limits number of rows that can be shown. See here if you want to change this setting.
Note that only the last line of a Jupyter cell will generate an output on the screen. We can use the
The rows of our
DataFrame have a numerical index (in bold above), which is the default behaviour. An important detail, if you are not used to Python or some other programming languages, is that the index starts from zero instead of one. Numerical indexes start from zero in Python. This does not need to be the case in the
DataFrame, but pandas follows the Python convention by default.
Alternatively, we can specify the
DataFrame index (that is, a label for each row), which does not need to be a number. For example, if you have time series data, it can be the date.
In our case we can see that the first column is an observation index, so that we could specify that this is the case when reading the data.
data=pd.read_csv('Data/Credit.csv', index_col='Obs') data.head()
Pandas has specialised functions for reading other types of input, such as Excel files. You can see a list of available functions here. The pandas read table function reads data stored as general delimited text files (for example, where the columns are separated by space rather than commas). In practical business situations, you may often need to obtain data from a relational database rather thah having to load a flat file stored in your computer. You can read database queries and tables input into a
DataFrame by using use the
Our dataset here is simple to work with, but others may require customising the function call. Refer to the documentation for finding the appropriate options for other data that you come across.
There are two ways to select data in pandas: by providing the column and index labels or by providing a numerical index.
Selecting a column by label
The output will now look different because selecting only one column returns a
Series (a specialised object for when there is only one column of data) rather than a
Obs 1 14.891 2 106.025 Name: Income, dtype: float64
Selecting multiple columns by label
Here, the inner brackets are used to indicate that we are passing a list of column names. The example will make this clear, and is a useful template for some of what we will do.
Selecting a column by a numerical index
iloc method allows us to select data by numerical indexes. We just have to be careful not be confused by zero indexing. If want the first column then the index needs to be zero. The following is equivalent to what we did above.
Obs 1 14.891 2 106.025 Name: Income, dtype: float64
: syntax indicates that we want to include all rows in the selection.
Selecting multiple columns by numerical indexes
Here, we pass a list of column numbers for indexing.
Another method is slicing. Suppose that we want to select the data from the 1st to the 6th column. When specifying a range of integer indexes, the last one does not count. This may be initially confusing, but it's the standard Python syntax with logic behind it. What the cell below does is to request indexes 0, 1, 2, 3, 4, 5, which correspond to columns 1-6.
Selecting rows by labels
loc method alows one to select rows by the designated index labels (
Unlike slicing with numerical indexing, slicing with label indexes includes the last item.
Selecting rows by numerical index
This is useful when the index variable is a string or date. Here we are back to zero indexing.
Jointly selecting rows and columns
We can combine the previous examples to simultaneously select specific rows and columns.
Combining labels and numerical indexes
As a more advanced concept, a slice of a
DataFrame is itself a pandas object (a
DataFrame if the slice has multiple columns, or a
Series if it has only one). That means that we can chain the operations.
Suppose that we want to know the sample average credit card balance only for females. Below, we select the balance column and the rows such that the value of the gender column is female.
This is called Boolean indexing in Python, because it involves the creation of binary variables indicating whether the condition is true of false for each row. The next cell will help you to understand this.
Obs 1 Male 2 Female Name: Gender, dtype: object Obs 1 False 2 True Name: Gender, dtype: bool
You can also specify multiple conditions. The following selects males with age equal or lower than 30. You can look at this reference for a list of Python comparison operators.
data.loc[(data['Gender']=='Female') & (data['Age']<=30),'Balance'].mean()
Assigning new values to a data selection
You may have noted that the first character in "Male" above is a space. This is an unintentional error in the data. The `unique` method in pandas allows us to view all the unique values in a column. In this case, it confirms that all entries are like this.
array([' Male', 'Female'], dtype=object)
We can use our data selection knowledge to fix this. Below, we joinly select the rows in which the gender is " Male" and the Gender column. We then replace the values in those locations with the correct label.
data.loc[data['Gender']==' Male','Gender']='Male' data.head(2)
Once you have made the necessary modications to the dataset, you may want to save it to continue working on it later. More generally, you may wish to save the results of your analysis or export tables so that they insert them in a report or webpage (after formatting). Pandas has methods to export data as csv and Excel files, LaTex and HTML tables, among other options.
In the following example I export our DataFrame as an Excel file. You can run the cell and try to open the file in Excel to check that it worked.
After loading and preparing the data, we can start exploring it by looking at the basic descriptive statitistics. The
describe method provides a table with basic summary statitics for the data.
data.describe().round(1) # here I appended round to limit the number of decimal places in the display, try without it
There are also individual functions for a range of summary statistics. Refer to the pandas documentation for a full list of available functions. As examples, we calculate the means of the dataset and the correlation between income and credit card limit.
Income 45.22 Limit 4735.60 Rating 354.94 Cards 2.96 Age 55.67 Education 13.45 Balance 520.02 dtype: float64
When preparing complex datasets for analysis, it is often useful to work with different types of variables (say, numerical or categorical) separately. You can use the dataframe
info method to view the list of columns and their variable types, which
object dtype is the most general type: a column with this dtype will typically contain text data.
<class 'pandas.core.frame.DataFrame'> Int64Index: 400 entries, 1 to 400 Data columns (total 11 columns): Income 400 non-null float64 Limit 400 non-null int64 Rating 400 non-null int64 Cards 400 non-null int64 Age 400 non-null int64 Education 400 non-null int64 Gender 400 non-null object Student 400 non-null object Married 400 non-null object Ethnicity 400 non-null object Balance 400 non-null int64 dtypes: float64(1), int64(6), object(4) memory usage: 57.5+ KB
To select columns of a particular type, you can do as follows:
# This cells formats the notebook for display online. Please omit it from your work. from IPython.core.display import HTML style = open('jstyle.css', "r").read() HTML('<style>'+ style +'</style>')