Hard Lessons, Real Tips

6 most common pandas query for cleaning data. Bonus at the end…

One-stop cheatsheet, for myself…

Photo by Angèle Kamp on Unsplash

Here are the commonly used queries which you would probably need every time you have a dataset, but they are nitty-gritty enough that you still need to google them to make sure of the syntax…So I summarized them here as a cheat sheet for myself.

  1. Rename column/columns
>> data.rename(columns={'x':'new_x'}, inplace=True)If you only have a few columns, you can also: >> last_rating.columns = ['column1/2', 'column2/2']

2. Remove “[]” “()” and etc…

>> print (df)
0 [1]
1 [2]
2 [3]
>> df['value'] = df['value'].str.strip('[]').astype(int)

3. Insert a column

>> df = pd.DataFrame({
'insert_col': ['insert_values', 'see_examples_below'],
'col_2': [4, 5, 6, 7]

4. Drop a column

>> df.drop(['B', 'C'], axis=1)   A   D
0 0 3
1 4 7
2 8 11

5. to_datetime

>> print(df['timestamp'])   timestamp
0 1381006850
1 1586466072
2 1579057827
>> df['date'] = df['timestamp'].apply(lambda row: pd.to_datetime(row, unit='s')) timestamp date
0 1381006850 2013-10-05 21:00:50
1 1586466072 2020-04-09 21:01:12
2 1579057827 2020-01-15 03:10:27

6. Merge

>> pd.merge(df1, df2, on='key')for more than 2 dfs>> df1.merge(df2, on='key').merge(df3, on='key)

Bonus: Dummy columns with multiple values.

Look how annoying the column “genre” is… The point here is that I cannot directly use “pd.get_dummies()”

Let’s say we want the result to look like this…

Here is how I approach this problem every time. There are 3 things we need to have:

  1. a list/set that contains all the unique values we want to turn into column names
  2. The column that we want to dummy is cleaned, meaning we can directly apply functions to it. There is nothing like brackets or other value and etc.
  3. A function that allows us to do 2 things: turn unique names to column names. Go through each value in the column that you want to dummy, return 1 if it matches the column name, and put 0 if it doesn’t.

Here is the example for column ‘genre’:

To have 1: I used .split() and the set method to get unique values.

>> genres =[]
>> for val in movies['genre']:
except: AttributeError
# I used try & except because when you hit Nan, you get attribute error. This allows you to keep going>> genres = set(genres) #set allows you to get all the unique values

To have 2: we already have it. The column ‘genre’ can directly apply a function to it. Let’s write the function now.

To have 3: I used.find() because each cell can have more than 1 value. Here is what the function does: go through each cell in column “genre”. If it finds a value that matches x genre, mark 1, else mark 0.

>> def split_genres(val):
if val.find(gene) > -1:
return 1
return 0
except AttributeError:
return 0

Lastly: Apply the function to the column

>> for gene in genres:        
movies[gene] = movies['genre'].apply(split_genres)

You have it!


Let’s look at another example: the year of the movie

I want to generalize it and turn it into

To have 1: It is very easy because there are only three values. I can directly create a list.

>> dummy_cols = ["1800's", "1900's", "2000's"]

To have 2: I used .split() and .strip() separate the year from the movie column and get ride of the ().

>> movies[['movie', 'year']] = movies.movie.str.split("(", expand=True)
>> movies['year']=movies['year'].str.strip(')')
>> movies.head()

Now, the column year is cleaned and ready to be applied function.

To have 3: here is the function I wrote

>> def add_movie_year(val):
if val[:2] == yr:
return 1
return 0

# Apply function
for yr in dummy_cols:
movies[yr] = movies['year'].apply(add_movie_year)

Conclusion: remember the 3 things and check what we are missing. Then go get it!

Share what I learned, and learn from what I shared. All about machines, humans, and the links between them. Take everything with a grain of salt.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store