The year of 2020 started as some kind of horror movie. Thanks to the COVID-19 situation, some countries closed their borders, airlines cancelled their flights and most of the people – where it is possible – work from home. As of today, we don’t really see the end of this story, but in any case, we need to try our best, not to spread the virus further.
My motive was also to support attendees and organizers of an event that is going to be held online, because of the safety regulations all around the world. This event takes place in the US (Los Angeles time zone), and the list of sessions are provided in a dataset. In this post, I’m going to play around with a dataframe, and specifically with some dates and times specified in it. I’m focusing on data cleansing solutions for specific problems that can occur when you work with a manually prepared dataset.
Note: Some solutions can be found at other sources too, you can find references to these throughout the post.
1. Time is specified in a 12 hour format, but you want it to be in 24 hour format
So let’s say, we have a Pandas dataframe with a Date (string) and Time (string) column with some data in it like the following:
|Monday, March 16, 2020||8:00AM – 8:50AM|
|Tuesday, March 17, 2020||5:00PM – 5:50PM|
|Thursday, March 19, 2020||7:00PM – 7:50PM|
Now we are only interested in the start time for this demo. The Time column we will split at the ‘-‘ character, and the first part we’ll use as hour. Then we concatenate the Date and Time columns, to get an easy to use datetime format.
df['Time'] = df['Time'].str.rsplit("-", n=0, expand=True) df['Datetime'] = df['Date'] + ' ' + df['Time']
The result dataset then will look like the following:
|Monday, March 16, 2020||8:00AM||Monday, March 16, 2020 8:00AM|
|Tuesday, March 17, 2020||5:00PM||Monday, March 17, 2020 5:00PM|
|Thursday, March 19, 2020||7:00PM||Monday, March 19, 2020 7:00PM|
1.1. Start and end time
If you want to get the start and end date as well from the Time column, you should do the following, and apply the further transformation on both new columns:
df1 = df['Time'].str.rsplit("-", n=1, expand=True) df['Start'] = df1 df['End'] = df1
Note: In this case, remember to put the result of the rsplit into an empty dataframe!
When you display, it will give you the following results:
1.2. Further transformation
import dateutil df['Schedule'] = df['Datetime'].apply(dateutil.parser.parse)
The Datetime and Schedule column should look like the following:
|Monday, March 16, 2020 8:00AM||2020-03-16 08:00:00|
|Monday, March 17, 2020 5:00PM||2020-03-17 17:00:00|
|Monday, March 19, 2020 7:00PM||2020-03-19 07:00:00|
Note: Don’t worry if the Datetime ends up having a value like ‘Tuesday, March 17, 2020 5PM‘. The dateutil package can convert that as well without an issue.
When the parser is finished, run the df.dtypes command in the IDE you are using, and note that the Datetime column is string (or object), but the Schedule is now a datetime64(ns) type of data. This datatype enables you to perform some important actions, such as time zone conversion, date format conversion and so on.
2. Define and change the time zone of a datetime64 column
Sometimes it is not very simple to figure out, how to use the built in python methods, it is especially a lot of hassle when you want to work with dates and time zones. As I mentioned before, the event is based in the US (Los Angeles time zone), and I really want to have a good overview about when are the more than 100 sessions exactly scheduled, in my time zone (Europe). One of the best solution is to use the timezones package.
Let’s use the Schedule column we calculated in the previous section. The first thing we need to do here is to define the location of the values. We know, that the schedule is defined in US (Los Angeles) time zone, but how do we define it for the entire column?
Thanks to the pytz package, you can easily figure out what time zones are available, and how do they called. So first we use the tz_localize function to define the time zone of the Schedule column, and from this point it is ready to convert to your desired time zone with tz_convert. The code to convert from US (Los Angeles) to Europe – eg. Denmark – looks like the following:
df['USTime'] = df['Schedule'].dt.tz_localize('America/Los_Angeles') df['CPHTime'] = df['USTime'].dt.tz_convert('Europe/Copenhagen')
You could also write these in one line, but I wanted to show the difference between the results. The one-liner:
df['CPHTime'] = df['Schedule'].dt.tz_localize('America/Los_Angeles').dt.tz_convert('Europe/Copenhagen')
The dataframe’s USTime and CPHTime column should look like the following:
|2020-03-16 08:00:00-07:00||2020-03-16 16:00:00+01:00|
|2020-03-17 17:00:00-07:00||2020-03-17 01:00:00+01:00|
|2020-03-19 07:00:00-07:00||2020-03-19 15:00:00+01:00|
Run the df.dtypes code, and the result should show the following:
You can verify the result of the conversion with the help of this online converter.
I really hope this collection of code and examples are helpful for you too when you want to play with dates using Python. I think it is always nice to see real examples next to the documentation, so you can apply the functions easier on the problem you try to solve.
Follow me 🙂