Pandas timeseries plot – setting x-axis major and minor ticks and labels

I’ve asked this question on StackOverflow (http://stackoverflow.com/questions/12945971/pandas-timeseries-plot-setting-x-axis-major-and-minor-ticks-and-labels), but couldn’t include images because I haven’t posted on stackOverflow before. So here it is, with the images.

I want to be able to set the major and minor xticks and their labels for a time series graph plotted from a Pandas time series object.

The Pandas 0.9 “what’s new” page says: “you can either use to_pydatetime or register a converter for the Timestamp type” but I can’t work out how to do that so that I can use the matplotlib ax.xaxis.set_major_locator ax.xaxis.set_major_formatter (and minor) commands.

If I use them without converting the pandas times, the x-axis ticks and labels end up wrong.

By using the ‘xticks’ parameter I can pass the major ticks to pandas.plot, and then set the major tick labels. I can’t work out how to do the minor ticks using this approach. (I can set the labels on the default minor ticks set by pandas.plot)

Here is my test code:

import pandas
print 'pandas.__version__ is ', pandas.__version__
print 'matplotlib.__version__ is ', matplotlib.__version__

dStart = datetime.datetime(2011,5,1) # 1 May
dEnd = datetime.datetime(2011,7,1) # 1 July

dateIndex = pandas.date_range(start=dStart, end=dEnd, freq='D')
print "1 May to 1 July 2011", dateIndex  

testSeries = pandas.Series(data=np.random.randn(len(dateIndex)), index=dateIndex)

ax = plt.figure(figsize=(7,4), dpi=300).add_subplot(111)
testSeries.plot(ax=ax, style='v-', label='first line')

# using MatPlotLib date time locators and formatters doesn't work with new pandas datetime index
ax.xaxis.set_minor_locator(matplotlib.dates.WeekdayLocator(byweekday=(1),interval=1))
ax.xaxis.set_minor_formatter(matplotlib.dates.DateFormatter('%d\n%a'))
ax.xaxis.grid(True, which="minor")
ax.xaxis.grid(False, which="major")
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('\n\n\n%b%Y'))
plt.show()

# set the major xticks and labels through pandas
ax2 = plt.figure(figsize=(7,4), dpi=300).add_subplot(111)
xticks = pandas.date_range(start=dStart, end=dEnd, freq='W-Tue')
print "xticks: ", xticks
testSeries.plot(ax=ax2, style='-v', label='second line', xticks=xticks.to_pydatetime())
ax2.set_xticklabels([x.strftime('%a\n%d\n%h\n%Y') for x in xticks]);
# set the text of the first few minor ticks created by pandas.plot
#    ax2.set_xticklabels(['a','b','c','d','e'], minor=True)
# remove the minor xtick labels set by pandas.plot 
ax2.set_xticklabels([], minor=True)
# turn the minor ticks created by pandas.plot off 
# plt.minorticks_off()
plt.show()
print testSeries['6/1/2011':'6/7/2011']

and it’s output:

pandas.__version__ is  0.9.1.dev-3de54ae
matplotlib.__version__ is  1.1.1
1 May to 1 July 2011 <class 'pandas.tseries.index.DatetimeIndex'>
[2011-05-01 00:00:00, ..., 2011-07-01 00:00:00]
Length: 62, Freq: D, Timezone: None

xticks: <class 'pandas.tseries.index.DatetimeIndex'>
[2011-05-03 00:00:00, ..., 2011-06-28 00:00:00]
Length: 9, Freq: W-TUE, Timezone: None

2011-06-04   -0.199393
2011-06-05   -0.043118
2011-06-06    0.477771
2011-06-07   -0.033207
Freq: D

Learning Python – iPython, matplotlib and Pandas

As I said in my last post, I was inspired by the talk at OSDC2011 by Dr Edward Schofield, Python for R&D to try out Python and in particular iPython.

So I’ve been learning Python by using iPython for analysing my twitter data. The iPython notebook provides a fantastic environment for doing this by letting you write notes in between blocks of python code, and see the results from running the python on the same page.

I’m starting ipython by opening a terminal window in the directory I have my ipython notebooks and running:
ipython notebook --pylab inline which makes the matplotlib graphics appear inline (on the webpage) instead of in a separate window.

I tried a few different approaches to getting everything working on Mac OSX Lion. The Scipi-Superpack for OS X was the last I tried, and it seems to have got the last piece that I hadn’t got working via the other approaches, Pandas and scikits.statsmodels, working.

I’m using the dev version of iPython from GitHub. It is great that they have it setup so that each time I pull the updates they are available straight away without any extra install just by restarting the notebook.

I began by working out how to get data sets from mySql and from Apache Solr and then draw graphs of them using matplotlib. I used paired lists for this as that was what the matplotlib examples used. When I started trying to add time series of different lengths and with different gaps in the data I started to find the limitations of paired lists. Looking around for python time series libraries I found scikits timeseries which looked good, but then came across scikits.statsmodels and Pandas and decided to try them.

If you want to try running this code, I’ve linked the iPython notebook that these code snippets were taken from at the bottom of the post.

Convert pair of Python lists to Pandas series

Pandas makes it easy to convert the paired lists into a pandas.series object:

def convertListPairToTimeSeries(dList, cList):
    # my dateList had date objects, so convert back to datetime objects
    dListDT = [datetime.datetime.combine(x, datetime.time()) for x in dList]
    # found that NaN didn't work if the cList contained int data
    cListL = [float(x) for x in cList]
    # create the index from the datestimes list
    indx = pandas.Index(dListDT)
    # create the timeseries
    ts = pandas.Series(cListL, index=indx)
    # fill in missing days
    ts = ts.asfreq(pandas.datetools.DateOffset())
    return ts

Adjusting the Pandas series

I then made the two data sets (tweets received per day and tweets limited per day) have the same start and end and filled in all the missing days with 0’s or, where I knew that I had a data collection outage, with NaN.

# my data had lots of gaps that were actually 0 values, not missing data
# So I used this to fix the NaN outside the known outage
startOutage = datetime.datetime(2011,12,7)
endOutage = datetime.datetime(2011,12,8)
# set all NaN values to 0
tsFilled = tSeries.fillna(0)
# set the known outage values back to NAN
tsFilled.ix[startOutage:endOutage] = numpy.NAN

If the gaps in my data were all meant to missing values instead of 0, I could have just left the series as they were and used pandas.join instead of + to add them together

Truncating the data

I truncated the data, both to make the series the same length (although there are other ways to do this) and to remove partial days of data at the start and end of the periods.

# use slicing to change length of data
tSeriesSlice = tSeries.ix[startData:endData]
# use truncate instead of slicing to change length of data
tSeriesTruncate = tSeries.truncate(before=startData, after=endData)

Plotting the data

Basic graphing of a Pandas series is very straight forward:
tsFilled.plot();

It is easy to have multiple lines on the same graph, and to add titles and axis labels.

tsFilled.plot(label="original")
tsNew = tsFilled+(rand(len(tsFilled)))
tsNew.plot(label="adding")
tsNew1 = tsFilled.fillna(1)
tsNew1 = tsNew1 +(rand(len(tsNew1)))
tsNew1.plot(label="fillna+adding")
plt.legend(loc=3)
plt.title("Testing Panda Plotting")
plt.ylabel("Counts");

For more control over the layout, it is possible to pass the matplotlib axis in the Pandas plot statement.

# loc = MonthLocator()
loc = DayLocator(interval=2)
formatter = DateFormatter('%d %b')
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(211) 
plt.title("Testing Panda Plotting")
tsFilled.plot(label="original", ax=ax)
plt.ylabel("Counts")
plt.legend()
ax2 = fig.add_subplot(212) 
tsFilled.fillna().plot(label="filled", ax=ax2, color='g')
plt.legend()
plt.ylabel("Counts")
plt.xlabel("2011")
ax2.xaxis.set_major_locator(loc)
ax2.xaxis.set_major_formatter(formatter)
labels = ax2.get_xticklabels()
setp(labels, rotation=80, fontsize=10);

For plotting multiple lines, it is probably better to add the series into a Pandas dataframe and then plot from that, but I’ll leave that for another day.

I’m new to python, matplotlib and Pandas, so I’d be very happy for any feedback about better ways to do things.

iPython Notebook for these examples

Download pandasTimeSeriesNotes.ipynb_.zip (58k)

Links