Matching Performance using Python list‘s, ndarray‘s, and pandas

Introduction

I’m adding a financial database creation module to simplify (and greatly speed up) some of the ongoing calculations I’m making to generate portfolio visualizations. In the creation of this financial database module, I constantly need to check to see if data already exists for some security, and then if not, add it. However, I’d like to be able to throw a whole bunch of security tickers at the function, and then have it “figure out” which ones it needs to add, based on what’s already inside of the database.

Initially, I was just going to use list comprehensions, as “iterability” is already a terrific strength of Python’s. But then I got to thinking, should I use list comprehensions… or numpy Python’s powerful array object library. And if I was going to use numpy, should I use lists, numpy.ndarray‘s, or pandas.Series as inputs to the numpy function? Because I’m attempting to construct (what I’m hoping is) production level code, I figured I’d test the answer numerically… and bore you all with the findings.

Setup

First, I wrote a little function to create random tickers of length 3 for the “superset” (the tickers already in the database) and the “subset” (those that I wanted to put into the database), but wanted to make sure I didn’t over-write any existing data. Here’s my adorable little for function for that:

def gen_tickers(num_already, num_new):
s  = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
r = numpy.random.randint(1, 27, 3)
ticks = []
#create the list of tickers
for i in numpy.arange(num_already):
    ticks.append(reduce(lambda x, y: x + y, map(lambda x: s[x], 
        numpy.random.randint(0, 26, 3) )) )

new_ticks = []
#create the list of tickers to append
for i in numpy.arange(num_new):
    new_ticks.append(reduce(lambda x, y: x + y, map(lambda x: s[x], 
        numpy.random.randint(0, 26, 3) )) )

return ticks, new_ticks

Next I created 4 functions, each to perform the desired goal of “check to see which of the subset are not a part of the superset.”

1. Using List Comprehensions:

def list_method(ticks, new_ticks):
    bool_map = map(lambda x: x in ticks, new_ticks)
    return [tick for (tick, bool_val) in zip(new_ticks, bool_map) if not bool_val]

2. Using numpy with lists as inputs:

def np_list_method(ticks, new_ticks):
    return numpy.setdiff1d(new_ticks, ticks )

3. Using numpy with numpy.ndarray‘s as inputs:

def np_ndarray_method (ticks, new_ticks):
    return numpy.setdiff1d(numpy.array(new_ticks), numpy.array(ticks) )

4. Using numpy with pandas.Series as inputs:

 def pandas_method(ticks, new_ticks):
    return numpy.setdiff1d(pandas.Series(new_ticks), pandas.Series(ticks) )

Next I used iPython convenient “timeit” functionality to time the different run times of each function for each different subset and superset size, taking the average of the 3 best performances (unfortunately, the timeit function doesn’t have the entire sample size run-times, so I had to make due with the best 3).

Here are the different subset and superset sizes that I used:

Tickers in Subset	Tickers in Superset
10	100
100	1,000
1,000	10,000
10,000	100,000
100,000	1,000,000

For each of the superset sizes, I ran all subset sizes — so for instance, a superset with 1,000 ticker (i.e. 1,000 tickers were “already in the database”), I ran a subset of 10, 100, 1,000, 10,000, & 100,000 tickers to find the tickers that weren’t already in the superset of 1,000 (i.e. that I wanted to put into the database, but wanted to make sure I didn’t overwite existing data).

Results

ten

When dealing with a small subset, don’t bother with any sort of module / library imports (i.e. pandas or bumpy, for this specific type of functionality, list comprehensions get the job done much faster.

hundred

… See prior comment

thou

Now we begin to see some interesting results… no matter what the size of the superset (except in the case of the 1,000,000 instance), the pure list implementation is the slowest. As we’ll see, this trend continues as we move into greater sized subsets.

tenthou

At this point, using pure list implementations begins to cause some serious drag on the speed of the simple matching algorithm. So much so, that I actually removed the pure list implementation so we could see what was happening with numpy implementation.

tennolist

Interestingly enough, the numpy functionality operates efficiently, across all subset sizes, on lists instead of numpy.ndarray‘s or pandas.Series. This is likely due to the case that “constructing a new object” doesn’t really provide for any greater efficiency for this algorithm — numpy.ndarray‘s / pandas.Series are much more efficient than pure Python lists when reshaping and indexing is used, and that isn’t the case here.

hundredthou

Now we see the Python list causing tremendous drag on the overall algorithm, in fact by several orders of magnitude. However, by taking out the list implementation we see …

hundredthounolist

By removing the (now anomalously slow) list implementation, there are a couple of interesting takeaways. pandas, albeit slightly, performs better than numpy.ndarrays in all instances. I found this to be incredibly interesting and here’s why: if you’ve ever played around with pandas objects versus numpy.ndarrays, you’ll know they’re just a lot “nicer” — Nicer to explore, nicer to print to screen… frankly (as one of my good friends would put), they’re just a more expressive data object.

My question was always, “but how much speed and computational drag am I adding by using this more expressive data object. And the answer is none, in fact, in this specific case, it’s faster than ndarrays.

Conclusion

This clearly isn’t a comprehensive analysis of “which way to implement a searching algorithm in Python.” However, this post tries to address (and answer) a couple of key questions that I, personally, had always wondered:

Am I gaining any operational efficiency / speed by transforming my Python list into another data type before I run an algorithm on it?

Most likely not It appears that, if I’m planning on doing any other indexing / manipulation on a list, there’s really no point in putting it into another datatype
Should I figure out how to solve the problem with list comprehensions, zipping, and if statements, or should I use the numpy / scipy libraries to do my heavy lifting?

Use numpy/scipy Although for small set sizes, list on list manipulation (now this post is R rated) works fine, as we get into bigger list sizes, it definitely makes sense to use some of the more computationally robust libraries that have been built.
I have unhealthy love for pandas (no not the animal, the library), how much time / speed am I wasting by using that as my primary datatype instead of numpy.ndarray‘s?

None, love pandas with reckless abandon, it might even make you faster Interestingly, the numpy functions ran faster on pandas.Series than on numpy.ndarrays, so now I know that any time I need to construct data types for easy indexing, manipulation and transformation, my love of pandas is going to help me, not hurt me (this entire paragraph sounds really inappropriate).

As always, comments, questions, criticisms, and critiques are welcome.

Appendix: Code I used to test the Implementation

superset_list = [100, 1000, 10000, 100000, 1000000]
subset_list = [10, 100, 1000, 10000, 100000]

ret_dict = {'list_method':[], 'np_list': [], 'np_ndarray': [],
            'np_pandas':[], 'subset_size':[], 'superset_size': []}

for i, n in enumerate(superset_list):
    for j, m in enumerate(subset_list):
        ticks, new_ticks = scratch.gen_tickers(n, m)
        ta = %timeit -o -n100 list_method(ticks, new_ticks)
        tb = %timeit -o -n100 np_list_method(ticks, new_ticks)
        tc = %timeit -o -n100 np_ndarray_method(ticks, new_ticks)
        td = %timeit -o -n100 pandas_method(ticks, new_ticks)
        ret_dict['list_method'].append(numpy.mean(ta.all_runs))
        ret_dict['np_list'].append(numpy.mean(tb.all_runs))
        ret_dict['np_ndarray'].append(numpy.mean(tc.all_runs))
        ret_dict['np_pandas'].append(numpy.mean(td.all_runs))
        ret_dict['subset_size'].append(m)
        ret_dict['superset_size'].append(n)

Benjamin M. Gross

A life of Code, CompFi, Canines & the Corporeal

Matching Performance using Python `list`‘s, `ndarray`‘s, and `pandas`

Introduction

Setup

Results

Conclusion

Appendix: Code I used to test the Implementation

Leave a Reply Cancel reply