Introduction
I’m adding a financial database creation module to simplify (and greatly speed up) some of the ongoing calculations I’m making to generate portfolio visualizations. In the creation of this financial database module, I constantly need to check to see if data already exists for some security, and then if not, add it. However, I’d like to be able to throw a whole bunch of security tickers at the function, and then have it “figure out” which ones it needs to add, based on what’s already inside of the database.
Initially, I was just going to use list comprehensions, as “iterability” is already a terrific strength of Python’s. But then I got to thinking, should I use list comprehensions… or numpy
Python’s powerful array object library. And if I was going to use numpy
, should I use lists, numpy.ndarray
‘s, or pandas.Series
as inputs to the numpy
function? Because I’m attempting to construct (what I’m hoping is) production level code, I figured I’d test the answer numerically… and bore you all with the findings.
Setup
First, I wrote a little function to create random tickers of length 3 for the “superset” (the tickers already in the database) and the “subset” (those that I wanted to put into the database), but wanted to make sure I didn’t over-write any existing data. Here’s my adorable little for function for that:
def gen_tickers(num_already, num_new):
s = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
r = numpy.random.randint(1, 27, 3)
ticks = []
#create the list of tickers
for i in numpy.arange(num_already):
ticks.append(reduce(lambda x, y: x + y, map(lambda x: s[x],
numpy.random.randint(0, 26, 3) )) )
new_ticks = []
#create the list of tickers to append
for i in numpy.arange(num_new):
new_ticks.append(reduce(lambda x, y: x + y, map(lambda x: s[x],
numpy.random.randint(0, 26, 3) )) )
return ticks, new_ticks
Next I created 4 functions, each to perform the desired goal of “check to see which of the subset are not a part of the superset.”
1. Using List Comprehensions:
def list_method(ticks, new_ticks):
bool_map = map(lambda x: x in ticks, new_ticks)
return [tick for (tick, bool_val) in zip(new_ticks, bool_map) if not bool_val]
2. Using numpy
with lists
as inputs:
def np_list_method(ticks, new_ticks):
return numpy.setdiff1d(new_ticks, ticks )
3. Using numpy
with numpy.ndarray
‘s as inputs:
def np_ndarray_method (ticks, new_ticks):
return numpy.setdiff1d(numpy.array(new_ticks), numpy.array(ticks) )
4. Using numpy
with pandas.Series
as inputs:
def pandas_method(ticks, new_ticks):
return numpy.setdiff1d(pandas.Series(new_ticks), pandas.Series(ticks) )
Next I used iPython convenient “timeit” functionality to time the different run times of each function for each different subset and superset size, taking the average of the 3 best performances (unfortunately, the timeit function doesn’t have the entire sample size run-times, so I had to make due with the best 3).
Here are the different subset and superset sizes that I used:
Tickers in Subset | Tickers in Superset |
---|---|
10 | 100 |
100 | 1,000 |
1,000 | 10,000 |
10,000 | 100,000 |
100,000 | 1,000,000 |
For each of the superset sizes, I ran all subset sizes — so for instance, a superset with 1,000 ticker (i.e. 1,000 tickers were “already in the database”), I ran a subset of 10, 100, 1,000, 10,000, & 100,000 tickers to find the tickers that weren’t already in the superset of 1,000 (i.e. that I wanted to put into the database, but wanted to make sure I didn’t overwite existing data).
Results
When dealing with a small subset, don’t bother with any sort of module / library imports (i.e. pandas
or bumpy
, for this specific type of functionality, list comprehensions get the job done much faster.
… See prior comment
Now we begin to see some interesting results… no matter what the size of the superset (except in the case of the 1,000,000 instance), the pure list implementation is the slowest. As we’ll see, this trend continues as we move into greater sized subsets.
At this point, using pure list implementations begins to cause some serious drag on the speed of the simple matching algorithm. So much so, that I actually removed the pure list implementation so we could see what was happening with numpy
implementation.
Interestingly enough, the numpy
functionality operates efficiently, across all subset sizes, on lists instead of numpy.ndarray
‘s or pandas.Series
. This is likely due to the case that “constructing a new object” doesn’t really provide for any greater efficiency for this algorithm — numpy.ndarray
‘s / pandas.Series
are much more efficient than pure Python list
s when reshaping and indexing is used, and that isn’t the case here.
Now we see the Python list
causing tremendous drag on the overall algorithm, in fact by several orders of magnitude. However, by taking out the list implementation we see …
By removing the (now anomalously slow) list implementation, there are a couple of interesting takeaways. pandas
, albeit slightly, performs better than numpy.ndarray
s in all instances. I found this to be incredibly interesting and here’s why: if you’ve ever played around with pandas
objects versus numpy.ndarrays
, you’ll know they’re just a lot “nicer” — Nicer to explore, nicer to print to screen… frankly (as one of my good friends would put), they’re just a more expressive data object.
My question was always, “but how much speed and computational drag am I adding by using this more expressive data object. And the answer is none, in fact, in this specific case, it’s faster than ndarray
s.
Conclusion
This clearly isn’t a comprehensive analysis of “which way to implement a searching algorithm in Python.” However, this post tries to address (and answer) a couple of key questions that I, personally, had always wondered:
-
Am I gaining any operational efficiency / speed by transforming my Python
list
into another data type before I run an algorithm on it?Most likely not It appears that, if I’m planning on doing any other indexing / manipulation on a list, there’s really no point in putting it into another datatype
-
Should I figure out how to solve the problem with list comprehensions, zipping, and if statements, or should I use the
numpy
/scipy
libraries to do my heavy lifting?Use
numpy
/scipy
Although for small set sizes, list on list manipulation (now this post is R rated) works fine, as we get into bigger list sizes, it definitely makes sense to use some of the more computationally robust libraries that have been built. -
I have unhealthy love for
pandas
(no not the animal, the library), how much time / speed am I wasting by using that as my primary datatype instead ofnumpy.ndarray
‘s?None, love
pandas
with reckless abandon, it might even make you faster Interestingly, thenumpy
functions ran faster onpandas.Series
than onnumpy.ndarray
s, so now I know that any time I need to construct data types for easy indexing, manipulation and transformation, my love ofpandas
is going to help me, not hurt me (this entire paragraph sounds really inappropriate).
As always, comments, questions, criticisms, and critiques are welcome.
Appendix: Code I used to test the Implementation
superset_list = [100, 1000, 10000, 100000, 1000000]
subset_list = [10, 100, 1000, 10000, 100000]
ret_dict = {'list_method':[], 'np_list': [], 'np_ndarray': [],
'np_pandas':[], 'subset_size':[], 'superset_size': []}
for i, n in enumerate(superset_list):
for j, m in enumerate(subset_list):
ticks, new_ticks = scratch.gen_tickers(n, m)
ta = %timeit -o -n100 list_method(ticks, new_ticks)
tb = %timeit -o -n100 np_list_method(ticks, new_ticks)
tc = %timeit -o -n100 np_ndarray_method(ticks, new_ticks)
td = %timeit -o -n100 pandas_method(ticks, new_ticks)
ret_dict['list_method'].append(numpy.mean(ta.all_runs))
ret_dict['np_list'].append(numpy.mean(tb.all_runs))
ret_dict['np_ndarray'].append(numpy.mean(tc.all_runs))
ret_dict['np_pandas'].append(numpy.mean(td.all_runs))
ret_dict['subset_size'].append(m)
ret_dict['superset_size'].append(n)