Data structures is a way to store and organize data. Obviously, some data structures are good in one set of problems, but terrible in other ones. The right choice of data structure can make your code faster, more efficient with memory and even more readable for other human beings.
In this article we are going to discuss the most important data structures in python. How they work, when to use them and how to use them. We will even cover a little bit of Big-O notation which helps to describe effectiveness of algorithms and data structures.
Primitive and non-primitive data structures
Data structures can be divided into primitive and non-primitive.
Primitive:
- Integers
- Floats
- Strings
- Booleans
Non-primitive:
- Lists
- Arrays
- Tuples
- Dicts
- Sets
Primitive data structures
There is not so much interesting information about primitive data structures. So I’m not going to talk about it. I would like to mention only one thing that is not obvious. Booleans are integers in python:
>>> isinstance(True, int) True >>> isinstance(False, int) True >>> True == 1 == 1.0 and False == 0 == 0.0 True >>> True + 5 6
The fact that you can add True and 5 and get 6 might not be obvious because you know that python is a strongly typed language. We can’t perform operations on different types in strongly typed languages:
>>> '25' + 5 TypeError: Can't convert 'int' object to str implicitly
But we can add True and 5 because True is 1 in python.
List
Now let’s talk about non-primitive data structures. The first one is list. List is a sequence of elements. It can store anything: numbers, strings, other lists, functions and etc. The fact that it can store anything is great but it has a disadvantage. This kind of list will require more memory.
Let’s take a look at some basic operations with list:
# create list >>> l = ['Noname', 42, True] # slice list >>> l[0:2] ['Noname', 42] >>> l[-1] True >>> l[:-1] ['Noname', 42] >>> l[:] ['Noname', 42, True] # add element to the end >>> l.append(66.6) >>> l ['Noname', 42, True, 66.6] # check if element is in list >>> 42 in l True # remove the last element >>> l.pop() 66.6 >>> l ['Noname', 42, True] # remove the first element >>> l.pop(0) 'Noname' >>> l [42, True]
We have used two methods here for removing element from list: pop()
and pop(0)
. What do you think will be faster? To remove first or last element?
In order to answer this question we need to understand how python implements list. In python list is a dynamic array. When you remove an element from the front of the dynamic array, all other elements are shifted one position closer to the beginning. For that reason it’s faster to remove the last element because we wouldn’t need to shift anything and slower to remove the first element because we would need to shift every other element.
An example which proves that pop()
is faster than pop(0)
:
import time def timing(func): def wrapper(): start = time.time() func() finish = time.time() print('Elapsed time: {}'.format(finish - start)) return wrapper @timing def f1(): elements = [1] * 100000 for _ in range(len(elements)): elements.pop() @timing def f2(): elements = [1] * 100000 for _ in range(len(elements)): elements.pop(0) f1() # Elapsed time: 0.007998943328857422 f2() # Elapsed time: 0.9251341819763184
Both functions create a list of 100000 elements, then remove these elements one by one. The only difference is that f1
removes elements from the end, f2
removes from the front. As you can see the execution time is very different.
If you don’t know about decorators and if function timing
seems confusing, don’t worry. It just prints the execution time of a function and it’s not necessary to understand how it works.
In most cases, it’s enough to just know what operation is faster or slower. But you should also be at least a little bit familiar with Big-O notation. Big-O is more “
Name | Notation | Comment |
Constant | O(1) | AWESOME!! |
Logarithmic | O(log N) | GREAT! |
Linear | O(N) | OKAY. |
Linearithmic | O(N log N) | UGH… |
Polynomial | O(N ^ 2) | SHITTY |
Exponential | O(2 ^ N) | HORRIBLE |
Factorial | O(N!) | WTF |
This table I’ve taken from itsy-bitsy-data-structures. It’s a short guide on data structures.
But how does Big-O correspond to our previous example with removing elements from the list? When we remove element from the end of the list, the Big-O is constant O(1). It means that no matter how many elements in list, it won’t affect the execution time. But when we remove element from the front of the list, the Big-O is linear O(N). Our execution time will depend on how many elements in list. The more elements, the slower it will be.
Big-O’s for different operations with list:
Name | List |
Access | O(1) |
Insert (end) | O(1) |
Insert (front) | O(N) |
Contains (in) | O(N) |
Remove (end) | O(1) |
Remove (front) | O(N) |
Conclusion here is that if you need to add or remove many elements from the front, regular list won’t be a good choice.
Now let’s take a look at another way to store sequences of elements that is more suitable for working with the front of the sequence.
Deque
Deque is a double-ended queue that supports adding and removing elements from either end in O(1) time.
A simple example that uses deque:
>>> from collections import deque >>> de = deque([6, 6, 6]) >>> de.appendleft(42) >>> de deque([42, 6, 6, 6]) >>> de.popleft() 42 >>> de deque([6, 6, 6])
Big-O’s for deque and list:
Name | List | Deque |
Access | O(1) | O(N) |
Insert (end) | O(1) | O(1) |
Insert (front) | O(N) | O(1) |
Contains (in) | O(N) | O(N) |
Remove (end) | O(1) | O(1) |
Remove (front) | O(N) | O(1) |
As you can see from the table, the access for deque is O(N). The reason is that deque is not optimized for working with the middle of a sequence. It’s only optimized for working with either end of a sequence.
Tuple
Another way to store a sequence of elements is to use tuples. Tuple is basically the same thing as list but with one difference. You can’t add or remove elements from tuples after initialization. It’s immutable data structure.
Some operations with tuple:
>>> t = ('Noname', 42, True) # slice tuple >>> t[:2] ('Noname', 42) # merge tuples (creates new tuple) >>> t + (66.6,) ('Noname', 42, True, 66.6) >>> t ('Noname', 42, True)
When to use tuples instead of lists? Tuples are better to use when you need to store a fixed number of elements. For example coordinates:
>>> point = (0, 25) >>> 'x: ' + point[0] 'x: 0' >>> 'y: ' + point[1] 'y: 25'
But there is a problem with tuples that you might not see right now. In this example with point it’s obvious that zero element is a coordinate of x and the first element is a coordinate of y. But not all the time it’s obvious what element is stored under each index:
>>> def get_name(): ... return 'Richard', 'Xavier', 'Jones' >>> name = get_name() >>> name[0], name[1], name[2] ('Richard', 'Xavier', 'Jones')
Named tuples to the rescue!
Namedtuple
Named tuple is just a tuple with named fields. It helps to make code more readable in some circumstances:
>>> from collections import namedtuple >>> def get_name(): ... name = namedtuple('name', 'first middle last') ... return name('Richard', 'Xavier', 'Jones') >>> name = get_name() >>> name.first, name.middle, name.last ('Richard', 'Xavier', 'Jones')
Dict
Another important data structure is dictionary. The difference between dictionary and list is that you access elements in dictionary by key, not by index. The dictionary optimized for accessing, changing and removing elements associated with some key:
Name | Dictionary |
Access | O(1) |
Insert | O(1) |
Remove | O(1) |
Contains (in) | O(1) |
How to create and add elements to a dictionary:
>>> books_by_authors = { ... 'Orwell': ['1984', 'Animal Farm'], ... 'Twain': ['Adventures of Huckleberry Finn', 'Letters from the Earth'], ... 'Fitzgerald': ['The Great Gatsby', 'The Rich Boy'], ... } >>> books_by_authors['Twain'] ['Adventures of Huckleberry Finn', 'Letters from the Earth'] >>> books_by_authors['Austen'] = ['Pride and Prejudice', 'Emma'] >>> books_by_authors { 'Orwell': ['1984', 'Animal Farm'], 'Twain': ['Adventures of Huckleberry Finn', 'Letters from the Earth'], 'Fitzgerald': ['The Great Gatsby', 'The Rich Boy'], 'Austen': ['Pride and Prejudice', 'Emma'] }
All of our dictionary keys here were strings. But string is not the only choice. The key of a dictionary can be any hashable object.
A hashable object is any object that doesn’t change during its lifetime. For example string or number. Not hashable object is obviously any object that changes. For example list. You can add or remove elements from the list.
But if we go a little bit deeper, a hashable object is an object that meets these three requirements:
- It supports
hash()
function via__hash__
method which returns the same hash value over the lifetime of the object. - It supports equality via
__eq__
method. - If
a == b
is True thenhash(a) == hash(b)
should also be True.
This is important to know if you want to create your own hashable object. Also, you might be asked to explain what is hashable object on the job interview because I was.
P.S. In case you were wondering, tuple is also a hashable object but only if it stores other hashable objects. If one of the elements of a tuple will be a list, the tuple won’t be hashable.
OrderedDict
Besides regular dictionary, python also has modified versions of dictionaries. One of them is ordered dict. It saves the order of key inserts into dictionary:
>>> from collections import OrderedDict >>> ordered_dict = OrderedDict() >>> ordered_dict[1] = 'one' >>> ordered_dict[2] = 'two' >>> ordered_dict[3] = 'three' >>> ordered_dict[4] = 'four' >>> ordered_dict[5] = 'five' >>> ordered_dict OrderedDict([(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four'), (5, 'five')])
The interesting part here is that a regular dictionary also maintains the order of key inserts in python 3.6 and higher. But it’s an implementation detail, not a specification. So you should not rely upon this because it can be changed in future, which is unlikely. But anyway, if you want to save the order of key inserts you should do this explicitly through defining OrderedDict instance.
Defaultdict
Another modified version of dictionary is defaultdict. In regular dictionary, if you try to get a value by key and the key doesn’t exist, it will raise KeyError
:
>>> d = {} >>> d['names'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'names'
Defaultdict, on the contrary, will not raise an error. If the key hasn’t been initialized in the dictionary, defaultdict will call a function that you passed in and assign its return value as the value of the new key:
>>> from collections import defaultdict >>> d = defaultdict(list) >>> d['names'] [] >>> d['names'].append('John Doe') >>> d['names'] ['John Doe']
It’s neat and sometimes can make your code a little bit shorter.
Set
The last data structure I will talk about is set. Set stores only unique elements and elements in set are unordered. Sets are usually used for:
- Eliminating duplicates
- Checking for membership of an element (x in s)
- Calculating different operations from set theory (intersaction, union and etc)
Python has two types of sets: regular set and frozenset. The only difference between regular set and frozenset is that you can’t add new elements to frozenset after its initialization.
Let’s take a look at some basic operations with set:
>>> letters = {'a', 'b', 'c'} >>> 'c' in letters True >>> letters.add('d') >>> letters {'c', 'b', 'd', 'a'}
You probably have already guessed it, but we have the same situation with sets as with dictionaries. Sets can only store hashable objects:
>>> letters.add(['d', 'k']) TypeError: unhashable type: 'list'
Now let’s take a look at how to create a frozenset. Like I said we can’t add new elements after its initialization:
>>> letters = frozenset({'a', 'b', 'c'}) >>> letters.add('d') AttributeError: 'frozenset' object has no attribute 'add'
The last important thing that you should probably know about sets is operations from set theory. In some circumstances, they can reduce a good chunk of code and make it simpler to read:
>>> s1 = {'a', 'b', 'c'} >>> s2 = {'a', 'b', 'd'} # intersection >>> s1 & s2 {'a', 'b'} >>> s1.intersection(s2) {'a', 'b'} # union >>> s1 | s2 {'a', 'b', 'c', 'd'} >>> s1.union(s2) {'a', 'b', 'c', 'd'} # difference >>> s1 - s2 {'c'} >>> s1.difference(s2) {'c'}
Conclusion
We’ve looked at the most important data structures in python. There are some other data structures like array.array
or queue.Queue
, but they are not that popular. For that reason I haven’t included them in this article.
Thank you! Good to know.
However, you didn’t discuss the memory required to host all those structures. Obviously, the DICT varieties require more space than the LIST/TUPLE ones.
Also O(1) in DICT must be a lot slower that the O(1) of a LIST, right?
Yeah, Agree. Dict requires more space.
I didn’t write anything about memory because my intention was to write a brief introduction to data structures. Writing about memory would be overkill. Also, I don’t know so much about memory myself.
Not sure about O(1) of list vs dict. I think that the difference in time is not that important but I might be wrong. You can share your thoughts about it.
Technically, the statement that the Boolean “True” is the integer 1, is False.
bool(x) is a subclass of `int` and as such can be invoked with class methods of `int` and will even refer to the same hash (“True.__hash__() is 1” => True). Yet the actual object is different, stored at 2 separate places in memory.
Overall very nice overview, thanks for a good read!
Yes, I agree. They are different objects. When I said booleans are integers, I meant that bool is a subclass of int. Thanks for the comment though. I really should have explained it better.
good article bro. Thanks
Thank you. The article clear for mind.
I would add about time complexity of sets and give a overview of common mistakes when using data structures (for example slow check of contains element in list -> use set).