"Python's Missing Batteries: Essential Libraries You're Missing Out On" - <div>
<p> Python is known to come with <i>"batteries included"</i>, thanks to its very extensive standard library, which includes many modules and functions that you would not expect to be there. However, there are many more <i>"essential"</i> Python libraries out there that you should know about and be using in all of your Python projects, and here's the list. </p>
<h2>General Purpose Utilities</h2>
<p> We will begin with a couple general purpose libraries that you can put to use in type of project. First one being <a href="https://boltons.readthedocs.io/en/latest/"><code class="inline">boltons</code></a>, which is described in docs as: </p>
<br><i><q>Boltons is a set of pure-Python utilities in the same spirit as β and yet conspicuously missing from β the standard library.</q></i>
<p> We would need a whole article just to go over every function and feature of <code class="inline">boltons</code>, but here are a couple examples of handy functions: </p>
<pre><code class="language-python">
# pip install boltons
from boltons import jsonutils, timeutils, iterutils
from datetime import date
# {"name": "John", "id": 1, "active": true}
# {"name": "Ben", "id": 2, "active": false}
# {"name": "Mary", "id": 3, "active": true}
with open('input.jsonl') as f:
for line in jsonutils.JSONLIterator(f): # Automatically converted to dict
print(f"User: {line['name']} with ID {line['id']} is {'active' if line['active'] else 'inactive'}")
# User: John with ID 1 is active
# ...
start_date = date(year=2023, month=4, day=9)
end_date = date(year=2023, month=4, day=30)
for day in timeutils.daterange(start_date, end_date, step=(0, 0, 2)):
# datetime.date(2023, 4, 9)
# datetime.date(2023, 4, 11)
# datetime.date(2023, 4, 13)
data = {"deeply": {"nested": {"python": {"dict": "value"}}}}
iterutils.get_path(data, ("deeply", "nested", "python"))
# {'dict': 'value'}
data = {"id": "234411",
"node1": {"id": 1234, "value": "some data"},
"node2": {"id": "2352345",
"node3": {"id": "123422", "value": "more data"}
iterutils.remap(data, lambda p, k, v: (k, int(v)) if k == 'id' else (k, v))
<p> While Python's standard library has <code class="inline">json</code> module, it does not support JSON Lines (<code class="inline">.jsonl</code>) format. First example shows how you can process <code class="inline">jsonl</code> using <code class="inline">boltons</code>. </p>
<p> Second examples showcases <code class="inline">boltons.timeutils</code> module which allows you to create date-ranges. You can iterate over them as well as set <code class="inline">step</code> argument to - for example - get every other day. Again, this is something that's missing from Python's <code class="inline">datetime</code> module. </p>
<p> Finally, in the third example, we use <code class="inline">remap</code> function from <code class="inline">boltons.iterutils</code> module to recursively convert all <code class="inline">id</code> fields in dictionary to integers. The <code class="inline">boltons.iterutils</code> here serves as a nice extension to builtin <code class="inline">itertools</code>. </p>
<p> Speaking of <code class="inline">iterutils</code> and <code class="inline">itertools</code>, next great library you need to check out is <code class="inline">more-itertools</code>, which provides well, <i>more <code class="inline">itertools</code></i>. Again, discussion about <code class="inline">more-itertools</code> would warrant a whole article and... I wrote one, you can check it out <a href="https://martinheinz.dev/blog/16">here</a>. </p>
<p> Last one for this category is <code class="inline">sh</code>, which is a <code class="inline">subprocess</code> module replacement. Great if you find yourself orchestrating lots of other processes in Python: </p>
<pre><code class="language-python">
# https://pypi.org/project/sh/
# pip install sh
import sh
# Run any command in $PATH...
# total 36
# drwxrwxr-x 2 martin martin 4096 apr 8 14:18 .
# drwxrwxr-x 41 martin martin 20480 apr 7 15:23 ..
# -rw-rw-r-- 1 martin martin 30 apr 8 14:18 examples.py
with sh.contrib.sudo:
# Do stuff using 'sudo'...
# Write to a file:
# Piping:
print(sh.wc('-l', _in=sh.ls('.', '-1')))
# Same as 'ls -1 | wc -l'
<p> When we invoke <code class="inline">sh.some_command</code>, <code class="inline">sh</code> library tries to look for builtin <code class="inline">shell</code> command or a binary in your <code class="inline">$PATH</code> with that name. If it finds such command, it will simply execute it for you. </p>
<p> In case you need to use <code class="inline">sudo</code>, you can use the <code class="inline">sudo</code> context manager from <code class="inline">contrib</code> module, as shown in the second part of the snippet. </p>
<p> To write output of a command to a file you only need to provide <code class="inline">_out</code> argument to the function. And finally, you can also use pipes (<code class="inline">|</code>) by using <code class="inline">_in</code> argument. </p>
<h2>Data Validation</h2>
<p> Another <i>"missing battery"</i> in Python standard library is category of data validation tools. One small library that provides this is called <a href="https://github.com/python-validators/validators"><code class="inline">validators</code></a>. This library lets you validate common patterns such as emails, IPs or credit cards: </p>
<pre><code class="language-python">
# https://python-validators.github.io/validators/
# pip install validators
import validators
validators.email('someone@example.com') # True
validators.ip_address.ipv4('') # ValidationFailure(func=ipv4, args={'value': ''})
<p> Next up is fuzzy string comparison - Python includes <code class="inline">difflib</code> for this, but this module could use some improvements. Some of which can be found in <code class="inline">thefuzz</code> library (previously known as <code class="inline">fuzzywuzzy</code>): </p>
<pre><code class="language-python">
# pip install thefuzz
from thefuzz import fuzz
from thefuzz import process
print(fuzz.ratio("Some text for testing", "text for some testing")) # 76
print(fuzz.token_sort_ratio("Some text for testing", "text for some testing")) # 100
print(fuzz.token_sort_ratio("Some text for testing", "some testing text for some text testing")) # 70
print(fuzz.token_set_ratio("Some text for testing", "some testing text for some text testing")) # 100
songs = [
'01 Radiohead - OK Computer - Airbag.mp3',
'02 Radiohead - OK Computer - Paranoid Android.mp3',
'04 Radiohead - OK Computer - Exit Music (For a Film).mp3',
'06 Radiohead - OK Computer - Karma Police.mp3',
'10 Radiohead - OK Computer - No Surprises.mp3',
'11 Radiohead - OK Computer - Lucky.mp3',
'01 Radiohead - Pablo Honey - You.mp3',
'02 Radiohead - Pablo Honey - Creep.mp3',
'04 Radiohead - Pablo Honey - Stop Whispering.mp3',
'06 Radiohead - Pablo Honey - Anyone Can Play Guitar.mp3',
"10 Radiohead - Pablo Honey - I Can't.mp3",
'13 Radiohead - Pablo Honey - Creep (Radio Edit).mp3',
# ...
print(process.extract("Radiohead - No Surprises", songs, limit=1, scorer=fuzz.token_sort_ratio))
# [('10 Radiohead - OK Computer - No Surprises.mp3', 70)]
<p> The appeal of <code class="inline">thefuzz</code> library are the <code class="inline">*ratio</code> functions that will <i>likely</i> do a better job than the builtin <code class="inline">difflib.get_close_matches</code> or <code class="inline">difflib.SequenceMatcher.ratio</code>. The snippet above shows their different uses. First we use the basic <code class="inline">ratio</code> which computes a simple similarity score of two strings. After that we use <code class="inline">token_sort_ratio</code> which ignores the order of tokens (words) in the string when calculating the similarity. Finally, we test the <code class="inline">token_set_ratio</code> function, which instead ignores duplicate tokens. </p>
<p> We also use the <code class="inline">extract</code> function from <code class="inline">process</code> module which is an alternative to <code class="inline">difflib.get_close_matches</code>. This function looks for the best match(es) in a list of strings. </p>
<p> If you're already using <code class="inline">difflib</code> and are wondering if you should use <code class="inline">thefuzz</code> instead, then make sure to check out an <a href="https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/">article</a> by the author of the library that nicely demonstrates why builtin <code class="inline">difflib</code> is not always sufficient and why the above functions might work better. </p>
<p> There are also quite a few debugging and troubleshooting libraries that bring superior experience in comparison to what standard library has. One such library is <code class="inline">stackprinter</code> which brings more helpful version of Python's built-in exception messages: </p>
<pre><code class="language-python">
# pip install stackprinter
import stackprinter
def do_stuff():
some_var = "data"
raise ValueError("Some error message")
<p> All you need to do to use it, is import it and set the exception hook. Then, running code that throws an exception will result in: </p>
<img src="https://i.imgur.com/F30tGqB.webp" alt="stackprinter"><p> I think this is big improvement because it shows local variables and context - that is - things that you would need interactive debugger for. Check out <a href="https://github.com/cknd/stackprinter#readme">docs</a> for additional options, such as integration with logging or different color themes. </p>
<p><code class="inline">stackprinter</code> helps with debugging issues that result in exceptions, but that's only a small fraction of issues we all debug. Most of the time troubleshooting bugs involves just putting <code class="inline">print</code> or <code class="inline">log</code> statements all over the code to see current state of variables or to see whether the code was run at all. And there's a library that can improve upon the basic <code class="inline">print</code>-style debugging: </p>
<pre><code class="language-python">
# pip install icecream
from icecream import ic
def do_stuff():
some_var = "data"
some_list = [1, 2, 3, 4]
return some_var
# ic| examples.py:46 in do_stuff() at 11:27:44.604
# ic| do_stuff(): 'data'
<p> It's called <code class="inline">icecream</code> and it provides <code class="inline">ic</code> function that serves as a <code class="inline">print</code> replacement. You can use plain <code class="inline">ic()</code> (without arguments) to test which parts of code were executed. Alternatively, you can use <code class="inline">ic(some_func(...))</code> which will print the function/expression along with the return value. </p>
<p> For additional options and configuration check out <a href="https://github.com/gruns/icecream">GitHub README</a>. </p>
<p> While on the topic of debugging, we should probably also mention testing. I'm not going to tell you to use other test framework then the builtin <code class="inline">unittest</code> (even though <code class="inline">pytest</code> is just better), instead I want to show you 3 little helpful tools: </p>
<p> First one is <code class="inline">freezegun</code> library, which allows you to mock datetime: </p>
<pre><code class="language-python">
# pip install pytest freezegun
from freezegun import freeze_time
import datetime
# Run 'pytest' in shell
def test_datetime():
assert datetime.datetime.now() == datetime.datetime(2022, 4, 9) # Passes!
def test_with():
with freeze_time("Apr 9th, 2022"):
assert datetime.datetime.now() == datetime.datetime(2022, 4, 9) # Passes!
@freeze_time("Apr 9th, 2022", tick=True)
def test_time_ticking():
assert datetime.datetime.now() > datetime.datetime(2022, 4, 9) # Passes!
<p> All you need to do is add decorator to the test function that sets the date (or datetime). Alternatively, you can also use it as a context manager (<code class="inline">with</code> statement). </p>
<p> Above you can also see that it allows you to specify the date in friendly format. And finally, you can also pass in <code class="inline">tick=True</code> which will restart time from the given value. </p>
<p> Optionally - if you're using <code class="inline">pytest</code> - you can also install <code class="inline">pytest-freezegun</code> for Pytest-style fixtures. </p>
<p> Second essential testing library/helper you need is <code class="inline">dirty-equals</code>. It provides helper equality functions for comparing things that are <i>kind-of</i> equal: </p>
<pre><code class="language-python">
# pip install dirty-equals
from dirty_equals import IsApprox, IsNow, IsJson, IsPositiveInt, IsPartialDict, IsList, AnyThing
from datetime import datetime
assert 1.0 == IsApprox(1)
assert 123 == IsApprox(120, delta=4) # close enough...
now = datetime.now()
assert now == IsNow # just about...
assert '{"a": 1, "b": 2}' == IsJson
assert '{"a": 1}' == IsJson(a=IsPositiveInt)
assert {'a': 1, 'b': 2, 'c': 3} == IsPartialDict(a=1, b=2) # Validate only subset of keys/values
assert [1, 2, 3] == IsList(1, AnyThing, 3)
<p> Above is a sample of helpers that test whether two integers or datetimes are approximately the same; whether something is a valid JSON, including testing individual keys in that JSON; or whether value is a dictionary or a list with specific keys/values. </p>
<p> And finally, the third helpful library is called <code class="inline">pyperclip</code> - it provides functions for copying and pasting to/from clipboard. I find this very useful for debugging, e.g. to copy values of variables or error messages to clipboard, but this can have a lot of other use cases: </p>
<pre><code class="language-python">
# pip install pyperclip
# sudo apt-get install xclip
import pyperclip
print("Do something that throws error...")
raise SyntaxError("Something went wrong...")
except Exception as e:
# CTRL+V -> Something went wrong...
<p> In this snippet we use <code class="inline">pyperclip.copy</code> to automatically copy exception message into clipboard, so that we don't have to copy it manually from program output. </p>
<p> Last category that deserves a mention is CLI tooling. If you build CLI applications in Python, then you can probably put <code class="inline">tqdm</code> to good use. This little library provides a progress bar to your programs: </p>
<pre><code class="language-python">
# pip install tqdm
from tqdm import tqdm, trange
from random import randint
from time import sleep
for i in tqdm(range(100)):
sleep(0.05) # 50ms per iteration
# 0% | | 0/100 [00:00<?, ?it/s]
# 100%|ββββββββββ| 100/100 [00:05<00:00, 19.95it/s]
with trange(100) as t:
for i in t:
t.set_description('Step %i' % i)
t.set_postfix(throughput=f"{randint(100, 999)/100.00}Mb/s", task=i)
# Step 60: 60%|ββββββ | 60/100 [00:03<00:02, 19.78it/s, task=60, throughput=4.06Mb/s]
<p> To use it we simply wrap a loop with <code class="inline">tqdm</code> and we get a progress bar in the program output. For more advanced cases you can use <code class="inline">trange</code> context manager and set additional options such as description or any custom progress bar fields, such as throughput or time elapsed. </p>
<p> The module can also be executed as a shell command (<code class="inline">python -m tqdm</code>), which could be useful e.g. when creating backup with <code class="inline">tar</code> or looking for files with <code class="inline">find</code>. </p>
<p> See <a href="https://github.com/tqdm/tqdm#examples-and-advanced-usage">docs</a> for further advanced examples, as well as things like integrations with Pandas or Jupyter Notebook. </p>
<h2>Closing Thoughts</h2>
<p> With Python, you should always search for existing libraries before implementing anything yourself from scratch. Unless you're creating a particularly unusual or bespoke solution, chances are someone has already built and shared it on PyPI. </p>
<p> In this article I listed only general purpose libraries that anyone can benefit from, but there are many other specialized ones - e.g. for ML or web development - so I would recommend that you check out <a href="https://github.com/vinta/awesome-python">https://github.com/vinta/awesome-python</a> which has very extensive list of interesting libraries, or you can also simply <a href="https://pypi.org/search/">search PyPI by category</a> and I'm sure you will find something useful there. </p>
