Fri, May 10, 2024

A comparable counter for Python

In this post, we’re going to investigate Python’s powerful collections.Counter class, and how we can use it to, well, count stuff. We’ll find that comparisons with other Counter instances are supported, but unintuitive, and that scalar comparisons are not supported. We’ll take a plunge into special methods, a somewhat low-level concept that enables, among other things, comparisons between objects. With this knowledge, we’ll build a ComparableCounter class that can we can use to apply a threshold to the actual value-counts and filter a Counter in a more intuitive approach!

Table of Contents

Recap
- Basic use of Counter
- Comparing Counter
Comparing a Counter and a scalar
- Using Python built-ins
The ComparableCounter
- Python’s special methods
- Subclassing and overloading
Performance
Wrapping up
Outlook

Python’s collections module contains some powerful data structures for efficiently handling those cases where the general containers list, dict, or tuple are, well, too general. One of these specialized containers is Counter, a “dict subclass for counting hashable objects”. We’re going to look at how this container can be further specialized by making it comparable, so that you can retrieve all elements relative to a threshold. Enter the ComparableCounter!

Recap

Let’s kick this off with a brief recap of what collections.Counter actually does.

Basic use of `Counter`

Say you have a logfile of requests to a web-server, and want to count how many distinct IPs have hit your server. For simplicity, we’ll assume this log to look like <IP>,<URI>\n:

93.180.71.3,/downloads/product_1
93.180.71.3,/downloads/product_1
80.91.33.133,/downloads/product_1
217.168.17.5,/downloads/product_1
217.168.17.5,/downloads/product_2
93.180.71.3,/downloads/product_1

To instantiate a Counter, you could collect a list of IPs, and pass that iterable to Counter:


from collections import Counter

logfile = """93.180.71.3,/downloads/product_1
93.180.71.3,/downloads/product_1
80.91.33.133,/downloads/product_1
217.168.17.5,/downloads/product_1
217.168.17.5,/downloads/product_2
93.180.71.3,/downloads/product_1
"""

ips = []

for line in logfile.splitlines():
    ips.append(line.split(',')[0])

c = Counter(ips)
print(c)

… which results in:

Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})

So far, so good! We have counted all occurrences of items in an iterable.

Comparing `Counter`

Since Python 3.10, Counters already provide rich comparison operations, but only to other Counters. Thus, if you have another Counter for, say, a different month, and you want to see if more IPs showed up at your server, you could do:

from collections import Counter

jan = Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})
feb = Counter({'93.180.71.3': 30, '217.168.17.5': 20, '80.91.33.133': 10})
print(feb > jan)

True

Great. Also, not great:

from collections import Counter

jan = Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})
feb = Counter({'93.180.71.3': 30, '217.168.17.5': 20})
print(feb > jan)

False

When comparing Counters, missing items on either side are assumed to have 0-counts, and the result of the comparison must hold for all items.

Comparisons with scalar values are, however, not supported:

from collections import Counter

jan = Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})
try:
    print(jan > 2)
except Exception as e:
    print(e)

'>' not supported between instances of 'Counter' and 'int'

💥 Whoops!

Comparing a `Counter` and a scalar

As demonstrated above, comparison with scalar values isn’t supported. So to apply a threshold to the Counter and only return items above or below a cut-off, you need to get a little more creative.

Using Python built-ins

As Counter is a subclass of dict, we can use methods from the dict-interface to apply the threshold. We could just loop over the values, and keep only those meeting our criterion:

from collections import Counter

jan = Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})

result = any(value > 2 for value in jan.values())
print(result)

True

Or, if you want the actual values, i.e., a new Counter:

from collections import Counter

jan = Counter({'93.180.71.3': 3, '217.168.17.5': 2, '80.91.33.133': 1})

jan_cut = Counter({k: v for k, v in jan.items() if v > 2})
print(jan_cut)

Counter({'93.180.71.3': 3})

Nice! A dict-comprehension to filter the original Counter, which is used to instantiate a new Counter. But there’s a certain elegance about writing Counter(...) > 2. It doesn’t require mentally parsing the dict-comprehension-with-conditional construct, and you can read it aloud as “where the counter is larger than two”, which I find more intuitive.

So let’s do that!

The `ComparableCounter`

Let’s think about this a bit: we can compare two instances of Counter. Why is that? How are comparisons between two objects implemented?

Python’s special methods

In order to provide operators such as ==, !=, >, <, etc., a class must implement the corresponding special methods pertaining to these operators.

Somewhat unsurprisingly, this is exactly what Counter does. In particular, note that the other side of the comparison being an instance of Counter is enforced through:

if not isinstance(other, Counter):
    return NotImplemented

All right, so comparisons are just special methods on classes. Could we not just, you know, overload the comparison operators to work with scalar types?

Subclassing and overloading

What we’d need is a subclass of Counter that defines the comparison operators for the scalar types we’re interested in. Also, let’s retain the functionality of the base-class when it comes to comparing two instances of our new class. This does mean that our comparisons will return bool | ComparableCounter, but I don’t see a real problem there (yet!).

Let’s give it a shot!

from collections import Counter

class ComparableCounter(Counter):
    def __eq__(self, other):
        if isinstance(other, Counter):
            return super().__eq__(other)
        return type(self)({k: v for k, v in self.items() if v == other})

    def __lt__(self, other):
        if isinstance(other, Counter):
            return super().__lt__(other)
        return type(self)({k: v for k, v in self.items() if v < other})

    def __le__(self, other):
        if isinstance(other, Counter):
            return super().__le__(other)
        return type(self)({k: v for k, v in self.items() if v <= other})

    def __gt__(self, other):
        if isinstance(other, Counter):
            return super().__gt__(other)
        return type(self)({k: v for k, v in self.items() if v > other})

    def __ge__(self, other):
        if isinstance(other, Counter):
            return super().__ge__(other)
        return type(self)({k: v for k, v in self.items() if v >= other})

cc1 = ComparableCounter(['a', 'a', 'b', 'c'])
cc2 = ComparableCounter({'a': 1, 'b': 0, 'c': 0})

print(cc1)
print(cc2)

# Retain the comparisons of the base-class; note that all
# values in cc1 are larger than cc2
print(f"All values in cc1 larger than cc2: {cc1 > cc2}")

# This is our new comparison: apply a threshold to cc1 and
# return a new ComparableCounter
print(cc1 > 1)

ComparableCounter({'a': 2, 'b': 1, 'c': 1})
ComparableCounter({'a': 1, 'b': 0, 'c': 0})
All values in cc1 larger than cc2: True
ComparableCounter({'a': 2})

This gives us the best of both worlds: we can still compare two instances of ComparableCounter, just like we can compare Counter. And we can compare all counts to a scalar, yielding a new, filtered ComparableCounter.

How does it perform compared to using built-ins? I’d expect there to be some overhead due to the isinstance(...) checks, and perhaps some method-dispatch on the object. But otherwise, we’ve just encapsulated the built-in variant in a class.

Performance

I’ll use Python’s timeit module, and run 5 repetitions of 1 million iterations. Each container is initialized with 1000 random integers in [0,100). As I don’t want the initialization to affect the measurements, I moved it to the setup code.

$ python -m timeit \
    -n 1000000 \
    -s "from collections import Counter; from numpy.random import randint; cc = Counter(randint(0, 100, 1000))" \
    "Counter({k: v for k, v in cc.items() if v > 2})"
1000000 loops, best of 5: 7.87 usec per loop

$ python -m timeit \
    -n 1000000 \
    -s "from baumanno import ComparableCounter; from numpy.random import randint; cc = ComparableCounter(randint(0, 100, 1000))" \
    "cc > 2"
1000000 loops, best of 5: 7.98 usec per loop

Not too shabby! A difference of 0.1 microseconds for an interface that is faster to write and easier to comprehend (in my opinion, at least).

Wrapping up

We started this off with a recap of what collections.Counter has in store for us, and how it allows us to compare two instances to tell us if a predicate holds for all values on both sides. Then, we found that Counter can’t be compared to scalar values, which would be nice, as it allows us to filter the container and return all values relative to a threshold. A dict-comprehension in Python can accomplish this, but it’s a bit wordy to write, and incurs the mental overhead of having to un-parse the conditional comprehension.

We took a look at how comparison operators can be implemented on a class using Python’s special methods, and devised a subclass of Counter that is comparable to scalars: ComparableCounter. All we did was encapsulate the dict-comprehension in a class, which incurred a negligible overhead in terms of time-performance compared to the raw dict-comprehension.

I’d encourage you to simply copy this code to your codebase if you find a use-case for filtering counts against a threshold, although I might plop this into PyPI just to see how that whole process works.

Outlook

While I believe list- and dict-comprehensions to be highly optimized by the interpreter already, I do have a feeling that since this is essentially a loop, we might be able to shave off a mu off the clock by implementing this thing in C++ or Rust.

As I’m eager to finally figure out how to write a Python module in a lower-level language, I’m planning a Part 2 and a Part 3 of this, investigating C++ and Rust, respectively.

Stay tuned, and thanks for coming this far :)

Go to top