<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://shsiddhant.github.io//feed.xml" rel="self" type="application/atom+xml" /><link href="https://shsiddhant.github.io//" rel="alternate" type="text/html" /><updated>2026-05-10T06:00:06+00:00</updated><id>https://shsiddhant.github.io//feed.xml</id><title type="html">Data Systems &amp;amp; What I Track</title><subtitle>Data pipelines, systems, and patterns in music and cricket</subtitle><author><name>Siddhant Sharma</name></author><entry><title type="html">Streaks Detection: Loops, Pandas, and a Cython Detour</title><link href="https://shsiddhant.github.io//streaks-loops-pandas-and-cython/" rel="alternate" type="text/html" title="Streaks Detection: Loops, Pandas, and a Cython Detour" /><published>2026-03-21T00:00:00+00:00</published><updated>2026-03-21T00:00:00+00:00</updated><id>https://shsiddhant.github.io//streaks-loops-pandas-and-cython</id><content type="html" xml:base="https://shsiddhant.github.io//streaks-loops-pandas-and-cython/"><![CDATA[<p>Back in October 2025, I was working on my project <a href="https://www.github.com/shsiddhant/memory.fm"><strong>memory.fm</strong></a>,
and I needed to solve a simple problem: detect <strong><em>listening streaks</em></strong>, i.e. consecutive listens of same artist, album, or track in a listening history.</p>

<p>It sounds like a natural fit for <strong>pandas</strong>. And to be fair, there is a clean, readable, and idiomatic solution.</p>

<h2 id="the-pandas-approach">The Pandas Approach</h2>

<p>A common solution would look like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">streak_gen_pandas</span><span class="p">(</span><span class="n">ser</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">min_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span><span class="p">):</span>
    <span class="n">data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">ser</span><span class="p">)</span>
    <span class="n">data</span><span class="p">[</span><span class="s">'start_of_streak'</span><span class="p">]</span> <span class="o">=</span> <span class="n">ser</span><span class="p">.</span><span class="n">ne</span><span class="p">(</span><span class="n">ser</span><span class="p">.</span><span class="n">shift</span><span class="p">())</span>
    <span class="n">data</span><span class="p">[</span><span class="s">'streak_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">start_of_streak</span><span class="p">.</span><span class="n">cumsum</span><span class="p">()</span>
    <span class="n">data</span><span class="p">[</span><span class="s">'streak_counter'</span><span class="p">]</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="s">'streak_id'</span><span class="p">).</span><span class="n">cumcount</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span>
    <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="n">data</span><span class="p">[</span><span class="s">"streak_counter"</span><span class="p">]</span> <span class="o">&gt;=</span> <span class="n">min_length</span><span class="p">]</span>
</code></pre></div></div>

<p>This works by:</p>

<ul>
  <li>
    <p>Detecting value changes using shift</p>
  </li>
  <li>
    <p>Assigning a group to each streak using <code class="language-plaintext highlighter-rouge">cumsum</code></p>
  </li>
  <li>
    <p>Counting within each group using <code class="language-plaintext highlighter-rouge">groupby</code> + <code class="language-plaintext highlighter-rouge">cumcount</code></p>
  </li>
</ul>

<p>It’s expressive and easy to read.</p>

<p>On my own listening history,  which had around <strong>40k</strong> scrobbles at the time, the above code takes about <strong>15 ms</strong> per run.</p>

<h2 id="a-more-direct-approach">A More Direct Approach</h2>

<p>Before looking up pandas solutions, I had written a more direct algorithm:</p>

<ul>
  <li>Scan the sequence</li>
  <li>Detect where streaks start and end</li>
  <li>Compute streak lengths</li>
  <li>Record streaks above a minimum length</li>
</ul>

<p><strong>Create a boolean series by comparing consecutive values.</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gen_bool_ser</span><span class="p">(</span>
    <span class="n">original_ser</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">,</span>
    <span class="n">min_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">[</span><span class="nb">bool</span><span class="p">]:</span>
    <span class="s">"""
    Generate a boolean series of comparisons of consecutive values.
    """</span>
    <span class="k">if</span> <span class="nb">int</span><span class="p">(</span><span class="n">min_length</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="s">"Streak Length must be at least 2"</span><span class="p">)</span>
    <span class="n">original_ser</span> <span class="o">=</span> <span class="n">original_ser</span><span class="p">.</span><span class="n">dropna</span><span class="p">()</span>
    <span class="n">ser</span> <span class="o">=</span> <span class="n">original_ser</span><span class="p">.</span><span class="n">eq</span><span class="p">(</span><span class="n">original_ser</span><span class="p">.</span><span class="n">shift</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)).</span><span class="n">iloc</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">ser</span>
</code></pre></div></div>

<p><strong>Search for streaks</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">loop</span><span class="p">(</span>
    <span class="n">ser</span><span class="p">:</span><span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">[</span><span class="nb">bool</span><span class="p">],</span>
    <span class="n">min_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    <span class="p">):</span>
    <span class="s">"""
    Loop to find streaks using the boolean comparison series.
    """</span>
    <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">ser</span><span class="p">)</span>
    <span class="n">streaks</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">stop</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="n">start</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
        <span class="c1"># Search for streak start.
</span>        <span class="c1"># Skip the search region to after the previous streak
</span>        <span class="n">g</span> <span class="o">=</span> <span class="p">(</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">stop</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">if</span> <span class="n">ser</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="p">)</span>
        <span class="n">start</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">start</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span> <span class="c1"># if no streak is found, stop the loop
</span>            <span class="k">break</span>
        <span class="c1"># Search for streak end.
</span>        <span class="c1">#Skip the search region to after the streak start
</span>        <span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="n">j</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">ser</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
        <span class="n">stop</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
         <span class="c1"># If streak continues to the end of series, stop the loop
</span>        <span class="k">if</span> <span class="n">stop</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">n</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&gt;=</span> <span class="n">min_length</span><span class="p">:</span> <span class="c1"># check length
</span>                <span class="n">streaks</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">start</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">n</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span>
            <span class="k">break</span>
        <span class="k">elif</span> <span class="n">stop</span><span class="o">-</span><span class="n">start</span><span class="o">+</span><span class="mi">1</span> <span class="o">&gt;=</span> <span class="n">min_length</span><span class="p">:</span> <span class="c1"># check length
</span>            <span class="n">streaks</span><span class="p">.</span><span class="n">append</span><span class="p">([</span><span class="n">start</span><span class="p">,</span> <span class="n">stop</span><span class="p">,</span> <span class="n">stop</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span><span class="p">])</span>
    <span class="k">return</span> <span class="n">streaks</span>

<span class="k">def</span> <span class="nf">streak_gen</span><span class="p">(</span><span class="n">orig_ser</span><span class="p">:</span> <span class="n">pd</span><span class="p">.</span><span class="n">Series</span><span class="p">,</span> <span class="n">min_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">loop</span><span class="p">(</span><span class="n">gen_bool_ser</span><span class="p">(</span><span class="n">orig_ser</span><span class="p">,</span> <span class="n">min_length</span><span class="p">),</span> <span class="n">min_length</span><span class="p">)</span>
</code></pre></div></div>

<p>This version is conceptually simple and runs in linear time. But performance-wise, it takes about <strong>200 ms</strong> per run, on the same dataset.</p>

<p>That is over <strong>10 times</strong> slower than the pandas solution!</p>

<h2 id="why-this-was-slow">Why This Was Slow</h2>

<p>That was the primary question. At its core, streak detection is simple: find consecutive runs of identical values.</p>

<p>I wanted to understand what pandas could be doing better that made it so much faster. It also hurt my ego a little, so that was another motivation.</p>

<p>The issue, as I realised, wasn’t the algorithm, but the execution method. When I looked into the <em>why</em> more closely,
I found that <strong>pandas</strong> and <strong>NumPy</strong> actually use <strong>C-compiled</strong> code behind the scenes.</p>

<p>So that’s what it meant when I kept hearing about the <em>vectorized</em> methods!</p>

<p><strong>Key problems</strong></p>

<ul>
  <li>
    <p>Python-level loops</p>
  </li>
  <li>
    <p>Repeated <code class="language-plaintext highlighter-rouge">pd.Series.iloc[...]</code> access</p>
  </li>
  <li>
    <p>Python generator expressions</p>
  </li>
</ul>

<p>Even though the algorithm is <em>O(n)</em>, each iteration carries significant overhead.</p>

<p>That led me to <strong>Cython</strong> and <strong>C-extensions</strong>, and my current solution.</p>

<h2 id="my-current-solution">My Current Solution</h2>

<p>The idea is to move the loop function into a Cython module, leveraging typed memoryviews. That way you avoid the Python overhead.</p>

<h3 id="reduce-pandas-overhead">Reduce Pandas Overhead</h3>

<p>Instead of working directly with pandas objects, I convert the series to a NumPy array of integers and return a comparison boolean int array.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gen_streaks_bool</span><span class="p">(</span><span class="n">series</span><span class="p">:</span> <span class="n">ArrayLike</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">NDArray</span><span class="p">:</span>
    <span class="s">"""
    Returns the boolean integer array of consecutive value comparisons.
    """</span>
    <span class="n">series</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">series</span><span class="p">.</span><span class="n">factorize</span><span class="p">()[</span><span class="mi">0</span><span class="p">])</span>
    <span class="n">bool_ser</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">((</span><span class="n">series</span><span class="p">[</span><span class="mi">1</span><span class="p">:]</span> <span class="o">==</span> <span class="n">series</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">intc</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">bool_ser</span>
</code></pre></div></div>

<h3 id="move-the-loop-to-cython">Move the Loop to Cython</h3>

<p>The main speedup came from moving the loop to Cython.</p>

<p>Importantly, the algorithm didn’t change, but only now it runs in a C compiled code.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">cython</span><span class="p">.</span><span class="n">boundscheck</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="o">@</span><span class="n">cython</span><span class="p">.</span><span class="n">wraparound</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">streak_gen</span><span class="p">(</span>
    <span class="n">streak_start</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="n">bint</span><span class="p">[:],</span> <span class="c1"># Boolean int array.
</span>    <span class="n">min_length</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span><span class="p">[:,</span> <span class="p">:]:</span>
    <span class="n">n</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span> <span class="o">=</span> <span class="n">streak_start</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">start</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">stop</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">i</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>

    <span class="n">streaks</span><span class="p">:</span> <span class="n">cython</span><span class="p">.</span><span class="nb">int</span><span class="p">[:,</span> <span class="p">:]</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">n</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="n">intc</span><span class="p">)</span>

    <span class="k">while</span> <span class="n">start</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">:</span>
        <span class="n">g</span> <span class="o">=</span> <span class="p">(</span><span class="n">k</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">stop</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">if</span> <span class="n">streak_start</span><span class="p">[</span><span class="n">k</span><span class="p">])</span>
        <span class="n">start</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">start</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
            <span class="k">break</span>

        <span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="n">j</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">start</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">streak_start</span><span class="p">[</span><span class="n">j</span><span class="p">])</span>
        <span class="n">stop</span> <span class="o">=</span> <span class="nb">next</span><span class="p">(</span><span class="n">h</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">stop</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">n</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&gt;=</span> <span class="n">min_length</span><span class="p">:</span>
                <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">start</span>
                <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span>
                <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">n</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>
            <span class="k">break</span>
        <span class="k">elif</span> <span class="n">stop</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span> <span class="o">&gt;=</span> <span class="n">min_length</span><span class="p">:</span>
            <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">start</span>
            <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">stop</span>
            <span class="n">streaks</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="n">stop</span> <span class="o">-</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">1</span>
            <span class="n">i</span> <span class="o">+=</span> <span class="mi">1</span>

    <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">streaks</span><span class="p">)[:</span><span class="n">i</span><span class="p">,</span> <span class="p">:]</span>
</code></pre></div></div>

<p>This code returns a 2D array of start, stop and streak lengths.</p>

<p>That brings runtime down to about <strong>10 ms</strong> per run!</p>

<p>Ever so slightly faster than the pandas solution.</p>

<h2 id="key-insight">Key Insight</h2>

<p>Performance might depend more on where your code runs than on what your algorithm is.</p>

<ul>
  <li>
    <p>Python loop → slow</p>
  </li>
  <li>
    <p>Pandas and NumPy (vectorized) → fast</p>
  </li>
  <li>
    <p>Cython (compiled loop) → as fast as the vectorized method</p>
  </li>
</ul>

<p>The algorithm stayed essentially the same throughout.</p>

<h2 id="notes">Notes</h2>

<p>This Cython version still uses generator expressions, so it’s not fully optimized. We could further improve performance by:</p>

<ol>
  <li>Using Cython syntax instead of Pure Python syntax</li>
  <li>Rewriting the loop with pure C-level iteration instead of generators</li>
</ol>

<p>The pandas solution is still more concise and perfectly reasonable for most use cases.</p>

<h2 id="conclusion">Conclusion</h2>

<p>It started as a simple annoyance: <em>Why is the ‘idiomatic’ Pandas way faster than my direct loop?</em></p>

<p>Conceptually, they aren’t very different. But digging into the why led me down the rabbit hole of <strong>CPython</strong>.
By using <strong>Cython</strong> to run my original loop in a C-compiled environment, I was able to match the performance of <strong>Pandas</strong> without needing a more clever approach.</p>

<p>Thus, I realised that while my algorithm was mathematically sound, it was being held back by the overhead of the Python interpreter itself!</p>

<p>A tiny bit of ego and curiosity might lead you to some very interesting things which you’d otherwise likely never discover.</p>]]></content><author><name>Siddhant Sharma</name></author><category term="Python" /><category term="Pandas" /><category term="Cython" /><summary type="html"><![CDATA[Back in October 2025, I was working on my project memory.fm, and I needed to solve a simple problem: detect listening streaks, i.e. consecutive listens of same artist, album, or track in a listening history.]]></summary></entry></feed>