GitHub Repos

GitHub Repos#

I don’t make a lot of commits outside of work to public repos, but it is a good reflection of the languages I work with most. GitHub gives a repo-by-repo summary of the code inside, which is a great way to peak at what’s inside, but wouldn’t it be nice to have an aggregate to know a candidate’s strengths at a glance?

It turns out some folks have already done this:

lucasdurand's Top Langs

But let’s give it a shot ourselves, the Python way

The Python Way#

We will need to hit the GitHub API to retrieve information on our repos. We will need to aggregate that, and then plot it. We should be able to do all this with requests, pandas, and plotly

import pandas as pd
import requests
import plotly.express as px

import config
token = config.token # Bearer token we generated in GitHub with minimal permissions to get around rate limits
s = requests.Session() # a shared session to make requests
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 5
      2 import requests
      3 import plotly.express as px
----> 5 import config
      6 token = config.token # Bearer token we generated in GitHub with minimal permissions to get around rate limits
      7 s = requests.Session() # a shared session to make requests

ModuleNotFoundError: No module named 'config'

We are going to make a number of HTTP requests to the API that all require an Auth token. Let’s make a simple requestor function that stores the required auth info

def GitHub(*, token:str=None, session=requests):
    """A closure for hitting the GitHub API with a bearer token
    
    Parameters
    ----------
    token:str
        a Github bearer token
    session: requests.Session
        a session to use for subsequent calls, defaults to using ``requests``, which uses a new session for each call
    """
    def github(url:str):
        r = session.get(
            url,
            headers={"Authorization": f"Bearer {token}"} if token else {}
        )
        r.raise_for_status()
        js = r.json()
        return js
    return github

Now we can initialize the github requestor with the token secret and a fresh session. We will use this for all our API calls

github = GitHub(token=token, session=s)

Get all the repos! This includes lots of metadata on repositories, including further API calls to make for additional information

repos = pd.DataFrame(github("https://api.github.com/users/lucasdurand/repos"))
repos = repos[~repos.fork]

For each repo, we need to hit its languages_url and retrieve the filesizes tied to each language. We’re almost there now!

languages = repos.languages_url.transform(github).apply(pd.Series)
languages.head(3)
Python Jupyter Notebook JavaScript CSS HTML TeX
2 1175.0 NaN NaN NaN NaN NaN
3 NaN 28733.0 NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN

Note

Jupyter Notebooks are a much more bloated file format than something like a .py file. In order to properly compare these file formats we would need to read every notebook and extract just the inputs, but we will leave that for another time

The Visualization#

And now we can make the plot. In this case a Bar chart makes a lot of sense. We are throwing away a bit of information here (on repo-level stats) to make a clean plot, and using a log axis to address the fact that “bytes” of data isn’t the same as lines of code.

summary = languages.melt(var_name="language", value_name="bytes").groupby("language", as_index=False).sum()
summary = summary.sort_values("bytes", ascending=False)
px.bar(
    summary,
    x="language",
    y="bytes", 
    color="language",
    log_y=True, 
    title="Top Languages by Filesize", 
    template="plotly_white"
)