GitHub Repos#
I don’t make a lot of commits outside of work to public repos, but it is a good reflection of the languages I work with most. GitHub gives a repo-by-repo summary of the code inside, which is a great way to peak at what’s inside, but wouldn’t it be nice to have an aggregate to know a candidate’s strengths at a glance?
It turns out some folks have already done this:
But let’s give it a shot ourselves, the Python way
The Python Way#
We will need to hit the GitHub API to retrieve information on our repos. We will need to aggregate that, and then plot it. We should be able to do all this with requests
, pandas
, and plotly
import pandas as pd
import requests
import plotly.express as px
import config
token = config.token # Bearer token we generated in GitHub with minimal permissions to get around rate limits
s = requests.Session() # a shared session to make requests
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 5
2 import requests
3 import plotly.express as px
----> 5 import config
6 token = config.token # Bearer token we generated in GitHub with minimal permissions to get around rate limits
7 s = requests.Session() # a shared session to make requests
ModuleNotFoundError: No module named 'config'
We are going to make a number of HTTP requests to the API that all require an Auth token. Let’s make a simple requestor function that stores the required auth info
def GitHub(*, token:str=None, session=requests):
"""A closure for hitting the GitHub API with a bearer token
Parameters
----------
token:str
a Github bearer token
session: requests.Session
a session to use for subsequent calls, defaults to using ``requests``, which uses a new session for each call
"""
def github(url:str):
r = session.get(
url,
headers={"Authorization": f"Bearer {token}"} if token else {}
)
r.raise_for_status()
js = r.json()
return js
return github
Now we can initialize the github
requestor with the token secret and a fresh session. We will use this for all our API calls
github = GitHub(token=token, session=s)
Get all the repos! This includes lots of metadata on repositories, including further API calls to make for additional information
repos = pd.DataFrame(github("https://api.github.com/users/lucasdurand/repos"))
repos = repos[~repos.fork]
For each repo, we need to hit its languages_url
and retrieve the filesizes tied to each language. We’re almost there now!
languages = repos.languages_url.transform(github).apply(pd.Series)
languages.head(3)
Python | Jupyter Notebook | JavaScript | CSS | HTML | TeX | |
---|---|---|---|---|---|---|
2 | 1175.0 | NaN | NaN | NaN | NaN | NaN |
3 | NaN | 28733.0 | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | NaN | NaN |
Note
Jupyter Notebooks are a much more bloated file format than something like a .py file. In order to properly compare these file formats we would need to read every notebook and extract just the inputs, but we will leave that for another time
The Visualization#
And now we can make the plot. In this case a Bar chart makes a lot of sense. We are throwing away a bit of information here (on repo-level stats) to make a clean plot, and using a log axis to address the fact that “bytes” of data isn’t the same as lines of code.
summary = languages.melt(var_name="language", value_name="bytes").groupby("language", as_index=False).sum()
summary = summary.sort_values("bytes", ascending=False)
px.bar(
summary,
x="language",
y="bytes",
color="language",
log_y=True,
title="Top Languages by Filesize",
template="plotly_white"
)