Try โ€‚โ€‰HackMD

Conda GitHub Statistics Dashboard

Summary

In order to give better insight to the operations of conda open-source projects, we want to develop a statistics dashboard. This dashboard will be organized along the following four categories: volume, responsiveness, engagement and lifecycle. The data for the dashboard will be drawn from GitHub, namely the issues and pull requests for individual projects. On top of this, we will also use this data to construct a narrative to be published as a blog post around the management of conda as an open-source project and the turn around it has seen in the last three years.

Creating a dashboard

Why do this?

The primary reason driving this work is gaining better insight into the operation of the community that manages and maintains projects under the "conda" organization in GitHub. The intended audience for such a dashboard include but is not limited to the conda community, companies supporting this community (e.g. Anaconda, Quansight), and other open-source communities. By providing these insights, we wish to give a better sense of the work involved to keep this ecosystem afloat to those inside and outside the conda community. On top of providing these statistics as a continuously updating dashboard, the data gathered may also be used in yearly reports about the health and well-being of the conda community.

What statistics can be shown?

The statistics that will be shown on the dashboard will be split into four separate categories: volume, responsiveness, engagement and lifecycle. By measuring and reporting the volume of incoming issues, readers will get a better sense of how many issues appear week after week and will also be able to see it split by category (e.g. bug, feature request or support). Responsiveness will show how quickly project maintainers are to answer incoming issues or review new pull requests. Engagement measurements will report the amount of comments on issues and pull requests and how many individuals participate in these conversations. Measurements related to lifetime intend to show how long it takes for a pull request/issue to be completed.

Additionally, these categories can be combined to provide better insights because looking at a single category may not tell the entire story. An example of this is combing lifetime and engagement measurements. An issue or pull request with a high level of engagement (many comments and discussions) may also hint at a longer delivery time because of necessary changes from the review. We may wish to stratify issues/pull requests into categories of high, medium and low engagement and then compare the delivery times among those issues.

Below is a non-exhaustive list of the statistics that will be shown on the dashboard organized by the aforemented categories:

Volume

  • Total incoming issues/pull requests
  • Community submitted vs. contributor submitted issues/pull requests
  • Bug vs. feature vs. support issues

Responsiveness

  • Time to first response for issues
  • Time to first review/comment for pull requests

Engagement

  • Number of comments per issue/pull request
  • Number of reactions per issue/pull request
  • Diversity of pariticipants (e.g. contributors vs. community)

Lifecycle

  • Duration from issue/pull request open to close
    • Broken down by "resolved" (i.e. fix implmeneted) or "closed" (no fix implemented)

How will this be implemented?

The implementation for such a dashboard should be setup to be run automatically at intervals of one day. This could be accomplished by creating a GitHub action that runs in the "infrastructure" project. This action could then produce a JSON file that would be read by a service that could visualize this data and produce a public-facing, web browser accessible report.

If it is not desired to create our own solution for this problem, we may wish to use an existing service for such reporting. The solutions we have already investigated for this include the following:

  • TODO: put list of services we've already checked out

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’
Things to watch out for
Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More โ†’

While we want to create a dashboard to provide statistics on our operations, we do not want to confuse this with a type of "Key Performance Indicator" (KPI) dashboard which are often used to ensure software development teams are achieving their service level agreements (SLAs). The conda community has no SLAs and as an open-source project will mostly likely never have SLAs. The maintainers may wish to achieve quicker response times for issues and pull request, but we acknowledge that there are no ideal values for these, and any targets set are purely arbitary and agreed to by the conda community or the individual project teams. Above all, we want these statistics to be a starting place for conversations about how the community and projects maintained within them can be run more efficiently and effectively and not as a measuring stick for rating performance of a particular project.

Constructing a narrative

Statistics alone often fail to provide the entire story behind a project's management and its successes. Additionally, readers connect better with stories rather than dry piles of numbers and statistics. Therefore, on top of creating a dashboard that can give a high-level overview for the projects in the conda organization, we also wish to use these numbers and figures to tell the story of conda and specifically how it has been revived in recent years by implementing a truly open-source governance model.

By doing so, we hope to give all stakeholders in this community a very clear idea of where it has been in order to give them a better sense of where it is going. Such a narrative will also help maintainers better elaborate their positions regarding how they wish the ecosystem to managed going into the future.

The narrative itself should come in the form of a blog post and be published to conda.org. The work for the dashboard mentioned above should preceed the work on the blog post so that we are able to get a better sense of the data we could use in it. The blog post should be largely driven by the data itself to illstrate the evolution of our project management, but it should also include important milestones from the last three years (e.g. implementing a better issue sorting and labeling process or the creation of a formal governance document).