# Question1
1.- What is the most popular language?
--> Most popular in terms of number of repositories
A good indicator of how much a language is 'used', but it does not indicate how actively it is used. I first believed that repositories and languages could be traversed as arrays, but I was leaving out much of the data.
++Languages associated with a greater number of repositories++
The idea is to identify in how many projects the language is used. However, we do not know how actively it is used.
~~~~sql
select language_name, count(distinct repo_name) n_repos from (
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL)
group by 1
order by 2 desc
~~~~
| Row | language_name | n_repos |
| -------- | -------- | -------- |
| 1 | JavaScript | 1102811 |
| 2 | CSS | 816623 |
| 3 | HTML | 781537 |
| 4 | Shell | 639491 |
| 5 | Python | 549967 |
<br> ++Popularity of languages of the 1000 most active repositories (most commits).++
The logic behind this question is to identify which are the languages associated with the repositories that have the most activity. These might be the 'most interesting' languages at the industry level.
~~~~sql
select language_name, count(language_name) as n_language_active_repo from
(select repo, count(commit) as n_commits
from `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
where repo_name IS NOT NULL
group by 1
order by 2 desc
limit 1000
) as commits
left join
(select language_name, repo_name from (
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL
)) as languages
on commits.repo = languages.repo_name
group by 1
order by 2 desc
~~~~
| Row | language_name | n_repos |
| -------- | -------- | -------- |
| 1 | JavaScript | 985 |
| 2 | CSS | 984 |
| 3 | HTML | 984 |
| 4 | Shell | 984 |
| 5 | Python | 984 |
If we look at the following five languages: Objective C, Assembly, Awk, UnrealScript and Makefile, we observe that most of them are oriented to interact more directly with the operating system, probably large collaborative projects such as GNU, Chrome, etc.
For these numbers to make sense, it is necessary to compare them with the total distribution of repositories that use those languages.
<br> ++The languages with the most commits by unique users++
This would allow us to know how specialized the languages are. Again, this amount must be weighted by the total number of commits a language receives (making sure to exclude those with too little support, as this would artificially influence its ranking).
This is an example of how I was trying to solve it without de-structuring the language or the repository.
~~~~sql
select language_name, count(distinct author) as n_authors from (
select author.email as author, commit, repo_name[ORDINAL(1)] as repo_name
from `bigquery-public-data.github_repos.commits`
WHERE
ARRAY_LENGTH(repo_name) > 0
) as repos
LEFT JOIN
(select
language[ORDINAL(1)].name as language_name, repo_name
from `bigquery-public-data.github_repos.languages`
where
ARRAY_LENGTH(language) > 0) as languages
ON repos.repo_name = languages.repo_name
WHERE
language_name IS NOT NULL
group by 1
order by 2 desc
~~~~
Destructuring, we obtain the following query:
~~~~sql
select language_name, count(distinct author_name) as unique_authors_commits from
(select repo, author.name as author_name
from `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
where repo_name IS NOT NULL
) as commits
left join
(select language_name, repo_name from (
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL
)) as languages
on commits.repo = languages.repo_name
group by 1
order by 2 desc
~~~~
| Row | language_name | n_repos |
| -------- | -------- | -------- |
| 1 | JavaScript | 1507512 |
| 2 | CSS | 1435131 |
| 3 | HTML | 1384847 |
| 4 | Shell | 1287231 |
| 5 | Python | 1018898 |
<br> Interesting to investigate:
- Languages that receive commits more frequently proportional to usage.
- The languages used by the most active commiters
- Number of daily commits from the commiters working on the most active repos.
- Average number of lines delivered by each commit per language.
---
2.- What is the repository with the most languages?
~~~~sql
select repo_name, count(distinct language_name) n_languages from (
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL
)
group by 1
order by 2 desc
~~~~
| Row | repo_name | n_languages |
| -------- | -------- | -------- |
| 1 | polyrabbit/polyglot | 216 |
| 2 | nzavagli/UnrealPy | 159 |
| 3 | mollstam/UnrealPy | 159 |
| 4 | aswinpj/Pygments | 154 |
| 5 | objcio/pygments | 153 |
---
3.- What is the repository with the most copies?
~~~~sql
select regexp_extract(repo_name, r'/(.*)') as repo_tag, count(1) as forks from (
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL
) as languages
group by 1
order by 2 desc
~~~~
| Row | repo_tag | forks |
| -------- | -------- | -------- |
| 1 | dotfiles | 26612 |
| 2 | blog | 10425 |
| 3 | linux | 10029 |
| 4 | kubernetes | 9406 |
| 5 | react-native | 7776 |
---
4.- Who is the most active commit author? In which repository and in what language?
(No one to one relation between commit and language)
++Most active author by total number of commits++
~~~~sql
select committer.email as comitter_email, count(distinct commit) as total_commits
from `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
where repo_name IS NOT NULL
group by 1
order by 2 desc
~~~~
| Row | comitter_email | total_commits |
| -------- | -------- | -------- |
| 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | 18039186 |
| 2 | 5c2bf027dc5cc7ef186314391646e49f6866dc00@chromium.org | 656639 |
| 3 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | 597952 |
| 4 | 99b48da825c239c6ecd0a54ebfc11552d7ffb56f@mycroft.ai | 524438 |
| 5 | react-d9075c74411dff69aa26d076c852f9974f5cf7cf@gmail.com | 495786 |
<br> ++Most active authors in a single repo++
~~~~sql
select author.name as author_name, repo, count(distinct commit) as total_commits
from `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
where repo_name IS NOT NULL
group by 1, 2
order by 3 desc
~~~~
| Row | comitter_email | repo | total_commits |
| -------- | -------- | -------- | -------- |
| 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | shenzhouzd/update | 1140574 |
| 2 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | 597893 |
| 3 | 5c2bf027dc5cc7ef186314391646e49f6866dc00@chromium.org | chromium/chromium | 530302 |
| 4 | 99b48da825c239c6ecd0a54ebfc11552d7ffb56f@mycroft.ai | Dark5ide/mycroft-core | 524423 |
| 5 | d9075c74411dff69aa26d076c852f9974f5cf7cf@gmail.com | spiderworthy/linux | 495786 |
The analysis on operating system-oriented languages is aligned with the authors and repositories with the most commits.
<br> ++Most active authors and their languages++
~~~~sql
select committer_email, repo, language_name, total_commits from (
select committer.email as committer_email, repo, count(distinct commit) as total_commits
from `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
where repo_name IS NOT NULL
group by 1, 2
) as commits
left join
(
select repo_name, language.name as language_name
from `bigquery-public-data.github_repos.languages` LEFT JOIN
unnest(language) as language
where language.name IS NOT NULL
) as languages
on commits.repo = languages.repo_name
order by 4 desc
~~~~
| Row | comitter_email | repo | language_name | total_commits |
| -------- | -------- | -------- | -------- | -------- |
| 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | | shenzhouzd/update | HTML | 1140574 |
| 2 | 1505422b2465e9a84f6fdfaa161078890c593f06@github | shenzhouzd/update | JavaScript | 1140574 |
| 3 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | C++ | 597893 |
| 4 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | Boo | 597893 |
| 5 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | Brainfuck | 597893 |
---
5.- Create a timeline with the number of commits the author has made per week
~~~~sql
select week, AVG(TIMESTAMP_DIFF( TIMESTAMP_SECONDS(commits.t),
TIMESTAMP_SECONDS(commits.t_prev_comit), second)) as mean_s_between_commits, count(commits.commit) as n_commits
from (
select
commit,
committer.email as committer_email,
EXTRACT(WEEK FROM TIMESTAMP_SECONDS(committer.time_sec)) as week,
committer.time_sec as t,
LAG(committer.time_sec) OVER (PARTITION BY committer.email order by committer.email, committer.time_sec) as t_prev_comit
from `bigquery-public-data.github_repos.commits`
where committer.email IS NOT NULL
AND committer.email = '1505422b2465e9a84f6fdfaa161078890c593f06@github.com'
order by 1, 2) as commits
group by 1
order by 1
~~~~
| Row | week | mean_s_between_commits | n_commits |
| -------- | -------- | -------- | -------- |
| 1 | 0 | 15.30265025647644 | 123988 |
| 2 | 1 | 10.539383948756592 | 318480 |
| 3 | 2 | 11.869079308616044 | 335096 |
| 4 | 3 | 9.128337177149612 | 331268 |
| 5 | 4 | 11.992890667122357 | 339413 |
---
6.- Top 10 repositories that most use "pandas"
Queries for content take more than 2 TiB
---
7.- Identify which was the first repository to commit with "panda" and which was the last
~~~~sql
select first_value(repo) over (ORDER BY date asc) as first_repo_commit,
first_value(date) over (ORDER BY date asc) as first_date_commit,
first_value(message) over (ORDER BY date asc) as first_message_commit,
first_value(repo) over (ORDER BY date desc) as last_repo_commit,
first_value(date) over (ORDER BY date desc) as last_date_commit,
first_value(message) over (ORDER BY date desc) as last_message_commit
from (
SELECT repo, TIMESTAMP_SECONDS(committer.time_sec) as date, message,
regexp_extract(message, r'pandas') as pandas
FROM `bigquery-public-data.github_repos.commits` LEFT JOIN
unnest(repo_name) as repo
WHERE
committer.time_sec IS NOT NULL ) as t
WHERE
pandas != ''
~~~~
| Row | first_repo_commit | first_date_commit | first_message_commit | last_repo_commit | last_date_commit | last_message_commit |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| 1 | BeehiveForum/BeehiveForum | 2002-09-09 23:11:22 UTC | Fixing all the bugs Quigabyte has listed on pandasurf.com | ueshin/apache-spark | 2021-07-14 21:01:10 UTC | [SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals <br><br> ### What changes were proposed in this pull request? <br> Implement non-equality comparison operators between two Categoricals. <br> Non-goal: supporting Scalar input will be a follow-up task. <br><br> ### Why are the changes needed? <br> pandas supports non-equality comparisons between two Categoricals. We should follow that. |