Question1 - HackMD

# Question1 1.- What is the most popular language? --> Most popular in terms of number of repositories A good indicator of how much a language is 'used', but it does not indicate how actively it is used. I first believed that repositories and languages could be traversed as arrays, but I was leaving out much of the data. ++Languages associated with a greater number of repositories++ The idea is to identify in how many projects the language is used. However, we do not know how actively it is used. ~~~~sql select language_name, count(distinct repo_name) n_repos from ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL) group by 1 order by 2 desc ~~~~ | Row | language_name | n_repos | | -------- | -------- | -------- | | 1 | JavaScript | 1102811 | | 2 | CSS | 816623 | | 3 | HTML | 781537 | | 4 | Shell | 639491 | | 5 | Python | 549967 | ++Popularity of languages of the 1000 most active repositories (most commits).++ The logic behind this question is to identify which are the languages associated with the repositories that have the most activity. These might be the 'most interesting' languages at the industry level. ~~~~sql select language_name, count(language_name) as n_language_active_repo from (select repo, count(commit) as n_commits from `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo where repo_name IS NOT NULL group by 1 order by 2 desc limit 1000 ) as commits left join (select language_name, repo_name from ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL )) as languages on commits.repo = languages.repo_name group by 1 order by 2 desc ~~~~ | Row | language_name | n_repos | | -------- | -------- | -------- | | 1 | JavaScript | 985 | | 2 | CSS | 984 | | 3 | HTML | 984 | | 4 | Shell | 984 | | 5 | Python | 984 | If we look at the following five languages: Objective C, Assembly, Awk, UnrealScript and Makefile, we observe that most of them are oriented to interact more directly with the operating system, probably large collaborative projects such as GNU, Chrome, etc. For these numbers to make sense, it is necessary to compare them with the total distribution of repositories that use those languages. ++The languages with the most commits by unique users++ This would allow us to know how specialized the languages are. Again, this amount must be weighted by the total number of commits a language receives (making sure to exclude those with too little support, as this would artificially influence its ranking). This is an example of how I was trying to solve it without de-structuring the language or the repository. ~~~~sql select language_name, count(distinct author) as n_authors from ( select author.email as author, commit, repo_name[ORDINAL(1)] as repo_name from `bigquery-public-data.github_repos.commits` WHERE ARRAY_LENGTH(repo_name) > 0 ) as repos LEFT JOIN (select language[ORDINAL(1)].name as language_name, repo_name from `bigquery-public-data.github_repos.languages` where ARRAY_LENGTH(language) > 0) as languages ON repos.repo_name = languages.repo_name WHERE language_name IS NOT NULL group by 1 order by 2 desc ~~~~ Destructuring, we obtain the following query: ~~~~sql select language_name, count(distinct author_name) as unique_authors_commits from (select repo, author.name as author_name from `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo where repo_name IS NOT NULL ) as commits left join (select language_name, repo_name from ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL )) as languages on commits.repo = languages.repo_name group by 1 order by 2 desc ~~~~ | Row | language_name | n_repos | | -------- | -------- | -------- | | 1 | JavaScript | 1507512 | | 2 | CSS | 1435131 | | 3 | HTML | 1384847 | | 4 | Shell | 1287231 | | 5 | Python | 1018898 | Interesting to investigate: - Languages that receive commits more frequently proportional to usage. - The languages used by the most active commiters - Number of daily commits from the commiters working on the most active repos. - Average number of lines delivered by each commit per language. --- 2.- What is the repository with the most languages? ~~~~sql select repo_name, count(distinct language_name) n_languages from ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL ) group by 1 order by 2 desc ~~~~ | Row | repo_name | n_languages | | -------- | -------- | -------- | | 1 | polyrabbit/polyglot | 216 | | 2 | nzavagli/UnrealPy | 159 | | 3 | mollstam/UnrealPy | 159 | | 4 | aswinpj/Pygments | 154 | | 5 | objcio/pygments | 153 | --- 3.- What is the repository with the most copies? ~~~~sql select regexp_extract(repo_name, r'/(.*)') as repo_tag, count(1) as forks from ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL ) as languages group by 1 order by 2 desc ~~~~ | Row | repo_tag | forks | | -------- | -------- | -------- | | 1 | dotfiles | 26612 | | 2 | blog | 10425 | | 3 | linux | 10029 | | 4 | kubernetes | 9406 | | 5 | react-native | 7776 | --- 4.- Who is the most active commit author? In which repository and in what language? (No one to one relation between commit and language) ++Most active author by total number of commits++ ~~~~sql select committer.email as comitter_email, count(distinct commit) as total_commits from `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo where repo_name IS NOT NULL group by 1 order by 2 desc ~~~~ | Row | comitter_email | total_commits | | -------- | -------- | -------- | | 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | 18039186 | | 2 | 5c2bf027dc5cc7ef186314391646e49f6866dc00@chromium.org | 656639 | | 3 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | 597952 | | 4 | 99b48da825c239c6ecd0a54ebfc11552d7ffb56f@mycroft.ai | 524438 | | 5 | react-d9075c74411dff69aa26d076c852f9974f5cf7cf@gmail.com | 495786 | ++Most active authors in a single repo++ ~~~~sql select author.name as author_name, repo, count(distinct commit) as total_commits from `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo where repo_name IS NOT NULL group by 1, 2 order by 3 desc ~~~~ | Row | comitter_email | repo | total_commits | | -------- | -------- | -------- | -------- | | 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | shenzhouzd/update | 1140574 | | 2 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | 597893 | | 3 | 5c2bf027dc5cc7ef186314391646e49f6866dc00@chromium.org | chromium/chromium | 530302 | | 4 | 99b48da825c239c6ecd0a54ebfc11552d7ffb56f@mycroft.ai | Dark5ide/mycroft-core | 524423 | | 5 | d9075c74411dff69aa26d076c852f9974f5cf7cf@gmail.com | spiderworthy/linux | 495786 | The analysis on operating system-oriented languages is aligned with the authors and repositories with the most commits. ++Most active authors and their languages++ ~~~~sql select committer_email, repo, language_name, total_commits from ( select committer.email as committer_email, repo, count(distinct commit) as total_commits from `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo where repo_name IS NOT NULL group by 1, 2 ) as commits left join ( select repo_name, language.name as language_name from `bigquery-public-data.github_repos.languages` LEFT JOIN unnest(language) as language where language.name IS NOT NULL ) as languages on commits.repo = languages.repo_name order by 4 desc ~~~~ | Row | comitter_email | repo | language_name | total_commits | | -------- | -------- | -------- | -------- | -------- | | 1 | 1505422b2465e9a84f6fdfaa161078890c593f06@github.com | | shenzhouzd/update | HTML | 1140574 | | 2 | 1505422b2465e9a84f6fdfaa161078890c593f06@github | shenzhouzd/update | JavaScript | 1140574 | | 3 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | C++ | 597893 | | 4 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | Boo | 597893 | | 5 | 1a5ddd760534b32e2190eeee14c30bd038fb6adf@duanefking.com | duaneking/rockstar_test | Brainfuck | 597893 | --- 5.- Create a timeline with the number of commits the author has made per week ~~~~sql select week, AVG(TIMESTAMP_DIFF( TIMESTAMP_SECONDS(commits.t), TIMESTAMP_SECONDS(commits.t_prev_comit), second)) as mean_s_between_commits, count(commits.commit) as n_commits from ( select commit, committer.email as committer_email, EXTRACT(WEEK FROM TIMESTAMP_SECONDS(committer.time_sec)) as week, committer.time_sec as t, LAG(committer.time_sec) OVER (PARTITION BY committer.email order by committer.email, committer.time_sec) as t_prev_comit from `bigquery-public-data.github_repos.commits` where committer.email IS NOT NULL AND committer.email = '1505422b2465e9a84f6fdfaa161078890c593f06@github.com' order by 1, 2) as commits group by 1 order by 1 ~~~~ | Row | week | mean_s_between_commits | n_commits | | -------- | -------- | -------- | -------- | | 1 | 0 | 15.30265025647644 | 123988 | | 2 | 1 | 10.539383948756592 | 318480 | | 3 | 2 | 11.869079308616044 | 335096 | | 4 | 3 | 9.128337177149612 | 331268 | | 5 | 4 | 11.992890667122357 | 339413 | --- 6.- Top 10 repositories that most use "pandas" Queries for content take more than 2 TiB --- 7.- Identify which was the first repository to commit with "panda" and which was the last ~~~~sql select first_value(repo) over (ORDER BY date asc) as first_repo_commit, first_value(date) over (ORDER BY date asc) as first_date_commit, first_value(message) over (ORDER BY date asc) as first_message_commit, first_value(repo) over (ORDER BY date desc) as last_repo_commit, first_value(date) over (ORDER BY date desc) as last_date_commit, first_value(message) over (ORDER BY date desc) as last_message_commit from ( SELECT repo, TIMESTAMP_SECONDS(committer.time_sec) as date, message, regexp_extract(message, r'pandas') as pandas FROM `bigquery-public-data.github_repos.commits` LEFT JOIN unnest(repo_name) as repo WHERE committer.time_sec IS NOT NULL ) as t WHERE pandas != '' ~~~~ | Row | first_repo_commit | first_date_commit | first_message_commit | last_repo_commit | last_date_commit | last_message_commit | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | 1 | BeehiveForum/BeehiveForum | 2002-09-09 23:11:22 UTC | Fixing all the bugs Quigabyte has listed on pandasurf.com | ueshin/apache-spark | 2021-07-14 21:01:10 UTC | [SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals ### What changes were proposed in this pull request? Implement non-equality comparison operators between two Categoricals. Non-goal: supporting Scalar input will be a follow-up task. ### Why are the changes needed? pandas supports non-equality comparisons between two Categoricals. We should follow that. |

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.