# TODO: Table of Contents
# Chosen Issues
Team Senders have chosen to investigate into 2 enhancement issues present in Pandas. These issues are #44424 and #46357.
Below follows the analysis and description of each enhancement, how Team Senders approached the problem and the criteria used to confirm correctness of implementation.
# ENH: pd.Series.shift and .diff to accept a collection of numbers [#44424](https://github.com/pandas-dev/pandas/issues/44424)
### Labels:
- Algos
- **Enhancement**
- Needs Discussion
### Assigned Team Members:
- Andy PhyLim
- Cheryl Chen
### Description:
When using `pd.DataFrame.shift` and/or `pd.Series.shift`, an integer can be passed to indicate how many periods to shift row(s)/column(s) of data. However because of this, the feature was only capable of shifting the data once for some value. An enhancement to the feature would be to also allow the user to pass in an iterable colection of integers (a list for example) that can allow for multi-shifting. This proves useful for the convenience of generating a lag version of columns or rows which can be a common use case in solving time series related problems.
### Potential Impacted Areas:
The implementation for this issue will directly impact the `Dataframe` and the `Series` class, more specifically, the `shift` function. The changes allow for the option of accepting an iterable collection of integers. As a result, it is not changing the core structure of any classes and it does not impact other existing functionalities as this is an extension of the current code behaviour rather than a modification.
### Previous Behaviour:
```
>>> df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
>>> df.shift([0,1,2])
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/internals/blocks.py", line 1298, in shift
.
.
packages/pandas/core/array_algos/transforms.py", line 30, in shift
if periods > 0:
TypeError: '>' not supported between instances of 'list' and 'int'
```
An error was generated as it is not a supported feature.
Similarly, this would occur for the Series data structure.
This led to the an approach to this problem; which would be to loop and concatenate the result.
### Expected Behaviour:
```
>>> df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5]})
>>> df.shift([0,1,2])
a_0 b_0 a_1 b_1 a_2 b_2
0 0 3 NaN NaN NaN NaN
1 1 4 0.0 3.0 NaN NaN
2 2 5 1.0 4.0 0.0 3.0
>>> ser = pd.Series([1, 2, 3])
>>> ser.shift([1, 0, -1])
1 0 -1
0 NaN 1 2.0
1 1.0 2 3.0
2 2.0 3 NaN
```
### Design and Implementation:
The implementation initially checks if an integer or an iterable collection of integers is being passed into the function. The valid iterables are lists, tuples and sets. The behaviour of the `shift` function will behave as it originally did when an integer is passed as input. When a collection of integers is passed, a loop will occur on the list to check for any invalid values (non-integer values) and provide shifted row(s)/column(s) which associates with the respective shift value in the collection using the original `shift` behaviour. All of these new row(s)/column(s) are then concatenated with the built-in `concat` function which builds the dataframe to display to the user. If an invalid value is found in the process, a TypeError is raised; notifying the user of an invalid input.
#### Interactions with the existing codebase
The enhancements of the function does not impact the the current structure of the codebase. Since it is a user-facing functionality, it doesn't affect the current internal functionalities.
### Test Suite Coverage:
- When an empty iterable is passed, there should not be any errors and should return the `DataFrame` or `Series` object with the same data; it is unshifted.
- The object returned by the function should not be the same object that was passed into the function, even when the data is the same (cloned object)
- Passing in a list/tuple/set of integers should not cause any errors, new row(s)/column(s) should be added and the data should be shifted accordingly
- The returned object should retain the columns label set by the user if there are any
- When a collection is passed and has a non-integer, a TypeError should be raised with a clear message indicating which value in the collection is causing the error
### Acceptance Testing:
The use case that is being raised in the issue is related to time series problems where users would create multiple lagged versions of a row/column.
| Category | ID | Test Cases | Pass/Fail | Result |
| --------- |:---:| ---------------------------- |:---------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Dataframe | 1.1 | Passes in a positive integer | Pass | Return a dataframe of identical structure with shifted downwards column of the value of the integer |
| | 1.2 | Passes in a negative integer | Pass | Return a dataframe of identical structure with shifted upwards column of the value of the integer |
| | 1.3 | Passes in a list or tuple or set of integers | Pass | Return a dataframe with multiple lagged versions of a column, where each column in the table would result in a column with a downward shift of the value of each integer in the list (N column dataframe and list size of M will result in NM columns in the output) |
| | 1.4 | Passes in a positive integer | Pass | Return a dataframe with multiple lagged versions of a column, where each column in the table would result in a column with an upward shift of the value of each integer in the list (N column dataframe and list size of M will result in NM columns in the output) |
| Series | 2.1 | Passes in a positive integer | Pass | Return a series of identical structure with shifted downwards column of the value of the integer |
| | 2.2 | Passes in a negative integer | Pass | Return a series of identical structure with shifted upwards column of the value of the integer |
| | 2.3 | Passes in a list or tuple or set of integers | Pass | Return a dataframe with multiple lagged versions of a column, where each column in the table would result in a column with a downward shift of the value of each integer in the list (N column dataframe and list size of M will result in NM columns in the output) |
| | 2.4 | Passes in a positive integer | Pass | Return a dataframe with multiple lagged versions of a column, where each column in the table would result in a column with an upward shift of the value of each integer in the list (N column dataframe and list size of M will result in NM columns in the output) |
### Files Modified:
- `core/frame.py`: The implementation for Dataframes expressed above was added here under the shift method
- `core/series.py`: The implementation for Series expressed above was added here under the shift method
- `core/generic.py`: Documentation/User guide was modified to express iterable collection of integers are accepted (along with integers) as an argument for the shift method. It also provides an example of what the shift method does when an iterable is provided
- `tests/frame/methods/test_shift.py`: Implementation of test cases that cover the implemented solution to the issue. *Refer to Test Suite Coverage for more details
### Acceptance Criteria:
- [x] Acceptance tests passed; columns shift based on a collection's values
- [x] Valid iterables (lists, tuples, sets) can be passed
- [x] Empty iterable returns the same (unshifted) object
- [x] Invalid values within the passed iterable invokes a ValueError
- [x] Write passing test cases for the implementation
- [x] Ensure that all previous and new tests pass
- [x] Update documentation/user guide
# ENH: Add 'observed' parameter to value_counts [#46357](https://github.com/pandas-dev/pandas/issues/46357)
### Labels:
- **Enhancement**
- Algos
### Assigned Team Members:
- Eugene Koo
- Anthony Ding
- Angus Lee
- Justin Wang
### Description:
This issue proposes the addition of a new functionality for an existing method `value_counts`, to accept a new parameter `observed`. This `observed` parameter accepts a boolean value, wherein if True, only show observed values, otherwise, show everything. The purpose of this issue is to add this `observed` parameter to `value_counts` that operates in a similar way as it does in `groupby` method. This will allow for extended functionality of `value_counts` method, that could be useful for users to utilize as a filtering mechanism.
`observed=True`: ignores empty instances, only observed values
`observed=False`: shows empty instances, shows unobserved values too
### Potential Impacted Areas:
This implementation will affect Series, as well as files where groupby is called, as they will need to intake a new parameter. Interestingly, the `observed` parameter already exists in the `groupby` implementation, but is always set to `False`. (It may impact any code that calls `groupby(observed=True)` in the code base as the returned behaviour will no longer be the same as before, where `observed` is set to `False`) We will modify the `value_counts()` function for `DataFrame` to include the `observed` parameter. We will also override the current `value_counts()` implementation in Series, and redefine it in the Series class, to match the implementation in `DataFrame`.
### Previous Behaviour:
The previous behaviour is defaulting all observed values to False, and always displaying empty instances or unobserved values.
Currently, adding the `observed` parameter into the `value_counts` method
obviously throws an error since value_counts method does not have that parameter.
### Expected Behaviour:
```
>>> import pandas as pd
>>> s = pd.Series(["a", "b", "c"], dtype="category").iloc[0:2]
>>> s
0 a
1 b
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> df = pd.DataFrame(s)
>>> df
0
0 a
1 b
```
#### For Series
When `observed=True`, any value with a count of 0 will be hidden from the output.
```
>>> s.value_counts(observed=False)
a 1
b 1
c 0
Name: count, dtype: int64
>>> s.value_counts(observed=True)
a 1
b 1
Name: count, dtype: int64
```
#### For DataFrame
When `observed=True`, any value with a count of 0 will be hidden from the output.
```
>>> df.value_counts(observed=False)
a 1
b 1
c 0
Name: count, dtype: int64
>>> df.value_counts(observed=True)
a 1
b 1
Name: count, dtype: int64
```
#### For SeriesGroupBy
When grouby `observed=True`, the `SeriesGroupBy` type value passed for `value_counts` to evaluate will have values with count of 0 to be hidden. Thus when `value_counts(observed=True)` evaluates, it will hide the value `c` in all groups, `b` in group `0`, and `a` in group `1`. And when `value_counts(observed=True)` evaluates, all empty values are listed.
When grouby `observed=False` and `value_counts(observed=True)` evaluates, only value `c` in group `0` and `1` are hidden since in the groupby, `c` has a count of 0 while `a` and `b` have a total count of more than 0 even though in group `0` and `1`, they may have a count of 0.
```
>>> s.groupby(level=0,observed=False).value_counts(observed=False)
0 a 1
b 0
c 0
1 b 1
a 0
c 0
>>> s.groupby(level=0,observed=False).value_counts(observed=True)
0 a 1
b 0
1 b 1
a 0
>>> s.groupby(level=0,observed=True).value_counts(observed=False)
0 a 1
b 0
c 0
1 b 1
a 0
c 0
>>> s.groupby(level=0,observed=True).value_counts(observed=True)
0 a 1
1 b 1
```
#### For DataFrameGroupBy
`DataFrameGroupBy` works similarly as above (`SeriesGroupBy`) but with more possible combinations since there are more than one rows and one columns.
```
>>> df.groupby(level=0,observed=False).value_counts(observed=False)
0 a 1
1 a 0
b 1
0 b 0
c 0
1 c 0
>>> df.groupby(level=0,observed=False).value_counts(observed=True)
0 a 1
1 a 0
b 1
0 b 0
>>> df.groupby(level=0,observed=True).value_counts(observed=False)
0 a 1
1 a 0
b 1
0 b 0
c 0
1 c 0
>>> df.groupby(level=0,observed=True).value_counts(observed=True)
0 a 1
1 b 1
```
Notice that the values for `c` is removed because `observed` is set to True; ie, we are only showing observed values, and the `c` is not observed (it does not have meaning)
Adding the `observed` parameter in value_counts() method, for Series/DataFrame, the output will remove the row with unobserved values, and for SeriesGroupBy/DataFrameGroupBy, the output will also remove the unobserved row. However, for SeriesGroupBy/DataFrameGroupBy, the `observed` value can be applied to the groupby method as well as the `value_counts` method. In any case, the `value_counts` method for `groupby` is not affected by the `observed` value in the `groupby` method.
### Design and Implementation
There are changes made to each of the `NDFrame` and `GroupBy` `value_counts` method, where a new parameter is added and its corresponding functionality is also implemented in the existing `value_counts` method.
More specifically, we shifted the method `value_counts()` to the Series class to added additional functionality without modifying the other classes. This also proposes a more consistent API design, to enforce loose coupling. This parameter is also added into `GroupBy value_counts()` method, as well as its implementation for both `SeriesGroupBy` and `DataFrameGroupBy`
#### Interactions with the existing codebase
 Old version
 New Version
The enhancements of the function does not impact the the current structure of the codebase beyond Series. All other functionalities are kept intact, as the default value of `observed=False` will retain identical functionality for other data types that utilize `value_counts`.
### Test Suite Coverage:
- Values for observed parameter should be of type bool (or int: where 0 is False, otherwise, True). If not, an error should be raised.
- Creating an object such that all values are OBSERVED should output the same result regardless of the value of observed
- Creating an object such that one or more values are UNOBSERVED should output different results depending on the value of observed.
### Acceptance Testing:
The use case that is being raised in the issue is related to time series problems where users would create multiple lagged versions of a column.
| Category | ID | Test Cases | Pass/Fail | Result |
| --------- |:---:| ---------------------------- |:---------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Series | 1.1 | Passes in a Series using `value_counts` with `observed` as `true` | Pass | Return a series containing counts of unique values for observed values only |
| | 1.2 | Passes in a Series using `value_counts` with `observed` as false | Pass | Return a series containing counts of unique values for observed and unobserved values |
| DataFrame |2.1 | Passes in a dataframe using `value_counts` with `observed` as `True` | Pass | Return a series containing counts of unique values for observed values only |
| | 2.2 | Passes in a dataframe using `value_counts` with `observed` as `False` | Pass | Return a series containing counts of unique values for observed and unobserved values |
| Series Groupby | 3.1 | Passes in a Series that is grouped using `groupby` with params `level=0` and `observed=false`, followed by `value_counts` with `observed=false` | Pass | Return a Series containing counts of unique values for specified groupings of unobserved and observed values |
| | 3.2 | Passes in a Series that is grouped using `groupby` with params `level=0` and `observed=false`, followed by `value_counts` with `observed=true` | Pass | Return a Series containing counts of unique values for specified groupings of observed values |
| | 3.3 | Passes in a Series that is grouped using `groupby` with params `level=0` and `observed=true`, followed by `value_counts` with `observed=false` | Pass | Return a Series containing counts of unique values for specified groupings of unobserved and observed values |
| | 3.4 | Passes in a Series that is grouped using `groupby` with params `level=0` and `observed=true`, followed by `value_counts` with `observed=true` | Pass | Return a Series containing counts of unique values for specified groupings of observed values only |
| DataFrame Groupby | 4.1 | Passes in a DataFrame that is grouped using `groupby` with params `level=0` and `observed=false`, followed by `value_counts` with `observed=false`. Checks if `observed=false` works as default for both functions | Pass | Return a Series containing counts of unique values for specified groupings of observed values
| | 4.2 | Passes in a DataFrame that is grouped using `groupby` with params `level=0` and `observed=true`, followed by `value_counts` with `observed=false`. Checks if `observed=false` works as default for `value_counts`. | Pass | Return a Series containing counts of unique values for specified groupings of observed values|
| | 4.3 | Passes in a DataFrame that is grouped using `groupby` with params `level=0` and `observed=false`, followed by `value_counts` with `observed=true`. Checks if `observed=false` works as default for `groupby`.| Pass | Return a Series containing counts of unique values for specified groupings of observed values|
| | 4.4 | Passes in a DataFrame that is grouped using `groupby` with params `level=0` and `observed=true`, followed by `value_counts` with `observed=true` | Pass | Return a Series containing counts of unique values for specified groupings of observed values|
### Files Modified:
- `pandas/core/base.py` The `observed` keyword is added to this base method.
- `pandas/core/frame.py` The implementation mentioned above for DataFrame class was added to the `value_counts` method.
- `pandas/core/groupby/generic.py` The implementation mentioned above for SeriesGroupBy class was added to the `value_counts` method.
- `pandas/core/groupby/groupby.py` The implementation mentioned above for SeriesGroupBy and DataFrameGroupBy classes was added to the shared`value_counts` method.
- `pandas/core/series.py` The implementation for `value_counts` with `observed` keyword parameter was added here.
- `pandas/tests/frame/methods/test_value_counts.py` The tests for DataFrame `value_counts` method with `observed` parameter was added here.
- `pandas/tests/groupby/test_value_counts.py` The tests for SeriesGroupBy and DataFrameGroupBy `value_counts` method with `observed` parameter was added here.
- `pandas/tests/series/methods/test_value_counts.py` The tests for Series `value_counts` method was `observed` parameter was added here.
### Acceptance Criteria:
- [x] Add a new parameter to `value_counts` method in NDFrame (Series/DataFrame)
- [x] Add a new parameter to `value_counts` method in GroupBy (SeriesGroupBy/DataFrameGroupBy)
- [x] Write tests for this new functionality
- [x] Make sure all existing and new tests pass
# Development Process
### Task Tracking
The team had decided to continue the use of Github Projects to manage the new issues found in Pandas.
Github Projects is a ticket board which allows for the management and tracking of tasks. 6 columns, icebox, backlog, in progress, in review, done, closed, have been used on the board.
The “icebox” column held issues that have not been explored but have some potential to be put in the backlog. The “Backlog” column is for issues that have been explored and have been confirmed by the team that the completion of it would benefit the team. “In progress” column contains issues that have an assigned member contributing progress towards its completion. In “In review” column is for issues that are requesting a review on the pull request. Lastly, the “Done” column are for issues that have successfully merged to main and closed are for issues that have been considered by the team as explored and not beneficial to the team.
Note that since the issues required significant changes, the team had decided to break down the problem into sub-tasks. These sub-tasks were labelled with OBS: and SHIFT: in the title of the ticket to represent the observed and shift issues respectively. The team also estimated the difficulty and priority of the task explained further below.
Below is an example of the board at some state:
[TODO](Creating note...)]
### Assignment/Updates of Tasks
When the team narrowed down the issues to two issues, the team was able to delegate who would work on what enhancement. Since one of the features were more involved with the codebase, namely the observed ticket, the team decided to assign more developers there.
For the shift feature, Andy was responsible for the implementation while Cheryl was responsible for the tests/documentation. They would then review each other's work to make sure everything had been implemented as expected.
For the observed feature, Eugene and Anthony were responsible for the implementation while Angus and Justin worked on the testing. As for documentation, everyone was required to contribute to the documentation, as well as proofread and review each other's work.
These features were broken down into smaller tasks where the members discussed how implementation could occur, the files that play a role in the behaviour and the difficulty of implementing the change.
Near the end of the project (when the PRs were requested for reviews), a call would occur to collectively review the PR.
A pull request has been made upon completion of the issue where the other members would collectively review the PR in an organized call
After the review, the assigned members then reviewed the feedback and improved the code or addressed the comments. Eventually, a consensus will arrive when everyone approves the PR to be merged to main.
### Meeting Log
March 20, 2023:
In person meeting occurred after class at 4pm. The meeting dicussed what was expected of deliverable four and finding potential issues.
March 23, 2023:
Discord meeting at 10pm occurred which continued the process of finding potential issues; more specifically narrowing down the options of what seemed feasible. Resulted in finding the two issues discussed in this document.
March 25, 2023:
Discord meeting at 6pm occurred which allowed for the breakdown of tasks (work distribution, investigate/research possible implementations, estimation, etc)
Made sure everyone was still set up for implementing in Pandas.
March 28, 2023:
Discord meeting at 5pm occurred where progress in tickets were discussed. Difficulties in the observed ticket were addressed and further investigation in the form of peer programming occurred for both issues.
March 30, 2023:
In person meeting occurred after class at 4pm where more difficulties in the observed ticket were addressed. PR was opened for the shift ticket and test case implementation were discussed.
April 3, 2023:
Discord meeting at 10pm where Series implementation was found to be incorrect. Further work was required. More peer programming occurred with the observed ticket, eventually leading to explanations of current implemented solutions.
April 7, 2023:
Discord meeting at 12pm occurred where Andy and Cheryl finished the implementation of the Series shifting. New test cases were created and a PR was further reviewed.
All team members were then present around 6pm to go over the PR as a team.
April 8, 2023:
Quick discord meeting occurred at 3pm where PR reviews, regression tests, and double checking of the deliverable document occurred