diff --git a/doc/source/getting_started/comparison/comparison_with_sas.rst b/doc/source/getting_started/comparison/comparison_with_sas.rst index eb11b75027909..b97efe31b8b29 100644 --- a/doc/source/getting_started/comparison/comparison_with_sas.rst +++ b/doc/source/getting_started/comparison/comparison_with_sas.rst @@ -308,8 +308,8 @@ Sorting in SAS is accomplished via ``PROC SORT`` String processing ----------------- -Length -~~~~~~ +Finding length of string +~~~~~~~~~~~~~~~~~~~~~~~~ SAS determines the length of a character string with the `LENGTHN `__ @@ -327,8 +327,8 @@ functions. ``LENGTHN`` excludes trailing blanks and ``LENGTHC`` includes trailin .. include:: includes/length.rst -Find -~~~~ +Finding position of substring +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SAS determines the position of a character in a string with the `FINDW `__ function. @@ -342,19 +342,11 @@ you supply as the second argument. put(FINDW(sex,'ale')); run; -Python determines the position of a character in a string with the -``find`` function. ``find`` searches for the first position of the -substring. If the substring is found, the function returns its -position. Keep in mind that Python indexes are zero-based and -the function will return -1 if it fails to find the substring. - -.. ipython:: python - - tips["sex"].str.find("ale").head() +.. include:: includes/find_substring.rst -Substring -~~~~~~~~~ +Extracting substring by position +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SAS extracts a substring from a string based on its position with the `SUBSTR `__ function. @@ -366,17 +358,11 @@ SAS extracts a substring from a string based on its position with the put(substr(sex,1,1)); run; -With pandas you can use ``[]`` notation to extract a substring -from a string by position locations. Keep in mind that Python -indexes are zero-based. +.. include:: includes/extract_substring.rst -.. ipython:: python - tips["sex"].str[0:1].head() - - -Scan -~~~~ +Extracting nth word +~~~~~~~~~~~~~~~~~~~ The SAS `SCAN `__ function returns the nth word from a string. The first argument is the string you want to parse and the @@ -394,20 +380,11 @@ second argument specifies which word you want to extract. ;;; run; -Python extracts a substring from a string based on its text -by using regular expressions. There are much more powerful -approaches, but this just shows a simple approach. - -.. ipython:: python - - firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) - firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0] - firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0] - firstlast +.. include:: includes/nth_word.rst -Upcase, lowcase, and propcase -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Changing case +~~~~~~~~~~~~~ The SAS `UPCASE `__ `LOWCASE `__ and @@ -427,27 +404,13 @@ functions change the case of the argument. ;;; run; -The equivalent Python functions are ``upper``, ``lower``, and ``title``. +.. include:: includes/case.rst -.. ipython:: python - - firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) - firstlast["string_up"] = firstlast["String"].str.upper() - firstlast["string_low"] = firstlast["String"].str.lower() - firstlast["string_prop"] = firstlast["String"].str.title() - firstlast Merging ------- -The following tables will be used in the merge examples - -.. ipython:: python - - df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) - df1 - df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)}) - df2 +.. include:: includes/merge_setup.rst In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the ``in=`` dummy @@ -473,39 +436,13 @@ input frames. if a or b then output outer_join; run; -pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides -similar functionality. Note that the data does not have -to be sorted ahead of time, and different join -types are accomplished via the ``how`` keyword. - -.. ipython:: python - - inner_join = df1.merge(df2, on=["key"], how="inner") - inner_join - - left_join = df1.merge(df2, on=["key"], how="left") - left_join - - right_join = df1.merge(df2, on=["key"], how="right") - right_join - - outer_join = df1.merge(df2, on=["key"], how="outer") - outer_join +.. include:: includes/merge.rst Missing data ------------ -Like SAS, pandas has a representation for missing data - which is the -special float value ``NaN`` (not a number). Many of the semantics -are the same, for example missing data propagates through numeric -operations, and is ignored by default for aggregations. - -.. ipython:: python - - outer_join - outer_join["value_x"] + outer_join["value_y"] - outer_join["value_x"].sum() +.. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this to filter missing values. @@ -522,25 +459,7 @@ For example, in SAS you could do this to filter missing values. if value_x ^= .; run; -Which doesn't work in pandas. Instead, the ``pd.isna`` or ``pd.notna`` functions -should be used for comparisons. - -.. ipython:: python - - outer_join[pd.isna(outer_join["value_x"])] - outer_join[pd.notna(outer_join["value_x"])] - -pandas also provides a variety of methods to work with missing data - some of -which would be challenging to express in SAS. For example, there are methods to -drop all rows with any missing values, replacing missing values with a specified -value, like the mean, or forward filling from previous rows. See the -:ref:`missing data documentation` for more. - -.. ipython:: python - - outer_join.dropna() - outer_join.fillna(method="ffill") - outer_join["value_x"].fillna(outer_join["value_x"].mean()) +.. include:: includes/missing.rst GroupBy @@ -549,7 +468,7 @@ GroupBy Aggregation ~~~~~~~~~~~ -SAS's PROC SUMMARY can be used to group by one or +SAS's ``PROC SUMMARY`` can be used to group by one or more key variables and compute aggregations on numeric columns. @@ -561,14 +480,7 @@ numeric columns. output out=tips_summed sum=; run; -pandas provides a flexible ``groupby`` mechanism that -allows similar aggregations. See the :ref:`groupby documentation` -for more details and examples. - -.. ipython:: python - - tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() - tips_summed.head() +.. include:: includes/groupby.rst Transformation @@ -597,16 +509,7 @@ example, to subtract the mean for each observation by smoker group. if a and b; run; - -pandas ``groupby`` provides a ``transform`` mechanism that allows -these type of operations to be succinctly expressed in one -operation. - -.. ipython:: python - - gb = tips.groupby("smoker")["total_bill"] - tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") - tips.head() +.. include:: includes/transform.rst By group processing diff --git a/doc/source/getting_started/comparison/comparison_with_stata.rst b/doc/source/getting_started/comparison/comparison_with_stata.rst index d1ad18bddb0a7..ca536e7273870 100644 --- a/doc/source/getting_started/comparison/comparison_with_stata.rst +++ b/doc/source/getting_started/comparison/comparison_with_stata.rst @@ -311,15 +311,7 @@ first position of the substring you supply as the second argument. generate str_position = strpos(sex, "ale") -Python determines the position of a character in a string with the -:func:`find` function. ``find`` searches for the first position of the -substring. If the substring is found, the function returns its -position. Keep in mind that Python indexes are zero-based and -the function will return -1 if it fails to find the substring. - -.. ipython:: python - - tips["sex"].str.find("ale").head() +.. include:: includes/find_substring.rst Extracting substring by position @@ -331,13 +323,7 @@ Stata extracts a substring from a string based on its position with the :func:`s generate short_sex = substr(sex, 1, 1) -With pandas you can use ``[]`` notation to extract a substring -from a string by position locations. Keep in mind that Python -indexes are zero-based. - -.. ipython:: python - - tips["sex"].str[0:1].head() +.. include:: includes/extract_substring.rst Extracting nth word @@ -358,16 +344,7 @@ second argument specifies which word you want to extract. generate first_name = word(name, 1) generate last_name = word(name, -1) -Python extracts a substring from a string based on its text -by using regular expressions. There are much more powerful -approaches, but this just shows a simple approach. - -.. ipython:: python - - firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]}) - firstlast["First_Name"] = firstlast["string"].str.split(" ", expand=True)[0] - firstlast["Last_Name"] = firstlast["string"].str.rsplit(" ", expand=True)[0] - firstlast +.. include:: includes/nth_word.rst Changing case @@ -390,27 +367,13 @@ change the case of ASCII and Unicode strings, respectively. generate title = strproper(string) list -The equivalent Python functions are ``upper``, ``lower``, and ``title``. - -.. ipython:: python +.. include:: includes/case.rst - firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]}) - firstlast["upper"] = firstlast["string"].str.upper() - firstlast["lower"] = firstlast["string"].str.lower() - firstlast["title"] = firstlast["string"].str.title() - firstlast Merging ------- -The following tables will be used in the merge examples - -.. ipython:: python - - df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) - df1 - df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)}) - df2 +.. include:: includes/merge_setup.rst In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk. In @@ -465,38 +428,13 @@ or the intersection of the two by using the values created in the restore merge 1:n key using df2.dta -pandas DataFrames have a :meth:`DataFrame.merge` method, which provides -similar functionality. Note that different join -types are accomplished via the ``how`` keyword. - -.. ipython:: python - - inner_join = df1.merge(df2, on=["key"], how="inner") - inner_join - - left_join = df1.merge(df2, on=["key"], how="left") - left_join - - right_join = df1.merge(df2, on=["key"], how="right") - right_join - - outer_join = df1.merge(df2, on=["key"], how="outer") - outer_join +.. include:: includes/merge_setup.rst Missing data ------------ -Like Stata, pandas has a representation for missing data -- the -special float value ``NaN`` (not a number). Many of the semantics -are the same; for example missing data propagates through numeric -operations, and is ignored by default for aggregations. - -.. ipython:: python - - outer_join - outer_join["value_x"] + outer_join["value_y"] - outer_join["value_x"].sum() +.. include:: includes/missing_intro.rst One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this to filter missing values. @@ -508,30 +446,7 @@ For example, in Stata you could do this to filter missing values. * Keep non-missing values list if value_x != . -This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions -should be used for comparisons. - -.. ipython:: python - - outer_join[pd.isna(outer_join["value_x"])] - outer_join[pd.notna(outer_join["value_x"])] - -pandas also provides a variety of methods to work with missing data -- some of -which would be challenging to express in Stata. For example, there are methods to -drop all rows with any missing values, replacing missing values with a specified -value, like the mean, or forward filling from previous rows. See the -:ref:`missing data documentation` for more. - -.. ipython:: python - - # Drop rows with any missing value - outer_join.dropna() - - # Fill forwards - outer_join.fillna(method="ffill") - - # Impute missing values with the mean - outer_join["value_x"].fillna(outer_join["value_x"].mean()) +.. include:: includes/missing.rst GroupBy @@ -548,14 +463,7 @@ numeric columns. collapse (sum) total_bill tip, by(sex smoker) -pandas provides a flexible ``groupby`` mechanism that -allows similar aggregations. See the :ref:`groupby documentation` -for more details and examples. - -.. ipython:: python - - tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() - tips_summed.head() +.. include:: includes/groupby.rst Transformation @@ -570,16 +478,7 @@ For example, to subtract the mean for each observation by smoker group. bysort sex smoker: egen group_bill = mean(total_bill) generate adj_total_bill = total_bill - group_bill - -pandas ``groupby`` provides a ``transform`` mechanism that allows -these type of operations to be succinctly expressed in one -operation. - -.. ipython:: python - - gb = tips.groupby("smoker")["total_bill"] - tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") - tips.head() +.. include:: includes/transform.rst By group processing diff --git a/doc/source/getting_started/comparison/includes/case.rst b/doc/source/getting_started/comparison/includes/case.rst new file mode 100644 index 0000000000000..c00a830bc8511 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/case.rst @@ -0,0 +1,10 @@ +The equivalent pandas methods are :meth:`Series.str.upper`, :meth:`Series.str.lower`, and +:meth:`Series.str.title`. + +.. ipython:: python + + firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]}) + firstlast["upper"] = firstlast["string"].str.upper() + firstlast["lower"] = firstlast["string"].str.lower() + firstlast["title"] = firstlast["string"].str.title() + firstlast diff --git a/doc/source/getting_started/comparison/includes/extract_substring.rst b/doc/source/getting_started/comparison/includes/extract_substring.rst new file mode 100644 index 0000000000000..78eee286ad467 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/extract_substring.rst @@ -0,0 +1,7 @@ +With pandas you can use ``[]`` notation to extract a substring +from a string by position locations. Keep in mind that Python +indexes are zero-based. + +.. ipython:: python + + tips["sex"].str[0:1].head() diff --git a/doc/source/getting_started/comparison/includes/find_substring.rst b/doc/source/getting_started/comparison/includes/find_substring.rst new file mode 100644 index 0000000000000..ee940b64f5cae --- /dev/null +++ b/doc/source/getting_started/comparison/includes/find_substring.rst @@ -0,0 +1,8 @@ +You can find the position of a character in a column of strings with the :meth:`Series.str.find` +method. ``find`` searches for the first position of the substring. If the substring is found, the +method returns its position. If not found, it returns ``-1``. Keep in mind that Python indexes are +zero-based. + +.. ipython:: python + + tips["sex"].str.find("ale").head() diff --git a/doc/source/getting_started/comparison/includes/groupby.rst b/doc/source/getting_started/comparison/includes/groupby.rst new file mode 100644 index 0000000000000..caa9f6ec9c9b8 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/groupby.rst @@ -0,0 +1,7 @@ +pandas provides a flexible ``groupby`` mechanism that allows similar aggregations. See the +:ref:`groupby documentation` for more details and examples. + +.. ipython:: python + + tips_summed = tips.groupby(["sex", "smoker"])[["total_bill", "tip"]].sum() + tips_summed.head() diff --git a/doc/source/getting_started/comparison/includes/length.rst b/doc/source/getting_started/comparison/includes/length.rst index 9581c661c0170..5a0c803e9eff2 100644 --- a/doc/source/getting_started/comparison/includes/length.rst +++ b/doc/source/getting_started/comparison/includes/length.rst @@ -1,4 +1,4 @@ -Python determines the length of a character string with the ``len`` function. +You can find the length of a character string with :meth:`Series.str.len`. In Python 3, all strings are Unicode strings. ``len`` includes trailing blanks. Use ``len`` and ``rstrip`` to exclude trailing blanks. diff --git a/doc/source/getting_started/comparison/includes/merge.rst b/doc/source/getting_started/comparison/includes/merge.rst new file mode 100644 index 0000000000000..b8e3f54fd132b --- /dev/null +++ b/doc/source/getting_started/comparison/includes/merge.rst @@ -0,0 +1,17 @@ +pandas DataFrames have a :meth:`~DataFrame.merge` method, which provides similar functionality. The +data does not have to be sorted ahead of time, and different join types are accomplished via the +``how`` keyword. + +.. ipython:: python + + inner_join = df1.merge(df2, on=["key"], how="inner") + inner_join + + left_join = df1.merge(df2, on=["key"], how="left") + left_join + + right_join = df1.merge(df2, on=["key"], how="right") + right_join + + outer_join = df1.merge(df2, on=["key"], how="outer") + outer_join diff --git a/doc/source/getting_started/comparison/includes/merge_setup.rst b/doc/source/getting_started/comparison/includes/merge_setup.rst new file mode 100644 index 0000000000000..f115cd58f7a94 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/merge_setup.rst @@ -0,0 +1,8 @@ +The following tables will be used in the merge examples: + +.. ipython:: python + + df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)}) + df1 + df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)}) + df2 diff --git a/doc/source/getting_started/comparison/includes/missing.rst b/doc/source/getting_started/comparison/includes/missing.rst new file mode 100644 index 0000000000000..8e6ba95e98036 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/missing.rst @@ -0,0 +1,24 @@ +This doesn't work in pandas. Instead, the :func:`pd.isna` or :func:`pd.notna` functions +should be used for comparisons. + +.. ipython:: python + + outer_join[pd.isna(outer_join["value_x"])] + outer_join[pd.notna(outer_join["value_x"])] + +pandas also provides a variety of methods to work with missing data -- some of +which would be challenging to express in Stata. For example, there are methods to +drop all rows with any missing values, replacing missing values with a specified +value, like the mean, or forward filling from previous rows. See the +:ref:`missing data documentation` for more. + +.. ipython:: python + + # Drop rows with any missing value + outer_join.dropna() + + # Fill forwards + outer_join.fillna(method="ffill") + + # Impute missing values with the mean + outer_join["value_x"].fillna(outer_join["value_x"].mean()) diff --git a/doc/source/getting_started/comparison/includes/missing_intro.rst b/doc/source/getting_started/comparison/includes/missing_intro.rst new file mode 100644 index 0000000000000..ed97f639f3f3d --- /dev/null +++ b/doc/source/getting_started/comparison/includes/missing_intro.rst @@ -0,0 +1,9 @@ +Both have a representation for missing data — pandas' is the special float value ``NaN`` (not a +number). Many of the semantics are the same; for example missing data propagates through numeric +operations, and is ignored by default for aggregations. + +.. ipython:: python + + outer_join + outer_join["value_x"] + outer_join["value_y"] + outer_join["value_x"].sum() diff --git a/doc/source/getting_started/comparison/includes/nth_word.rst b/doc/source/getting_started/comparison/includes/nth_word.rst new file mode 100644 index 0000000000000..7af0285005d5b --- /dev/null +++ b/doc/source/getting_started/comparison/includes/nth_word.rst @@ -0,0 +1,9 @@ +The simplest way to extract words in pandas is to split the strings by spaces, then reference the +word by index. Note there are more powerful approaches should you need them. + +.. ipython:: python + + firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]}) + firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0] + firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[0] + firstlast diff --git a/doc/source/getting_started/comparison/includes/sorting.rst b/doc/source/getting_started/comparison/includes/sorting.rst index 23f11ff485474..0840c9dd554b7 100644 --- a/doc/source/getting_started/comparison/includes/sorting.rst +++ b/doc/source/getting_started/comparison/includes/sorting.rst @@ -1,5 +1,4 @@ -pandas objects have a :meth:`DataFrame.sort_values` method, which -takes a list of columns to sort by. +pandas has a :meth:`DataFrame.sort_values` method, which takes a list of columns to sort by. .. ipython:: python diff --git a/doc/source/getting_started/comparison/includes/transform.rst b/doc/source/getting_started/comparison/includes/transform.rst new file mode 100644 index 0000000000000..0aa5b5b298cf7 --- /dev/null +++ b/doc/source/getting_started/comparison/includes/transform.rst @@ -0,0 +1,8 @@ +pandas provides a :ref:`groupby.transform` mechanism that allows these type of operations to be +succinctly expressed in one operation. + +.. ipython:: python + + gb = tips.groupby("smoker")["total_bill"] + tips["adj_total_bill"] = tips["total_bill"] - gb.transform("mean") + tips.head()