Working with missing data — pandas 2.2.2 documentation (2024)

Values considered “missing”#

pandas uses different sentinel values to represent a missing (also referred to as NA)depending on the data type.

numpy.nan for NumPy data types. The disadvantage of using NumPy data typesis that the original data type will be coerced to np.float64 or object.

In [1]: pd.Series([1, 2], dtype=np.int64).reindex([0, 1, 2])Out[1]: 0 1.01 2.02 NaNdtype: float64In [2]: pd.Series([True, False], dtype=np.bool_).reindex([0, 1, 2])Out[2]: 0 True1 False2 NaNdtype: object

NaT for NumPy np.datetime64, np.timedelta64, and PeriodDtype. For typing applications,use api.types.NaTType.

In [3]: pd.Series([1, 2], dtype=np.dtype("timedelta64[ns]")).reindex([0, 1, 2])Out[3]: 0 0 days 00:00:00.0000000011 0 days 00:00:00.0000000022 NaTdtype: timedelta64[ns]In [4]: pd.Series([1, 2], dtype=np.dtype("datetime64[ns]")).reindex([0, 1, 2])Out[4]: 0 1970-01-01 00:00:00.0000000011 1970-01-01 00:00:00.0000000022 NaTdtype: datetime64[ns]In [5]: pd.Series(["2020", "2020"], dtype=pd.PeriodDtype("D")).reindex([0, 1, 2])Out[5]: 0 2020-01-011 2020-01-012 NaTdtype: period[D]

NA for StringDtype, Int64Dtype (and other bit widths),Float64Dtype`(and other bit widths), :class:`BooleanDtype and ArrowDtype.These types will maintain the original data type of the data.For typing applications, use api.types.NAType.

In [6]: pd.Series([1, 2], dtype="Int64").reindex([0, 1, 2])Out[6]: 0 11 22 <NA>dtype: Int64In [7]: pd.Series([True, False], dtype="boolean[pyarrow]").reindex([0, 1, 2])Out[7]: 0 True1 False2 <NA>dtype: bool[pyarrow]

To detect these missing value, use the isna() or notna() methods.

In [8]: ser = pd.Series([pd.Timestamp("2020-01-01"), pd.NaT])In [9]: serOut[9]: 0 2020-01-011 NaTdtype: datetime64[ns]In [10]: pd.isna(ser)Out[10]: 0 False1 Truedtype: bool

Note

isna() or notna() will also consider None a missing value.

In [11]: ser = pd.Series([1, None], dtype=object)In [12]: serOut[12]: 0 11 Nonedtype: objectIn [13]: pd.isna(ser)Out[13]: 0 False1 Truedtype: bool

Warning

Equality compaisons between np.nan, NaT, and NAdo not act like None

In [14]: None == None # noqa: E711Out[14]: TrueIn [15]: np.nan == np.nanOut[15]: FalseIn [16]: pd.NaT == pd.NaTOut[16]: FalseIn [17]: pd.NA == pd.NAOut[17]: <NA>

Therefore, an equality comparison between a DataFrame or Serieswith one of these missing values does not provide the same information asisna() or notna().

In [18]: ser = pd.Series([True, None], dtype="boolean[pyarrow]")In [19]: ser == pd.NAOut[19]: 0 <NA>1 <NA>dtype: bool[pyarrow]In [20]: pd.isna(ser)Out[20]: 0 False1 Truedtype: bool

NA semantics#

Warning

Experimental: the behaviour of NA` can still change without warning.

Starting from pandas 1.0, an experimental NA value (singleton) isavailable to represent scalar missing values. The goal of NA is provide a“missing” indicator that can be used consistently across data types(instead of np.nan, None or pd.NaT depending on the data type).

For example, when having missing values in a Series with the nullable integerdtype, it will use NA:

In [21]: s = pd.Series([1, 2, None], dtype="Int64")In [22]: sOut[22]: 0 11 22 <NA>dtype: Int64In [23]: s[2]Out[23]: <NA>In [24]: s[2] is pd.NAOut[24]: True

Currently, pandas does not yet use those data types using NA by defaulta DataFrame or Series, so you need to specifythe dtype explicitly. An easy way to convert to those dtypes is explained in theconversion section.

Propagation in arithmetic and comparison operations#

In general, missing values propagate in operations involving NA. Whenone of the operands is unknown, the outcome of the operation is also unknown.

For example, NA propagates in arithmetic operations, similarly tonp.nan:

In [25]: pd.NA + 1Out[25]: <NA>In [26]: "a" * pd.NAOut[26]: <NA>

There are a few special cases when the result is known, even when one of theoperands is NA.

In [27]: pd.NA ** 0Out[27]: 1In [28]: 1 ** pd.NAOut[28]: 1

In equality and comparison operations, NA also propagates. This deviatesfrom the behaviour of np.nan, where comparisons with np.nan alwaysreturn False.

In [29]: pd.NA == 1Out[29]: <NA>In [30]: pd.NA == pd.NAOut[30]: <NA>In [31]: pd.NA < 2.5Out[31]: <NA>

To check if a value is equal to NA, use isna()

In [32]: pd.isna(pd.NA)Out[32]: True

Note

An exception on this basic propagation rule are reductions (such as themean or the minimum), where pandas defaults to skipping missing values. See thecalculation section for more.

Logical operations#

For logical operations, NA follows the rules of thethree-valued logic (orKleene logic, similarly to R, SQL and Julia). This logic means to onlypropagate missing values when it is logically required.

For example, for the logical “or” operation (|), if one of the operandsis True, we already know the result will be True, regardless of theother value (so regardless the missing value would be True or False).In this case, NA does not propagate:

In [33]: True | FalseOut[33]: TrueIn [34]: True | pd.NAOut[34]: TrueIn [35]: pd.NA | TrueOut[35]: True

On the other hand, if one of the operands is False, the result dependson the value of the other operand. Therefore, in this case NApropagates:

In [36]: False | TrueOut[36]: TrueIn [37]: False | FalseOut[37]: FalseIn [38]: False | pd.NAOut[38]: <NA>

The behaviour of the logical “and” operation (&) can be derived usingsimilar logic (where now NA will not propagate if one of the operandsis already False):

In [39]: False & TrueOut[39]: FalseIn [40]: False & FalseOut[40]: FalseIn [41]: False & pd.NAOut[41]: False
In [42]: True & TrueOut[42]: TrueIn [43]: True & FalseOut[43]: FalseIn [44]: True & pd.NAOut[44]: <NA>

NA in a boolean context#

Since the actual value of an NA is unknown, it is ambiguous to convert NAto a boolean value.

In [45]: bool(pd.NA)---------------------------------------------------------------------------TypeError Traceback (most recent call last)Cell In[45], line 1----> 1 bool(pd.NA)File missing.pyx:392, in pandas._libs.missing.NAType.__bool__()TypeError: boolean value of NA is ambiguous

This also means that NA cannot be used in a context where it isevaluated to a boolean, such as if condition: ... where condition canpotentially be NA. In such cases, isna() can be used to checkfor NA or condition being NA can be avoided, for example byfilling missing values beforehand.

A similar situation occurs when using Series or DataFrame objects in ifstatements, see Using if/truth statements with pandas.

NumPy ufuncs#

pandas.NA implements NumPy’s __array_ufunc__ protocol. Most ufuncswork with NA, and generally return NA:

In [46]: np.log(pd.NA)Out[46]: <NA>In [47]: np.add(pd.NA, 1)Out[47]: <NA>

Warning

Currently, ufuncs involving an ndarray and NA will return anobject-dtype filled with NA values.

In [48]: a = np.array([1, 2, 3])In [49]: np.greater(a, pd.NA)Out[49]: array([<NA>, <NA>, <NA>], dtype=object)

The return type here may change to return a different array typein the future.

See DataFrame interoperability with NumPy functions for more on ufuncs.

Conversion#

If you have a DataFrame or Series using np.nan,Series.convert_dtypes() and DataFrame.convert_dtypes()in DataFrame that can convert data to use the data types that use NAsuch as Int64Dtype or ArrowDtype. This is especially helpful after readingin data sets from IO methods where data types were inferred.

In this example, while the dtypes of all columns are changed, we show the results forthe first 10 columns.

In [50]: import ioIn [51]: data = io.StringIO("a,b\n,True\n2,")In [52]: df = pd.read_csv(data)In [53]: df.dtypesOut[53]: a float64b objectdtype: objectIn [54]: df_conv = df.convert_dtypes()In [55]: df_convOut[55]:  a b0 <NA> True1 2 <NA>In [56]: df_conv.dtypesOut[56]: a Int64b booleandtype: object

Inserting missing data#

You can insert missing values by simply assigning to a Series or DataFrame.The missing value sentinel used will be chosen based on the dtype.

In [57]: ser = pd.Series([1., 2., 3.])In [58]: ser.loc[0] = NoneIn [59]: serOut[59]: 0 NaN1 2.02 3.0dtype: float64In [60]: ser = pd.Series([pd.Timestamp("2021"), pd.Timestamp("2021")])In [61]: ser.iloc[0] = np.nanIn [62]: serOut[62]: 0 NaT1 2021-01-01dtype: datetime64[ns]In [63]: ser = pd.Series([True, False], dtype="boolean[pyarrow]")In [64]: ser.iloc[0] = NoneIn [65]: serOut[65]: 0 <NA>1 Falsedtype: bool[pyarrow]

For object types, pandas will use the value given:

In [66]: s = pd.Series(["a", "b", "c"], dtype=object)In [67]: s.loc[0] = NoneIn [68]: s.loc[1] = np.nanIn [69]: sOut[69]: 0 None1 NaN2 cdtype: object

Calculations with missing data#

Missing values propagate through arithmetic operations between pandas objects.

In [70]: ser1 = pd.Series([np.nan, np.nan, 2, 3])In [71]: ser2 = pd.Series([np.nan, 1, np.nan, 4])In [72]: ser1Out[72]: 0 NaN1 NaN2 2.03 3.0dtype: float64In [73]: ser2Out[73]: 0 NaN1 1.02 NaN3 4.0dtype: float64In [74]: ser1 + ser2Out[74]: 0 NaN1 NaN2 NaN3 7.0dtype: float64

The descriptive statistics and computational methods discussed in thedata structure overview (and listed here and here) are allaccount for missing data.

When summing data, NA values or empty data will be treated as zero.

In [75]: pd.Series([np.nan]).sum()Out[75]: 0.0In [76]: pd.Series([], dtype="float64").sum()Out[76]: 0.0

When taking the product, NA values or empty data will be treated as 1.

In [77]: pd.Series([np.nan]).prod()Out[77]: 1.0In [78]: pd.Series([], dtype="float64").prod()Out[78]: 1.0

Cumulative methods like c*msum() and cumprod()ignore NA values by default preserve them in the result. This behavior can be changedwith skipna

  • Cumulative methods like c*msum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.

In [79]: ser = pd.Series([1, np.nan, 3, np.nan])In [80]: serOut[80]: 0 1.01 NaN2 3.03 NaNdtype: float64In [81]: ser.c*msum()Out[81]: 0 1.01 NaN2 4.03 NaNdtype: float64In [82]: ser.c*msum(skipna=False)Out[82]: 0 1.01 NaN2 NaN3 NaNdtype: float64

Dropping missing data#

dropna() dropa rows or columns with missing data.

In [83]: df = pd.DataFrame([[np.nan, 1, 2], [1, 2, np.nan], [1, 2, 3]])In [84]: dfOut[84]:  0 1 20 NaN 1 2.01 1.0 2 NaN2 1.0 2 3.0In [85]: df.dropna()Out[85]:  0 1 22 1.0 2 3.0In [86]: df.dropna(axis=1)Out[86]:  10 11 22 2In [87]: ser = pd.Series([1, pd.NA], dtype="int64[pyarrow]")In [88]: ser.dropna()Out[88]: 0 1dtype: int64[pyarrow]

Filling missing data#

Filling by value#

fillna() replaces NA values with non-NA data.

Replace NA with a scalar value

In [89]: data = {"np": [1.0, np.nan, np.nan, 2], "arrow": pd.array([1.0, pd.NA, pd.NA, 2], dtype="float64[pyarrow]")}In [90]: df = pd.DataFrame(data)In [91]: dfOut[91]:  np arrow0 1.0 1.01 NaN <NA>2 NaN <NA>3 2.0 2.0In [92]: df.fillna(0)Out[92]:  np arrow0 1.0 1.01 0.0 0.02 0.0 0.03 2.0 2.0

Fill gaps forward or backward

In [93]: df.ffill()Out[93]:  np arrow0 1.0 1.01 1.0 1.02 1.0 1.03 2.0 2.0In [94]: df.bfill()Out[94]:  np arrow0 1.0 1.01 2.0 2.02 2.0 2.03 2.0 2.0

Limit the number of NA values filled

In [95]: df.ffill(limit=1)Out[95]:  np arrow0 1.0 1.01 1.0 1.02 NaN <NA>3 2.0 2.0

NA values can be replaced with corresponding value from a Series or DataFramewhere the index and column aligns between the original object and the filled object.

In [96]: dff = pd.DataFrame(np.arange(30, dtype=np.float64).reshape(10, 3), columns=list("ABC"))In [97]: dff.iloc[3:5, 0] = np.nanIn [98]: dff.iloc[4:6, 1] = np.nanIn [99]: dff.iloc[5:8, 2] = np.nanIn [100]: dffOut[100]:  A B C0 0.0 1.0 2.01 3.0 4.0 5.02 6.0 7.0 8.03 NaN 10.0 11.04 NaN NaN 14.05 15.0 NaN NaN6 18.0 19.0 NaN7 21.0 22.0 NaN8 24.0 25.0 26.09 27.0 28.0 29.0In [101]: dff.fillna(dff.mean())Out[101]:  A B C0 0.00 1.0 2.0000001 3.00 4.0 5.0000002 6.00 7.0 8.0000003 14.25 10.0 11.0000004 14.25 14.5 14.0000005 15.00 14.5 13.5714296 18.00 19.0 13.5714297 21.00 22.0 13.5714298 24.00 25.0 26.0000009 27.00 28.0 29.000000

Note

DataFrame.where() can also be used to fill NA values.Same result as above.

In [102]: dff.where(pd.notna(dff), dff.mean(), axis="columns")Out[102]:  A B C0 0.00 1.0 2.0000001 3.00 4.0 5.0000002 6.00 7.0 8.0000003 14.25 10.0 11.0000004 14.25 14.5 14.0000005 15.00 14.5 13.5714296 18.00 19.0 13.5714297 21.00 22.0 13.5714298 24.00 25.0 26.0000009 27.00 28.0 29.000000

Interpolation#

DataFrame.interpolate() and Series.interpolate() fills NA valuesusing various interpolation methods.

In [103]: df = pd.DataFrame( .....:  { .....:  "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], .....:  "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], .....:  } .....: ) .....: In [104]: dfOut[104]:  A B0 1.0 0.251 2.1 NaN2 NaN NaN3 4.7 4.004 5.6 12.205 6.8 14.40In [105]: df.interpolate()Out[105]:  A B0 1.0 0.251 2.1 1.502 3.4 2.753 4.7 4.004 5.6 12.205 6.8 14.40In [106]: idx = pd.date_range("2020-01-01", periods=10, freq="D")In [107]: data = np.random.default_rng(2).integers(0, 10, 10).astype(np.float64)In [108]: ts = pd.Series(data, index=idx)In [109]: ts.iloc[[1, 2, 5, 6, 9]] = np.nanIn [110]: tsOut[110]: 2020-01-01 8.02020-01-02 NaN2020-01-03 NaN2020-01-04 2.02020-01-05 4.02020-01-06 NaN2020-01-07 NaN2020-01-08 0.02020-01-09 3.02020-01-10 NaNFreq: D, dtype: float64In [111]: ts.plot()Out[111]: <Axes: >
Working with missing data — pandas 2.2.2 documentation (1)
In [112]: ts.interpolate()Out[112]: 2020-01-01 8.0000002020-01-02 6.0000002020-01-03 4.0000002020-01-04 2.0000002020-01-05 4.0000002020-01-06 2.6666672020-01-07 1.3333332020-01-08 0.0000002020-01-09 3.0000002020-01-10 3.000000Freq: D, dtype: float64In [113]: ts.interpolate().plot()Out[113]: <Axes: >
Working with missing data — pandas 2.2.2 documentation (2)

Interpolation relative to a Timestamp in the DatetimeIndexis available by setting method="time"

In [114]: ts2 = ts.iloc[[0, 1, 3, 7, 9]]In [115]: ts2Out[115]: 2020-01-01 8.02020-01-02 NaN2020-01-04 2.02020-01-08 0.02020-01-10 NaNdtype: float64In [116]: ts2.interpolate()Out[116]: 2020-01-01 8.02020-01-02 5.02020-01-04 2.02020-01-08 0.02020-01-10 0.0dtype: float64In [117]: ts2.interpolate(method="time")Out[117]: 2020-01-01 8.02020-01-02 6.02020-01-04 2.02020-01-08 0.02020-01-10 0.0dtype: float64

For a floating-point index, use method='values':

In [118]: idx = [0.0, 1.0, 10.0]In [119]: ser = pd.Series([0.0, np.nan, 10.0], idx)In [120]: serOut[120]: 0.0 0.01.0 NaN10.0 10.0dtype: float64In [121]: ser.interpolate()Out[121]: 0.0 0.01.0 5.010.0 10.0dtype: float64In [122]: ser.interpolate(method="values")Out[122]: 0.0 0.01.0 1.010.0 10.0dtype: float64

If you have scipy installed, you can pass the name of a 1-d interpolation routine to method.as specified in the scipy interpolation documentation and reference guide.The appropriate interpolation method will depend on the data type.

Tip

If you are dealing with a time series that is growing at an increasing rate,use method='barycentric'.

If you have values approximating a cumulative distribution function,use method='pchip'.

To fill missing values with goal of smooth plotting use method='akima'.

In [123]: df = pd.DataFrame( .....:  { .....:  "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8], .....:  "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4], .....:  } .....: ) .....: In [124]: dfOut[124]:  A B0 1.0 0.251 2.1 NaN2 NaN NaN3 4.7 4.004 5.6 12.205 6.8 14.40In [125]: df.interpolate(method="barycentric")Out[125]:  A B0 1.00 0.2501 2.10 -7.6602 3.53 -4.5153 4.70 4.0004 5.60 12.2005 6.80 14.400In [126]: df.interpolate(method="pchip")Out[126]:  A B0 1.00000 0.2500001 2.10000 0.6728082 3.43454 1.9289503 4.70000 4.0000004 5.60000 12.2000005 6.80000 14.400000In [127]: df.interpolate(method="akima")Out[127]:  A B0 1.000000 0.2500001 2.100000 -0.8733162 3.406667 0.3200343 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000

When interpolating via a polynomial or spline approximation, you must also specifythe degree or order of the approximation:

In [128]: df.interpolate(method="spline", order=2)Out[128]:  A B0 1.000000 0.2500001 2.100000 -0.4285982 3.404545 1.2069003 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000In [129]: df.interpolate(method="polynomial", order=2)Out[129]:  A B0 1.000000 0.2500001 2.100000 -2.7038462 3.451351 -1.4538463 4.700000 4.0000004 5.600000 12.2000005 6.800000 14.400000

Comparing several methods.

In [130]: np.random.seed(2)In [131]: ser = pd.Series(np.arange(1, 10.1, 0.25) ** 2 + np.random.randn(37))In [132]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])In [133]: ser.iloc[missing] = np.nanIn [134]: methods = ["linear", "quadratic", "cubic"]In [135]: df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})In [136]: df.plot()Out[136]: <Axes: >
Working with missing data — pandas 2.2.2 documentation (3)

Interpolating new observations from expanding data with Series.reindex().

In [137]: ser = pd.Series(np.sort(np.random.uniform(size=100)))# interpolate at new_indexIn [138]: new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.75]))In [139]: interp_s = ser.reindex(new_index).interpolate(method="pchip")In [140]: interp_s.loc[49:51]Out[140]: 49.00 0.47141049.25 0.47684149.50 0.48178049.75 0.48599850.00 0.48926650.25 0.49181450.50 0.49399550.75 0.49576351.00 0.497074dtype: float64

Interpolation limits#

interpolate() accepts a limit keywordargument to limit the number of consecutive NaN valuesfilled since the last valid observation

In [141]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.nan])In [142]: serOut[142]: 0 NaN1 NaN2 5.03 NaN4 NaN5 NaN6 13.07 NaN8 NaNdtype: float64In [143]: ser.interpolate()Out[143]: 0 NaN1 NaN2 5.03 7.04 9.05 11.06 13.07 13.08 13.0dtype: float64In [144]: ser.interpolate(limit=1)Out[144]: 0 NaN1 NaN2 5.03 7.04 NaN5 NaN6 13.07 13.08 NaNdtype: float64

By default, NaN values are filled in a forward direction. Uselimit_direction parameter to fill backward or from both directions.

In [145]: ser.interpolate(limit=1, limit_direction="backward")Out[145]: 0 NaN1 5.02 5.03 NaN4 NaN5 11.06 13.07 NaN8 NaNdtype: float64In [146]: ser.interpolate(limit=1, limit_direction="both")Out[146]: 0 NaN1 5.02 5.03 7.04 NaN5 11.06 13.07 13.08 NaNdtype: float64In [147]: ser.interpolate(limit_direction="both")Out[147]: 0 5.01 5.02 5.03 7.04 9.05 11.06 13.07 13.08 13.0dtype: float64

By default, NaN values are filled whether they are surrounded byexisting valid values or outside existing valid values. The limit_areaparameter restricts filling to either inside or outside values.

# fill one consecutive inside value in both directionsIn [148]: ser.interpolate(limit_direction="both", limit_area="inside", limit=1)Out[148]: 0 NaN1 NaN2 5.03 7.04 NaN5 11.06 13.07 NaN8 NaNdtype: float64# fill all consecutive outside values backwardIn [149]: ser.interpolate(limit_direction="backward", limit_area="outside")Out[149]: 0 5.01 5.02 5.03 NaN4 NaN5 NaN6 13.07 NaN8 NaNdtype: float64# fill all consecutive outside values in both directionsIn [150]: ser.interpolate(limit_direction="both", limit_area="outside")Out[150]: 0 5.01 5.02 5.03 NaN4 NaN5 NaN6 13.07 13.08 13.0dtype: float64

Replacing values#

Series.replace() and DataFrame.replace() can be used similar toSeries.fillna() and DataFrame.fillna() to replace or insert missing values.

In [151]: df = pd.DataFrame(np.eye(3))In [152]: dfOut[152]:  0 1 20 1.0 0.0 0.01 0.0 1.0 0.02 0.0 0.0 1.0In [153]: df_missing = df.replace(0, np.nan)In [154]: df_missingOut[154]:  0 1 20 1.0 NaN NaN1 NaN 1.0 NaN2 NaN NaN 1.0In [155]: df_filled = df_missing.replace(np.nan, 2)In [156]: df_filledOut[156]:  0 1 20 1.0 2.0 2.01 2.0 1.0 2.02 2.0 2.0 1.0

Replacing more than one value is possible by passing a list.

In [157]: df_filled.replace([1, 44], [2, 28])Out[157]:  0 1 20 2.0 2.0 2.01 2.0 2.0 2.02 2.0 2.0 2.0

Replacing using a mapping dict.

In [158]: df_filled.replace({1: 44, 2: 28})Out[158]:  0 1 20 44.0 28.0 28.01 28.0 44.0 28.02 28.0 28.0 44.0

Regular expression replacement#

Note

Python strings prefixed with the r character such as r'hello world'are “raw” strings.They have different semantics regarding backslashes than strings without this prefix.Backslashes in raw strings will be interpreted as an escaped backslash, e.g., r'\' == '\\'.

Replace the ‘.’ with NaN

In [159]: d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}In [160]: df = pd.DataFrame(d)In [161]: df.replace(".", np.nan)Out[161]:  a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d

Replace the ‘.’ with NaN with regular expression that removes surrounding whitespace

In [162]: df.replace(r"\s*\.\s*", np.nan, regex=True)Out[162]:  a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d

Replace with a list of regexes.

In [163]: df.replace([r"\.", r"(a)"], ["dot", r"\1stuff"], regex=True)Out[163]:  a b c0 0 astuff astuff1 1 b b2 2 dot NaN3 3 dot d

Replace with a regex in a mapping dict.

In [164]: df.replace({"b": r"\s*\.\s*"}, {"b": np.nan}, regex=True)Out[164]:  a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN d

Pass nested dictionaries of regular expressions that use the regex keyword.

In [165]: df.replace({"b": {"b": r""}}, regex=True)Out[165]:  a b c0 0 a a1 1 b2 2 . NaN3 3 . dIn [166]: df.replace(regex={"b": {r"\s*\.\s*": np.nan}})Out[166]:  a b c0 0 a a1 1 b b2 2 NaN NaN3 3 NaN dIn [167]: df.replace({"b": r"\s*(\.)\s*"}, {"b": r"\1ty"}, regex=True)Out[167]:  a b c0 0 a a1 1 b b2 2 .ty NaN3 3 .ty d

Pass a list of regular expressions that will replace matches with a scalar.

In [168]: df.replace([r"\s*\.\s*", r"a|b"], "placeholder", regex=True)Out[168]:  a b c0 0 placeholder placeholder1 1 placeholder placeholder2 2 placeholder NaN3 3 placeholder d

All of the regular expression examples can also be passed with theto_replace argument as the regex argument. In this case the valueargument must be passed explicitly by name or regex must be a nesteddictionary.

In [169]: df.replace(regex=[r"\s*\.\s*", r"a|b"], value="placeholder")Out[169]:  a b c0 0 placeholder placeholder1 1 placeholder placeholder2 2 placeholder NaN3 3 placeholder d

Note

A regular expression object from re.compile is a valid input as well.

Working with missing data — pandas 2.2.2 documentation (2024)

References

Top Articles
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 5581

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.