python - 仅插入（或外推）熊猫数据框中的小间隙

Question

我有一个以时间为索引（1 分钟频率）和几列数据的 pandas DataFrame。有时数据包含 NaN。如果是这样，我只想在间隙不超过 5 分钟时进行插值。在这种情况下，这将是最多 5 个连续的 NaN。数据可能看起来像这样（几个测试用例，显示了问题）：

import numpy as np
import pandas as pd
from datetime import datetime

start = datetime(2014,2,21,14,50)
data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

它的内容如下：

In [147]: data
Out[147]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

我知道，data.interpolate()但它有几个缺陷，因为它产生了这个结果，这对 ae 列有好处，但对于 fh 列，由于不同的原因它失败了::

                         a      b      c   d       e       f       g  \
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3   
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN  2330.3   
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0  2330.3   
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3  2330.3   
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3  2330.3   
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3  2330.3   

                               h  
2014-02-21 14:50:00  2330.300000  
2014-02-21 14:51:00  2394.214286  
2014-02-21 14:52:00  2458.128571  
2014-02-21 14:53:00  2522.042857  
2014-02-21 14:54:00  2585.957143  
2014-02-21 14:55:00  2649.871429  
2014-02-21 14:56:00  2713.785714  
2014-02-21 14:57:00  2777.700000

f) 间隙由开始时 4 分钟的 NaN 组成，它们应替换为该值 2763.0（即及时向后推断）

g) 间隔超过 5 分钟，但仍然可以推断

h) 间隙长于 5 分钟，但间隙仍被内插。

我理解这些原因，当然我没有指定它不应插入超过 5 分钟的间隔。我知道这interpolate只会在时间上向前推断，但我希望它也能在时间上向后推断。是否有任何已知的方法可以用来解决我的问题，而无需重新发明轮子？

编辑：该方法data.interpolate接受输入参数limit，该参数定义要被插值替换的连续 NaN 的最大数量。但这仍然会插值到极限，但在这种情况下，我想继续使用所有 NaN。

score 12 · Accepted Answer

所以这里有一个应该可以解决问题的面具。只需interpolate然后应用掩码将适当的值重置为 NaN。老实说，这比我意识到的要多一些工作，因为我必须遍历每一列，但是如果没有我提供一些像“ones”这样的虚拟列，groupby 就无法正常工作。

无论如何，我可以解释是否有任何不清楚的地方，但实际上只有几行有点难以理解。请参阅此处以获取有关该行的技巧的更多解释，df['new']或者仅打印出单独的行以更好地了解发生了什么。

mask = data.copy()
for i in list('abcdefgh'):
    df = pd.DataFrame( data[i] )
    df['new'] = ((df.notnull() != df.shift().notnull()).cumsum())
    df['ones'] = 1
    mask[i] = (df.groupby('new')['ones'].transform('count') < 5) | data[i].notnull()

In [7]: data
Out[7]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

In [8]: mask
Out[8]: 
                        a     b     c      d      e     f      g      h
2014-02-21 14:50:00  True  True  True  False  False  True   True   True
2014-02-21 14:51:00  True  True  True  False  False  True  False  False
2014-02-21 14:52:00  True  True  True  False  False  True  False  False
2014-02-21 14:53:00  True  True  True  False  False  True  False  False
2014-02-21 14:54:00  True  True  True  False  False  True  False  False
2014-02-21 14:55:00  True  True  True  False  False  True  False  False
2014-02-21 14:56:00  True  True  True  False  False  True  False  False
2014-02-21 14:57:00  True  True  True  False   True  True  False   True

如果您在外推方面不做任何花哨的事情，从那里很容易：

In [9]: data.interpolate().bfill()[mask]
Out[9]: 
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

编辑添加： 通过将一些东西移到循环之外，这是一种更快（在这个示例数据上大约是 2 倍）和稍微简单的方法：

mask = data.copy()
grp = ((mask.notnull() != mask.shift().notnull()).cumsum())
grp['ones'] = 1
for i in list('abcdefgh'):
    mask[i] = (grp.groupby(i)['ones'].transform('count') < 5) | data[i].notnull()

score 3 · Accepted Answer

numpy在我找到上面的答案之前，我必须解决一个类似的问题并提出一个基于解决方案。因为我的代码大约是。快十倍，我在这里提供它是为了将来对某人有用。它在系列末尾处理 NaN 的方式与上述 JohnE 的解决方案不同。如果一个系列以 NaN 结尾，它会将最后一个间隙标记为无效。

这是代码：


def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out

def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff < maxgap) | ~isnan


mask = data.copy()

for column_name in data:
    x = data[column_name].values
    mask[column_name] = calc_mask(x, 5)

print('data:')
print(data)

print('\nmask:')
print mask

输出：

data:
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0    NaN NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3    NaN NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0    NaN NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1    NaN NaN  2330.3  2330.3     NaN  2777.7

mask:
                        a     b      c      d      e     f      g      h
2014-02-21 14:50:00  True  True   True  False  False  True   True   True
2014-02-21 14:51:00  True  True   True  False  False  True  False  False
2014-02-21 14:52:00  True  True   True  False  False  True  False  False
2014-02-21 14:53:00  True  True   True  False  False  True  False  False
2014-02-21 14:54:00  True  True  False  False  False  True  False  False
2014-02-21 14:55:00  True  True  False  False  False  True  False  False
2014-02-21 14:56:00  True  True  False  False  False  True  False  False
2014-02-21 14:57:00  True  True  False  False   True  True  False   True

score 3 · Accepted Answer

根据下面使用的interpolate 文档 limit_area是 0.23.0 版中的新内容。我不确定这是否是 e 和 g 列的所需输出，因为您没有详细指定所需的输出。

import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta

start = datetime(2014,2,21,14,50)
df = data = pd.DataFrame(index=[start + timedelta(minutes=1*x) for x in range(0, 8)],
                         data={'a': [123.5, np.NaN, 136.3, 164.3, 213.0, 164.3, 213.0, 221.1],
                               'b': [433.5, 523.2, 536.3, 464.3, 413.0, 164.3, 213.0, 221.1],
                               'c': [123.5, 132.3, 136.3, 164.3] + [np.NaN]*4,
                               'd': [np.NaN]*8,
                               'e': [np.NaN]*7 + [2330.3],
                               'f': [np.NaN]*4 + [2763.0, 2142.3, 2127.3, 2330.3],
                               'g': [2330.3] + [np.NaN]*7,
                               'h': [2330.3] + [np.NaN]*6 + [2777.7]})

df.interpolate(
    limit=5,
    inplace=True,
    limit_direction='both',
    limit_area='outside',
    )

print(df)

输出：

                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN  2763.0  2330.3  2330.3
2014-02-21 14:51:00    NaN  523.2  132.3 NaN     NaN  2763.0  2330.3     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN  2330.3  2763.0  2330.3     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN  2330.3  2142.3  2330.3     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN  2330.3  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

score 1 · Accepted Answer

我继续将@JohnE 的解决方案改编成一个函数（进行了一些调整/改进）。我正在使用 Python 3.8，并且我相信 3.9 的类型提示已更改，因此您可能必须适应。

from typing import Union

def fill_with_hard_limit(
        df_or_series: Union[pd.DataFrame, pd.Series], limit: int,
        fill_method='interpolate',
        **fill_method_kwargs) -> Union[pd.DataFrame, pd.Series]:
    """The fill methods from Pandas such as ``interpolate`` or ``bfill``
    will fill ``limit`` number of NaNs, even if the total number of
    consecutive NaNs is larger than ``limit``. This function instead
    does not fill any data when the number of consecutive NaNs
    is > ``limit``.

    Adapted from: https://stackoverflow.com/a/30538371/11052174

    :param df_or_series: DataFrame or Series to perform interpolation
        on.
    :param limit: Maximum number of consecutive NaNs to allow. Any
        occurrences of more consecutive NaNs than ``limit`` will have no
        filling performed.
    :param fill_method: Filling method to use, e.g. 'interpolate',
        'bfill', etc.
    :param fill_method_kwargs: Keyword arguments to pass to the
        fill_method, in addition to the given limit.

    :returns: A filled version of the given df_or_series according
        to the given inputs.
    """

    # Keep things simple, ensure we have a DataFrame.
    try:
        df = df_or_series.to_frame()
    except AttributeError:
        df = df_or_series

    # Initialize our mask.
    mask = pd.DataFrame(True, index=df.index, columns=df.columns)

    # Get cumulative sums of consecutive NaNs.
    grp = (df.notnull() != df.shift().notnull()).cumsum()

    # Add columns of ones.
    grp['ones'] = 1

    # Loop through columns and update the mask.
    for col in df.columns:

        mask.loc[:, col] = (
                (grp.groupby(col)['ones'].transform('count') <= limit)
                | df[col].notnull()
        )

    # Now, interpolate and use the mask to create NaNs for the larger
    # gaps.
    method = getattr(df, fill_method)
    out = method(limit=limit, **fill_method_kwargs)[mask]

    # Be nice to the caller and return a Series if that's what they
    # provided.
    if isinstance(df_or_series, pd.Series):
        # Return a Series.
        return out.loc[:, out.columns[0]]

    return out

用法：

>>> data_filled = fill_with_hard_limit(data, 5)
>>> data_filled
                         a      b      c   d       e       f       g       h
2014-02-21 14:50:00  123.5  433.5  123.5 NaN     NaN     NaN  2330.3  2330.3
2014-02-21 14:51:00  129.9  523.2  132.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:52:00  136.3  536.3  136.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:53:00  164.3  464.3  164.3 NaN     NaN     NaN     NaN     NaN
2014-02-21 14:54:00  213.0  413.0  164.3 NaN     NaN  2763.0     NaN     NaN
2014-02-21 14:55:00  164.3  164.3  164.3 NaN     NaN  2142.3     NaN     NaN
2014-02-21 14:56:00  213.0  213.0  164.3 NaN     NaN  2127.3     NaN     NaN
2014-02-21 14:57:00  221.1  221.1  164.3 NaN  2330.3  2330.3     NaN  2777.7

python - 仅插入（或外推）熊猫数据框中的小间隙

4 回答 4

Related

Reference