parallel_apply

parallel_apply(data, func, show_progress=True, threads=None, desc='Processing...', **kwargs)

The parallel_apply function parallelizes the application of a function on grouped dataframes using concurrent.futures.

Parameters

Name Type Description Default
data pd.core.groupby.generic.DataFrameGroupBy The data parameter is a Pandas DataFrameGroupBy object, which is the result of grouping a DataFrame by one or more columns. It represents the grouped data that you want to apply the function to. required
func Callable The func parameter is the function that you want to apply to each group in the grouped dataframe. This function should take a single argument, which is a dataframe representing a group, and return a result. The result can be a scalar value, a pandas Series, or a pandas DataFrame. required
show_progress bool A boolean parameter that determines whether to display progress using tqdm. If set to True, progress will be displayed. If set to False, progress will not be displayed. True
threads int The threads parameter specifies the number of threads to use for parallel processing. If threads is set to None, it will use all available processors. If threads is set to -1, it will use all available processors as well. None
**kwargs The **kwargs parameter is a dictionary of keyword arguments that are passed to the func function. {}

Returns

Type Description
pd.DataFrame The parallel_apply function returns a combined result after applying the specified function on all groups in the grouped dataframe. The result can be a pandas DataFrame or a pandas Series, depending on the function applied.

Examples:

# Example 1 - Single argument returns Series

import pytimetk as tk
import pandas as pd   

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': [1, 2, 3, 4, 5, 6]
})

grouped = df.groupby('A')

result = grouped.apply(lambda df: df['B'].sum())
result

result = tk.parallel_apply(grouped, lambda df: df['B'].sum(), show_progress=True, threads=2)
result
A
bar    12
foo     9
dtype: int64
# Example 2 - Multiple arguments returns MultiIndex DataFrame

import pytimetk as tk
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'B': ['one', 'one', 'one', 'two', 'two', 'two', 'one', 'two'],
    'C': [1, 3, 5, 7, 9, 2, 4, 6]
})

def calculate(group):
    return pd.DataFrame({
        'sum': [group['C'].sum()],
        'mean': [group['C'].mean()]
    })

grouped = df.groupby(['A', 'B'])

result = grouped.apply(calculate)
result

result = tk.parallel_apply(grouped, calculate, show_progress=True)
result
sum mean
A B
bar one 0 5 5.000000
two 0 9 4.500000
foo one 0 8 2.666667
two 0 15 7.500000
# Example 3 - Multiple arguments returns MultiIndex DataFrame

import pytimetk as tk
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'B': ['one', 'one', 'one', 'two', 'two', 'two', 'one', 'two'],
    'C': [1, 3, 5, 7, 9, 2, 4, 6]
})

def calculate(group):
    return group.head(2)

grouped = df.groupby(['A', 'B'])

result = grouped.apply(calculate)
result

result = tk.parallel_apply(grouped, calculate, show_progress=True)
result
A B C
A B
bar one 2 bar one 5
two 3 bar two 7
5 bar two 2
foo one 0 foo one 1
1 foo one 3
two 4 foo two 9
7 foo two 6
# Example 4 - Single Grouping Column Returns DataFrame

import pytimetk as tk
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'foo', 'bar', 'bar', 'foo', 'bar', 'foo', 'foo'],
    'B': [1, 3, 5, 7, 9, 2, 4, 6]
})

def calculate(group):
    return pd.DataFrame({
        'sum': [group['B'].sum()],
        'mean': [group['B'].mean()]
    })

grouped = df.groupby(['A'])

result = grouped.apply(calculate)
result

result = tk.parallel_apply(grouped, calculate, show_progress=True)
result
sum mean
A
bar 0 14 4.666667
foo 0 23 4.600000