For machine learning algorithms to work well, it’s usually useful to remove noise from features.For time-series this can be achieved in several ways, such as moving averages, applying sign transform, or applying low pass filter. Other, more simple way is to just apply quantilization on the features i.e. break feature values into quantiles. There are several caveats to do it in python however.When there are very few values of the feature the default behaviour of pandas’s qcut function would be to bin the values into unequal buckets. To confront this there are 2 ways:
1) rank feature values first via rank() function using ‘first’ method
2) add random noise to the feature
2nd method is more universal as it will move the features into
different quantile buckets randomly thus keeping symmetry. But in some paplication it may not matter.
To experiment with different ways to quantilize the data we can use the following function:
def addqs(df,nq=5,lags=None,cols=None,inplace=False,method='randn',retfnames=False): # if method=='qcut , possible that 0,5%,95%,100% quantiles a lot of -1 , 1 instead of alot 0, randn should compensate on average df = df if inplace else df.copy() fnames=[] if cols is None: cols=df.columns for col in cols: if lags is None: dftemp=df[[col]] kwargs={} iloc=list(range(len(dftemp))) else: dftemp=df[col].rolling(lags) kwargs={'raw':False} iloc=-1 if type(nq)==type([]): norm=(len(nq)-1)//2 else: norm=(nq-1)//2 if method=='qcut': res=dftemp.apply(lambda x: pd.qcut(x, nq, labels=False,duplicates='drop').iloc[iloc]-norm ,**kwargs)#pd.rank,pct=True) elif method=='rank': res=dftemp.apply(lambda x: pd.qcut(x.rank(method='first'), nq, labels=False,duplicates='drop').iloc[iloc]-norm ,**kwargs)#pd.rank,pct=True) elif method=='randn': res=dftemp.apply(lambda x: pd.qcut(x+0.0000000001*np.random.randn(len(x)), nq, labels=False,duplicates='drop').iloc[iloc]-norm,**kwargs)#pd.rank,pct=True) fname='q'+method+str(nq).replace("[", "").replace("]", "").replace(" ", "")+','+str(lags)+'.'+col df[fname]=res fnames.append(fname) if retfnames: return df,fnames return df pd.core.frame.DataFrame.addqs=addqs DF=pd.DataFrame
After defining this function and adding it as a DataFrame function we can use it like this:
xint=np.random.randint(0,10,20) DF({'x':xint}).addqs(nq=3,lags=5,method='rank')
In this example we used nq (number of quantiles) = 3 and 5 lags
In the following we can compare all 3 methods of quantilization for lag=5:
We can see that randn method geenrates more equal quantiles than other methods.
and now the same 3 methods for lag=None i.e. quantilization from the start of time-series.
The code to generate previous dataframe and histograms in jupyter is the following:
lagsN=None # or 5 df=DF({'x':xint}).addqs(nq=3,lags=lagsN,method='qcut') for lags in [lagsN]: #[None,5] for nq in [3]: #[0,0.05,0.95,1], for method in ['rank','qcut','randn']: df=df.join(DF({'x':xint}).addqs(nq=nq,lags=lags,method=method),rsuffix='w')#.drop(columns='x.',errors='ignore')#.set_index('x') df=df.set_index('x') df=df.drop(columns=df.columns[list(df.columns.str.contains('w'))]) df.hist()