Generating random dates within a given range in pandas
PythonPandasDatetimeRandomPython Problem Overview
This is a self-answered post. A common problem is to randomly generate dates between a given start and end date.
There are two cases to consider:
- random dates with a time component, and
- random dates without time
For example, given some start date 2015-01-01
and an end date 2018-01-01
, how can I sample N random dates between this range using pandas?
Python Solutions
Solution 1 - Python
Is converting to the unix timestamp acceptable?
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
Sample run:
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
random_dates(start, end)
DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-05-15 03:47:54',
'2015-06-24 07:32:32', '2015-11-10 20:39:36',
'2016-07-25 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None)
EDIT:
As per the comment by @smci, I wrote a function to accommodate both 1 and 2 with a little explanation inside the function itself.
def random_datetimes_or_dates(start, end, out_format='datetime', n=10):
'''
unix timestamp is in ns by default.
I divide the unix time value by 10**9 to make it seconds (or 24*60*60*10**9 to make it days).
The corresponding unit variable is passed to the pd.to_datetime function.
Values for the (divide_by, unit) pair to select is defined by the out_format parameter.
for 1 -> out_format='datetime'
for 2 -> out_format=anything else
'''
(divide_by, unit) = (10**9, 's') if out_format=='datetime' else (24*60*60*10**9, 'D')
start_u = start.value//divide_by
end_u = end.value//divide_by
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit=unit)
Sample run:
random_datetimes_or_dates(start, end, out_format='datetime')
DatetimeIndex(['2017-01-30 05:14:27', '2016-10-18 21:17:16',
'2016-10-20 08:38:02', '2015-09-02 00:03:08',
'2015-06-04 02:38:12', '2016-02-19 05:22:01',
'2015-11-06 10:37:10', '2017-12-17 03:26:02',
'2017-11-20 06:51:32', '2016-01-02 02:48:03'],
dtype='datetime64[ns]', freq=None)
random_datetimes_or_dates(start, end, out_format='not datetime')
DatetimeIndex(['2017-05-10', '2017-12-31', '2017-11-10', '2015-05-02',
'2016-04-11', '2015-11-27', '2015-03-29', '2017-05-21',
'2015-05-11', '2017-02-08'],
dtype='datetime64[ns]', freq=None)
Solution 2 - Python
###np.random.randn
+ to_timedelta
This addresses Case (1). You can do this by generating a random array of timedelta
objects and adding them to your start
date.
def random_dates(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
>>> np.random.seed(0)
>>> start = pd.to_datetime('2015-01-01')
>>> end = pd.to_datetime('2018-01-01')
>>> random_dates(start, end, 10)
DatetimeIndex([ '2016-08-25 01:09:42.969600',
'2017-02-23 13:30:20.304000',
'2016-10-23 05:33:15.033600',
'2016-08-20 17:41:04.012799999',
'2016-04-09 17:59:00.815999999',
'2016-12-09 13:06:00.748800',
'2016-04-25 00:47:45.974400',
'2017-09-05 06:35:58.444800',
'2017-11-23 03:18:47.347200',
'2016-02-25 15:14:53.894400'],
dtype='datetime64[ns]', freq=None)
This will generate dates with a time component as well.
Sadly, rand
does not support a replace=False
, so if you want unique dates, you'll need a two-step process of 1) generate the non-unique days component, and 2) generate the unique seconds/milliseconds component, then add the two together.
np.random.randint
+ to_timedelta
This addresses Case (2). You can modify random_dates
above to generate random integers instead of random floats:
def random_dates2(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
>>> random_dates2(start, end, 10)
DatetimeIndex(['2016-11-15', '2016-07-13', '2017-04-15', '2017-02-02',
'2017-10-30', '2015-10-05', '2016-08-22', '2017-12-30',
'2016-08-23', '2015-11-11'],
dtype='datetime64[ns]', freq=None)
To generate dates with other frequencies, the functions above can be called with a different value for unit
. Additionally, you can add a parameter freq
and tweak your function call as needed.
If you want unique random dates, you can use np.random.choice
with replace=False
:
def random_dates2_unique(start, end, n, unit='D', seed=None):
if not seed: # from piR's answer
np.random.seed(0)
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.choice(ndays, n, replace=False), unit=unit
)
Performance
Going to benchmark just the methods that address Case (1), since Case (2) is really a special case which any method can get to using dt.floor
.
def cs(start, end, n):
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit='D') + start
def akilat90(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
def piR(start, end, n):
dr = pd.date_range(start, end, freq='H') # can't get better than this :-(
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
def piR2(start, end, n):
dr = pd.date_range(start, end, freq='H')
a = np.arange(len(dr))
b = np.sort(np.random.permutation(a)[:n])
return dr[b]
Benchmarking Code
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cs', 'akilat90', 'piR', 'piR2'],
columns=[10, 20, 50, 100, 200, 500, 1000, 2000, 5000],
dtype=float
)
for f in res.index:
for c in res.columns:
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
stmt = '{}(start, end, c)'.format(f)
setp = 'from __main__ import start, end, c, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=30)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Solution 3 - Python
We can speed up @akilat90's approach about twofold (in @coldspeed's benchmark) by using the fact that datetime64
is just a rebranded int64
hence we can view-cast:
def pp(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))
Solution 4 - Python
numpy.random.choice
You can leverage Numpy's random choice. choice
may be problematic over large data_ranges
. For example, too large will result in a MemoryError. It requires storing the entire thing in order to select random bits.
random_dates('2015-01-01', '2018-01-01', 10, 'ns', seed=[3, 1415])
MemoryError
Also, this requires a sort.
def random_dates(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))
random_dates('2015-01-01', '2018-01-01', 10, 'H', seed=[3, 1415])
DatetimeIndex(['2015-04-24 02:00:00', '2015-11-26 23:00:00',
'2016-01-18 00:00:00', '2016-06-27 22:00:00',
'2016-08-12 17:00:00', '2016-10-21 11:00:00',
'2016-11-07 11:00:00', '2016-12-09 23:00:00',
'2017-02-20 01:00:00', '2017-06-17 18:00:00'],
dtype='datetime64[ns]', freq=None)
numpy.random.permutation
Similar to other answer. However, I like this answer as it slices the datetimeindex
produced by date_range
and automatically returns another datetimeindex
.
def random_dates_2(start, end, n, freq, seed=None):
if seed is not None:
np.random.seed(seed)
dr = pd.date_range(start, end, freq=freq)
a = np.arange(len(dr))
b = np.sort(np.random.permutation(a)[:n])
return dr[b]
Solution 5 - Python
Just my two cents, using date_range and sample:
def random_dates(start, end, n, seed=1, replace=False):
dates = pd.date_range(start, end).to_series()
return dates.sample(n, replace=replace, random_state=seed)
random_dates("20170101","20171223", 10, seed=1)
Out[29]:
2017-10-01 2017-10-01
2017-08-23 2017-08-23
2017-11-30 2017-11-30
2017-06-15 2017-06-15
2017-11-18 2017-11-18
2017-10-31 2017-10-31
2017-07-31 2017-07-31
2017-03-07 2017-03-07
2017-09-09 2017-09-09
2017-10-15 2017-10-15
dtype: datetime64[ns]
Solution 6 - Python
I found A new base library generated the range of the date, seems like on my side a bit faster than pandas.data_range
, credit from this answer
from dateutil.rrule import rrule, DAILY
import datetime, random
def pick(start,end,n):
return (random.sample(list(rrule(DAILY, dtstart=start,until=end)),n))
pick(datetime.datetime(2010, 2, 1, 0, 0),datetime.datetime(2010, 2, 5, 0, 0),2)
[datetime.datetime(2010, 2, 3, 0, 0), datetime.datetime(2010, 2, 2, 0, 0)]
Solution 7 - Python
Thats some alternative way :D Maybe someone will need it.
from datetime import datetime
import random
import numpy as np
import pandas as pd
N = 10 #N-samples
dates = np.zeros([N,3])
for i in range(0,N):
year = random.randint(1970, 2010)
month = random.randint(1, 12)
day = random.randint(1, 28)
#if you need to change it use variables :3
birth_date = datetime(year, month, day)
dates[i] = [year,month,day]
df = pd.DataFrame(dates.astype(int))
df.columns = ['year', 'month', 'day']
pd.to_datetime(df)
Result:
0 1999-08-22
1 1989-04-27
2 1978-10-01
3 1998-12-09
4 1979-04-19
5 1988-03-22
6 1992-03-02
7 1993-04-28
8 1978-10-04
9 1972-01-13
dtype: datetime64[ns]
Solution 8 - Python
I think this is an easier solution for just creating a date field in a pandas DateFrame
list1 = []
for x in range(0,365):
list1.append(x)
date = pd.DataFrame(pd.to_datetime(list1, unit='D',origin=pd.Timestamp('2018-01-01')))