RandomRDDs#

class pyspark.mllib.random.RandomRDDs[source]#

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

New in version 1.1.0.

Methods

`exponentialRDD`(sc, mean, size[, ...])	Generates an RDD comprised of i.i.d.
`exponentialVectorRDD`(sc, mean, numRows, numCols)	Generates an RDD comprised of vectors containing i.i.d.
`gammaRDD`(sc, shape, scale, size[, ...])	Generates an RDD comprised of i.i.d.
`gammaVectorRDD`(sc, shape, scale, numRows, ...)	Generates an RDD comprised of vectors containing i.i.d.
`logNormalRDD`(sc, mean, std, size[, ...])	Generates an RDD comprised of i.i.d.
`logNormalVectorRDD`(sc, mean, std, numRows, ...)	Generates an RDD comprised of vectors containing i.i.d.
`normalRDD`(sc, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`normalVectorRDD`(sc, numRows, numCols[, ...])	Generates an RDD comprised of vectors containing i.i.d.
`poissonRDD`(sc, mean, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`poissonVectorRDD`(sc, mean, numRows, numCols)	Generates an RDD comprised of vectors containing i.i.d.
`uniformRDD`(sc, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`uniformVectorRDD`(sc, numRows, numCols[, ...])	Generates an RDD comprised of vectors containing i.i.d.

Methods Documentation

static exponentialRDD(sc, mean, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or 1 / lambda, for the Exponential distribution.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Exp(mean).

Examples

>>> mean = 2.0
>>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True

static exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or 1 / lambda, for the Exponential distribution.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism)
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).

Examples

>>> import numpy as np
>>> mean = 0.5
>>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True

static gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
shapefloat: shape (> 0) parameter for the Gamma distribution
scalefloat: scale (> 0) parameter for the Gamma distribution
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).

Examples

>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> abs(stats.stdev() - expStd) < 0.5
True

static gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
shapefloat: Shape (> 0) of the Gamma distribution
scalefloat: Scale (> 0) of the Gamma distribution
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional,: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).

Examples

>>> import numpy as np
>>> from math import sqrt
>>> shape = 1.0
>>> scale = 2.0
>>> expMean = shape * scale
>>> expStd = sqrt(shape * scale * scale)
>>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True

static logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
meanfloat: mean for the log Normal distribution
stdfloat: std for the log Normal distribution
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ log N(mean, std).

Examples

>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - expMean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - expStd) < 0.5
True

static logNormalVectorRDD(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean of the log normal distribution
stdfloat: Standard Deviation of the log normal distribution
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).

Examples

>>> import numpy as np
>>> from math import sqrt, exp
>>> mean = 0.0
>>> std = 1.0
>>> expMean = exp(mean + 0.5 * std * std)
>>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std))
>>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect()
>>> mat = np.matrix(m)
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - expMean) < 0.1
True
>>> abs(mat.std() - expStd) < 0.1
True

static normalRDD(sc, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the standard normal distribution.

To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)

New in version 1.1.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).

Examples

>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - 0.0) < 0.1
True
>>> abs(stats.stdev() - 1.0) < 0.1
True

static normalVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).

Examples

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - 0.0) < 0.1
True
>>> abs(mat.std() - 1.0) < 0.1
True

static poissonRDD(sc, mean, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or lambda, for the Poisson distribution.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Pois(mean).

Examples

>>> mean = 100.0
>>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2)
>>> stats = x.stats()
>>> stats.count()
1000
>>> abs(stats.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(stats.stdev() - sqrt(mean)) < 0.5
True

static poissonVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or lambda, for the Poisson distribution.
numRowsfloat: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism)
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).

Examples

>>> import numpy as np
>>> mean = 100.0
>>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1)
>>> mat = np.mat(rdd.collect())
>>> mat.shape
(100, 100)
>>> abs(mat.mean() - mean) < 0.5
True
>>> from math import sqrt
>>> abs(mat.std() - sqrt(mean)) < 0.5
True

static uniformRDD(sc, size, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).

To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)

New in version 1.1.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default: sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).

Examples

>>> x = RandomRDDs.uniformRDD(sc, 100).collect()
>>> len(x)
100
>>> max(x) <= 1.0 and min(x) >= 0.0
True
>>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions()
4
>>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions()
>>> parts == sc.defaultParallelism
True

static uniformVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD.
seedint, optional: Seed for the RNG that generates the seed for the generator in each partition.

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).

Examples

>>> import numpy as np
>>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect())
>>> mat.shape
(10, 10)
>>> mat.max() <= 1.0 and mat.min() >= 0.0
True
>>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions()
4