RandomRDDs#
- class pyspark.mllib.random.RandomRDDs[source]#
Generator methods for creating RDDs comprised of i.i.d samples from some distribution.
New in version 1.1.0.
Methods
exponentialRDD
(sc, mean, size[, ...])Generates an RDD comprised of i.i.d.
exponentialVectorRDD
(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
gammaRDD
(sc, shape, scale, size[, ...])Generates an RDD comprised of i.i.d.
gammaVectorRDD
(sc, shape, scale, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
logNormalRDD
(sc, mean, std, size[, ...])Generates an RDD comprised of i.i.d.
logNormalVectorRDD
(sc, mean, std, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
normalRDD
(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
normalVectorRDD
(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
poissonRDD
(sc, mean, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
poissonVectorRDD
(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
uniformRDD
(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
uniformVectorRDD
(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
Methods Documentation
- static exponentialRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Exp(mean).
Examples
>>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
- static exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
Examples
>>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
- static gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- shapefloat
shape (> 0) parameter for the Gamma distribution
- scalefloat
scale (> 0) parameter for the Gamma distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> abs(stats.stdev() - expStd) < 0.5 True
- static gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- shapefloat
Shape (> 0) of the Gamma distribution
- scalefloat
Scale (> 0) of the Gamma distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional,
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
- static logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- meanfloat
mean for the log Normal distribution
- stdfloat
std for the log Normal distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
- RDD of float comprised of i.i.d. samples ~ log N(mean, std).
Examples
>>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - expStd) < 0.5 True
- static logNormalVectorRDD(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean of the log normal distribution
- stdfloat
Standard Deviation of the log normal distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).
Examples
>>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
- static normalRDD(sc, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the standard normal distribution.
To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use
RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - 0.0) < 0.1 True >>> abs(stats.stdev() - 1.0) < 0.1 True
- static normalVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - 0.0) < 0.1 True >>> abs(mat.std() - 1.0) < 0.1 True
- static poissonRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ Pois(mean).
Examples
>>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
- static poissonVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- numRowsfloat
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
Examples
>>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
- static uniformRDD(sc, size, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use
RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDD
RDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).
Examples
>>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True
- static uniformVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext
SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD.
- seedint, optional
Seed for the RNG that generates the seed for the generator in each partition.
- sc
- Returns
pyspark.RDD
RDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> mat.max() <= 1.0 and mat.min() >= 0.0 True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4