RandomRDDs#
- class pyspark.mllib.random.RandomRDDs[source]#
 Generator methods for creating RDDs comprised of i.i.d samples from some distribution.
New in version 1.1.0.
Methods
exponentialRDD(sc, mean, size[, ...])Generates an RDD comprised of i.i.d.
exponentialVectorRDD(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
gammaRDD(sc, shape, scale, size[, ...])Generates an RDD comprised of i.i.d.
gammaVectorRDD(sc, shape, scale, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
logNormalRDD(sc, mean, std, size[, ...])Generates an RDD comprised of i.i.d.
logNormalVectorRDD(sc, mean, std, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
normalRDD(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
normalVectorRDD(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
poissonRDD(sc, mean, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
poissonVectorRDD(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
uniformRDD(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
uniformVectorRDD(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
Methods Documentation
- static exponentialRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the Exponential distribution with the input mean.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
 Mean, or 1 / lambda, for the Exponential distribution.
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of float comprised of i.i.d. samples ~ Exp(mean).
Examples
>>> mean = 2.0 >>> x = RandomRDDs.exponentialRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
- static exponentialVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Exponential distribution with the input mean.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
 Mean, or 1 / lambda, for the Exponential distribution.
- numRowsint
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
Examples
>>> import numpy as np >>> mean = 0.5 >>> rdd = RandomRDDs.exponentialVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
- static gammaRDD(sc, shape, scale, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the Gamma distribution with the input shape and scale.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- shapefloat
 shape (> 0) parameter for the Gamma distribution
- scalefloat
 scale (> 0) parameter for the Gamma distribution
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> x = RandomRDDs.gammaRDD(sc, shape, scale, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> abs(stats.stdev() - expStd) < 0.5 True
- static gammaVectorRDD(sc, shape, scale, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Gamma distribution.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- shapefloat
 Shape (> 0) of the Gamma distribution
- scalefloat
 Scale (> 0) of the Gamma distribution
- numRowsint
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional,
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
Examples
>>> import numpy as np >>> from math import sqrt >>> shape = 1.0 >>> scale = 2.0 >>> expMean = shape * scale >>> expStd = sqrt(shape * scale * scale) >>> mat = np.matrix(RandomRDDs.gammaVectorRDD(sc, shape, scale, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
- static logNormalRDD(sc, mean, std, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the log normal distribution with the input mean and standard distribution.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext used to create the RDD.
- meanfloat
 mean for the log Normal distribution
- stdfloat
 std for the log Normal distribution
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 - RDD of float comprised of i.i.d. samples ~ log N(mean, std).
 
Examples
>>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> x = RandomRDDs.logNormalRDD(sc, mean, std, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - expMean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - expStd) < 0.5 True
- static logNormalVectorRDD(sc, mean, std, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the log normal distribution.
New in version 1.3.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
 Mean of the log normal distribution
- stdfloat
 Standard Deviation of the log normal distribution
- numRowsint
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ log N(mean, std).
Examples
>>> import numpy as np >>> from math import sqrt, exp >>> mean = 0.0 >>> std = 1.0 >>> expMean = exp(mean + 0.5 * std * std) >>> expStd = sqrt((exp(std * std) - 1.0) * exp(2.0 * mean + std * std)) >>> m = RandomRDDs.logNormalVectorRDD(sc, mean, std, 100, 100, seed=1).collect() >>> mat = np.matrix(m) >>> mat.shape (100, 100) >>> abs(mat.mean() - expMean) < 0.1 True >>> abs(mat.std() - expStd) < 0.1 True
- static normalRDD(sc, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the standard normal distribution.
To transform the distribution in the generated RDD from standard normal to some other normal N(mean, sigma^2), use
RandomRDDs.normal(sc, n, p, seed).map(lambda v: mean + sigma * v)New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext used to create the RDD.
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> x = RandomRDDs.normalRDD(sc, 1000, seed=1) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - 0.0) < 0.1 True >>> abs(stats.stdev() - 1.0) < 0.1 True
- static normalVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the standard normal distribution.
New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- numRowsint
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ N(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.normalVectorRDD(sc, 100, 100, seed=1).collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - 0.0) < 0.1 True >>> abs(mat.std() - 1.0) < 0.1 True
- static poissonRDD(sc, mean, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the Poisson distribution with the input mean.
New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
 Mean, or lambda, for the Poisson distribution.
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of float comprised of i.i.d. samples ~ Pois(mean).
Examples
>>> mean = 100.0 >>> x = RandomRDDs.poissonRDD(sc, mean, 1000, seed=2) >>> stats = x.stats() >>> stats.count() 1000 >>> abs(stats.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(stats.stdev() - sqrt(mean)) < 0.5 True
- static poissonVectorRDD(sc, mean, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the Poisson distribution with the input mean.
New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
 Mean, or lambda, for the Poisson distribution.
- numRowsfloat
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism)
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
Examples
>>> import numpy as np >>> mean = 100.0 >>> rdd = RandomRDDs.poissonVectorRDD(sc, mean, 100, 100, seed=1) >>> mat = np.mat(rdd.collect()) >>> mat.shape (100, 100) >>> abs(mat.mean() - mean) < 0.5 True >>> from math import sqrt >>> abs(mat.std() - sqrt(mean)) < 0.5 True
- static uniformRDD(sc, size, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of i.i.d. samples from the uniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0) to U(a, b), use
RandomRDDs.uniformRDD(sc, n, p, seed).map(lambda v: a + (b - a) * v)New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext used to create the RDD.
- sizeint
 Size of the RDD.
- numPartitionsint, optional
 Number of partitions in the RDD (default: sc.defaultParallelism).
- seedint, optional
 Random seed (default: a random long integer).
- sc
 - Returns
 pyspark.RDDRDD of float comprised of i.i.d. samples ~ U(0.0, 1.0).
Examples
>>> x = RandomRDDs.uniformRDD(sc, 100).collect() >>> len(x) 100 >>> max(x) <= 1.0 and min(x) >= 0.0 True >>> RandomRDDs.uniformRDD(sc, 100, 4).getNumPartitions() 4 >>> parts = RandomRDDs.uniformRDD(sc, 100, seed=4).getNumPartitions() >>> parts == sc.defaultParallelism True
- static uniformVectorRDD(sc, numRows, numCols, numPartitions=None, seed=None)[source]#
 Generates an RDD comprised of vectors containing i.i.d. samples drawn from the uniform distribution U(0.0, 1.0).
New in version 1.1.0.
- Parameters
 - sc
pyspark.SparkContext SparkContext used to create the RDD.
- numRowsint
 Number of Vectors in the RDD.
- numColsint
 Number of elements in each Vector.
- numPartitionsint, optional
 Number of partitions in the RDD.
- seedint, optional
 Seed for the RNG that generates the seed for the generator in each partition.
- sc
 - Returns
 pyspark.RDDRDD of Vector with vectors containing i.i.d samples ~ U(0.0, 1.0).
Examples
>>> import numpy as np >>> mat = np.matrix(RandomRDDs.uniformVectorRDD(sc, 10, 10).collect()) >>> mat.shape (10, 10) >>> mat.max() <= 1.0 and mat.min() >= 0.0 True >>> RandomRDDs.uniformVectorRDD(sc, 10, 10, 4).getNumPartitions() 4