KMeans¶
-
class
pyspark.mllib.clustering.
KMeans
[source]¶ K-means clustering.
New in version 0.9.0.
Methods
train
(rdd, k[, maxIterations, …])Train a k-means clustering model.
Methods Documentation
-
classmethod
train
(rdd: pyspark.rdd.RDD[VectorLike], k: int, maxIterations: int = 100, initializationMode: str = 'k-means||', seed: Optional[int] = None, initializationSteps: int = 2, epsilon: float = 0.0001, initialModel: Optional[pyspark.mllib.clustering.KMeansModel] = None, distanceMeasure: str = 'euclidean') → KMeansModel[source]¶ Train a k-means clustering model.
New in version 0.9.0.
- Parameters
- rdd:
pyspark.RDD
Training points as an RDD of
pyspark.mllib.linalg.Vector
or convertible sequence types.- kint
Number of clusters to create.
- maxIterationsint, optional
Maximum number of iterations allowed. (default: 100)
- initializationModestr, optional
The initialization algorithm. This can be either “random” or “k-means||”. (default: “k-means||”)
- seedint, optional
Random seed value for cluster initialization. Set as None to generate seed based on system time. (default: None)
- initializationSteps :
Number of steps for the k-means|| initialization mode. This is an advanced setting – the default of 2 is almost always enough. (default: 2)
- epsilonfloat, optional
Distance threshold within which a center will be considered to have converged. If all centers move less than this Euclidean distance, iterations are stopped. (default: 1e-4)
- initialModel
KMeansModel
, optional Initial cluster centers can be provided as a KMeansModel object rather than using the random or k-means|| initializationModel. (default: None)
- distanceMeasurestr, optional
The distance measure used by the k-means algorithm. (default: “euclidean”)
- rdd:
-
classmethod