pyspark.RDD.countApproxDistinct¶
- 
RDD.countApproxDistinct(relativeSD=0.05)[source]¶
- Return approximate number of distinct elements in the RDD. - Parameters
- relativeSDfloat, optional
- Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. 
 
 - Notes - The algorithm used is based on streamlib’s implementation of “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”, available here. - Examples - >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 900 < n < 1100 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 16 < n < 24 True