gpt4 book ai didi

scala - Apache Spark - 两个样本 Kolmogorov-Smirnov 测试

转载 作者:行者123 更新时间:2023-12-01 21:50:28 24 4
gpt4 key购买 nike

我在 Spark 中有两组数据(我们称它们为 d1、d2)。我想执行两个样本柯尔莫哥洛夫-斯米尔诺夫检验,以测试它们的底层总体分布函数是否不同。 MLLib 的 Statistics.kolmogorovSmirnovTest 可以做到这一点吗?

文档提供了这个示例:

import org.apache.spark.mllib.stat.Statistics

val data: RDD[Double] = ... // an RDD of sample data

// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)

我尝试计算 d2 的经验累积分布函数(将其收集为 Map)并将其与 d1 进行比较。

Statistics.kolmogorovSmirnovTest(d1, ecdf_map)

测试运行,但结果错误。

我做错了什么吗?是否有可能做到这一点?有什么想法吗?

感谢您的帮助!

最佳答案

在 Spark Mllib KolmogorovSmirnovTest是单采样且双面的。因此,如果您想要特定的两个采样变体,则在该库中是不可能的。但是,您仍然可以通过计算经验累积分布函数(我找到了一个库来执行此操作,因此如果结果良好,我将更新此答案)或使用正态分布的偏差来比较数据集。在此示例中,我将选择后者。

将 KST 统计数据与正态分布的数据集进行比较

为了进行此测试,我生成了 3 个发行版:2 triangular看起来相似,并且 exponential统计数据显示出巨大差异。

Note: I couldn't find any scientific papers describing this method as viable for distribution comparison so the idea is mostly empirical.

For every distribution you most definetely could find a mirrored one with the same global maximum distance between its CDF and normal distribution.

下一步是根据给定均值和标准差的正态分布获取 KS 结果。我将它们可视化以获得更好的图片:

breeze visualisation for distributions and their KS test results

如您所见,三角分布的结果(KS 统计数据和 p 值)彼此接近,而指数分布则相差很远。正如我在注释中所述,您可以通过镜像数据集轻松欺骗此方法,但对于真实世界的数据来说,这可能没问题。

import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.stat.Statistics

import org.apache.commons.math3.distribution.{ ExponentialDistribution, TriangularDistribution }

import breeze.plot._
import breeze.linalg._
import breeze.numerics._

object Main {

def main( args: Array[ String ] ): Unit = {

val conf =
new SparkConf()
.setAppName( "SO Spark" )
.setMaster( "local[*]" )
.set( "spark.driver.host", "localhost" )

val sc = new SparkContext( conf )

// Create similar distributions
val triDist1 = new TriangularDistribution( -3, 5, 7 )
val triDist2 = new TriangularDistribution( -3, 7, 7 )

// Exponential distribution to show big difference
val expDist1 = new ExponentialDistribution( 0.6 )

// Sample data from the distributions and parallelize it
val n = 100000
val sampledTriDist1 = sc.parallelize( triDist1.sample( n ) )
val sampledTriDist2 = sc.parallelize( triDist2.sample( n ) )
val sampledExpDist1 = sc.parallelize( expDist1.sample( n ) )

// KS tests
val resultTriDist1 = Statistics
.kolmogorovSmirnovTest( sampledTriDist1,
"norm",
sampledTriDist1.mean,
sampledTriDist1.stdev )

val resultTriDist2 = Statistics
.kolmogorovSmirnovTest( sampledTriDist2,
"norm",
sampledTriDist2.mean,
sampledTriDist2.stdev )

val resultExpDist1 = Statistics
.kolmogorovSmirnovTest( sampledExpDist1,
"norm",
sampledExpDist1.mean,
sampledExpDist1.stdev )

// Results
val statsTriDist1 =
"Tri1: ( " +
resultTriDist1.statistic +
", " +
resultTriDist1.pValue +
" )"

val statsTriDist2 =
"Tri2: ( " +
resultTriDist2.statistic +
", " +
resultTriDist2.pValue +
" )"

val statsExpDist1 =
"Exp1: ( " +
resultExpDist1.statistic +
", " +
resultExpDist1.pValue +
" )"

println( statsTriDist1 )
println( statsTriDist2 )
println( statsExpDist1 )

// Visualize
val graphCanvas = Figure()

val mainPlot =
graphCanvas
.subplot( 0 )

mainPlot.legend = true

val x = linspace( 1, n, n )

mainPlot += plot( x,
sampledTriDist1.sortBy( x => x ).take( n ),
name = statsTriDist1 )

mainPlot += plot( x,
sampledTriDist2.sortBy( x => x ).take( n ),
name = statsTriDist2 )

mainPlot += plot( x,
sampledExpDist1.sortBy( x => x ).take( n ),
name = statsExpDist1 )

mainPlot.xlabel = "x"
mainPlot.ylabel = "sorted sample"

mainPlot.title = "KS results for 2 Triangular and 1 Exponential Distributions"

graphCanvas.saveas( "ks-sample.png", 300 )

sc.stop()
}
}

关于scala - Apache Spark - 两个样本 Kolmogorov-Smirnov 测试,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46471399/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com