FlexMix–Interface - Model–based Clustering with Copulas

Finally, FLXMCcopula implements the M–step driver for FlexMix. It takes a reference mvdc instance, which was created and initialized by the user as parameter and returns a FLXMC instance which is then used by FlexMix.

Listing 4: FLXMCcopula

1 F L X M C c o p u l a

<-2 f u n c t i o n ( f o r m u l a =. ~ . , mvd c = NUL L ) {

3 z < - new ( " F L X M C " ,

4 w e i g h t e d = TRUE ,

5 f o r m u l a = formula ,

6 di st = " c o p u l a " ,

7 na me = " model - b a s e d c o p u l a c l u s t e r i n g " )

9 if ( is . n ull ( mvdc ))

10 st op ( " m vdc o b j e c t m i s s i n g " )

12 z @ d e f i n e C o m p o n e n t <- e x p r e s s i o n ({

13 l o g L i k <- f u n c t i o n ( x , y ) {

14 log ( d m v d c ( mvdc , y ))

15 }

17 p r e d i c t < - f u n c t i o n ( x , . ..) {

18 st op ( " p r e d i c t () not i m p l e m e n t e d " )

19 }

21 new ( " F L X c o m p o n e n t " ,

22 p a r a m e t e r s = li st ( mv dc = mv dc ) ,

23 df =1 ,

24 l o g L i k = logLik ,

25 p r e d i c t = p r e d i c t )

26 })

27 28

29 z @ f i t <- f u n c t i o n ( x , y , w ) {

30 mv dc <- c o p M L E ( y , mvdc , w )

32 pa ra <- l ist ( m vdc = m vdc )

33 wi th ( para , e val ( z @ d e f i n e C o m p o n e n t ))

34 }

36 z

37 }

At first, anFLXMC object is instantiated and initialised with the formula parameter (line 5) and the notion that weighted estimation is being used (line 4) as well as some metadata (lines 6–7).

Next, some slots of the newly–created FLXMC instance have to be ini-tialised; the fit slot is assigned a wrapper function around copMLE that returns an instance of FLXcomponent (lines 29–34). The parameters x and y correspond to the respective values in the formula parameter earlier. de-fineComponent then stores a function that creates anFLXcomponent object and expects to find an mvdc instance in its environment that contains the parameters estimated in fit (lines 12–26). Also, the FLXcomponent object is supplied with functions logLik and predict. The former computes the log–likelihood per observation, while the latter was left unimplemented. pre-dict would usually predict the target variable given a dataset, in this case however only clustering was of interest and since there is no target variable, prediction would not have been very meaningful.

The end user would then usually call flexmix() with the FLXMC object and thus obtain an object of class flexmix which encodes the estimated model. parameters() returns a list of mvdc instances, one per estimated component, where the parameters of each component can be extracted with mvdcGetParam.

C Simulation Code

This section contains the code that was used to run the simulations from section 5. The underlying idea was to do the simulations on a number of computers, which would each repeatedly run one simulation and store the results locally in a file. In each simulation run, one kind of a simulation and a corresponding parameter (e.g. cluster size in simulations 1–3) was chosen randomly, so there would not have to be any coordination between the com-puters. Also, this way simulations could be terminated while only affecting the current model and one did not have to wait until a set of simulations was complete. The data files were then (incrementally) copied to a central location where the results were post–processed (such as calculating the pa-rameters’ estimation error) and stored in a more convenient format that can be read by read.table.

First off a couple of utility functions;cluster.distance implements the sum part of equation 4. origParamandestParamare each lists of parameter vectors (original and estimated respectively). perm is a vector denoting the permutation that should be applied to the cluster labels.

Listing 5: cluster.distance

1 c l u s t e r . d i s t a n c e

<-2 f u n c t i o n ( o r i g P a r a m , est Par am , per m ) {

3 sm < - 0

4 for ( i in 1: l e n g t h ( o r i g P a r a m )) {

5 dif <- o r i g P a r a m [[ i ]] - e s t P a r a m [[ p erm [ i ] ]]

6 sm < - sm + sqrt ( sum ( dif * dif ))

7 }

8 r e t u r n ( sm )

9 }

cluster.order tries to find the permutation that minimizes the sum in equation 4. It determines all possible permutations using the equally named function permutations from the gtools package, calculates the estimation difference for each permutation and then returns the permutation with the least difference.

Listing 6: cluster.order

1 c l u s t e r . o r d e r

<-2 f u n c t i o n ( o r i g P a r a m , e s t P a r a m ) {

3 s t o p i f n o t ( l e n g t h ( o r i g P a r a m ) == l e n g t h ( e s t P a r a m ))

5 l n g t h <- l e n g t h ( o r i g P a r a m )

6 m p e r m <- p e r m u t a t i o n s ( lngth , l n g t h )

7 pe rm <- l a p p l y ( seq . int ( n row ( m p e r m )) ,

8 f u n c t i o n ( x ) m p e r m [ x ,])

10 d i f l i s t < - d o u b l e ( l e n g t h ( p erm ))

12 for ( i in 1: l e n g t h ( per m )) {

13 d i f l i s t [ i ] <- c l u s t e r . d i s t a n c e ( o r i g P a r a m ,

14 es tPa ram ,

15 pe rm [[ i ]])

16 }

18 r e t u r n ( per m [[ w h i c h . min ( d i f l i s t )] ])

19 }

cluster.classification determines the amount of correctly classified ob-servations given the permutationperm. origand estare each vectors where the i–th entry contains the assigned cluster number of thei–th observation.

Listing 7: cluster.classification

1 c l u s t e r . c l a s s i f i c a t i o n <

-2 f u n c t i o n ( orig , est , per m ) {

3 est <- s a p p l y ( est , f u n c t i o n ( x ) pe rm [ x ])

4 N c o r r e c t < - l e n g t h ( w h i c h ( ori g == est ))

5 r e t u r n ( N c o r r e c t / l e n g t h ( o rig ))

6 }

The following five functions construct a number of objects needed for a cer-tain kind of simulation without actually running the simulation; that is, a reference mvdc instance is created, containing the starting values for the op-timizer. Then for each cluster an mvdc instance of the same structure is created, with the cluster’s parameters, as well as a vector containing the sample sizes of each cluster. Each of these functions represent one simulation type from section 5.

Listing 8: sim.exp2345

1 sim . e x p 2 3 4 5

<-2 f u n c t i o n ( n = NU LL ) {

3 if ( is . n ull ( n ))

4 n < - s a m p l e ( x = c ( 1 : 1 0 * 50) , si ze =1)

6 my . m vdc < - mvdc ( c o p u l a = g u m b e l C o p u l a ( p a r a m =1 , dim =2) ,

7 m a r g i n s = c ( " exp " , " exp " ) ,

8 p a r a m M a r g i n s = li st (

9 li st ( ra te =1) ,

10 li st ( ra te =1)

11 )

12 )

14 c l u s t e r L i s t <- c (

15 m v d c S e t P a r a m ( c (2 ,3 , 2) , my . mv dc ) ,

16 m v d c S e t P a r a m ( c (4 ,5 , 1) , my . mv dc )

17 )

19 c l u s t e r S i z e s < - c ( n , n )

21 li st (

22 na me = " e x p 2 3 4 5 " ,

23 c l u s t e r L i s t = c l u s t e r L i s t ,

24 c l u s t e r S i z e s = c l u s t e r S i z e s ,

25 s t a r t M v d c = my . mvd c

26 )

27 }

Listing 9: sim.exp3553

1 sim . e x p 3 5 5 3

<-2 f u n c t i o n ( n = NU LL ) {

3 if ( is . n ull ( n ))

4 n < - s a m p l e ( x = c ( 1 : 1 0 * 50) , si ze =1)

6 my . m vdc < - mvdc ( c o p u l a = g u m b e l C o p u l a ( p a r a m =1 , dim =2) ,

7 m a r g i n s = c ( " exp " , " exp " ) ,

8 p a r a m M a r g i n s = li st (

9 li st ( ra te =1) ,

10 li st ( ra te =1)

11 )

12 )

14 c l u s t e r L i s t <- c (

15 m v d c S e t P a r a m ( c (3 ,5 , 2) , my . mv dc ) ,

16 m v d c S e t P a r a m ( c (5 ,3 , 1) , my . mv dc )

17 )

19 c l u s t e r S i z e s < - c ( n , n )

21 li st (

22 na me = " e x p 3 5 5 3 " ,

23 c l u s t e r L i s t = c l u s t e r L i s t ,

24 c l u s t e r S i z e s = c l u s t e r S i z e s ,

25 s t a r t M v d c = my . mvd c

26 )

27 }

Listing 10: sim.gamma24455434

1 sim . g a m m a 2 4 4 5 5 4 3 4

<-2 f u n c t i o n ( n = NU LL ) {

3 if ( is . n ull ( n ))

4 n < - s a m p l e ( x = c ( 1 : 1 0 * 50) , si ze =1)

6 my . m vdc < - mvdc ( c o p u l a = g u m b e l C o p u l a ( p a r a m =1 , dim =2) ,

7 m a r g i n s = c ( " g a m m a " , " g a m m a " ) ,

8 p a r a m M a r g i n s = li st (

9 li st ( s h a p e =2 , s c a l e =2) ,

10 li st ( s h a p e =2 , s c a l e =2)

11 )

12 )

13 14

15 c l u s t e r L i s t <- c (

16 m v d c S e t P a r a m ( c (2 ,4 , 4 ,5 , 2) , my . mv dc ) ,

17 m v d c S e t P a r a m ( c (5 ,4 , 3 ,4 , 1) , my . mv dc )

18 )

20 c l u s t e r S i z e s < - c ( n , n )

22 li st (

23 na me = " g a m m a 2 4 4 5 5 4 3 4 " ,

24 c l u s t e r L i s t = c l u s t e r L i s t ,

25 c l u s t e r S i z e s = c l u s t e r S i z e s ,

26 s t a r t M v d c = my . mvd c

27 )

28 29 }

Listing 11: sim.theta12345

1 sim . t h e t a 1 2 3 4 5 <

-2 f u n c t i o n ( t h e t a = NUL L ) {

3 n < - 400

5 if ( is . n ull ( t h e t a ))

6 t h e t a <- s a m p l e ( seq . int (5) , si ze =1)

8 my . m vdc < - mvdc ( c o p u l a = g u m b e l C o p u l a ( p a r a m =1.5 , dim =2) ,

9 m a r g i n s = c ( " exp " , " exp " ) ,

10 p a r a m M a r g i n s = li st (

11 li st ( ra te =1) ,

12 li st ( ra te =1)

13 )

14 )

16 c l u s t e r L i s t <- c (

17 m v d c S e t P a r a m ( c (3 ,5 , t h e t a ) , my . mvd c ) ,

18 m v d c S e t P a r a m ( c (5 ,3 , t h e t a ) , my . mvd c )

19 )

21 c l u s t e r S i z e s < - c ( n , n )

23 li st (

24 na me = " t h e t a 1 2 3 4 5 " ,

25 c l u s t e r L i s t = c l u s t e r L i s t ,

26 c l u s t e r S i z e s = c l u s t e r S i z e s ,

27 s t a r t M v d c = my . mvd c

28 )

29 30 }

Listing 12: sim.Ncluster

1 sim . N c l u s t e r

<-2 f u n c t i o n ( n = NU LL ) {

3 if ( is . n ull ( n )) {

4 n < - s a m p l e (1+ seq . int (4) , s ize =1)

5 }

7 my . m vdc < - mvdc ( c o p u l a = g u m b e l C o p u l a ( p a r a m =1.5 , dim =2) ,

8 m a r g i n s = c ( " exp " , " exp " ) ,

9 p a r a m M a r g i n s = li st (

10 li st ( ra te =1.5) ,

11 li st ( ra te = 1 . 5 )

12 )

13 )

15 t h e t a <- 2

17 c l u s t e r L i s t <- c (

18 m v d c S e t P a r a m ( c (1 ,5 , t h e t a ) , my . mvd c ) ,

19 m v d c S e t P a r a m ( c (2 ,4 , t h e t a ) , my . mvd c ) ,

20 m v d c S e t P a r a m ( c (3 ,3 , t h e t a ) , my . mvd c ) ,

21 m v d c S e t P a r a m ( c (4 ,2 , t h e t a ) , my . mvd c ) ,

22 m v d c S e t P a r a m ( c (5 ,1 , t h e t a ) , my . mvd c )

23 )

25 c l u s t e r L i s t <- s a m p l e ( c l u s t e r L i s t , si ze = n , r e p l a c e = F )

27 c l u s t e r S i z e s < - rep (400 , n )

29 li st (

30 na me = " N c l u s t e r " ,

31 c l u s t e r L i s t = c l u s t e r L i s t ,

32 c l u s t e r S i z e s = c l u s t e r S i z e s ,

33 s t a r t M v d c = my . mvd c

34 )

35 }

sim.once runs just one simulation, calling one of the functions above. It calls one of the functions above, generates the dataset and original cluster assignments. Then flexmixis called and the result stored along with some metadata.

Listing 13: sim.once

1 sim . on ce

<-2 f u n c t i o n ( s i m n a m e = NULL , n = NUL L ) {

3 if ( is . n ull ( s i m n a m e )) {

4 s i m L i s t < - c ( " sim . e x p 3 5 5 3 " , " sim . e x p 2 3 4 5 " ,

5 " sim . g a m m a 2 4 4 5 5 4 3 4 " ,

6 " sim . t h e t a 1 2 3 4 5 " , " sim . N c l u s t e r " )

7 sim <- s a m p l e ( simList , siz e =1)

8 } e lse {

9 sim <- p a s t e ( " sim . " , simname , sep = " " )

10 }

12 ret <- ev al ( ca ll ( sim , n ))

14 s i m N a m e < - ret $ n ame

15 c l u s t e r L i s t <- ret $ c l u s t e r L i s t

16 c l u s t e r S i z e s < - ret $ c l u s t e r S i z e s

17 s t a r t M v d c <- ret $ s t a r t M v d c

18 N c l u s t e r <- l e n g t h ( c l u s t e r L i s t )

20 f n a m e <- s p r i n t f ( " dat a / sim % d _ % d " ,

21 as . i n t e g e r ( Sys . ti me ()) ,

22 as . i n t e g e r ( r u n i f ( 1 , 1 , 9 9 9 9 9 9 9 ) ) )

23 o u t f n a m e < - p a s t e ( fname , " . Rou t " , sep = " " )

24 f n a m e <- p a s t e ( fname , " . R D a t a " , sep = " " )

25 26

27 N < - sum ( c l u s t e r S i z e s )

29 cat ( " N ame : " , simName , " \ n " )

30 cat ( c l a s s ( s t a r t M v d c @ c o p u l a ) , " dim = " ,

31 s t a r t M v d c @ c o p u l a @ d i m e n s i o n , " p a r a m = " ,

32 s t a r t M v d c @ c o p u l a @ p a r a m e t e r s , " \ n " )

33 cat ( " M a r g i n s : " , s t a r t M v d c @ m a r g i n s , " \ n " )

34 cat ( N , " : " , c l u s t e r S i z e s , " \ n " )

35 cat ( " \ n " )

37 o r i g C l u s t e r <- c ()

38 da ta <- N ULL

40 for ( i in 1: N c l u s t e r ) {

41 o r i g C l u s t e r <- c ( o r i g C l u s t e r ,

42 rep . int ( i , c l u s t e r S i z e s [ i ]))

44 d a t a C l u s t e r <- r m v d c ( c l u s t e r L i s t [[ i ]] ,

45 c l u s t e r S i z e s [ i ])

46 if ( is . n ull ( data ))

47 da ta <- d a t a C l u s t e r

48 el se

49 da ta <- r b i n d ( data , d a t a C l u s t e r )

50 }

52 ti me . s t a r t < - Sys . t ime ()

53 mod <- NU LL

54 try ( mod

<-55 f l e x m i x ( d ata ~ 1 , k = N clu ste r ,

56 m o d e l = F L X M C c o p u l a ( mvd c = s t a r t M v d c )))

57 ti me . end <- Sys . time ()

58 59

60 out <- li st (

61 da ta = data ,

62 na me = simName ,

63 or ig = o r i g C l u s t e r ,

64 m v d c L i s t = c l u s t e r L i s t ,

65 s i z e s = c l u s t e r S i z e s ,

66 mv dc = s t a r t M v d c ,

67 m o d e l = mod ,

68 s t a r t T i m e = t ime . start ,

69 e n d T i m e = ti me . end ,

70 s y s I n f o = Sys . inf o ()

71 )

73 sa ve ( out , fi le = fname , c o m p r e s s = " b z i p 2 " )

74 }

makedata.R is script that iterates over all the data files created bysim.once and then stores the results of the simulation using write.table. Mostly the estimation error, the ratio of correctly classified observations and simulation covariates, such as the sample sizes of the clusters, the amount of clusters or the value of θ (depending on simulation type) is stored.

Listing 14: makedata.R

1 l i b r a r y ( c o p u l a )

2 l i b r a r y ( f l e x m i x )

3 s o u r c e ( " f l x m c c o p u l a . R " )

4 s o u r c e ( " sim . R " )

5 6

7 f i l e s <- Sys . glo b ( " d ata / * . R D a t a " )

8 N f i l e s <- l e n g t h ( f i l e s )

10 o l d d a t a < - NUL L

11 try ( o l d d a t a <- r ead . t a b l e ( " d ata / sim . txt " ,

12 h e a d e r = T , as . is = T ))

13 if ( ! is . nu ll ( o l d d a t a )) {

14 if ( n row ( o l d d a t a ) == 0)

15 o l d d a t a < - NUL L

16 }

18 cnt <- 0

19 da ta <- N ULL

21 s t a r t <- Sys . tim e ()

23 for ( fn in f i l e s ) {

24 cnt <- cnt +1

25 out <- NU LL

26 cat ( s p r i n t f ( " %5 d / %5 d %2. 1 f %% >> % s \ n " ,

27 cnt , Nfiles , cnt / N f i l e s * 100 , fn ))

28 n e w i d <- s u b s t r ( fn , 9 , n c h a r ( fn ) -6)

30 if ( ! is . nu ll ( o l d d a t a )) {

31 if ( n row ( s u b s e t ( olddata , id == n e w i d )) > 0) {

32 ne xt

33 }

34 }

36 lo ad ( fn )

38 if ( ! is . nu ll ( out ) & ! is . n ull ( out $ m o d e l )) {

39 if ( l e n g t h ( p a r a m e t e r s ( out $ m o d e l )) ==

40 l e n g t h ( out $ m v d c L i s t )) {

41 p a r a m . or ig <- l a p p l y ( out $ mvd cLi st , m v d c G e t P a r a m )

42 p a r a m . est <- l a p p l y ( p a r a m e t e r s ( out $ m o d e l ) , m v d c G e t P a r a m )

44 pe rm <- c l u s t e r . o r d e r ( p a r a m . orig , p a r a m . est )

45 st at <- c l u s t e r . d i s t a n c e ( p a r a m . orig , p a r a m . est , p erm )

46 r a t i o <- c l u s t e r . c l a s s i f i c a t i o n ( out $ orig ,

47 c l u s t e r s ( out $ m o d e l ) ,

48 pe rm )

49 } e lse {

50 pe rm <- NA

51 st at <- NA

52 r a t i o <- NA

53 }

55 d u r a t i o n < - as . d o u b l e ( out $ e n d T i m e - out $ s t a r t T i m e ,

56 u n i t s = " s ecs " )

58 t h e t a <- NA

59 N c l u s t e r < - 2

61 if ( out $ na me == " t h e t a 1 2 3 4 5 " ) {

62 t h e t a <- out $ m v d c L i s t [ [ 1 ] ] @ c o p u l a @ p a r a m e t e r s

63 } e lse if ( out $ n ame == " N c l u s t e r " ) {

64 N c l u s t e r < - l e n g t h ( out $ m v d c L i s t )

65 }

67 n e w d a t a < - c ( newid , out $ name ,

68 stat , ratio ,

69 sum ( out $ s i z e s ) , Ncl ust er ,

70 theta , du rat ion ,

71 out $ s y s I n f o [[ " n o d e n a m e " ]] )

73 if ( is . n ull ( data )) {

74 da ta <- m a t r i x ( nr ow =1 , d ata = n e w d a t a )

75 } e lse {

76 da ta <- r b i n d ( data , n e w d a t a )

77 }

78 }

79 }

81 end <- Sys . tim e ()

82 p r i n t ( as . d o u b l e ( end - start , u n i t s = " s ecs " ))

84 r o w n a m e s ( d ata ) <- NU LL

85 df < - as . dat a . f r a m e ( data , s t r i n g s A s F a c t o r s = F )

86 c o l n a m e s ( df ) <- c ( " id " , " nam e " , " st at " , " r a t i o " ,

87 " N " , " N c l u s t e r " ,

88 " t h e t a " , " d u r a t i o n " ,

89 " n o d e n a m e " )

91 if ( ! is . nu ll ( o l d d a t a )) {

92 if ( all ( c o l n a m e s ( df ) == c o l n a m e s ( o l d d a t a ))) {

93 df < - r b i n d ( df , o l d d a t a )

94 }

95 }

97 w r i t e . t a b l e ( df , fi le = " d ata / sim . txt " )

References

I. N. Bronstein, K. A. Semendjajew, G. Musiol, and H. M¨uhlig. Taschen-buch der Mathematik. Thun, Frankfurt am Main: Verlag Harri Deutsch, 6 edition, 2005.

Edward W. Frees and Emiliano A. Valdez. Understanding relationships using copulas. North American Actuarial Journal, 2:1–25, 1998.

Christian Genest and Anne C. Favre. Everything you always wanted to know about copula modeling but were afraid to ask. Journal of Hydrologic Engineering, 12(4):347–368, 2007.

Christian Genest, Bruno R´emillard, and David Beaudoin. Goodness-of-fit tests for copulas: A review and a power study. Insurance: Mathematics and Economics, 44(2):199–213, April 2009. URL http://ideas.repec.

org/a/eee/insuma/v44y2009i2p199-213.html.

Ivan Kojadinovic, Jun Yan, and Mark Holmes. Fast large–sample goodness-of-fit tests for copulas. Statistica Sinica, in press.

Ivan Kondofersky. Modellbasiertes Clustern mit der Beta-Binomialverteilung.

Bachelor’s thesis, Ludwig–Maximilians–Universit¨at M¨unchen, 2008.

Friedrich Leisch. Flexmix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software, 11(8):1–18, 10 2004. ISSN 1548-7660. URL http://www.jstatsoft.org/v11/i08.

Uwe Ligges. Programmieren mit R. Springer-Verlag Berlin Heidelberg, 2007.

URL http://ebooks.ub.uni-muenchen.de/8279/.

Roger B. Nelsen. An Introduction to Copulas (Springer Series in Statis-tics). Springer, 2nd edition, 10 2007. ISBN 9780387286594. URL http://amazon.com/o/ASIN/0387286594/.

Jun Yan. Enjoy the joy of copulas: with a package copula. Journal of Statistical Software, 21(4):1–21, 2007.

Im Dokument Model–based Clustering with Copulas (Seite 31-44)