Social networks have emerged as a communication tool of unexpected impact. Frequent contact between people through these networks gives rise to virtual relationships developed according to their interests. This study’s approach follows  which poses emphasis not only in the individual behaviour but in social interaction within the network.
One of today’s challenges is to develop accurate tools to identify influential users by sectors and markets, and to understand information flow dynamics. In turn, it is difficult to fully observe a social network; therefore statistical problems and probabilistic modelling are important issues (see   and  for surveys on this subject). Besides this topic, social networks also provide examples of situations of unidentifiable or censored data models and this makes them particularly interesting.
The above concerns lead us to study virtual social networks phenomenon and we restricted our analysis to Facebook. For this we develop a model which, although does not describe it in full reality, it is useful as an approach to network’s dynamics study. Naturally this model leads to graph theory which relates to work in  and  .
We will use mathematical tools and statistics from stochastic processes field    , trying to answer questions such as the existence of transversal communication or how to find if a network segmentation proposed is optimal within a certain communication behaviour between segments.
This paper is structured as follows. In Section 2 we address Facebook modelling using tools of Markov chains, we introduce the concept of complete transversality in the communication, and in this context, we try to find the distribution of random functions involved in the model.
Section 3 proposes two statistical hypothesis tests, one to prove network’s CT and the other to prove CT between network segments using U-statistics theory and asymptotic convergence theorems presented in  . We end this section, defining useful performance index to measure segmentation’s quality.
Section 4 is devoted to conclusions and acknowledgment.
Section 5 is an appendix with proofs of the results obtained in the preceding sections.
Further and related works on this subject can be read in one of author’s PhD theses  .
2. Model Description
In this section we propose a model for describe Facebook dynamics.
Consider . Let denote the set of all internet users at instant t and the set of Facebook’s profiles at time t. Of course and, we’ll also suppose that once a profile is created it cannot be eliminated. (Actually a profile can be eliminated, but this is tedious and difficult to do, so we disregard this behaviour.) Then .
We also denote and to the sets and respectively.
For each instant t, we will model the network with a random graph where the nodes represent profiles and the edges represent friendship links.
Definition 1. Let . We define the random “friendship” function at time t as the function , such that if ,
Remark. The “friendship” relationship is symmetric, so function is also symmetric. Then, the random graph determined by is bi-directional and we can define the adjacency matrix as follows.
Definition 2. Let M be the cardinal number of and let be the set of binary symmetric matrices of order M. We define the random “friendship” matrix at time t as the matrix whose elements are the values taken by for each pair of profiles .
Due to the high level of interactivity on social networks, is natural to suppose that is a Markov chain with states space in . Then, given two states A and B of , we denote to the one step transition probability from A to B.
Because of symmetry’s, is determined by the one step transition probability of the lower subdiagonal’s elements of . Then, if we say that f precedes if and we write and, given , we denote to the set of the lower subdiagonal’s entries of A, we have that:
For (1) calculation’s, we describe through events the profile’s actions which have impact on the transition. These events will be linked to certain indices that we will construct to measure affinities between profiles.
We will use the following notation. Let A be any set, we will denote:
to the set of distinct pairs of A’s elements, and by
to the set of triples of A’s elements they are different by pairs.
Definition 3. Let , with . We define the p “image index” over as the function such that
Remark. We will suppose that the network lies at steady state, this implies that image between profiles has also evolved to a steady state, so the function X does not depends on time t. Besides, function X is not symmetric, non observable and monotonic.
Definition 4. Let , with . We define f “image index” over g as the function , given by
with and the users sets that administrate f and g profiles respectively, a monotonic regression function, and independent and identically distributed random variables with and , for all pair , .
To triples ordered of users and triples ordered of profiles we define the following indices.
Definition 5. Let . For the triple ordered of users we define an “image index” as the function
that assigns to each triple, a real number that represents acceptance level for the action p suggests to to be a friend of .
Remark. As with X, U doesn’t depend on t, is non symmetric and non observable.
Definition 6. Let distinct by pairs. For the triple ordered of profiles we define the “index image” as the function
with , and the users sets that administrate f, g and h profiles respectively, a monotonic regression function and independent and identically distributed random variables with and , for every triple , distinct by pairs.
Then, given , we will suppose that there are , and , such that:
i) “f breaks friendship with g” Û .
(The breakdown of friendship may be due to that f take decision to eliminate g or vice versa).
ii) “f successfully requests friendship to g” Û .
(The request of successful friendship arises when f asks to g for friendship and g accepts to f as his friend).
iii) “f successfully suggests to g, h’s friendship” Û .
(The suggestion of successful friendship is given when f suggests h to g, g requests h’s friendship and h accepts).
Then, if we denote with to the interventions of f regardless its effects in the transitions from one instant to the next from to , these can be described in a disjunct union of the following events:
that is, , and the transition probabilities in one step from to can be expressed in the following formula:
Proposition 1. The one step transition probability from to of the Markov chain is given by:
Complete Transversality (CT) in a social network is associated to a certain communication behavior. This behavior implies that relationship probability between two profiles is always the same. No matter who the profiles are.
CT arises from social scientific theories  that poses that massification of social networks would bring an horizontality in communication and transversality in connection between people, overpassing social, economic, ethnographic and other differences. Virtual social networks would bring balance and democracy to all people connection.
In terms of the model, this could be reflected in the “image index” X of one user p over another and in the “image index” U of the triple ordered of users distinct by pairs. So, averages of regression functions and takes the same value in (2) and (3). Let’s say they are equal to for all distinct user pairs and for all triples ordered of users distinct by pairs. Under this conditions, “image indices” given by (2) and (3) are reduced to and .
When network follows this behaviour, the following results can be established.
Theorem 1. (Homogeneity)
Under the CT context, is a time-homogeneous Markov chain.
Theorem 2. (Ergodicity)
Under the CT context, the Markov chain is ergodic, and when initial distribution is ergodic, the friendship indicator between a pair of profiles , with , denoted by , is a random variable with Bernoulli distribution and parameter p, , with the same distribution and independent of the friendship indicator of any pair of distinct profiles.
From the results exposed in former Theorem we can conclude that, for , ergodic distribution under CT is
3. Test and Estimation
We are aimed to discuss the CT’s validity in social network Facebook. For this, we will propose two statistics based on samples of N profiles and will study their asymptotic distribution under CT when N increases. Besides, we will present the CT tests.
Further, in this section we will introduce a lighter version of CT called Segment Transversality (ST) related to a given segmentation, that allow us to elaborate segmentation quality index.
3.1. Average Communication between Profiles
this statistic averages proportion of friends who have profiles on the sample and therefore measures “sample’s communication average”.
Focusing on long term dynamics, we have seen that under CT, the random variables , are independent Bernoulli with parameter p. Then,
we can verify that and .
We want to find EN’s asymptotic distribution, so we study asymptotic behaviour of
when sample size N grows towards infinity.
If we denote , we have that , where forms a triangular array.
Theorem 3. If the triangular array verifies the following conditions:
i) For each , are independents;
ii) , when ;
iv) Exists such that
, for all N and for all i,
and Lyapunov conditions is met, that is,
Proof 1. Hypothesis i) is trivial because for and N fixed, is independent of .
Besides, , and
Then, the hypothesis ii) and iii) are met.
Let . We will see that iv) holds. For this, we will calculate
with , , and .
and Lyapunov condition given by (7) can be verified for :
Consequently the hypothesis i)-iv) holds and we conclude that
Returning to the centered statistic expression and to the expression of given by (9) results the following:
As EN is a communication estimation between profiles, if we select different profile samples under CT, we shouldn’t detect differences among p’s estimations.
To study this, we propose the following hypothesis test to compare communication average.
Given and , independent profile samples of , that verifies , we build the statistics:
and we have that
with , this is, mean and variance are linked and is derivable. So, if we take,
Consequently, to test average communication, we can substract (11) and (12). This statistic, under CT assumption, results
and, given a signification level α, we obtain the following critic region
Real Data Testing
We perform such an experiment in a real profile network that gives permission to the authors for sampling. For confidential reasons we cannot release any of the data used to make calculations. We can state that we take two independent and disjunct samples of size . The statistic value was
. Therefore, with a significance level, we reject CT hypothesis.
This results indicates the social network Facebook is a platform in which communication between people or groups of people is it NOT TRANSVERSAL.
3.2. Mean Square Deviation of the Communication between Profiles
be an statistic that estimates the mean square deviation of the communication between profiles respect to its mean.
In order to find the asymptotic distribution of TN we use properties of U-statistics introduced in  .
Suppose that are independent and identically distributed random variables and that , is some symmetric function respect to permutations.
Definition 7. A U-statistic of order r with kernel h is defined as
We state the following theorem whose proof can be seen in  .
Theorem 4. Let UN be a U-statistic of order r with kernel h. Suppose that and that
i) are independent and identically distributed random variables,
Then, if ,
Then we have that TN is a U-statistic of order 2 with kernel
where and . Thus, under CT, and .
Remark. For the proposed model for Facebook, p is the friendship probability of any pair of profiles in long term and, because of the large size of the network, p is probably less than 0.5. Thus, , and , therefore .
As , by the Theorem 4 is
As and , we see that and is derivable. Taking
verifies that and the limit expression on (16) is
We can make a test to prove CT by comparing the mean square deviation of two independent populations of profiles. For this we take two independent samples of N profiles of , and such that , we construct the statistics
then, by (17), we obtain the asymptotic distribution of the statistic for the comparison of mean square deviations under the context of CT, that is,
being the critical region at the level of significance α
Similarly as for the test comparing the average proportion of communication between profiles, we use the same real profile and, on his network of friends, we take two independent and disjoint samples and calculate the statistic and the critical region, concluding that the hypothesis of CT is rejected again.
3.3. Segmented Transversality
A context of CT in a social network like Facebook is far from reality as evidenced by the findings of the two tests that we made. Is reasonable to think that profiles tend to cluster in different segments according to social criteria such as political ideologies, economic interests, musical tastes, ages, etc., and that these segments are also related to each other.
We introduce the concept of Segmented Transversality (ST), that is, CT between segments. Then, making a priori segmentation on the network, we will introduce a statistic representing the communication between pairs of segments and we will prove CT between the profiles of the segments.
Let be a partition in segments of . We notice with to the proportion of profiles of , , , , and we make a random stratified sampling by segments as follows: are chosen randomly inside of , are chosen randomly inside of , and so on until are chosen randomly inside of , being [x] the integer part of x, that is, , for .
Given the sets , , , and, given and two segments of
the partition on k segments of , for and , with and , we notice with to the probability of friendship between and , that is, .
Remark. Friendship’s random functions still are independent random variables with Bernoulli distribution, but now the parameter distribution depends on the intensity of the relationship between the pair of profiles considered.
Definition 8. Given two segments of profiles and , we say that there is CT between them if given and , with and , has Bernoulli distribution with parameter , for all and for all .
That is, CT between segments means a distinctive homogeneous behavior in the communication between them.
be the average proportion of friends of the profiles of in the segment . Then, under CT between and , we have that and
Of the same way as we obtain the asymptotic distribution of the centered estimator of the expression (6), we can obtain the asymptotic distribution of
If we want to test CT between a pair of segments and we make a stratified random sampling independent from the previous one in which are chosen randomly inside of , are chosen randomly inside of , and so on until are chosen randomly inside of , with
and we construct the statistic
Then, under CT between and , are , and
For and we have that the variance s is a function of the expected value m, that is, , where and is derivable. Then, taking
we have that and, similarly as in the previous section, we can conclude that
being the critical region at the level of significance α,
Therefore, if the test leads to the rejection of the hypothesis of CT between the segments and , with an error probability of α we say that such segments do not have a distinctive homogeneous behavior in the communication.
3.4. Quality on Segmentation
If we divide the network into k disjoint segments, we can take all possible pairs of those k segments, , and make a total of test, one of each pair and test whether the segment of this pair have a distinctive homogeneous behavior in the communication. We can represent these test by a binary symmetric matrix of order k, , in which each element is one if the segments and were not homogeneous in terms of communication, that is, if the corresponding test rejects the hypothesis of CT and, equals zero, otherwise.
Then, noticing the cardinal of the set of “ones” in the subdiagonal of as
we can define the following useful performance index to measure the quality on segmentation:
If we keep the segmentation and make a stratified random sampling segment m times, with m sufficiently large, and calculate m times the index defined in (23), , with , we can observe the histogram representing the distribution of quality on segmentation. If the most of the times this measure results, for example, greater than the mean of the observations, we continue segmenting according to the criteria which has been used, otherwise it is desirable to modify the segmentation criteria.
Let’s illustrate this with an example. Suppose we segment people by age (>15, <20, >20 and <30, >30) and gender (M, F).
Given this segmentation, suppose we conduct the 15 hypothesis test for segment transversality and obtain the following segment adjacency matrix
A one in this matrix means that segment with segment behaves distinctly in the sense of transversality communication. Zeros in the upper triangle matrix doesn’t mean anything. Each segment compare to each self is homegenous (zero in the diagonals) except in segment 6. This could mean that further segmentation within it might improve audience segmentation. Nevertheless, segment 6 distinguish from other segments anyway.
If we sum the ones under the diagonal and divided by the number of segment combinations (see (23)), we calculate the performance of this segmentation which is 40%. The more the zeros the lower the performance index.
Following this framework we can improvement this segmentation by:
1) fusion homogenous segments
2) explore intra-segmentation in the cases were there was one in the diagonal
For actions in the first group, we look that in this toy example S1 and S2 doesn’t differ in communication behaviour, so we could consider to group them as one segment. We could summaries this saying that gender doesn’t segment among young people (under 20 years). We calculate again the matrix with this augmented segment and calculate the performance. Then occam’s razor lead us to select the least segmented partition when we have two or more with same performance.
For segment 6, this is female over thirty years, we calculate that it is inhomogenous with itself, so we could try a sub-segmentation by education degree or by motherhood. Then repeat the five tests between the new segment with the previous ones and calculate performance of this new matrix with one or more rows.
Of course this iterative method implies significant work with estimation, data recompilation, amount of data, independent sampling, etc. that although might be of extreme relevance it’s beyond the scope of pure mathematics and poses a great source of scientific challenge and interdisciplinary work.
In this work we analyze Facebook social network, modelling it with a Markov Chain and several random variables representing profiles friendships. We further propose communication behaviour between all profiles called complete transversality that assumes no bias between profiles willingness to connect as friends. This CT behaviour leads to estimations that allow us a hypothesis test by means of mean square deviation to reject the CT assumption. This might be an obvious conclusion (because people behaves within Facebook as they behave in real context), but it has all the hypothesis testing machinery behind which gives it strong rigorosity.
Next step in our work was to weaken CT and, for this, we introduce ST (segment Transversality). This is given a determined network segmentation, and each segment profile connects to any other segment profile with the same probability. (Of course this probability can change with different pair of segments.)
In this ST scenario we were able to compare between two entire segments and determine whether they behave in the same way or differently. If we don’t find differences we can safely group the two segments in order to improve the original segmentation. For a given segmentation we test intra segment and every pair of different segments and define a performance index that reflects a percentage of the segments that behaves differently from all the possible comparisons. With this result, we join similar segment (don’t reject null hypothesis) and recalculate tests and performance index to the new segmentation.
This iterative process of grouping homogenous segments or leaving different leads to a hierarchy in segmentations according to segmentation’s quality. This information could be interpreted and used to determine if segmentation’s rules are adequate to distinguish segments; of course the comparison falls in this communication behavior sense.
1.1. Proof of Proposition 1
Let , and . We will see that is independent of . For this, it is enough prove that is independent from , when . In fact, if we look at , the first coordinate of both involved indices for and for , is fixed and is the first coordinate of the random variables and of the mentioned indices respectively. Then, is independent from and is independent from , hence, is independent from , when . Thus,
Applying inclusion-exclusion principle for the events , , we have
We have to check that the events , , are independent between them for a same profile f. is independent from , from and , because , which corresponds to image index involved in is independent from which corresponds to index that appears in , and .
is independent from and from , because if a profile appears in the intersections of , it doesn’t figure either in the intersections of nor in the unions of . The same argument allows us to affirm that is independent from .
Therefore, transition probability (24) is given by
We will see how , is finally written.
If , with , , , is independent from . Then, are independents: from , from , from and from . Thus,
and, applying set properties we have that
as if then and accordingly we have that and are independents.
and, for fixed , , , , , is independent from . So, are independents from and from . Besides, and are independents, as if then . Therefore,
1.2. Proof of Theorem 1
Suppose we are in a CT context. To prove time homogeneity in Markov chain we will see that probabilities involved in , , , don’t depend on time. To do this we introduce some useful notations.
Let F and G be the distribution function for random variables and related with image indices and respectively.
The CT hypothesis implies that F and G neither depend on time nor on profiles. This is, and , for all , for all and every ordered triple , with F and G continuous functions.
We also denote with and the following cardinals numbers of sets:
Consequently, under CT hypothesis we have,
Then, , , and therefore doesn’t depend on t. Consequently, is homogeneous in time.
1.3. Proof of Theorem 2
Let be the space states of and suppose that we are in a CT context.
We denote to the set of states with all ones on the diagonal.
Lemma 1. C is a closed, irreducible and aperiodic communication class.
Proof 1. Let such that . Then, for some , , that is, at time t, the friendship state indicates that for some , the profile doesn’t exists on Facebook. As we have supposed that once a profile is created it can’t be deleted, then if , , that is, an state outside of C is not accessible from a state inside of C. Then, C is closed.
Also, given , outside the diagonal, an entry equal to one can turn into in zero, or an entry with zero can turn into one in one step, that is, A and B are communicated and aperiodic, so C is irreducible and aperiodic.
As we are studying chain’s behaviour at steady state of the network, we suppose that all profiles had been created. Then, if at time t, in a finite amount of steps the state will be attracted by C. Therefore, on the long term, every states of S will be attracted in a finite amount of states by the closed, irreducible and aperiodic class C. As C is a finite set, all of its states are positive recurrent. Then, restricting the chain to C, it is closed, irreducible, aperiodic and every state is positive recurrent, hence it is ergodic.
The ergodic property, ensures limit distribution existence. In this case, if ,
Moreover, friendship functions between profiles f and g of , , were defined as dicotomic variables taking zero or one according to whether they were Facebook friends or not at time t. Then, under the ergodic distribution, as time tends to infinity, random variable , with , has Bernoulli distribution.
Under CT hypothesis, each profile of engages friendship with any other profile with the same probability, let’s say p, so that random variables , with , are identically distributed.
Also, this variables keeps direct relation with “image index” between profile pairs . Under CT this index is reduced to a constant plus a random variable. This variables are independent for all , with , so image index between distinct profile pairs are also independent. Consequently, , with , are independent.