In classical statistical analysis, the collected data usually have exact value. But, in network technology era, the collected data are usually symbolic type. Diday  introduced symbolic data which are presented in the form of intervals, histograms, lists and so on. Unlike the classical data, symbolic data could be presented in more types in a p-dimensional space . We discuss symbolic interval data in this article, which are symbolic interval-values no longer an exact value. With this change, the classical methods may not be available. Therefore, it is necessary to develop new methods for the analysis of symbolic data. With covariates, we study parameter estimates of the linear regression for symbolic interval data.
Billard and Diday  used the center point of each interval-value to fit the linear regression model. Carvalho et al.  used the center point and range of each interval-value to fit two linear regression models. Xu  used the symbolic covariance method for the symbolic interval data, which was introduced by Billard  . In this article, we present two approaches to estimate regression parameters for symbolic interval data. The first method considers the endpoints least square estimate, and the second method considers the least squares estimate with interval weighted function.
This paper is organized as follows. Section 2 gives a introduction for symbolic interval data, Model 1 and Model 2. In Section 3, we propose two methods to estimate regression coefficient for symbolic interval data. In Section 4, the comparisons of the proposed methods and some existing methods are performed via simulations. In Section 5, we analyze two real datasets with the proposed approaches. Finally, we make some concluding remarks in Section 6.
2. Data and Models
2.1. Symbolic Interval Data
In this article, we study on symbolic regression of interval-valued data. First of all, we introduce the symbolic interval data. In classical data, the exact value of the interested variables usually can be observed. In the network technology era, the collected data are growing more and more complex, and no longer a single point. Diday  introduced the new data format which is called as symbolic data. Symbolic data have several types as follows: intervals, histograms, lists and so on. For the several types, it is necessary to develop some new methods. For example, because of privacy issues, we usually cannot collect the exact data from respondents. Thus, we usually design some questionnaires to collect the symbolic interval data. For notations, define .
, and . Thus, the observed data are .
2.2. Model 1
This model considered the linear regression model for symbolic interval data as
where , are the parameters of interest, and is the error term. Here, we assume , which is also considered by Billard and Diday  . Therefore, and . Due to the unknown of β's, we cannot identify the order of and . Hence, is either or , and the remaining one is . This model implies that the length of depends on the lengths of . But, in practice, the length of may not depends on lengths of . Thus, we also consider the different model, Model 2.
2.3. Model 2
In typical statistical analysis, the linear regression model is
where and are single points, are the parameters of interest, and is the error term. In practice, may not be observed due to privacy issues or some reasons. Usually, the proxies , of can be collected. Note that and , , . Thus, the collected data is . In this model, the length of does not depend on the lengths of .
3. The Proposed Estimations
3.1. Method 1: Endpoints Least Squares Estimate
Based on Model 1, we propose the endpoints least squares estimation approach to estimate . We assume that , , , which is also considered by Billard and Diday  . Due to the unknown of , the order of cannot be identified. We consider the following procedure to identify the order of . From model 1, the model is presented as follows,
where . To identify the order of and , we apply the centre method  to obtain the estimates of as .
Then, compute and as
When , and . When , and . Then we would obtain the estimates of by the endpoints least squares estimate as
where , , , , and . Then, set
is the estimator of in model 1.
3.2. Method 2: Interval Weighted Least Squares Estimate
The method 2 is provided for the model 2, which allows the length of does not depend on the lengths of . The centre method  estimates the regression parameters by least squares estimate approach with center points of the interval data. Based on the centre method , we think the lengths of the interval data can provide some different information in the estimation procedure. Therefore, we use the lengths of the interval data to construct some weighted functions, which provide different impact for each data observation in least squares estimation procedure. Denote the weighted function by . Thus, we suggest the interval weighted least squares estimation method as
where , , . As the results of (8), the minimizer is the estimator of in model 2. Through some examinations in simulations, we suggest three weighted functions of the length of the interval data in the following. Denote the length of interval: , , and , , . The first weighted function is designed as
where and are positive constants. The weighted function is exponential decline as the lengths of interval data increasing. The second weighted function is given as
where , are positive constants, and . The weighted function is linear decline as the lengths of interval data increasing. Define the standardized lengths of interval data and as and . Let and be and . The third weighted function is designed as
where and is a positive constant. The weighted function is decreasing when the standardized length less than the average of the standardized length and increasing when the standardized length is more than the average of the standardized length. We will compare all methods via simulations in Section 4.
In this section, we compare our proposed methods, endspoints least squares estimator (M1) and interval weighted least squares estimator (M2), with the existing methods, CM , CRM  and SCM , by simulated datasets. We consider two data generations for model 1 and model 2. For each table, we present the bias, empirical standard deviation (SD), average of jackknife standard deviation (JackSD), mean squares error (MSE), and 95% coverage probability (CP). Data are simulated with sample size n = 50 and 100, and replications R = 500.
For model 1: we first generate 2 independent values from , and let
be the larger one and be the smaller one, where , . The error term and , . Then, we generate and as
The are set as and . are set as . In Table 1 & Table 2, we consider the interval data have the same error terms of . That is, . In Table 3 & Table 4, we consider the error terms of are different. That is, and . From the results, the endpoints least squares estimation (M1) has smaller standard deviation than others for and under Table 1, Table 2 and Table 4. SCM has better performance than others under Table 3. Note that SCM has poor performance when in Table 2 and Table 4.
For model 2: we first generate the single points , , and set
where . To construct the interval data, the range is generated from a uniform distribution, and denote the upper range of by and the lower range of by , , . Therefore, we could built the interval-valued data as and , , . Thus, we obtain the interval data , . For the settings, are set as and , and are set as , . , , and are generated from uniform distribution such as and from , and from , and and from , . Note that M2 (1) is the interval weighted LSE with the first weighted function, W1, and ; M2 (2) is the method with the second weighted function, W2, and ; M2 (3) is the method with the third weighted function, W3, and ; M2 (4) is the method with the third weighted function, W3, and ; M2 (5) is the
Table 1. Estimations of under model 1 with .
Table 2. Estimations of under model 1 with .
Table 3. Estimations of under model 1 with .
Table 4. Estimations of under model 1 with .
method with the third weighted function, W3, and . The simulation results are shown in Tables 5-8. From the results, the interval weighted least squares estimation with W3 has better performance than others.
Table 5. Estimations of under model 2 with , and .
Table 6. Estimations of under model 2 with , and .
Table 7. Estimations of under model 2 with , and .
Table 8. Estimations of under model 2 with , and .
5. Real Data Analysis
In this section, we apply our proposed methods to analyze two datasets, mushroom data and medical data, which are interval data corresponding to Model 1 or Model 2. The first data which we used to analyze is a mushroom data, which is from the Fungi of California Species Index. The complete data can be downloaded from the internet site, http://www.mykoweb.com/CAF/species_index.html. Three features are represented by three variables Y = the width of the pileus cap, X1 = the length of the stipe, and X2 = the thickness of the stipe. These measurements in the dataset are interval value (in cm). There were 311 observations from the Fungi of California Species Index. Because the lengths of the variables should depend on each other, the dataset belongs to Model 1. By the method 1 and method 2 with the same settings in simulations, we analyze the dataset and present the results in Table 9. In Table 9, we present the estimations of ( ), the jackknife standard deviation (JackSD) and the 95% confidence interval (95% CI). From the results in Table 9, the M1 approach has smaller standard deviation in the estimations of and . The SCM approach has smaller standard deviation in the estimation of .
The next data which we used to analyze is a medical data, which is from Billard and Diday , and the dataset have 10,000 classical observations. Xu  classified the entire data to form 42 categories by Agegroup × diabetes × race ( ). For the dataset, we consider three variables Y = cholesterol (chol), X1 = age, and X2 = income. The medical dataset should belong to Model 2, because the lengths of the variables do not depend on each other. Then, apply the method 1 and method 2 with the same settings in simulations to analyze the dataset and present it in Table 10. In Table 10, we present the estimations of ( ), the jackknife standard deviation (JackSD) and the 95% confidence interval (95% CI). From the results, the interval weighted LSE (M2) with W3 has smaller standard deviation than others, which coincides with the results in simulations. The age variable is significant and the income variable is not significant. Furthermore, the average of cholesterol adds about 0.59 when age adds one year.
In the network technology era, the collected data are growing more and more complex, and become larger than before. It brings the difficulty to analyze by using the standard statistical tools. Diday  introduced the new data format which is called symbolic data, and symbolic data can be presented in many types. In this paper, we focus on parameter estimates of the linear regression for symbolic interval data. We propose two approaches to estimate regression parameters for symbolic interval data. For the data of model 1, which are considered by Billard and Diday , Carvalho et al. , and Xu , we develop the endpoints least squares estimator for the regression coefficients. But data of this kind implicate the lengths of the interval data of the dependent variable and the independent variables are correlated with each other. In some applications, the interval lengths of the two variables may not depend on each other. Thus, for the situation, we consider model 2 data and suggest the interval weighted least squares estimation method. In addition, we compare our proposed methods with CM proposed by Billard and Diday , CRM proposed by Carvalho et al.  and SCM proposed by Xu  via simulations. From simulation studies, the performance of the endpoints LSE is similar to others for model 1 data. The interval weighted LSE
Table 9. Estimations of for mushroom data.
Table 10. Estimations of for medical data (Y: chol, X1: age, X2: income).
with W3 has better performance for model 2 data. Finally, we analyze two real datasets for illustration. Furthermore, the results coincide with the results in simulation studies.