## Master's Theses

2015

Thesis

#### Degree Name

Master of Science

#### College

College of Arts and Sciences

#### Program

Engineering & Computer Science, MS

Roy Villafane

#### Abstract

Problem

In life we need to compare situations in order to select the best solution. The study in this paper is about analyzing data (variables), which is also called data mining. There are situations where it is not enough to compare variables among themselves at one specific moment. Sometimes it is necessary to compare the behavior of variables at different periods of time and know how they behave at different times in order to select the best arrangements for any situation.

Method

To find correlation among variables, traffic intersections were simulated so they could be compared, since the correlation coefficient matrix is normalized. This type of matrix was used to compare intersections in different time variances to find the most interesting information. By comparing each point from the first matrix with each point to the second matrix one can find the intersections that are busier and have a larger difference from the others. Also, two formulas were found to help find the most interesting correlations; in one of those I modified the harmonic mean formula to obtain a balance between two important details.

Results

By using these two new formulas the most interesting information between variables may be found, such as those that are the most popular or least popular (average value) and those that are very different from or very similar to each other (difference value) at different times. “Rank 1” is the value of the balance between the average and the difference, with values ranging between 0 and 0.6. A 0 means that those intersections have very low values in averages and differences, and 0.6 means the opposite. The formula “Rank 2” is based on assigning weight into the average and the difference categories. Once the formula is applied, the values would be between 0 and 1, where 0 will mean that their average or their difference is low, depending on which one was assigned more weight. A value of 1 would mean the opposite. The weight depends on what is needed for a specific situation.

Conclusions

By comparing two correlation coefficient matrices from any type of data in different time periods (since this type of matrix is already normalized) anybody can find out very interesting information for any situations where we need to know how different and popular any types of variables may be. Finally, the most interesting information may be identified by calculating the average or the difference between variables. As an example, these formulas were used to compare traffic intersections, and the result obtained was a rank with the most popular intersections to the less important intersections, which confirmed previously observed traffic patterns.

#### Subject Area

Data mining; Correlation (Statistics); Matrices; Variables (Mathematics)