Toward the development of a conventional univariate time series Essay

Madhup Mamodia and Kshitiz Sharma
Abstract— The importance of web reliability is growing with time since there is an exponential increase in the usage of various social networks", e-commerce systems and other online applications. There are various web errors along with workload characteristics", which influence the reliability of a web system. Therefore", web administrators must have the knowledge of these factors to enhance the reliability of any existing web system. In addition", the issues like the forecast and estimation of the combined occurrences of the source content failures and the relating time between failures of a web framework turn out to be less featured by the reliability research group. Subsequently", in this work", the authors have displayed an all-around characterized strategy (a determining structure) for the web administrators to investigate and upgrade the dependability of the sites under their watch. Initially", HTTP access and error logs are used to extract all the important data which is related to workloads", web errors and corresponding time between failures. After that", the next step is to select the number of independent variables by performing principal component analysis", correlation analysis", and the change point analysis. Next step is to develop different time series based forecasting models to predict the occurrences of the SCFs and the corresponding time between failures. In the present system", the multivariate models also include several uncorrelated workloads", the exogenous and the endogenous noises for forecasting the web errors and the corresponding time between failures. The suggested method is validated to the usage of collected statistics from websites belonging to two highly recognized Indian Educational Institutions.
Index Terms— Web Reliability", Web Server", HTTP logs", Forecasting.
I. INTRODUCTIONWith the expeditious advancement in web technology", a profound transformation has taken place in the area of e-commerce and communication", as a consequence of which", several commercial", non-commercial and other users have now become more dependent on 'WWW' ",making it a preeminent part of business", scientific research", entertainment and other human activities. Today", any failure or malfunction in the web system not only disrupts the business and other activities but also halts the entire communication system. The web workload characteristics", i.e.", number of requests", bytes transferred", the number of hits and number of users for respective websites have exponentially increased in these recent years and therefore", making web browsers", databases", etc.", more error-prone in nature. Thus", the interest for a reliable web software is increased between web administrators.
Web reliability is defined as the probability of failure-free web operations and activities. Whereas web failure is defined as the inability to transfer the right information or document or any other computational result by the web browser. To estimate the reliability of our modern and complex web systems", various metrics are available in literature like Mean Time to Failure (MTTF)", Mean Time to Repair (MTTR)", Mean Time Between Failure (MTBF)", failure intensity", and the number of faults present in the system [3",4]. The prediction and estimation of errors will be done for these metrics.
Mean time to failure (MTTF) is the length of time a device or other product is expected to last in operation [3",4]. Mean Time to Repair (MTTR) is a basic measure of the maintainability of repairable items [3",4]. It represents the average time required to repair a failed component or device. And lastly", Mean time between failures is the predicted elapsed time between inherent failures of a system", during normal system operation. MTBF can be calculated as the arithmetic mean time between failures of a system.
In this work", the authors have mainly used the static Nelson model to calculate the reliability. The calculate the reliability", the necessary information will be extracted from the server HTTP logs (access and error logs). Whenever a user generates a request to access a web page", a session is started. When the client web browser receives the specific response for the above request from the server that has hosted the web site", it simply parses the requested file for generating responses for all the embedded files. Therefore", several sessions may start when a client requests for a single web page which makes a hit entry in the HTTP access log. At the point when immense number of clients with various IP addresses (utilizing a few PCs associated with the intra/internet)", from various corners of the world are influencing an extensive number of requests for a specific web to page of a site", we may locate and large number of hits and also huge number of sessions at the same time. Among the workload characteristics", we could not establish any generalized straightforward correlation. Therefore", we cannot say that only a particular web workload characteristic is responsible for the errors in web systems. Hence", the historical occurrences of workloads along with different SCF errors are used as input variables in VAR(p) and VARMA(p",q) models to predict the future occurrences of SCF errors [8]. However", in the current study", we have restricted ourselves only in the univariate forecasting models.
Web errors are generally", classified as the Source Content Failures (SCF) and External Failures (EF). these two", SCFs influence most of the reliability of a web system. In an earlier study", it is found that majority of 404 (Does not exist) error are caused by internal bad links. While a small percentage are caused by external factors such as", mistyping the URL; bots from the search engines like Google", Yahoo", etc.; disabled external links; old bookmarks; etc.", which sometimes are almost out of reach of the web administrators and also have no influence on the reliability of the web system.
In the present work", we have adopted the error forecasting framework developed by Roy and Pham [1] and have conducted the experiments on a newly developed website. For this reason", initially", we use HTTP access and error logs as inputs (see Fig. 1). After that", all the important information related to workloads", several SCF errors and corresponding TBFs is extracted. The identification of the actual number of independent variables that have impacts on the SCF errors and the corresponding TBFs by conducting the correlation analysis", PCA and the change point analysis of the data sets using proposed frameworks. Lastly", the suitable univariate time series models for foretelling the cumulative occurrences of the SCF errors and the corresponding TBFs have been developed.
The current methods for predicting the cumulative SCF errors and corresponding TBFs can be used on other websites having similar structural characteristics such as navigation", architecture", design", etc.
II. GOAL OF THE PRESENT WORK
The goal of the present work is to forecast the occurrences of various SCF errors and the corresponding TBFs using different time series modelling.
The proposed framework is divided into two modules. The first module is for data extraction and storage in which an Apache server with MySQL database connectivity will be used. The first module will extract the occurrences of various errors and store it in the database. The second module will select the suitable time series model for forecasting after various statistical testing like stationary testing", correlation analysis", principal component analysis and change point analysis. In conclusion", the estimating precision of the chose models has been judged utilizing a few measures like RMSE", MAE MAPE", MdAPE", GMRAE", etc. [1",9]. We have used the univariate models (for the simplicity of the experiment) though it has been established by Roy and Pham [1] that the multivariate models work better than univariate models for the data set with multiple PCs.
Table 1. Different web error codes with specific reasons as described in the work of Roy and Pham [1].
Error codes
Reasons
1xx(100 − 199)
Informational (100", 101", etc. )
2xx(200 − 299)
Successful (200", 201", 203", 204", 205", 206", etc.)
3xx(300 − 399)
Redirectional (300", 301", 302", 303", 304", etc.)
4xx(400 − 499)
Client error (400", 401", 402", 403", 404", etc.)
5xx(500 − 599)
Server error (500", 501", 502", 503", 504", etc.)
III. Overview of Error codes
The most frequently occurred web error is 404 and the conceivable purposes behind EF 404 could be the client mistyping the URL", robots from different utilized web crawlers (like Googlebot for Google)", external links (links from other sites", which is out of reach from the website admin)", old bookmarks", and so on. No examinations are accounted for in the current writing that can refer to a summed up control for recognizing the SCF mistakes from EFs. As an outcome", in the present work", the authors have analysed the greater part of the error codes experienced by the sites under the present investigation to decide whether the errors could be categorized as SCF or EF. For instance", in the present work", the authors found that the error codes 401", 403", 404", 500 and 501 could happen because of SCF or host failure or EF and can be considered as potential dangers to the sites quality", and reliability.
Fig. 1. This diagram has been taken from the work of Roy and Pham", 2017 [1]. In the current work", we have implemented this detailed strategy in our developed website and have presented detailed experimental result.
A brief explanation of the error codes is given as follows:
1. 401 (Unauthorized)– The client's authentication credentials (which have not been provided or failed authorization tests) could not be accepted by the server [2",5",6",7]. The 401-response code indicates that the authorization has been refused by the server for improper/incorrect authentication credentials. However", 401 errors", occurring due to SCF could only be included in the reliability.
2. 403 (Forbidden) –It is mainly generated if the server could understand the client request but refuses to respond (similar to 401). Quite in a similar manner", SCF 403 are only included in the reliability analysis of the specific site [2",5",6",7].
3. 404 (Page Not found) - This error code is generated when the user attempts to follow a broken hyperlink present in the web page. Currently", 404 can be considered as the most dominating error code [2",5",6",7]. This error code can also be classified into two classes", viz.", SCF and EF. Between those two categories", 404 that occurred due to the source content failure can be considered for the reliability analysis of the web software.
4. 500 (Internal error) – This error must be considered for reliability analysis of web software", as it is mainly a server error [2",5",6",7].
5. 501 (Not implemented) –In this case", the request of the client is not understood by the server. It must be considered for reliability analysis of web software", as it is mainly a server error [2",5",6",7].
6. 503 (Service unavailable) –In this case", the server is simply overloaded with massive number client requests and could not process any more request", which", in turn", increases the response time beyond a pre-assigned limit and results with a 503 error. However", the 503-response code does not always generate by the overloaded servers", sometimes it could also be spawned by the refusal of connection by the server due to some technical difficulties. Hence", this error code must be taken into consideration in the reliability analysis of the specific web site [2",5",6",7].
7. 504 (Gateway timeout) –Here", the server mainly acts as a gateway or a proxy server and this error is generated due to irregular IP communication between back-end computers", web server with the clients' computers. Hence", this error is beyond the reach of the specific webmaster and it must not be considered in the reliability analysis of the site [2",5].
IV. First Module
In the first module", we need to extract the necessary information from the server HTTP logs(access and error log) which are the total number of hits {X1t}", bytes transferred {X2t}", number of sessions {X3t}", number of users {X4t}", and the cumulative manifestation of several web errors {Yt}. This extracted information then will be stored in the database.
The access log contains the information about every single request for the web server. The error log contains a record of critical errors that occurred during the server's operation.
A session can be defined as the chain of requests from the same user during a single visit to a particular website. Now", we have to choose a threshold value for session and within that threshold value we will check the number of times a user requested for the website. Servers automatically terminates a connection after some if a user is idle. So the threshold value should not be greater than the time after which the server terminates the session. We have used 30 minutes as the threshold value [1].
V. Second Module
In the first part", we are going to deal with different statistical analysis", e.g.", stationary checking", correlation checking", principal component analysis", and the change point analysis of the data.
After the statistical analysis", we are going to choose a time series model", univariate or multivariate", depending upon the results of statistical analysis", which will foretell the future occurrences of different web errors. For univariate time series models", we can use AR(p) Model", MA(q) Model", ARMA(p",q) Model. For multivariate time series models", we can use VAR(p) Model", VARMA(p",q) Model. But", in this implementation", we have restricted ourselves only to the univariate models [1",9]. The detailed theoretical discussions of various univariate forecasting models can be found in the work of Roy and Pham [1].
VI. Results
As earlier discussed", we have tested the strategy (only for univariate forecasting time series like AR(p) and ARMA(p",q) models) described in [1] on our newly developed testing website (inputting HTTP access and error logs) and the results are given in Table 2. The forecasting accuracies of the forecasting models are satisfactory and hence", we have partially tested the forecasting strategy developed by Roy and Pham [1].
Table 2. Detailed testing results using the Roy and Pham model [1].
VII. Conclusion
The current paper has successfully shows the applicability of the Roy and Pham model [1] and will encourage the readers to apply it while predicting the possible web errors. However", we would like to implement the full model [1] in future for a better validation and would like to see if there is any significant flaw in the mentioned model [1].
ACKNOWLEDGEMENT
It is a great honor to express our profound gratitude to our Computer Science & Engineering department", SRM Institute of Science and
Technology", Kattankulathur", for encouraging and giving us support and good environment to gain valuable experiences.
VIII. References
1. Roy", Arunava", and Hoang Pham. "Toward the development of a conventional time series based web error forecasting framework." Empirical Software Engineering (2017): 1-75.
2. Chatterjee", Subhashish", and Arunava Roy. "Web software fault prediction under fuzzy environment using MODULO-M multivariate overlapping fuzzy clustering algorithm and newly proposed revised prediction algorithm." Applied Soft Computing 22 (2014): 372-396.
3. Pham H (1995) Software Reliability and testing. Wiley-IEEE Computer Society Press", ISBN:978-0-8186-6852-4.
4. Pham H (2006) System Software Reliability. Springer-Verlag", London.
5. Popstojanova KS", Singh AD", Mazimdar S", Li F (2006) Empirical Characterization of Session-Based Workload and Reliability for Web Servers. Empire Software Eng. 11:71–117.
6. Chatterjee S", Roy A (2015) Novel algorithms for web software fault prediction. Quality and Reliability Engineering International 31(8):1517–1535.
7. Arlitt MF", Williamson CL (1997) Internet Web Servers: Workload Characterization and Performance
Implications. IEEE/ACM Trans. Networking 5:631–645.
8. Box GPE", Jenkins GM (1976) Time series analysis", forecasting", and control. Holden-Day", San Francisco.
9. Jolliffee IT (1986) Principal component analysis. Springer", New York.

How to cite this essay: