What email address or phone number would you like to use to sign in to Docs.com?
If you already have an account that you use with Office or other Microsoft services, enter it here.
Or sign in with:
Signing in allows you to download and like content, and it provides the authors analytical data about your interactions with their content.
Embed code for: ATC_log2
Select a size
Log2: ACost-AwareLoggingMechanismforPerformanceDiagnosis Rui Ding1, Hucheng Zhou1, Jian-Guang Lou1, Hongyu Zhang1, Qingwei Lin1, Qiang Fu2, Dongmei Zhang1, Tao Xie3 1Microsoft Research 2Microsoft 3University of Illinois at Urbana-Champaign Abstract Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service sys- tems. First, the overhead incurred by intensive logging is non-negligible. Second, it is costly to diagnose a perfor- mance issue if there are a tremendous amount of redun- dant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacriﬁc- ing the logging effectiveness. In this paper we propose Log2, a cost-aware logging mechanism. Given a “bud- get” (deﬁned as the maximum volume of logs allowed to be output in a time interval), Log2 makes the “whether to log” decision through a two-phase ﬁltering mechanism. In the ﬁrst phase, a large number of irrelevant logs are discarded efﬁciently. In the second phase, useful logs are cached and output while complying with logging budget. In this way, Log2 keeps the useful logs and discards the less useful ones. We have implemented Log2 and evalu- ated it on an open source system as well as a real-world online service system from Microsoft. The experimen- tal results show that Log2 can control logging overhead while preserving logging effectiveness. 1 Introduction Logging has been commonly adopted for monitoring and diagnosing performance issues of online service systems, such as web search engines and online banking systems. Typically, performance logs record the end-to-end exe- cution time of a service request as well as the execution time of a component of the service system. Logging is usually achieved by instrumenting source code with log- ging statements and the resultant logs are stored on disks. In practice, performance logs constitute a large propor- tion of total logs. For example, our study of a Microsoft online service system (described in Section 6) shows that around 20%-40% of the total logs are performance logs. Although logging is effective for performance diagno- sis, it comes at a cost. Logging introduces overhead, such as disk I/O bandwidth as well as CPU and memory con- sumption. Intensive logging could further interfere with the service’s normal execution. For example, web search engines are sensitive to performance interference from the logging system, which tends to generate huge vol- ume of logs. Empirical results  show that if logging is fully conducted, the average execution time of requests in a search engine could increase by 16.3% and the aver- age throughput could decrease by 1.48%. Therefore, it is critical to reduce the performance interference by reduc- ing the logging overhead. In addition, our survey (See Section 2 for more details) of Microsoft engineers con- ﬁrms this ﬁnding. About 80% of the survey participants conﬁrmed that they had experienced non-negligible per- formance overhead caused by logging. Furthermore, in- tensive logging could introduce a large amount of less “useful” logs (i.e., the logs that are not useful for helping diagnose the performance issue under investigation). A study  on one large-scale online service system in Mi- crosoft indicates that a high proportion of logs are useless for diagnostic purposes. Our survey of Microsoft engi- neers also conﬁrms this observation. Existing techniques for reducing logging overhead include manually removing some logging statements, changing the logging level (e.g., from “Verbose” to “Medium”), and outputting logs in a sampling fash- ion . These techniques aim to reduce the num- ber of logs to be output. However, these techniques are insufﬁcient for several reasons. First, they cannot guar- antee to preserve logging effectiveness (i.e., preserving the useful logs for diagnosis purposes). For instance, the sampling technique could miss important events due to randomness of the sampling. Second, there is no con- trol mechanism on “whether to log” (whether or not the executed logging statement should be output) over the existing logging systems. Therefore, once developers decide “where to log”, the logging system must strictly output the logs after the execution of the placed logging statements. The resultant logs could still contain many useless ones. Finally, most of these existing techniques do not consider the dynamic properties of a running sys- tem. For a running system, the changes of workload and throughput can inﬂuence the load of its logging system. Simply using a single logging level or a sampling rate may not be able to control the logging overhead dur- ing workload spikes. Therefore, it is desirable to have a new, overhead-constrained logging system for perfor- mance diagnosis. In this paper, we propose a cost-aware logging mech- anism called Log2. Using Log2, developers predeﬁne a resource budget allowed for logging. At runtime, the logging system decides “whether to log” such that the logging overhead is constrained under the budget while the logging effectiveness is maximized. The budget for logging overhead is deﬁned as logging bandwidth, which is the maximum volume of logs allowed to be output in a time interval (such as 1KB per second). There are two reasons for choosing logging bandwidth as the bud- get. First, according to our survey, I/O bandwidth is the most concerning overhead in practice. Second, in gen- eral, most logging overhead such as disk storage, net- work I/O and CPU are directly or indirectly affected by I/O bandwidth. The logging effectiveness is measured as the percentage of performance issues that can be cap- tured by the resultant logs. There are three challenges for realizing such a cost- aware logging mechanism: • It should be able to control logging overhead while preserving logging effectiveness. • It should incur low additional overhead such as CPU and memory consumption. • It should provide ﬂexibility for developers to con- ﬁgure it for different service scenarios, and should be able to adapt to environmental changes dynami- cally. To address the above challenges, Log2 introduces a two-phase ﬁltering mechanism. In the ﬁrst phase, a large number of irrelevant logs are discarded efﬁciently. In the second phase, useful logs are cached and out- put while complying with the logging budget. The two- phase mechanism is updated dynamically to address all the challenges. We evaluate Log2 on BlogEngine, which is a popular open source blogging platform. Furthermore, we per- form an evaluation of Log2 using real logs of ServiceX, which is a large-scale online service system from Mi- crosoft. The evaluation results conﬁrm that Log2 is ef- fective and practical in real-world scenarios. This paper makes the following main contributions: • We propose a novel cost-aware logging mechanism Log2, which helps achieve a balance between log- ging overhead and effectiveness. Such a mechanism incurs low additional overhead and is ﬂexible. • We design and implement Log2. We also evaluate Log2 on both a open source system and a large-scale online service system from Microsoft. The rest of the paper is organized as follows. Sec- tion 2 describes a survey of logging practice in Mi- crosoft, which motivates the design goals of Log2 de- scribed in Section 3. Section 4 describes the design and implementation of Log2. Section 5 provides the detailed evaluation of Log2 on an open source system. Section 6 describes a case study on Microsoft ServiceX system. We discuss the limitations and future work in Section 7. Section 8 introduces the related work, and Section 9 con- cludes the paper. 2 A Survey of Logging Practice in Mi- crosoft To better understand the current logging practice, we conducted a comprehensive survey among hundreds of engineers from ﬁve product teams in Microsoft. We re- ceived responses from 84 engineers. According to the survey, 81 out of 84 respondents are “expert” or “knowl- edgeable” to logging systems. The survey aims to un- derstand the participants’ experience in logging systems and logging overhead. The details of survey questions are available online . In general, the logging systems used by Microsoft en- gineers fall into three categories, including (1) internally developed systems that directly output the executed log- ging statements via a language-intrinsic component or a wrapped API; (2) ETW logging , which writes the buffered logs in a batch fashion, and (3) sampling-based logging tools that are mainly designed for large-scale on- line services sensitive to logging overhead. 2.1 LoggingOverhead According to our survey, 80% of the participants agreed that logging overhead is a non-negligible issue. The top three most commonly concerned types of overhead are storage (60%), I/O bandwidth (58%), and CPU us- age (56%). Among the participants, 59% of them have suffered from the consequences incurred by the logging overhead. Table 1 shows some of the experiences re- ported by the surveyed engineers. The top three most widely used approaches to control the logging overhead include adjusting the logging level (93%), manually removing unnecessary logs (64%), and 2 Table 1: Some of the experiences of the logging overhead Category ReportedExperiences Disk I/O bandwidth Overuse of I/O caused perception of interference with core functionality. The bandwidth requirement by enabling all logs is 8MB/s, which however should be ≤ 200KB/s. Storage OS slows down, other process that needs disk space may crash and even logging system could crash. Storage is a critical component that may cause system crash, but it is often overlooked. CPU Service is slowed down signiﬁcantly once the CPU usage of logging is increased to double digits. CPU usage of logging is very sensitive to our super-efﬁcient system. 3%-5% is the upper bound for CPU usage of logging. Memory Unexpected increases of memory usage of logging system was the root cause of one service incident. Memory leak of logging system caused days of efforts on debugging. archiving log ﬁles periodically (43%). However, about 65% of the participants replied that they are not satis- ﬁed with the existing approaches. For instance, remov- ing logs by changing source code requires extra efforts on re-compiling, testing, and re-deployment. Archiving log ﬁles is often expensive because a large volume of data needs to be transformed via network. All these existing approaches are considered to be after-thoughts, and are applied only when logging overhead starts to compro- mise the system quality. About 83% of the survey participants also agreed that many log messages are redundant for diagnosing perfor- mance issues, implying the feasibility to reduce logging overhead while preserving sufﬁcient logging effective- ness. In addition, about 43% of all participants agreed that logging overhead needs to be controlled, and they considered resource budget for logging in their work. 2.2 Other Limitations of Existing Logging Systems A number of participants also shared with us additional limitations of the existing logging systems and expressed the needs for a cost-aware logging mechanism. These comments and suggestions strongly motivated the design of Log2: Lack of cost-awareness during log instrumentation. One participant complained about the lack of cost- awareness during log instrumentation. He noticed that some developers often had little idea about the result- ing logging overhead when they planned to instrument source code with new logging statements. A typical bad logging practice is to insert logging statements in tight loops (i.e., the loops which iterates intensively), which could cause high overhead, especially in I/O throughput and storage. He suggested a logging system for control- ling the logging overhead transparently, so that devel- opers can perform log instrumentation without worrying about the overhead incurred. Burdeninloganalysis. One participant commented that too many logs make it challenging to analyze logs via manual inspection. It would be helpful if a logging sys- tem can collect all possible logs but do not ﬂush all of them. He also suggested a potential solution: logging system should ﬂush the logs only when some predeﬁned rules are violated. In summary, the survey results motivate a new overhead-constrained logging mechanism as we propose in this paper. 3 TheDesignGoalsof Log2 3.1 Cost-AwareLoggingMechanism In this paper, we propose Log2, a cost-aware logging mechanism that constraints logging overhead. Using this mechanism, developer can perform logging by instru- menting their programs, and predeﬁne a resource budget for logging. With the given budget, the logging mecha- nism decides “whether to log” for each logging request at runtime, makes sure that the logging overhead com- plies with the predeﬁned budget, and maximizes the log- ging effectiveness at the same time. In addition, the log- ging mechanism can support on-the-ﬂy budget setting. Therefore, the logging mechanism not only provides de- velopers with the ﬂexibility to strike the balance between logging overhead and effectiveness, but also provides the ﬂexibility to conﬁgure different logging budgets for dif- ferent service scenarios, or even the ﬂexibility to dynam- ically conﬁgure the logging budget. Furthermore, such a cost-aware logging mechanism enables better planning of maintenance resources , as the logging budget can be determined in advance. 3.2 DesignGoals Log2 is designed to realize such a cost-aware logging mechanism. The budget for logging overhead in Log2 is deﬁned as logging bandwidth, which is the maximum volume of logs allowed to be output in a time interval. Logging bandwidth is the most concerning logging over- head according to engineers’ feedback. It is also the most 3 representative logging overhead, because other types of logging overhead such as disk storage, network I/O and CPU are often directly or indirectly affected by the log- ging bandwidth. We have identiﬁed four design goals for Log2, which are listed below: Cost-effectiveness: Log2 should be able to achieve an optimal balance between logging overhead and effec- tiveness. The logging overhead, deﬁned in terms of log bandwidth, should be constrained under the budget. Al- though the logging budget is under constraint, logging effectiveness cannot be compromised, i.e., with respect to performance diagnosis, the number of performance is- sues detected by the reduced number of logs should be similar to the number of issues detected by the total num- ber of logs. In Log2, a ranking score named utility score is deﬁned to measure how much utility each logging re- quest contributes to performance diagnosis. Log2 then selects the top-ranked logging requests and outputs them. Other logging requests are ﬁltered away. More details are described in Section 4.3. Low additional overhead. Log2 should incur low addi- tional overhead. The additional overhead brought by run- time decision on “whether to log” (i.e., CPU usage and memory consumption) should be negligible. The design choices of Log2 for minimizing CPU usage and memory consumption are described in detail in Section 4.4. Scalable. Log2 should be scalable to the number of log- ging requests. It is very common that thousands of re- quests are processed per second, and considering that many logging statements are executed when serving one single request, the scale of the logging requests per sec- ond is large. A traditional logging system, which makes centralized decision, suffers since such centralized deci- sion can delay the logging time as well as increasing the corresponding memory buffer usage. In contrast, Log2 includes a two-phase ﬁltering design to avoid the poten- tial bottleneck. The details are described in Section 4. Flexible. Log2 should provide developers with the ﬂex- ibility to conﬁgure the system. First, Log2 provides sev- eral types of predeﬁned utility scores, which are designed for the most common diagnostic scenarios (to be de- scribed in Section 4.3.1). It also allows developers to conﬁgure a user-deﬁned function for computing utility scores. Such ﬂexibility enables Log2 to tackle various types of performance issues. Second, the budget can be conﬁgured on-the-ﬂy. Such on-the-ﬂy conﬁguration en- ables developers to select a proper logging bandwidth ac- cording to the different resource plans in different scenar- ios. Since there is no one-ﬁt-for-all conﬁguration for all kinds of services, such ﬂexibility is crucially important for wide adoption in different scenarios. More details are described in Section 4.3.2 and Section 4.4.2. 1 Log2.Begin(string McrName , ...); // begin 2 DoSomething (); 3 Log2.End(string McrName , ...); //end Figure 1: Logging API in Log2. 4 DesignandImplementationof Log2 This section illustrates the detailed design and imple- mentation of Log2. We ﬁrst discuss the high level work- ﬂow of Log2, and then illustrate its two core components, namely local ﬁlter and global ﬁlter. These core compo- nents are essential for achieving the goals of Log2. 4.1 LoggingRequests For performance diagnosis, developers can specify an area of code that should be monitored and logged. We call such an area of code Monitored Code Region (MCR). Examples of typical MCR include: • Expensive system-level APIs, such as operations on I/O, database, networking, etc. • Loop blocks. Previous work  found that a sig- niﬁcant portion of real-world performance issues are caused by inefﬁcient loops. • Function calls cross application-level component boundaries, such as RPC or the connection between GUI and backend services. Performance logs should record two timestamps at the beginning and end of a MCR, which are sufﬁcient to compute the execution time of the MCR. Log2 provides two logging APIs, Begin and End, to denote the begin- ning and end of an MCR, respectively. The APIs com- pute the execution time of an MCR and also record the unique ID of the MCR. Figure 1 depicts the logging API usage in Log2, where the execution time of DoSomething is recorded. A pair of logs Begin and End form a logging request, which will be further processed by Log2 to de- cide whether they should be ﬁltered or output. Local filter Local filter Global filter adjusted threshold adjusted threshold log requests log requests Disk Local filter ... ... Figure 2: The workﬂow of Log2. 4 4.2 OverallWorkﬂow The workﬂow of Log2 is depicted in Figure 2. Two ﬁl- tering phases, local ﬁlter and global ﬁlter, are adopted to decide whether or not the incoming logging requests should be logged (whether to log). Such a two-phase ﬁltering mechanism is used to avoid the potential bottle- neck of a single centralized ﬁlter, when a huge number of log requests come in simultaneously. The local ﬁl- ters are responsible for discarding the trivial logging re- quests, which are logging requests that have low utility scores. The global ﬁlter is responsible for ﬂushing the top ranked logging requests to disk and in the meantime complying with the logging budget. Each thread of logging requests has a local ﬁlter. Only the logging requests with utility scores (which are cal- culated dynamically) higher than a global threshold can pass through the local ﬁlter to a memory buffer in the global ﬁlter. Other logging requests are discarded. The global threshold is adjusted dynamically, to adapt to environment dynamics, while optimizing the effec- tiveness and efﬁciency of Log2. Usually, a signiﬁcant high portion of logging requests are discarded in the ﬁrst phase. In the global ﬁlter, the ﬁnal decision on log out- putting is made periodically to make sure that the bud- get constraint is compliant. The logging requests from all local ﬁlters during the last time window are cached in memory. When a periodic event is triggered, the cached logging requests are sorted according to their utility scores. Only the top-ranked requests with total volume equal to the logging budget are ﬂushed to disk. Meanwhile, the global threshold for utility scores is up- dated by the global ﬁlter by considering the volume of logging requests in recent time intervals. Lastly, the global ﬁlter feeds the new threshold back to each local ﬁlter. Details about each component are described in the fol- lowing subsections. 4.3 LocalFilter The major task of the local ﬁlter component is to com- pute the utility score for each logging request. The util- ity score measures the usefulness of a logging request for performance diagnosis. Note that a local ﬁlter is ex- ecuted in the same service thread being monitored. The overhead for computing utility score should be kept low to reduce the impact on the service. 4.3.1 Formulaofutilityscore To compute the utility score for each logging request, we analyze the histogram of the execution time of the cor- responding MCR. The intuition is that the utility score should be higher if the execution time of a MCR de- viates further away from its past behavior. For each MCR, we can measure the degree of performance devia- tion based on the histogram of the execution time of the MCR. However, it is inefﬁcient to maintain the complete history of execution time for each MCR and compute the histogram. In our work, we adopt the concept of method of moments , which can be efﬁciently computed. Ac- cording to statistical theory, moments can well approxi- mate histogram . The 1-order of moment is mean, and the 2-order of moment (σ2) is the square of standard deviation (σ). Based on the mean (µ) and the standard deviation (σ) of execution time of an MCR, we propose three forms of utility scores, given the current execution time t of the MCR: utility = t −µ −τ σ (1) utility = t (2) utility = t −µ −τ (3) In Equation (1), a constant value τ is a tolerance factor, which is used to further reduce false-positives for MCRs. For example, execution time of 5ms is signiﬁcantly ab- normal compared to 1ms as the average execution time, but is ignorable for performance diagnosis. The default value of τ is 25ms. Equation (2) simply uses the execution time as the util- ity score, which is suitable when the users would like to identify performance hotspots (e.g., those components with the longest execution time). Equation (3) computes utility score based on the mean execution time. Com- pared with Equations (1) and (2), it considers the abnor- mality (t-µ) while ignores the ﬂuctuation. Besides the predeﬁned utility formulas, we also allow users to specify their own utility functions to cater for their own scenarios. 4.3.2 Updatingtheutilityscoresdynamically During performance monitoring, the execution time t of each MCR varies at runtime. Therefore, the mean and standard deviation of t should be updated dynamically over time. moments can be updated incrementally, with the time complexity of O(1): µn = (1− 1 n )µn−1 + 1 n tn (4) σ 2 n = (1− 1 n )[σ2 n−1 + 1 n (tn −µn−1)2] (5) where n denotes the nth update; tn is the nth execution time. We also modify the Equations (4) and (5) in a man- ner similar to Exponential Smoothing. Exponential Smoothing can better capture the slow-varying system dynamics. The corresponding formulas are as follows: µn = (1−α)µn−1 +αtn (6) 5 σ 2 n = (1−α)[σ2 n−1 +α(tn −µn−1)2] (7) where α is a weighting factor, which is empirically set to 0.01. 4.4 GlobalFilter In Log2, the global ﬁlter component performs two major tasks: log ﬂushing and utility-threshold adjusting. 4.4.1 Logﬂushing Log ﬂushing is triggered periodically, and such period is called ﬂush interval. When the timer is triggered, Log2 ﬁrst sorts the buffered logs according to the util- ity score, and then ﬂushes the top ranked logs so that the total ﬂushed log volume does not exceed the logging budget. All selected logs are packed together and are ﬂushed once in a batched fashion. Buffer design. Proper buffer design is important for re- ducing logging overhead, especially for reducing CPU usage. Note that the buffer will be accessed by multiple local ﬁlters with fast inserting operation, as well as the global ﬁlter thread with slow sorting and ﬂushing oper- ations. To make sure that the latter one does not affect the inserting performance and thus does not block work- ing threads, Log2 includes a data structure called swap buffer, which has two buffers: one serves for inserting operation, and the other serves for sorting and ﬂushing operations. These two buffers are swapped periodically after a ﬂush interval. A 0/1 ﬂag is used to indicate which buffer is currently used for insertion, and which one is for ﬂushing. Such mechanism guarantees that the two threads work on different buffers without lock contention except swapping the global ﬂag. Flush-interval selection. Long ﬂush interval would result in larger swap buffer, and thus more memory consumption; while shorter interval beneﬁts less from batched ﬂushing, and incurs frequent overhead in swap- ping buffers. Log2 currently sets the default ﬂush interval to 30 seconds, which works well in our experiments and practice. Users are also allowed to conﬁgure the ﬂush interval on-the-ﬂy. 4.4.2 Utilitythresholdadjustment The utility threshold is used to control the volume of logs to be inserted into the swap buffer. Because only the log- ging requests with utility scores larger than the thresh- old is cached, setting a proper threshold is very impor- tant for Log2. Speciﬁcally, if the threshold is set too low, massive logs could be inserted into the swap buffer, the consequence is larger overhead. On the other hand, if the threshold is set too high, only a small amount of logs could be cached in the buffer, thus the important logs could be missed, leading to unacceptable logging effec- tiveness. The optimal objective is to cache just budget-volume logs by selecting a proper threshold. Choosing such an optimal threshold value in one-shot in unrealistic, be- cause either the environment dynamics or the frequency of different utility scores is unknown. To address this challenge, we design an iterative way for adjusting the threshold by ‘learning from history’. The duration of each iteration is called adjust interval. Intuitively, when the volume of logs in the previous adjust interval is higher than the budget, then the threshold should be in- creased. The threshold should be decreased when the volume of logs in the previous adjust interval is lower than the budget. From both effectiveness and efﬁciency perspectives, it is desirable that the adjusting algorithm should converge quickly, and the volume of logs in the buffer should not be too large (low overshoot ) in any interval. We next illustrate the details of Log2’s threshold-adjustment algorithm, which is agile and has low overshoot. Adjustment mechanism. Let us denote the threshold and log volume as Tn and Vn, respectively. Here n is the index of the adjust interval. Let us denote B as the log- ging budget. The threshold adjusting mechanism used in Log2 is as follows (in the form of Secant Method ): Tn = Tn−1 +(Vn−1 −B)× Tn−1 −Tn−2 Vn−1 −Vn−2 (8) Mathematically, the convergence of our algorithm is super-linear, with an order of 1.618 . More de- tails about the mathematical deduction of our method are available at our project website . The interpretation is that the ‘gain’ Tn −Tn−1 on the threshold is proportional to ‘error’ Vn−1 − B, and coefﬁcient Tn−1−Tn−2 Vn−1−Vn−2 approxi- mates the reciprocal of the derivative, if we treat V as a function of T. In our implementation, to avoid a divide-by-zero error, we add 1 if Vn−1 −Vn−2 is close to 0. When Tn−1 −Tn−2 is equal to zero, threshold updating can trap to a certain number and never changes. To avoid such issue, we add a very small value (0.01) under such situation. Adjustmentinterval. To make the threshold adjustment mechanism more effective, a properly chosen adjust- ment interval is needed. The adjustment interval should mitigate the ﬂuctuation of environment change, i.e., the workload varies slowly under the granularity of the cho- sen adjustment interval. Therefore, the adjust interval cannot be too short; otherwise, the transient random vari- ation of workload will be signiﬁcant, On the other hand, a too long interval indicates longer time for convergence, making Log2 less agile. In our implementation, Log2 sets the adjust interval to 30 seconds, which is the same as the ﬂush interval. 6 4.5 ImplementationDetails We have implemented Log2 using the C# language. Some details about the implementation are as follows. Bounded memory usage. The maximum memory us- age of Log2 is set to 50MB in conﬁguration, so that Log2 has negligible memory contention with normal service operations. In our implementation, when the maximum memory usage is reached, new logging requests will be dropped in the same ﬂushing interval. In fact, 50MB is rarely reached in most cases. Speciﬁcally, two compo- nents in Log2 consume most memory usage. One is the cache for maintaining µ and σ for all the MCRs. For a large-scale online service, the number of MCRs is in a magnitude of 100,000, so the corresponding memory us- age is 100,000 × 2 × 8B = 1.6MB. The other component that consumes most memory usage is the swap buffer. Its size depends on both the budget size and ﬂush inter- val. The I/O bandwidth of logging is 200KB/s (which is 20GB per day!) per machine for a typical large-scale on- line service. Because the budget size does not exceed the overall throughput, a much loose upper bound of mem- ory usage on the swap buffer is 200KB/s × 60s × 2 = 24MB. In addition, the 50MB threshold has not been reached in all of our experiments. Handling system idle time. System idling is a spe- cial circumstance that needs to be handled. Speciﬁcally, when logging requests are rare, the budget will not be reached no matter how the utility threshold is adjusted. The consequence is that the utility threshold could be- come extremely low, and thus the system will overshoot dramatically (i.e., there will be a burst of ﬂushing) when the intensity of logging requests turns back to normal. In order to avoid such circumstances, a lower bound on the adjust interval is set. In our implementation, we set the lower bound to 0. Such mechanism is commonly used in the area of control engineering . Nested instrumentation. To support nested instrumen- tation, it is noteworthy that each local ﬁlter actually maintains a timestamp stack to match the logging begin- end pair, When a Begin is invoked, the corresponding timestamp is pushed into the stack; and when an End is invoked, the top element in the stack is popped, and is matched as the Begin corresponding to the current End invocation. As illustrated in Section 4.3.2, the histori- cal information of each MCR is maintained separately, therefore, dropping the outer log request will not directly lead to the dropping of the inner log request. 5 Evaluation In our evaluation, we intend to evaluate Log2 from the following three aspects: Logging throughput: How much I/O throughput (the volume of logs ﬂushed to disk within a time interval) can be reduced by Log2, compared with the existing logging system? Logging effectiveness: How effective is Log2 in diag- nosing performance issues? The effectiveness is mea- sured as the percentage of performance issues that can be captured by the ﬂushed logs. Additional overhead: How much additional CPU and memory overhead is incurred by Log2? 5.1 ExperimentalSubjectandSetup To evaluate Log2, we design experiments on BlogEngine , which is a popular open-source, ASP.NET based blogging platform. BlogEngine has received more than 1,000,000 downloads as of January 30, 2015. It supports various blogging activities, such as writing blogs, adding comments, sharing, and following. We choose the ver- sion 2.8, as it is a recent stable version. To evaluate Log2 on BlogEngine, we run the Blo- gEngine as a service, and we simulate concurrent ac- cess to the service via multiple synthetic users. We then analyze the logs generated by Log2 as well as the run- time performance. We set up the experiment on Blo- gEngine with four steps: instrumentation, deployment, performance issue injection, and overhead monitoring. Below are the detailed setup procedures. Instrumentation. We perform program instrumenta- tion guided by previous work  . Speciﬁcally, three types of code regions in BlogEngine are marked as MCRs and logged, since they have relatively high po- tential to cause performance issues. These three types of MCRs include expensive system-level APIs, loop blocks, and function calls. In summary, about 1000 MCRs are identiﬁed and instrumented. Deployment. We use one physical machine to deploy the BlogEngine service, and two other physical machines are conﬁgured as client nodes. Each machine runs Windows Server 2012 R2, with CPU Intel(R) Xeon(R) E5-2650 v2 @ 2.60GHz (2 processors) and 192GB Memory. We adopt a tool named WebTest  to simulate high workload from multiple synthetic users to access the Blo- gEngine service. WebTest is a new testing tool released with Visual Studio 2012. It can be conﬁgured to gen- erate mixed types of requests with user-speciﬁed loads. In our experiment, we generate ﬁve typical types of re- quests in WebTest - read blogs, write comments, search, download ﬁles and upload ﬁles. These requests cover the most common usage scenarios of BlogEngine. Performance Issue Injection. In order to evaluate the logging effectiveness of Log2, we inject three types of performance issues, namely upload an extremely large ﬁle, search a strange term, and exhaust CPU by other process. Speciﬁcally, when uploading a ﬁle with size 7 larger than 100MB, the GUI on the client side starts to hang (a possible ﬁx is to put the uploading job in a back- end thread). The response to the search operation be- comes signiﬁcantly slow when entering a strange query term that is long and contains special characters (a pos- sible ﬁx is to pre-process the query term). Both of these two performance issues can be directly pinpointed by the corresponding logs. We write a program named ResourceEater to con- sume high CPU usage in a certain period to mimic the third type of performance issues. When ResourceEater is launched, it occupies CPU intensively. The runtime performance of BlogEngine degrades signiﬁcantly. Such performance issues can be reﬂected in the corresponding logs (e.g., the logs that mark the loop blocks). Overhead monitoring. To measure the I/O throughput, we record the number of logs ﬂushed to disk per time interval. To measure the additional CPU/Memory over- head of Log2, we write a program named Per fMonitor to periodically monitor the CPU and memory usage of BlogEngine at every second. The CPU overhead is mea- sured as the percentage of total CPU cycles Log2 occu- pies, and the memory overhead is measured as the bytes of memory space Log2 consumes. 5.2 ExperimentalDesign We design an experiment to evaluate Log2. We use the WebTest tool  to simulate 101 synthetic users con- currently accessing BlogEngine. The experiment runs for two hours. Among the 101 users, 100 users mimic the normal user behaviors, which fall into the ﬁve afore- mentioned groups (read blogs, write comments, search, download ﬁles and upload ﬁles). One user mimics the abnormal usage to inject two types of performance is- sues (upload an extremely large ﬁle and search a strange term), which are generated 78 times during the 2-hour experiment. To inject the issues caused by exhausting CPU by other process, the ResourceEater is triggered on the ser- vice machine one hour after start, and lasts for 10 min- utes. We also evaluate the logging effectiveness of Log2 us- ing three utility scores: t, (t − µ − τ), (t − µ − τ)/σ, respectively. In the experiment, we compare Log2 with the baseline approach, which directly outputs all executed logs with- out considering cost-effectiveness. As we instrument all the interested MCRs, the baseline approach is able to de- tect all injected performance issues. We are interested in knowing how Log2 can detect similar number of issues using fewer amount of logs. In addition, we compare Log2 with two sampling- based logging approaches, named Sampling-counter 0 1000 2000 3000 4000 5000 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 # of flushed logs Elapsed time (30s) Traditional logging system Log2 Figure 3: Comparison of logging throughput. (budget = 120 logs/interval) and Sampling-time, respectively. Sampling-counter is counter-based, which uses a global counter to record how many logging requests are processed. Only the logs whose corresponding counter is divisible by the recipro- cal of the sampling rate are ﬂushed to disk. Sampling- time is time-interval based, which uses a timer to control when the logs are ﬂushed to disk. Only the logs executed when the timer is triggered are ﬂushed. 5.3 ExperimentalResults Logging throughput. Figure 3 shows the number of logs ﬂushed per time interval (30s) using Log2 and the baseline logging approach, respectively. The budget is set to 120 logs/interval. The big drop on the number of logging requests (around interval 118-136) is due to the launching of ResourceEater. Figure 3 shows that the logging throughput is signiﬁ- cantly reduced using Log2. The average number of logs ﬂushed per interval is 104 for Log2, while it is 3,800 for the baseline logging approach. The reduction on log- ging throughput is over 97%. In addition, the logging throughput of Log2 strictly complies with the budget con- straint (< 120 logs/interval). Logging effectiveness. The logging effectiveness is inherently associated with the budget size, i.e., the log- ging bandwidth. Higher logging bandwidth would in- duce higher logging effectiveness. We evaluate the log- ging effectiveness by varying the budget size. In addi- tion, we also evaluate three alternative formulas of utility scores (t, t −µ −τ, and (t −µ −τ)/σ). Figure 4 illustrates how the logging effectiveness in- creases as the budget size increases. All the three pro- posed utility scores help achieve high effectiveness, i.e., the coverage of marked logs increases quickly to almost 100% when the budget size starts to increase. The results indicate that Log2 has strong ability to preserve high log- ging effectiveness while reducing a signiﬁcant amount of logs. 8 Figure 4: Logging effectiveness vs. budget size Figure 5: Logging effectiveness of two sampling-based approaches The results of two sampling-based logging systems, Sampling-counter and Sampling-time, are illustrated in Figure 5. The effectiveness of either Sampling-counter or Sampling-time is approximately proportional to the sampling rate, which is much lower than what Log2 achieves. It is worth noting that the budget size of 120 logs/interval is equivalent to the sampling rate of 3%. While Log2 achieves almost 100% coverage with such budget size, Sampling-counter and Sampling-time achieve only 2% and 6% coverage, respectively. For the issues injected by exhausting CPU by other processes, there are in total 690,000 individual calls on 6 instrumented loop blocks during the experiment (note that each loop block is one MCR). By using Log2, only 22,000 (97% reduction) calls on loop blocks are recorded (budget size = 120 logs/interval), with the average execu- tion time of 160ms. By inspecting the loop-related logs, we found that the average execution time is 423ms when ResourceEater is launched, which is signiﬁcantly larger than the average value (160ms) without the impact of Re- sourceEater. Our inspection shows that the logs reﬂect- ing loops with long execution time are recorded, which demonstrates the capability of Log2 to detect the perfor- mance issue due to exhausted CPU usage. In summary, the experimental results show that Log2 is effective in detecting performance issues, while keeping the volume of logs low. Additional overhead. Log2 works in the same process of the BlogEngine service, hence its own CPU/Memory usage cannot be measured directly. In or- der to evaluate the overhead of Log2, we measure the overall CPU/Memory usage of the BlogEngine system integrated with Log2, and compare it with the overall us- age of the BlogEngine system integrated with the base- line logging approach (outputting all logs). We run the experiment with each setting 7 times to overcome ran- dom variations. Table 2: Comparison on overall resource usage Loggingsystem Memory(GB) CPU(%) Log2 4.74±0.21 63.4±3.0 Baseline 4.70±0.25 70.6±4.1 According to Table 2, the additional memory usage of Log2 over the baseline approach is not noticeable. When integrated with Log2, the average CPU usage of Blo- gEngine is slightly lower than that with the baseline log- ging system. This is because using Log2, a large number of logging requests are discarded at early stage, there- fore a signiﬁcant amount of processing (such as logging state extraction or string conversions) as well as lock con- tention are avoided, leading to reduced CPU usage. 0 120 240 360 480 600 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199 208 217 226 235 244 # logs inserted into swap buffer Elapsed time (30s) Figure 6: Dynamics of swap buffer size In order to evaluate the memory usage of Log2, we monitor the size of the swap buffer over time. Figure 6 shows the number of logs inserted into the swap buffer per ﬂush interval. There is one peak at the beginning, when the threshold for the utility score is not converged. The peak is about 1.3 times higher than average, which is far from the default maximum memory limit set in Log2. In addition, it takes only ﬁve iterations to con- verge, which shows that the small memory peak disap- pears quickly. The variation of the curve is mostly caused by the randomness in the workload. 9 6 AnApplicationtoMicrosoftServiceX To further evaluate Log2, we have applied it to analyze the performance logs of Microsoft ServiceX (the service name is anonymized due to conﬁdentiality). ServiceX is a large-scale online service system, serving millions of users globally. Designed with a 3-tier architecture, ServiceX is run on a large number of machines, each of which continu- ously generates huge amount of logs. A typical front-end machine usually generates logs with a speed of 30MB per minute. Log aggregation from all the machines is a heavy task, since each machine generates about 40GB logs every day. ServiceX provides a logging API called MoS for performance diagnosis. The corresponding logs are called MoS logs (i.e., performance logs), which take up 20%-40% of the total logs. Engineers of ServiceX would like to reduce the large volume of MoS logs, since most of them are not useful for performance diagnosis and they simply incur overhead. We apply Log2 to evaluate its ability to reduce the vol- ume of MoS logs. Setup. Each MoS log entry contains the following in- formation: log time, execution time of the MCR, code region ID, and thread ID. Such information is sufﬁcient to re-construct the execution ﬂows of all the MoS logs. We randomly select 12 different datasets. Each dataset contains logs generated during one continuous hour. We focus on evaluating logging bandwidth and effec- tiveness in our study. To do so, we identify performance hotspots, which are the code regions that take most time to execute. We choose the MoS logs having the top 0.3% (i.e., 1 - 99.7%, which is a 3-sigma rule of thumb ) longest execution time as the performance hotspots. We then apply Log2 to see how many of these performance hotspots can be successfully identiﬁed. We choose t as the utility formula. We evaluate logging effectiveness as the coverage of the performance hotspots by varying the budget size. Additionally, we also evaluate how the ﬂush interval affects the effectiveness. Results. Figure 7 shows the logging effectiveness of Log2 by varying the budget size. Since we conduct ex- periments on 12 datasets, the effectiveness on each bud- get is represented by a range. As shown in Figure 7, the coverage of performance hotspots quickly comes up to 100% when the budget size increases. Particularly, when the budget is set to 100 logs/interval, which is equivalent to the sampling rate of 0.77%, the coverage is already 98%. On the other hand, only 4.5 MB logs are recorded, while the size of original MoS logs is 500MB for each dataset. Figure 8 shows the effectiveness of Log2 under differ- ent ﬂush interval values. Here the budget is set to 120 logs/interval, which is equivalent to the sampling rate of 0 20 40 60 80 100 0 100 200 300 Detected Performance Issues (%) Budget (# logs/interval) Figure 7: Logging effectiveness vs. budget 1.0% . When the ﬂush interval is very small, the cover- age rate is relatively low, mainly due to the signiﬁcance of randomness on the workload. Setting the ﬂush inter- val to 30 seconds is satisfactory, since the coverage rate here is almost 100%. 0 20 40 60 80 100 0 10 20 30 40 50 Detected Performance Issues (%) Flush interval (s) Figure 8: Logging effectiveness vs. ﬂush interval In summary, our case study on ServiceX has con- ﬁrmed the applicability of Log2 to real-world systems. 7 Discussion Budgetcontrolformultipleservices. In our current de- sign, Log2 is implemented as a runtime logging library and can be dynamically linked to a service system un- der monitoring. It controls the budget for only one single service. As budget can be changed dynamically, it is pos- sible to make Log2 a standalone process, which manages a set of budgets for multiple services. Such a centralized budget control system can further enable dynamic budget re-allocation to different services. Supporting more types of performance analysis. Log2 is very effective for capturing performance hotspots on-the-ﬂy. In practice, there are other commonly re- quired types of performance analysis. For example, to understand the overall latency status of the system un- der monitoring, the total number of times the latency hits the 3-sigma threshold, the average latency of a compo- nent, and so on. Log2 has the ability to provide such information. For example, Log2 maintains the mean 10 and standard deviation values for each MCR. In addi- tion, Log2 can record how many times each MCR is up- dated. Hence, many other measures of performance sta- tus (such as the 3-sigma measures) can be easily derived from these basic statistics. Additionally, all the data of Log2 can be dumped periodically and used by other per- formance analysis tools. Analytical reports based on the off-line processing of the data can produce comprehen- sive information for postmortem analysis. Multiple objectives. Currently we only deﬁne bud- get in terms of I/O bandwidth, as it is mostly concerned by our surveyed participants. It is possible to consider more objectives, such as CPU and memory usage, and to control logging overhead by performing multi-objective optimization. We will address it in our future work. Where to log. As described in Section 4.1, we iden- tify MCRs for performance diagnosis. In this paper, we focus on the problem of “whether to log”. Another im- portant topic is “where to log”, i.e., the automatic identi- ﬁcation of code regions that should be logged and moni- tored. The two problems, “whether to log” and “where to log” are closely related to each other. For example, the logging mechanism we proposed enables “conservative logging”, i.e., developers can instrument a large amount of logging statements without concerning about the log- ging cost. This is an important topic of our future work. Leveragingnon-performancelogs. Although perfor- mance logs are common in practice, there are also other types of logs such as those for failure diagnostics. Two adjacent log entries indicate the time spent on executing code between the two log entries. It would be interesting to leverage those logs for performance diagnosis. Extension to failure diagnosis. Our current work fo- cuses on analyzing performance logs for effective and ef- ﬁcient monitoring and diagnosis of performance issues. Apart from performance logs, there are other types of logs such as logs recording error and failure information. These logs are mainly for diagnosing software failures in production environment [24, 26, 27]. How to extend our work to support failure diagnosis is important future work. 8 RelatedWork Performance monitoring and diagnosis has becoming in- creasingly important, especially in the era of Internet- based services and cloud computing. A large amount of research has been conducted to characterize [13, 28, 16] and improve system performance [14, 21, 23, 12, 7]. In production environment, logging is still the most commonly used technique for performance monitoring and diagnosis. Dapper  is a large-scale distributed tracing infrastructure widely adopted by Google for ubiquitous and continuous monitoring. Dapper is de- signed to have low overhead, application-level trans- parency and scalability. Log2 shares the same design goals with Dapper, and goes one-step forward with ﬁner- grained and more accurate control on logging overhead to comply with the resource budget. Dapper ﬂushes only a fraction of all traces using a sampling (with a manu- ally conﬁgured sampling rate) approach such that inter- esting traces could be missed. Log2 preserves useful logs with signiﬁcantly higher effectiveness. At the same time, Log2 guarantees the resource budget constraints, which can be violated in Dapper. ETW (Event Tracing for Windows)  is a frame- work that can log Windows kernel or application-speciﬁc events to a log ﬁle. It has a buffering mechanism that reduces the number of disk accesses for logging. How- ever, ETW is not cost-aware: it cannot selectively record a number of logs based on a given budget. Paradyn  also controls its instrumentation over- head dynamically. However, it depends on users to ex- plicitly conﬁgure where to log, and predict whether to log. Log2 instead is user-transparent in that whether to log decisions are dynamically made by the log- ging mechanism. Excessive instrumentation is com- monly adopted in the proﬁling domain. Matthew  presents sampling based low-cost instrumentation to en- able feedback-guided just-in-time optimization. Like Dapper, logging based on random sampling would miss interesting traces. Yuan et al. [25, 26, 27] have pioneered the work on log-based failure diagnosis. LogEnhancer  aims to enhance the recorded contents in existing logging state- ments by automatically identifying and inserting critical variable values into them. ErrLog  utilizes a num- ber of exception patterns that potentially cause system failures, and then adds proactive logging code to auto- matically log all of them. These work mainly address the problems of “what to log” and “where to log”. Our work, instead, focuses on “whether to log”. 9 Conclusion In this paper, we have presented Log2, a cost-aware log- ging system for making the optimal “whether to log” decisions. Log2 adopts a two-phase ﬁltering mecha- nism to selectively record useful logs based on a given logging bandwidth. The experimental results on both BlogEngine and ServiceX demonstrate the capability of Log2 to control logging overhead while preserving effec- tiveness. Currently, Log2 analyzes performance logs for perfor- mance monitoring and diagnosis. As we discussed in Section 7, in the future we will extend Log2 to support more type of analysis, such as supporting other kinds of logs for failure diagnosis. 11 References  Blogengine, 2007. http://www.dotnetblogengine.net/.  Etw tracing, 2007. https://msdn.microsoft.com/en- us/library/ms751538(v=vs.110).aspx.  Record and run a web performance test, 2013. http://msdn.microsoft.com/en-us/library/ms182539.aspx.  Log2, an overhead-constrained logging system, 2014. http://research.microsoft.com/en-us/projects/log2/default.aspx.  ANDERSON, E., HOBBS, M., KEETON, K., SPENCE, S., UYSAL, M., AND VEITCH, A. Hippodrome: Running circles around storage administration. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (2002), FAST ’02, USENIX Association.  ARNOLD, M., AND RYDER, B. G. A framework for reducing the cost of instrumented code. In SIGPLAN Conference on Pro- gramming Language Design and Implementation (2001), ACM Press, pp. 168–179.  ARULRAJ, J., CHANG, P., JIN, G., AND LU, S. Production-run software failure diagnosis via hardware performance counters. In Architectural Support for Programming Languages and Operat- ing Systems, ASPLOS ’13, Houston, USA (2013), pp. 101–112.  CHONG, K. H., Y., G. C., AND Y, L. Pid control system analy- sis, design, and technology. In IEEE Trans Control Systems Tech (2005).  DING, R., FU, Q., LOU, J., LIN, Q., ZHANG, D., AND XIE, T. Mining historical issue repositories to heal large-scale online service systems. In 44th Annual IEEE/IFIP International Confer- ence on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA (2014), pp. 311–322.  ERLING, A. Sufﬁciency and exponential families for discrete sample spaces. In Journal of the American Statistical Association (1970).  GOODELL, B. R. Smoothing Forecasting and Prediction of Dis- crete Time Series. Englewood Cliffs, NJ: Prentice-Hall, 1963.  HAN, S., DANG, Y., GE, S., ZHANG, D., AND XIE, T. Perfor- mance debugging in the large via mining millions of stack traces. In 34th International Conference on Software Engineering, ICSE 2012, Zurich, Switzerland (2012), pp. 145–155.  JIN, G., SONG, L., SHI, X., SCHERPELZ, J., AND LU, S. Un- derstanding and detecting real-world performance bugs. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12 (2012), pp. 77–88.  JOVIC, M., ADAMOLI, A., AND HAUSWIRTH, M. Catch me if you can: performance bug detection in the wild. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (New York, NY, USA, 2011), OOPSLA ’11, ACM, pp. 155–170.  L, W. All of statistics: A concise course in statistical inference. New York: Springer, 2004.  LIU, Y., XU, C., AND CHEUNG, S. Characterizing and detecting performance bugs for smartphone applications. In 36th Interna- tional Conference on Software Engineering, ICSE ’14, Hyder- abad, India (2014), pp. 1013–1024.  MILLER, B., CALLAGHAN, M., CARGILLE, J., HOLLINGSWORTH, J., IRVIN, R., KARAVANIC, K., KUN- CHITHAPADAM, K., AND NEWHALL, T. The paradyn parallel performance measurement tool. In IEEE Computer (1995).  MYRON, A., AND ELI, I. Numerical analysis for applied sci- ence. John Wiley, Sons, 1998.  OGATA, K. Discrete-time control systems. Prentice-Hall, 1987.  SIGELMAN, B. H., BARROSO, L. A., BURROWS, M., STEPHENSON, P., PLAKAL, M., BEAVER, D., JASPAN, S., AND SHANBHAG, C. Dapper, a large-scale distributed systems tracing infrastructure. In Google technical report (2010).  SONG, L., AND LU, S. Statistical debugging for real-world performance problems. In Proceedings of the 2014 ACM In- ternational Conference on Object Oriented Programming Sys- tems Languages & Applications, OOPSLA 2014, part of SPLASH 2014, Portland, OR, USA, (2014), pp. 561–578.  WHEELER, D. J., AND CHAMBERS, D. S. Understanding Sta- tistical Process Control. SPC Press, 1992.  XU, W., HUANG, L., FOX, A., PATTERSON, D., AND JOR- DAN, M. I. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22Nd Sym- posium on Operating Systems Principles (New York, NY, USA, 2009), SOSP ’09, ACM, pp. 117–132.  YUAN, D., MAI, H., XIONG, W., TAN, L., ZHOU, Y., AND PA- SUPATHY, S. Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the International Conference on Architecture Support for Programming Languages and Operating Systems (March 2010).  YUAN, D., MAI, H., XIONG, W., TAN, L., ZHOU, Y., AND PASUPATHY, S. Sherlog: Error diagnosis by connecting clues from run-time logs. In ASPLOS (2010).  YUAN, D., PARK, S., HUANG, P., LIU, Y., LEE, M. M., ZHOU, Y., AND SAVAGE, S. Be conservative: Enhancing failure diag- nosis with proactive logging. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (2012), OSDI’12, USENIX Association.  YUAN, D., ZHENG, J., PARK, S., ZHOU, Y., AND SAVAGE, S. Improving software diagnosability via log enhancement. In Pro- ceedings of Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Newport Beach, CA, March 2011).  ZAMAN, S., ADAMS, B., AND HASSAN, A. E. A qualitative study on performance bugs. In 9th IEEE Working Conference of Mining Software Repositories, MSR 2012, Zurich, Switzerland (2012), pp. 199–208. 12 aly- sis, design, and technology. In IEEE Trans Control Systems Tech (2005).  DING, R., FU, Q., LOU, J., LIN, Q., ZHANG, D., AND XIE, T. Mining historical issue repositories to heal large-scale online service systems. In 44th Annual IEEE/IFIP International Confer- ence on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA (2014), pp. 311–322.  ERLING, A. Sufﬁciency and exponential families f