DZone Spotlight

Monday, May 13 View All Articles »

Participate in DZone Research Surveys: You Can Shape Trend Reports! (+ Enter the Raffles)

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Low and No Code Research DZone's research will be shaping our June Low-Code Development Trend Report. Our 2024 low code survey explores: Low code's impact on software quality, performance, maintainability, and scalability Opinions on low code use cases and experience with implementation How teams are integrating AI into their low-code practices Concerns about security and governance over low code Join the Low Code Research You'll also have the chance to enter the $300 raffle at the end of the survey — four random people will be drawn and receive $75 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 12-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 cloud native survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management and monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Best Practices for Migration of COTS Applications to Cloud

By Rakesh Rao

The first step in a Cloud Adoption journey for any enterprise is Application Portfolio Analysis. During this assessment, we see custom in-house (Bespoke) applications, Commercial-Off-The-Shelf (COTS) applications, Software-as-a-Service (SaaS) applications, etc. The constitution of these applications in the portfolio varies between enterprises and industries. As an outcome of the assessment, the applications are dispositioned into one of the seven common migration strategies (7-R’s of Migration: Retire, Retain, Refactor, Replatform, Repurchase, Rehost, and Relocate) and arrive at a roadmap for cloud migration. While COTS applications are generally perceived as low-hanging fruits during cloud migrations, they come with their own challenges. For example, the currency of technical stack (Operating System, Product Versions, Frameworks, Databases, etc.), managing licenses and adherence to security requirements in the cloud, etc. Understanding these challenges is critical to arriving at migration strategies. Let us deep dive a bit and uncover them. Challenges and Best Practices From our experiences in cloud migrations, we have observed some common challenges and best practices to mitigate them. These have helped us in successful migrations for many clients. Disposition COTS products are used by enterprises as a ready-made solution for their business needs. However, in many cases, these applications trail other evolving applications and become outdated or difficult to integrate with other modern applications. Some applications do not undergo any upgrades/enhancements and outlive the usual application lifecycle. It becomes challenging to migrate such applications and requires due diligence, stakeholder concurrence, etc. Best Practices Perform a comprehensive assessment, considering the business needs, performance, compatibility, risks, and cost-benefit study. Involving all stakeholders like Business, IT, Independent Software Vendor (ISV), etc. is important for the best outcome of this assessment. Based on the analysis, choose one of the following strategies. Rehost the application and database (lift-and-shift the virtual machine (as-is)) to the Infrastructure-as-a-Service (IaaS) compute instance on the cloud. Rehost the application to IaaS and re-platform the database to a Platform-as-a-Service (PaaS) instance on the cloud. Replatform to ISV Managed SaaS solution. This solution can be from the same ISV or from another ISV that provides an easy migration path to their solution. The following challenges will also contribute to the dispositioning of the COTS applications. ISV Support Pure COTS applications are usually straightforward to migrate provided the vendor supports and certifies the application to run on the target cloud platform. Some ISVs provide customized versions of their product to suit the business needs of an enterprise. For such applications, the Vendor has their share of ownership and responsibility to maintain them on the client's premises. While these applications are managed by an application team, the knowledge of the application, the intrinsic details, and the application roadmap is with the vendor. The ISVs have their own timelines and schedules for their releases, and they would require sufficient notice to have their personnel engaged to support the migration. This impacts the migration plan of the COTS application and its dependencies. Best Practices Start engaging with the ISV early in the migration journey preferably in the planning stage. Some ISVs would require a new professional services contract to be engaged for providing support. Understand the rules of engagement and ensure the contract clearly details the roles and responsibilities of the ISV. Some COTS products require certification by the ISV to run on the cloud. Understand the requirements and clearly document them as this would have cost implications. There are also scenarios when enterprises decide to migrate without vendor support due to cost and they have in-house expertise on the product. Target Cloud Support Most enterprises incorrectly assume the COTS product is compatible with any target platform. However, there are instances where they realize that the products do not work as intended after the migration journey and the vendor refuses to support the chosen platform/technology stack. Some examples are: The ISV might not have an immediate plan for supporting the target operating system or the target Database in the cloud. One of the objectives of moving to the cloud is to provide for High Availability and Disaster Recovery and some of the COTS products may not support front-ending by a Load Balancer or provide support for Clustering. Best Practices During assessment, determine the ISV support for the target technology stack. Some applications would require modifications/upgrades to support. It’s good to ask specific questions regarding the support like Does the COTS product support Windows 2022 in AWS Cloud? If yes, does it require modifications to support it? Does the COTS product which currently uses SQL Server on-premises support migration to RDS SQL Server in AWS Cloud? It’s also good to have the ISV involved in the design of the target architecture. Getting the target architecture approved by the ISV is a good practice. Security Requirements One of the common issues encountered during migrations is that COTS products do not adhere to the Security Policies defined by the Information Security(InfoSec) Team. For Example, privileged credentials used by COTS applications are often embedded in cleartext within application configuration files, database tables, scripts, etc. The credentials would not be changed frequently, and the same would also be used across multiple environments or in other applications as well. This is perceived as a security vulnerability by the InfoSec team. Best Practices Understand from the ISV if there is a direct integration available to a secure vault like Azure Key Vault, AWS Secrets Manager, HashiCorp Vault, etc. Alternatively, the ISV must provide a patch to encrypt the password stored locally on the server, or database. Provide for rotation of credentials based on policies like the criticality of the application or based on data sensitivity requirements. Some COTS products do not provide support for integration with vaults due to the legacy software stack and the application cannot be modified. In such cases, an exception from the InfoSec team is sought with a remediation plan. For example, a product upgrade, or enhancement, say within 6 months after the migrations, as it involves cost, potential change in integrations with other applications, etc. Licenses ISVs use different types of licensing models for their products. Some offer one-time, perpetual licenses, while others require enterprises to renew the licenses (subscription-based). Similarly, some licenses are tied to the server's metadata (IP address, MAC address, or hostname) while others are portable. During migrations, another common issue observed is licensing conflict. Inadequate licenses restrict the application from functioning on the cloud while the license is already tied to the running instance at on-premises or vice-versa. There could also be a change in the licensing model when the COTS product is moved to the cloud. For example, moving from an on-premises deployment to a SaaS model would require a move from a perpetual licensing model to a subscription-based one. Best Practices Understand the current licensing model, the number of licenses available, and request for additional licenses. Some ISVs provide temporary licenses that will allow the application to run simultaneously on-premises and on the target cloud platform. Understand if there are license checks done by the installed software. Some COTS applications send internet egress traffic for license validation. This would help in planning for firewall rules during migration. Team Organization and Co-ordination In the case of business-owned applications, the IT presence will be limited to providing platform support. Also, for customized COTS products, the ISV is a key participant and contributor in the migration. Involving them late in the play is a common mistake and it causes delay, the need for expedited engagement, and in turn proves to be an expensive affair. Along with identifying the contributors from IT (infra and database support), business (testers), etc. to support the migrations, placing the right ownership of actions on the ISV is also important. Best Practices ISV team should have tasks assigned in the project plan and it is necessary to communicate the tasks and the timeline to them as early as possible. ISV should have clearly defined responsibilities. For example, during the installation of the COTS software in the cloud, the application team would perform the installation with support from the ISV or the ISV themselves might perform the installation. These activities should be listed in the runbook against the ISV listed as task owner. ISVs might also require access to the cloud environment during migration or for later support. Access requirements for the ISV can be evaluated again during migration and provided for. Integrations While most modern COTS products support enhanced security controls, you will come across a few products that use non-secure ports or integration mechanisms for communication. For example, http ports (80, 8080) or FTP (21). In the cloud, one of the security controls enforced is the encryption of data in transit. Additionally, the other applications having an affinity with the COTS product may take a modernization path, involving a change in the framework(Struts to Spring Boot), data models(XML to JSON), etc. This may require some changes to the COTS product. Application Remediation for such enhancements would require a considerable amount of time and testing. Best Practices Start the identification of these integrations much earlier and work with the ISV for the changes to the COTS application. Factor efforts for comprehensive testing. Despite the identification of such requirements early in the migration, it’s a possibility that changes couldn’t be made to the COTS product due to timelines and various other factors. In such cases, it's normal to get an exception approval from InfoSec for allowing these ports. We can also understand from the ISV if there are automation possibilities like enabling CI/CD pipelines, configuration management, etc. that will reduce the manual effort and errors in deployment. Automation can also assist in enabling faster recovery in case of an outage. Data Migration Ensuring data remains secure during and after the migration is a significant challenge. This is required as enterprises must consider data encryption, access controls, and compliance requirements such as GDPR, HIPAA, or PCI-DSS. COTS applications often have large volumes of data stored in various formats and structures. Migrating this data to the cloud while maintaining its integrity and consistency can be complex, especially if the data is spread across multiple sources or databases. Best Practices Evaluate the options to encrypt the data in transit and at rest with the ISV and other stakeholders, as it may require changes to the product. Understand the complexity of the data and its structure by doing a thorough analysis. Work with the ISV if there are proprietary tools for migration of the data and obtain clearance for using the tool from the InfoSec organization. This is a longer process hence it is essential to be addressed very early in the migration life cycle. Plan for incremental data migration to ensure data integrity during cutover to the target. Other Challenges Containerization Containerizing a COTS product is a popular solution as the application can benefit from isolation, portability, scalability, and efficient utilization of resources. While the benefits are huge, this migration path is usually tricky because the ISVs may not have container images and even if they agree to build a container image, they may not have the resources to maintain images on a continuous basis. So, it’s necessary to understand these intricacies before proceeding with containerization. Refactor Refactor the COTS application to a custom application. This is usually perceived as a project by itself, driven by a strong business case and it involves considerable cost, time, and manpower. This could lead to build-vs-buy decisions or even buying and building (customizations). It's advised to take this route only when you have in-house knowledge about the application. Conclusion Every migration provides a lot of learnings and insights to carry forward into successive migrations. Based on our migration experiences for various enterprises across industries, we have shared the challenges and the mitigations that have helped us overcome them. As mentioned earlier, the number of COTS applications would vary across enterprises and industries and the challenges are similar. Addressing these challenges early in the migration cycle would help in the cloud journey while ensuring maximum benefits from the cloud for your application portfolio. More

Trend Report

Modern API Management

When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.

Refcard #395

Open Source Migration Practices and Patterns

By Nuwan Dias

CORE

Open Source Migration Practices and Patterns

Refcard #171

MongoDB Essentials

By Abhishek Gupta

CORE

Why and How To Integrate Elastic APM in Apache JMeter

The Advantages of Elastic APM for Observing the Tested Environment My first use of the Elastic Application Performance Monitoring (Elastic APM) solution coincides with projects that were developed based on microservices in 2019 for the projects on which I was responsible for performance testing. At that time (2019) the first versions of Elastic APM were released. I was attracted by the easy installation of agents, the numerous protocols supported by the Java agent (see Elastic supported technologies) including the Apache HttpClient used in JMeter and other languages (Go, .NET, Node.js, PHP, Python, Ruby), and the quality of the dashboard in Kibana for the APM. I found the information displayed in the Kibana APM dashboards to be relevant and not too verbose. The Java agent monitoring is simple but displays essential information on the machine's OS and JVM. The open-source aspect and the free solution for the main functions of the tool were also decisive. I generalize the use of the Elastic APM solution in performance environments for all projects. With Elastic APM, I have the timelines of the different calls and exchanges between web services, the SQL queries executed, the exchange of messages by JMS file, and monitoring. I also have quick access to errors or exceptions thrown in Java applications. Why Integrate Elastic APM in Apache JMeter By adding Java APM Agents to web applications, we find the services called timelines in the Kibana dashboards. However, we remain at a REST API call level mainly, because we do not have the notion of a page. For example, page PAGE01 will make the following API calls: /rest/service1 /rest/service2 /rest/service3 On another page, PAGE02 will make the following calls: /rest/service2 /rest/service4 /rest/service5 /rest/service6 The third page, PAGE03, will make the following calls: /rest/service1 /rest/service2 /rest/service4 In this example, service2 is called on 3 different pages and service4 in 2 pages. If we look in the Kibana dashboard for service2, we will find the union of the calls of the 3 calls corresponding to the 3 pages, but we don't have the notion of a page. We cannot answer "In this page, what is the breakdown of time in the different REST calls," because for a user of the application, the notion of page response time is important. The goal of the jmeter-elastic-apm tool is to add the notion of an existing page in JMeter in the Transaction Controller. This starts in JMeter by creating an APM transaction, and then propagating this transaction identifier (traceparent) with the Elastic agent to an HTTP REST request to web services because the APM Agent recognizes the Apache HttpClient library and can instrument it. In the HTTP request, the APM Agent will add the identifier of the APM transaction to the header of the HTTP request. The headers added are traceparent and elastic-apm-traceparent. We start from the notion of the page in JMeter (Transaction Controller) to go to the HTTP calls of the web application (gestdoc) hosted in Tomcat. In the case of an application composed of multi-web services, we will see in the timeline the different web services called in HTTP(s) or JMS and the time spent in each web service. This is an example of technical architecture for a performance test with Apache JMeter and Elastic APM Agent to test a web application hosted in Apache Tomcat. How the jmeter-elastic-apm Tool Works jmeter-elastic-apm adds Groovy code before a JMeter Transaction Controller to create an APM transaction before a page. In the JMeter Transaction Controller, we find HTTP samplers that make REST HTTP(s) calls to the services. The Elastic APM Agent automatically adds a new traceparent header containing the identifier of the APM transaction because it recognizes the Apache HttpClient of the HTTP sampler. The Groovy code terminates the APM transaction to indicate the end of the page. The jmeter-elastic-apm tool automates the addition of Groovy code before and after the JMeter Transaction Controller. The jmeter-elastic-apm tool is open source on GitHub (see link in the Conclusion section of this article). This JMeter script is simple with 3 pages in 3 JMeter Transaction Controllers. After launching the jmeter-elastic-apm action ADD tool, the JMeter Transaction Controllers are surrounded by Groovy code to create an APM transaction before the JMeter Transaction Controller and close the APM transaction after the JMeter Transaction Controller. In the “groovy begin transaction apm” sampler, the Groovy code calls the Elastic APM API (simplified version): Groovy Transaction transaction = ElasticApm.startTransaction(); Scope scope = transaction.activate(); transaction.setName(transactionName); // contains JMeter Transaction Controller Name In the “groovy end transaction apm” sampler, the groovy code calls the ElasticApm API (simplified version): Groovy transaction.end(); Configuring Apache JMeter With the Elastic APM Agent and the APM Library Start Apache JMeter With Elastic APM Agent and Elastic APM API Library Declare the Elastic APM Agent URLto find the APM Agent: Add the ELASTIC APM Agent somewhere in the filesystem (could be in the <JMETER_HOME>\lib but not mandatory). In <JMETER_HOME>\bin, modify the jmeter.bat or setenv.bat. Add Elastic APM configuration like so: Shell set APM_SERVICE_NAME=yourServiceName set APM_ENVIRONMENT=yourEnvironment set APM_SERVER_URL=http://apm_host:8200 set JVM_ARGS=-javaagent:<PATH_TO_AGENT_APM_JAR>\elastic-apm-agent-<version>.jar -Delastic.apm.service_name=%APM_SERVICE_NAME% -Delastic.apm.environment=%APM_ENVIRONMENT% -Delastic.apm.server_urls=%APM_SERVER_URL% 2. Add the Elastic APM library: Add the Elastic APM API library to the <JMETER_HOME>\lib\apm-agent-api-<version>.jar. This library is used by JSR223 Groovy code. Use this URL to find the APM library. Recommendations on the Impact of Adding Elastic APM in JMeter The APM Agent will intercept and modify all HTTP sampler calls, and this information will be stored in Elasticsearch. It is preferable to voluntarily disable the HTTP request of static elements (images, CSS, JavaScript, fonts, etc.) which can generate a large number of requests but are not very useful in analyzing the timeline. In the case of heavy load testing, it's recommended to change the elastic.apm.transaction_sample_rate parameter to only take part of the calls so as not to saturate the APM Server and Elasticsearch. This elastic.apm.transaction_sample_rate parameter can be declared in <JMETER_HOME>\jmeter.bat or setenv.bat but also in a JSR223 sampler with a short Groovy code in a setUp thread group. Groovy code records only 50% samples: Groovy import co.elastic.apm.api.ElasticApm; // update elastic.apm.transaction_sample_rate ElasticApm.setConfig("transaction_sample_rate","0.5"); Conclusion The jmeter-elastic-apm tool allows you to easily integrate the Elastic APM solution into JMeter and add the notion of a page in the timelines of Kibana APM dashboards. Elastic APM + Apache JMeter is an excellent solution for understanding how the environment works during a performance test with simple monitoring, quality dashboards, time breakdown timelines in the different distributed application layers, and the display of exceptions in web services. Over time, the Elastic APM solution only gets better. I strongly recommend it, of course, in a performance testing context, but it also has many advantages in the context of a development environment used for developers or integration used by functional or technical testers. Links Command Line Tool jmeter-elastic-apm JMeter plugin elastic-apm-jmeter-plugin Elastic APM Guides: APM Guide or Application performance monitoring (APM)

By Vincent DABURON

Explainable AI: Interpreting BERT Model

Motivation and Background Why is it important to build interpretable AI models? The future of AI is in enabling humans and machines to work together to solve complex problems. Organizations are attempting to improve process efficiency and transparency by combining AI/ML technology with human review. In recent years with the advancement of AI, AI-specific regulations have emerged, for example, Good Machine Learning Practices (GMLP) in pharma and Model Risk Management (MRM) in finance industries, other broad-spectrum regulations addressing data privacy, EU’s GDPR and California’s CCPA. Similarly, internal compliance teams may also want to interpret a model’s behavior when validating decisions based on model predictions. For instance, underwriters want to learn why a specific loan application was tagged suspicious by an ML model. Overview What is interpretability? In the ML context, interpretability refers to trying to backtrack what factors have contributed to an ML model for making a certain prediction. As shown in the graph below, simpler models are easier to interpret but may often produce lower accuracy compared to complex models like Deep Learning and transformer-based models that can understand non-linear relations in the data and often have high accuracy. Loosely defined, there are two types of explanations: Global explanation: is explaining on an overall model level to understand what features have contributed the most to the output? For example, in a finance setting where the use case is to build an ML model to identify customers who are most likely to default, some of the most influential features for making that decision are the customer’s credit score, total no. of credit cards, revolving balance, etc. Local explanation: This can enable you to zoom in on a particular data point and observe the behavior of the model in that neighborhood. For example, for sentiment classification of a movie review use case, certain words in the review may have a higher impact on the outcomes vs the other. “I have never watched something as bad.” What is a transformer model? A transformer model is a neural network that tracks relationships in sequential input, such as the words in a sentence, to learn context and subsequent meaning. Transformer models use an evolving set of mathematical approaches, called attention or self-attention, to find minute relationships between even distance data elements in a series. Refer to Google’s publication for more information. Integrated Gradients Integrated Gradients (IG), is an Explainable AI technique introduced in the paper Axiomatic Attribution for Deep Networks. In this paper, an attempt is made to assign an attribution value to each input feature. This tells how much the input contributed to the final prediction. IG is a local method that is a popular interpretability technique due to its broad applicability to any differentiable model (e.g., text, image, structured data), ease of implementation, computational efficiency relative to alternative approaches, and theoretical justifications. Integrated gradients represent the integral of gradients with respect to inputs along the path from a given baseline to input. The integral can be approximated using a Riemann Sum or Gauss Legendre quadrature rule. Formally, it can be described as follows: Integrated Gradients along the i — th dimension of input X. Alpha is the scaling coefficient. The equations are copied from the original paper. The cornerstones of this approach are two fundamental axioms, namely sensitivity and implementation invariance. More information can be found in the original paper. Use Case Now let’s see in action how the Integrated Gradients method can be applied using the Captum package. We will be fine-tuning a question-answering BERT (Bidirectional Encoder Representations from Transformers) model, on the SQUAD dataset using the transformers library from HuggingFace, review notebook for a detailed walkthrough. Steps Load the tokenizer and pre-trained BERT model, in this case, bert-base-uncased Next is computing attributions w.r.t BertEmbeddings layer. To do so, define baseline/references and numericalize both the baselines and inputs. Python def construct_whole_bert_embeddings(input_ids, ref_input_ids, \ token_type_ids=None, ref_token_type_ids=None, \ position_ids=None, ref_position_ids=None): Python input_embeddings = model.bert.embeddings(input_ids, token_type_ids=token_type_ids, position_ids=position_ids) Python ref_input_embeddings = model.bert.embeddings(ref_input_ids, token_type_ids=ref_token_type_ids, position_ids=ref_position_ids) Python return input_embeddings, ref_input_embeddings Now, let's define the question-answer pair as an input to our BERT model Question = “What is important to us?” text = “It is important to us to include, empower and support humans of all kinds.” Generate corresponding baselines/references for question-answer pair The next step is to make predictions, one option is to use LayerIntegratedGradients and compute the attributions with respect to BertEmbedding. LayerIntegratedGradients represents the integral of gradients with respect to the layer inputs/outputs along the straight-line path from the layer activations at the given baseline to the layer activation at the input. Python start_scores, end_scores = predict(input_ids, \ token_type_ids=token_type_ids, \ position_ids=position_ids, \ attention_mask=attention_mask) Python print(‘Question: ‘, question) print(‘Predicted Answer: ‘, ‘ ‘.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])) Python lig = LayerIntegratedGradients(squad_pos_forward_func, model.bert.embeddings) Output: Question: What is important to us? Predicted Answer: to include , em ##power and support humans of all kinds Visualize attributes for each word token in the input sequence using a helper function Python # storing couple samples in an array for visualization purposes Python start_position_vis = viz.VisualizationDataRecord( attributions_start_sum, torch.max(torch.softmax(start_scores[0], dim=0)), torch.argmax(start_scores), torch.argmax(start_scores), str(ground_truth_start_ind), attributions_start_sum.sum(), all_tokens, delta_start) Python print(‘\033[1m’, ‘Visualizations For Start Position’, ‘\033[0m’) viz.visualize_text([start_position_vis]) Python print(‘\033[1m’, ‘Visualizations For End Position’, ‘\033[0m’) viz.visualize_text([end_position_vis]) From the results above we can tell that for predicting the start position, our model is focusing more on the question side. More specifically on the tokens ‘what’ and ‘important’. It has also a slight focus on the token sequence ‘to us’ on the text side. In contrast to that, for predicting end position, our model focuses more on the text side and has relatively high attribution on the last end position token ‘kinds’. Conclusion This blog describes how explainable AI techniques like Integrated Gradients can be used to make a deep learning NLP model interpretable by highlighting positive and negative word influences on the outcome of the model. References Axiomatic Attribution for Deep Networks Model Interpretability for PyTorch Towards Better Understanding of Gradient-Based Attribution Methods for Deep Neural Networks

By Sai Sharanya Nalla

Operation and Network Administration Management of Telecom 5G Network Functions Using Openshift Kubernetes Tools

The world of Telecom is evolving at a rapid pace, and it is not just important, but crucial for operators to stay ahead of the game. As 5G technology becomes the norm, it is not just essential, but a strategic imperative to transition seamlessly from 4G technology (which operates on OpenStack cloud) to 5G technology (which uses Kubernetes). In the current scenario, operators invest in multiple vendor-specific monitoring tools, leading to higher costs and less efficient operations. However, with the upcoming 5G world, operators can adopt a unified monitoring and alert system for all their products. This single system, with its ability to monitor network equipment, customer devices, and service platforms, offers a reassuringly holistic view of the entire system, thereby reducing complexity and enhancing efficiency. By adopting a Prometheus-based monitoring and alert system, operators can streamline operations, reduce costs, and enhance customer experience. With a single monitoring system, operators can monitor their entire 5G system seamlessly, ensuring optimal performance and avoiding disruptions. This practical solution eliminates the need for a complete overhaul and offers a cost-effective transition. Let's dive deep. Prometheus, Grafana, and Alert Manager Prometheus is a tool for monitoring and alerting systems, utilizing a pull-based monitoring system. It scrapes, collects, and stores Key Performance Indicators (KPI) with labels and timestamps, enabling it to collect metrics from targets, which are the Network Functions' namespaces in the 5G telecom world. Grafana is a dynamic web application that offers a wide range of functionalities. It visualizes data, allowing the building of charts, graphs, and dashboards that the 5G Telecom operator wants to visualize. Its primary feature is the display of multiple graphing and dashboarding support modes using GUI (Graphical user interface). Grafana can seamlessly integrate data collected by Prometheus, making it an indispensable tool for telecom operators. It is a powerful web application that supports the integration of different data sources into one dashboard, enabling continuous monitoring. This versatility improves response rates by alerting the telecom operator's team when an incident emerges, ensuring a minimum 5G network function downtime. The Alert Manager is a crucial component that manages alerts from the Prometheus server via alerting rules. It manages the received alerts, including silencing and inhibiting them and sending out notifications via email or chat. The Alert Manager also removes duplications, grouping, and routing them to the centralized webhook receiver, making it a must-have tool for any telecom operator. Architectural Diagram Prometheus Components of Prometheus (Specific to a 5G Telecom Operator) Core component: Prometheus server scrapes HTTP endpoints and stores data (time series). The Prometheus server, a crucial component in the 5G telecom world, collects metrics from the Prometheus targets. In our context, these targets are the Kubernetes cluster that houses the 5G network functions. Time series database (TSDB): Prometheus stores telecom Metrics as time series data. HTTP Server: API to query data stored in TSDB; The Grafana dashboard can query this data for visualization. Telecom operator-specific libraries (5G) for instrumenting application code. Push gateway (scrape target for short-lived jobs) Service Discovery: In the world of 5G, network function pods are constantly being added or deleted by Telecom operators to scale up or down. Prometheus's adaptable service discovery component monitors the ever-changing list of pods. The Prometheus Web UI, accessible through port 9090, is a data visualization tool. It allows users to view and analyze Prometheus data in a user-friendly and interactive manner, enhancing the monitoring capabilities of the 5G telecom operators. The Alert Manager, a key component of Prometheus, is responsible for handling alerts. It is designed to notify users if something goes wrong, triggering notifications when certain conditions are met. When alerting triggers are met, Prometheus alerts the Alert Manager, which sends alerts through various channels such as email or messenger, ensuring timely and effective communication of critical issues. Grafana for dashboard visualization (actual graphs) With Prometheus's robust components, your Telecom operator's 5G network functions are monitored with diligence, ensuring reliable resource utilization, tracking performance, detection of errors in availability, and more. Prometheus can provide you with the necessary tools to keep your network running smoothly and efficiently. Prometheus Features The multi-dimensional data model identified by metric details uses PromQL (Prometheus Querying Language) as the query language and the HTTP Pull model. Telecom operators can now discover 5G network functions with service discovery and static configuration. The multiple modes of dashboard and GUI support provide a comprehensive and customizable experience for users. Prometheus Remote Write to Central Prometheus from Network Functions 5G Operators will have multiple network functions from various vendors, such as SMF (Session Management Function), UPF (User Plane Function), AMF (Access and Mobility Management Function), PCF (Policy Control Function), and UDM (Unified Data Management). Using multiple Prometheus/Grafana dashboards for each network function can lead to a complex and inefficient 5G network operator monitoring process. To address this, it is highly recommended that all data/metrics from individual Prometheus be consolidated into a single Central Prometheus, simplifying the monitoring process and enhancing efficiency. The 5G network operator can now confidently monitor all the data at the Central Prometheus's centralized location. This user-friendly interface provides a comprehensive view of the network's performance, empowering the operator with the necessary tools for efficient monitoring. Grafana Grafana Features Panels: This powerful feature empowers operators to visualize Telecom 5G data in many ways, including histograms, graphs, maps, and KPIs. It offers a versatile and adaptable interface for data representation, enhancing the efficiency and effectiveness of your data analysis. Plugins: This feature efficiently renders Telecom 5G data in real-time on a user-friendly API (Application Programming Interface), ensuring operators always have the most accurate and up-to-date data at their fingertips. It also enables operators to create data source plugins and retrieve metrics from any API. Transformations: This feature allows you to flexibly adapt, summarize, combine, and perform KPI metrics query/calculations across 5G network functions data sources, providing the tools to effectively manipulate and analyze your data. Annotations: Rich events from different Telecom 5G network functions data sources are used to annotate metrics-based graphs. Panel editor: Reliable and consistent graphical user interface for configuring and customizing 5G telecom metrics panels Grafana Sample Dashboard GUI for 5G Alert Manager Alert Manager Components The Ingester swiftly ingests all alerts, while the Grouper groups them into categories. The De-duplicator prevents repetitive alerts, ensuring you're not bombarded with notifications. The Silencer is there to mute alerts based on a label, and the Throttler regulates the frequency of alerts. Finally, the Notifier will ensure that third parties are notified promptly. Alert Manager Functionalities Grouping: Grouping categorizes similar alerts into a single notification system. This is helpful during more extensive outages when many 5G network functions fail simultaneously and when all the alerts need to fire simultaneously. The telecom operator will expect only to get a single page while still being able to visualize the exact service instances affected. Inhibition: Inhibition suppresses the notification for specific low-priority alerts if certain major/critical alerts are already firing. For example, when a critical alert fires, indicating that an entire 5G SMF (Session Management Function) cluster is not reachable, AlertManager can mute all other minor/warning alerts concerning this cluster. Silences: Silences are simply mute alerts for a given time. Incoming alerts are checked to match the regular expression matches of an active silence. If they match, no notifications will be sent out for that alert. High availability: Telecom operators will not load balance traffic between Prometheus and all its Alert Managers; instead, they will point Prometheus to a list of all Alert Managers. Dashboard Visualization Grafana dashboard visualizes the Alert Manager webhook traffic notifications as shown below: Configuration YAMLs (Yet Another Markup Language) Telecom Operators can install and run Prometheus using the configuration below: YAML prometheus: enabled: true route: enabled: {} nameOverride: Prometheus tls: enabled: true certificatesSecret: backstage-prometheus-certs certFilename: tls.crt certKeyFilename: tls.key volumePermissions: enabled: true initdbScriptsSecret: backstage-prometheus-initdb prometheusSpec: retention: 3d replicas: 2 prometheusExternalLabelName: prometheus_cluster image: repository: <5G operator image repository for Prometheus> tag: <Version example v2.39.1> sha: "" podAntiAffinity: "hard" securityContext: null resources: limits: cpu: 1 memory: 2Gi requests: cpu: 500m memory: 1Gi serviceMonitorNamespaceSelector: matchExpressions: - {key: namespace, operator: In, values: [<Network function 1 namespace>, <Network function 2 namespace>]} serviceMonitorSelectorNilUsesHelmValues: false podMonitorSelectorNilUsesHelmValues: false ruleSelectorNilUsesHelmValues: false Configuration to route scrape data segregated based on the namespace and route to Central Prometheus. Note: The below configuration can be appended to the Prometheus mentioned in the above installation YAML. YAML remoteWrite: - url: <Central Prometheus URL for namespace 1 by 5G operator> basicAuth: username: name: <secret username for namespace 1> key: username password: name: <secret password for namespace 1> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 1> action: keep - url: <Central Prometheus URL for namespace 2 by 5G operator> basicAuth: username: name: <secret username for namespace 2> key: username password: name: <secret password for namespace 2> key: password tlsConfig: insecureSkipVerify: true writeRelabelConfigs: - sourceLabels: - namespace regex: <namespace 2> action: keep Telecom Operators can install and run Grafana using the configuration below. YAML grafana: replicas: 2 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app.kubernetes.io/name" operator: In values: - Grafana topologyKey: "kubernetes.io/hostname" securityContext: false rbac: pspEnabled: false # Must be disabled due to tenant permissions namespaced: true adminPassword: admin image: repository: <artifactory>/Grafana tag: <version> sha: "" pullPolicy: IfNotPresent persistence: enabled: false initChownData: enabled: false sidecar: image: repository: <artifactory>/k8s-sidecar tag: <version> sha: "" imagePullPolicy: IfNotPresent resources: limits: cpu: 100m memory: 100Mi requests: cpu: 50m memory: 50Mi dashboards: enabled: true label: grafana_dashboard labelValue: "Vendor name" datasources: enabled: true defaultDatasourceEnabled: false additionalDataSources: - name: Prometheus type: Prometheus url: http://<prometheus-operated>:9090 access: proxy isDefault: true jsonData: timeInterval: 30s resources: limits: cpu: 400m memory: 512Mi requests: cpu: 50m memory: 206Mi extraContainers: - name: oauth-proxy image: <artifactory>/origin-oauth-proxy:<version> imagePullPolicy: IfNotPresent ports: - name: proxy-web containerPort: 4181 args: - --https-address=:4181 - --provider=openshift # Service account name here must be "<Helm Release name>-grafana" - --openshift-service-account=monitoring-grafana - --upstream=http://localhost:3000 - --tls-cert=/etc/tls/private/tls.crt - --tls-key=/etc/tls/private/tls.key - --cookie-secret=SECRET - --pass-basic-auth=false resources: limits: cpu: 100m memory: 256Mi requests: cpu: 50m memory: 128Mi volumeMounts: - mountPath: /etc/tls/private name: grafana-tls extraContainerVolumes: - name: grafana-tls secret: secretName: grafana-tls serviceAccount: annotations: "serviceaccounts.openshift.io/oauth-redirecturi.first": https://[SPK exposed IP for Grafana] service: targetPort: 4181 annotations: service.alpha.openshift.io/serving-cert-secret-name: <secret> Telecom Operators can install and run Alert Manager using the configuration below. YAML alertmanager: enabled: true alertmanagerSpec: image: repository: prometheus/alertmanager tag: <version> replicas: 2 podAntiAffinity: hard securityContext: null resources: requests: cpu: 25m memory: 200Mi limits: cpu: 100m memory: 400Mi containers: - name: config-reloader resources: requests: cpu: 10m memory: 10Mi limits: cpu: 25m memory: 50Mi Configuration to route Prometheus Alert Manager data to the Operator's centralized webhook receiver. Note: The below configuration can be appended to the Alert Manager mentioned in the above installation YAML. YAML config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' routes: - receiver: '<Network function 1>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 1>" - receiver: '<Network function 2>' group_wait: 10s group_interval: 10s group_by: ['alertname','oid','action','time','geid','ip'] matchers: - namespace="<namespace 2>" Conclusion The open-source OAM (Operation and Maintenance) tools Prometheus, Grafana, and Alert Manager can benefit 5G Telecom operators. Prometheus periodically captures all the status of monitored 5G Telecom network functions through the HTTP protocol, and any component can be connected to the monitoring as long as the 5G Telecom operator provides the corresponding HTTP interface. Prometheus and Grafana Agent gives the 5G Telecom operator control over the metrics the operator wants to report; once the data is in Grafana, it can be stored in a Grafana database as extra data redundancy. In conclusion, Prometheus allows 5G Telecom operators to improve their operations and offer better customer service. Adopting a unified monitoring and alert system like Prometheus is one way to achieve this.

By BINU SUDHAKARAN PILLAI

Introduction To Artificial Intelligence With Code: Part 1

When ChatGPT became a global phenomenon, books, papers, or articles about AI (Artificial Intelligence) appeared in countless numbers, but most of them were heavy on theory and mathematics. The series of articles "Introduction to Artificial Intelligence with Code" is a compilation of the most fundamental aspects of AI for beginners, presented with a combination of theory and code (C#) to help readers gain a better understanding of the concepts and ideas discussed in these articles. In the first article of the series, we will introduce propositional logic. Theory: An Introduction to Propositional Logic The rules of logic provide precise meanings for propositions. These rules are used to distinguish between valid and invalid mathematical arguments. Alongside its significance in understanding mathematical reasoning, logic also has many applications in computer science, such as designing computer networks, programming, checking program correctness, and so on. Propositions are the building blocks of the logical edifice of propositional logic. A proposition is a statement that is either true or false but cannot be both true and false simultaneously. The truth value of a proposition (in propositional logic) is referred to as its logical value (true or false). Letters are used to symbolize propositions, as well as to represent variables in programming. The commonly used conventions for these letters are p, q, r, s, and so on. Many mathematical propositions are created by combining one or more propositions we already have. These new propositions are called compound propositions (denoted temporarily as F), and they are formed from existing propositions using logical operators. Some basic logical operators are AND, OR, and NOT. A classical application of logical operators in computer science is to design logic gates. To check the truth value of a compound proposition, we need to apply the rules of logic and consider the truth values of the individual propositions along with the logical operators used. Coding: Checking the Truth Value of a Compound Proposition (F) We’ll create a set of classes, all related by inheritance, that will allow us to obtain the output of any F from inputs defined a priori. Here is the first class: C# public abstract class F { public abstract bool Check(); public abstract IEnumerable<Prop> Props(); } The abstract F class states that all its descendants must implement a Boolean method Check() and an IEnumerable<Prop> method Props(). The first will return the evaluation of the compound proposition and the latter the propositions contained within it. Because logical operators share some features, we’ll create an abstract class to group these features and create a more concise, logical inheritance design. The Op class, which can be seen in the code below, will contain the similarities that every logical operator shares: C# public abstract class Op : F { public F P { get; set; } public F Q { get; set; } public Op(F p, F q) { P = p; Q = q; } public override IEnumerable<Prop> Props() { return P.Props().Concat(Q.Props()); } } The first logical operator, the AND, is illustrated: C# public class AND: Op { public AND(F p, F q): base(p, q) { } public override bool Check() { return P.Check() &&Q.Check(); } } The implementation of the AND class is pretty simple. It receives two arguments that it passes to its parent constructor, and the Check method merely returns the logic AND that is built into C#. Very similar are the OR, NOT, and Prop classes, which are shown below: C# //OR class public class OR : Op { public OR(F p, F q): base(p, q) { } public override bool Check() { return P.Check() || Q.Check(); } } //NOT class public class NOT : F { public F P { get; set; } public NOT(F p) { P = p; } public override bool Check() { return !P.Check(); } public override IEnumerable<Prop> Props() { return new List<Prop>(P.Props()); } } The Prop class is the one we use for representing propositions in compound propositions. It includes a truthValue field, which is the truth value given to the proposition (true, false), and when the Props() method is called it returns a List<Prop> whose single element is itself: C# public class Prop : F { public bool truthValue { get; set; } public Prop(bool truthvalue) { truthValue = truthvalue; } public override bool Check() { return truthValue; } public override IEnumerable<Prop> Props() { return new List<Prop>() { this }; } } Creating and checking F = NOT(p) OR q: C# var p = new Prop(false); var q = new Prop(false); var f = new OR(new NOT(p), q); Console.WriteLine(f.Check()); p.trueValue = true; Console.WriteLine(f.Check()); The result looks like this: Summary In this article, we introduced a basic logic — propositional logic — and we also described C# code for representing compound propositions (propositions, logical operators, and so on). In the next article, we’ll introduce a very important logic that extends propositional logic: first-order logic.

By Trần Ngọc Minh

SQL Server From Zero To Advanced Level: Leveraging nProbe Data

Microsoft's SQL Server is a powerful RDBMS that is extensively utilized in diverse industries for the purposes of data storage, retrieval, and analysis. The objective of this article is to assist novices in comprehending SQL Server from fundamental principles to advanced techniques, employing real-world illustrations derived from nProbe data. nProbe is a well-known network traffic monitoring tool that offers comprehensive insights into network traffic patterns. Getting Started With SQL Server 1. Introduction to SQL Server SQL Server provides a comprehensive database management platform that integrates advanced analytics, robust security features, and extensive reporting capabilities. It offers support for a wide range of data types and functions, enabling efficient data management and analysis. 2. Installation Begin by installing SQL Server. Microsoft offers different editions, including Express, Standard, and Enterprise, to cater to varying needs. The Express edition is free and suitable for learning and small applications. Here is the step-by-step guide to install the SQL server. 3. Basic SQL Operations Learn the fundamentals of SQL, including creating databases, tables, and writing basic queries: Create database: `CREATE DATABASE TrafficData;` Create table: Define a table structure to store nProbe data: MS SQL CREATE TABLE NetworkTraffic ( ID INT PRIMARY KEY, SourceIP VARCHAR(15), DestinationIP VARCHAR(15), Packets INT, Bytes BIGINT, Timestamp DATETIME ); Intermediate SQL Techniques 4. Data Manipulation Inserting Data To insert data into the `NetworkTraffic` table, you might collect information from various sources, such as network sensors or logs. MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.1', '192.168.1.1', 150, 2048, '2023-10-01T14:30:00'); Batch insert to minimize the impact on database performance: MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.2', '192.168.1.2', 50, 1024, '2023-10-01T15:00:00'), ('10.0.0.3', '192.168.1.3', 100, 1536, '2023-10-01T15:05:00'), ('10.0.0.4', '192.168.1.4', 200, 4096, '2023-10-01T15:10:00'); Updating Data You may need to update records as new data becomes available or corrections are necessary. For instance, updating the byte count for a particular traffic record: MS SQL UPDATE NetworkTraffic SET Bytes = 3072 WHERE ID = 1; Update multiple fields at once: MS SQL UPDATE NetworkTraffic SET Packets = 180, Bytes = 3072 WHERE SourceIP = '10.0.0.1' AND Timestamp = '2023-10-01T14:30:00'; Deleting Data Removing data is straightforward but should be handled with caution to avoid accidental data loss. MS SQL DELETE FROM NetworkTraffic WHERE Timestamp < '2023-01-01'; Conditional delete based on network traffic analysis: MS SQL DELETE FROM NetworkTraffic WHERE Bytes < 500 AND Timestamp BETWEEN '2023-01-01' AND '2023-06-01'; Querying Data Simple Queries: Retrieve basic information from your data set. MS SQL SELECT FROM NetworkTraffic WHERE SourceIP = '10.0.0.1'; Select specific columns: MS SQL SELECT SourceIP, DestinationIP, Bytes FROM NetworkTraffic WHERE Bytes > 2000; Aggregate Functions Useful for summarizing or analyzing large data sets. MS SQL SELECT AVG(Bytes), MAX(Bytes), MIN(Bytes) FROM NetworkTraffic WHERE Timestamp > '2023-01-01'; Grouping data for more detailed analysis: MS SQL SELECT SourceIP, AVG(Bytes) AS AvgBytes FROM NetworkTraffic GROUP BY SourceIP HAVING AVG(Bytes) > 1500; Join Operations In scenarios where you have multiple tables, joins are essential. Assume another table `IPDetails` that stores additional information about each IP. MS SQL SELECT n.SourceIP, n.DestinationIP, n.Bytes, i.Location FROM NetworkTraffic n JOIN IPDetails i ON n.SourceIP = i.IPAddress WHERE n.Bytes > 1000; Complex Queries Combining multiple SQL operations to extract in-depth insights. MS SQL SELECT SourceIP, SUM(Bytes) AS TotalBytes FROM NetworkTraffic WHERE Timestamp BETWEEN '2023-01-01' AND '2023-02-01' GROUP BY SourceIP ORDER BY TotalBytes DESC; Advanced SQL Server Features 5. Indexing for Performance Optimizing SQL Server performance through indexing and leveraging stored procedures for automation is critical for managing large databases efficiently. Here’s an in-depth look at both topics, with practical examples, particularly focusing on enhancing operations within a network traffic database like the one collected from nProbe. Why Indexing Matters Indexing is a strategy to speed up the retrieval of records from a database by reducing the number of disk accesses required when a query is processed. It is especially vital in databases with large volumes of data, where search operations can become increasingly slow. Types of Indexes Clustered indexes: Change the way records are stored in the database as they sort and store the data rows in the table based on their key values. Tables can have only one clustered index. Non-clustered indexes: Do not alter the physical order of the data, but create a logical ordering of the data rows and use pointers to physical rows; each table can have multiple non-clustered indexes. Example: Creating an Index on Network Traffic Data Suppose you frequently query the `NetworkTraffic` table to fetch records based on `SourceIP` and `Timestamp`. You can create a non-clustered index to speed up these queries: MS SQL CREATE NONCLUSTERED INDEX idx_networktraffic_sourceip ON NetworkTraffic (SourceIP, Timestamp); This index would particularly improve performance for queries that look up records by `SourceIP` and filter on `Timestamp`, as the index helps locate data quickly without scanning the entire table. Below are additional instructions on utilizing indexing effectively. 6. Stored Procedures and Automation Benefits of Using Stored Procedures Stored procedures help in encapsulating SQL code for reuse and automating routine operations. They enhance security, reduce network traffic, and improve performance by minimizing the amount of information sent to the server. Example: Creating a Stored Procedure Imagine you often need to insert new records into the `NetworkTraffic` table. A stored procedure that encapsulates the insert operation can simplify the addition of new records: MS SQL CREATE PROCEDURE AddNetworkTraffic @SourceIP VARCHAR(15), @DestinationIP VARCHAR(15), @Packets INT, @Bytes BIGINT, @Timestamp DATETIME AS BEGIN INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES (@SourceIP, @DestinationIP, @Packets, @Bytes, @Timestamp); END; Using the Stored Procedure To insert a new record, instead of writing a full insert query, you simply execute the stored procedure: MS SQL EXEC AddNetworkTraffic @SourceIP = '192.168.1.1', @DestinationIP = '10.0.0.1', @Packets = 100, @Bytes = 2048, @Timestamp = '2024-04-12T14:30:00'; Automation Example: Scheduled Tasks SQL Server Agent can be used to schedule the execution of stored procedures. For instance, you might want to run a procedure that cleans up old records every night: MS SQL CREATE PROCEDURE CleanupOldRecords AS BEGIN DELETE FROM NetworkTraffic WHERE Timestamp < DATEADD(month, -1, GETDATE()); END; You can schedule this procedure to run automatically at midnight every day using SQL Server Agent, ensuring that your database does not retain outdated records beyond a certain period. By implementing proper indexing strategies and utilizing stored procedures, you can significantly enhance the performance and maintainability of your SQL Server databases. These practices are particularly beneficial in environments where data volumes are large and efficiency is paramount, such as in managing network traffic data for IFC systems. 7. Performance Tuning and Optimization Performance tuning and optimization in SQL Server are critical aspects that involve a systematic review of database and system settings to improve the efficiency of your operations. Proper tuning not only enhances the speed and responsiveness of your database but also helps in managing resources more effectively, leading to cost savings and improved user satisfaction. Key Areas for Performance Tuning and Optimization 1. Query Optimization Optimize queries: The first step in performance tuning is to ensure that the queries are as efficient as possible. This includes selecting the appropriate columns, avoiding unnecessary calculations, and using joins effectively. Query profiling: SQL Server provides tools like SQL Server Profiler and Query Store that help identify slow-running queries and bottlenecks in your SQL statements. Example: Here’s how you can use the Query Store to find performance issues: MS SQL SELECT TOP 10 qt.query_sql_text, rs.avg_duration FROM sys.query_store_query_text AS qt JOIN sys.query_store_plan AS qp ON qt.query_text_id = qp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; 2. Index Management Review and adjust indexes: Regularly reviewing the usage and effectiveness of indexes is vital. Unused indexes should be dropped, and missing indexes should be added where significant performance gains can be made. Index maintenance: Rebuilding and reorganizing indexes can help in maintaining performance, especially in databases with heavy write operations. Example: Rebuild an index using T-SQL: MS SQL ALTER INDEX ALL ON dbo.YourTable REBUILD WITH (FILLFACTOR = 90, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = OFF); 3. Database Configuration and Maintenance Database settings: Adjust database settings such as recovery model, file configuration, and buffer management to optimize performance. Routine maintenance: Implement regular maintenance plans that include updating statistics, checking database integrity, and cleaning up old data. Example: Set up a maintenance plan in SQL Server Management Studio (SSMS) using the Maintenance Plan Wizard. 4. Hardware and Resource Optimization Hardware upgrades: Sometimes, the best way to achieve performance gains is through hardware upgrades, such as increasing memory, adding faster disks, or upgrading CPUs. Resource allocation: Ensure that the SQL Server has enough memory and CPU resources allocated, particularly in environments where the server hosts multiple applications. Example: Configure maximum server memory: MS SQL EXEC sp_configure 'max server memory', 4096; RECONFIGURE; 5. Monitoring and Alerts System monitoring: Continuous monitoring of system performance metrics is crucial. Tools like System Monitor (PerfMon) and Dynamic Management Views (DMVs) in SQL Server provide real-time data about system health. Alerts setup: Configure alerts for critical conditions, such as low disk space, high CPU usage, or blocking issues, to ensure that timely actions are taken. Example: Set up an alert in SQL Server Agent: MS SQL USE msdb ; GO EXEC dbo.sp_add_alert @name = N'High CPU Alert', @message_id = 0, @severity = 0, @enabled = 1, @delay_between_responses = 0, @include_event_description_in = 1, @notification_message = N'SQL Server CPU usage is high.', @performance_condition = N'SQLServer:SQL Statistics|Batch Requests/sec|_Total|>|1000', @job_id = N'00000000-1111-2222-3333-444444444444'; GO Performance tuning and optimization is an ongoing process, requiring regular adjustments and monitoring. By systematically addressing these key areas, you can ensure that your SQL Server environment is running efficiently, effectively supporting your organizational needs. Conclusion Mastering SQL Server is a journey that evolves with practice and experience. Starting from basic operations to leveraging advanced features, SQL Server provides a powerful toolset for managing and analyzing data. As your skills progress, you can handle larger datasets like those from nProbe, extracting valuable insights and improving your network's performance and security. For those looking to dive deeper, Microsoft offers extensive documentation and a community rich with resources to explore more complex SQL Server capabilities. Useful References nProbe SQL Server SQL server performance tuning

By Vijay Panwar

CORE

Failure Is Required: Understanding Fail-Safe and Fail-Fast Strategies

Failures in software systems are inevitable. How these failures are handled can significantly impact system performance, reliability, and the business’s bottom line. In this post, I want to discuss the upside of failure. Why you should seek failure, why failure is good, and why avoiding failure can reduce the reliability of your application. We will start with the discussion of fail-fast vs. fail-safe, this will take us to the second discussion about failures in general. As a side note, if you like the content of this and the other posts in this series check out my Debugging book that covers this subject. If you have friends that are learning to code I'd appreciate a reference to my Java Basics book. If you want to get back to Java after a while check out my Java 8 to 21 book. Fail-Fast Fail-fast systems are designed to immediately stop functioning upon encountering an unexpected condition. This immediate failure helps to catch errors early, making debugging more straightforward. The fail-fast approach ensures that errors are caught immediately. For example, in the world of programming languages, Java embodies this approach by producing a NullPointerException instantly when encountering a null value, stopping the system, and making the error clear. This immediate response helps developers identify and address issues quickly, preventing them from becoming more serious. By catching and stopping errors early, fail-fast systems reduce the risk of cascading failures, where one error leads to others. This makes it easier to contain and resolve issues before they spread through the system, preserving overall stability. It is easy to write unit and integration tests for fail-fast systems. This advantage is even more pronounced when we need to understand the test failure. Fail-fast systems usually point directly at the problem in the error stack trace. However, fail-fast systems carry their own risks, particularly in production environments: Production disruptions: If a bug reaches production, it can cause immediate and significant disruptions, potentially impacting both system performance and the business’s operations. Risk appetite: Fail-fast systems require a level of risk tolerance from both engineers and executives. They need to be prepared to handle and address failures quickly, often balancing this with potential business impacts. Fail-Safe Fail-safe systems take a different approach, aiming to recover and continue even in the face of unexpected conditions. This makes them particularly suited for uncertain or volatile environments. Microservices are a prime example of fail-safe systems, embracing resiliency through their architecture. Circuit breakers, both physical and software-based, disconnect failing functionality to prevent cascading failures, helping the system continue operating. Fail-safe systems ensure that systems can survive even harsh production environments, reducing the risk of catastrophic failure. This makes them particularly suited for mission-critical applications, such as in hardware devices or aerospace systems, where smooth recovery from errors is crucial. However, fail-safe systems have downsides: Hidden errors: By attempting to recover from errors, fail-safe systems can delay the detection of issues, making them harder to trace and potentially leading to more severe cascading failures. Debugging challenges: This delayed nature of errors can complicate debugging, requiring more time and effort to find and resolve issues. Choosing Between Fail-Fast and Fail-Safe It's challenging to determine which approach is better, as both have their merits. Fail-fast systems offer immediate debugging, lower risk of cascading failures, and quicker detection and resolution of bugs. This helps catch and fix issues early, preventing them from spreading. Fail-safe systems handle errors gracefully, making them better suited for mission-critical systems and volatile environments, where catastrophic failures can be devastating. Balancing Both To leverage the strengths of each approach, a balanced strategy can be effective: Fail-fast for local services: When invoking local services like databases, fail-fast can catch errors early, preventing cascading failures. Fail-safe for remote resources: When relying on remote resources, such as external web services, fail-safe can prevent disruptions from external failures. A balanced approach also requires clear and consistent implementation throughout coding, reviews, tooling, and testing processes, ensuring it is integrated seamlessly. Fail-fast can integrate well with orchestration and observability. Effectively, this moves the fail-safe aspect to a different layer of OPS instead of into the developer layer. Consistent Layer Behavior This is where things get interesting. It isn't about choosing between fail-safe and fail-fast. It's about choosing the right layer for them. E.g. if an error is handled in a deep layer using a fail-safe approach, it won't be noticed. This might be OK, but if that error has an adverse impact (performance, garbage data, corruption, security, etc.) then we will have a problem later on and won't have a clue. The right solution is to handle all errors in a single layer, in modern systems the top layer is the OPS layer and it makes the most sense. It can report the error to the engineers who are most qualified to deal with the error. But they can also provide immediate mitigation such as restarting a service, allocating additional resources, or reverting a version. Retry’s Are Not Fail-Safe Recently I was at a lecture where the speakers listed their updated cloud architecture. They chose to take a shortcut to microservices by using a framework that allows them to retry in the case of failure. Unfortunately, failure doesn't behave the way we would like. You can't eliminate it completely through testing alone. Retry isn't fail-safe. In fact: it can mean catastrophe. They tested their system and "it works", even in production. But let's assume that a catastrophic situation does occur, their retry mechanism can operate as a denial of service attack against their own servers. The number of ways in which ad-hoc architectures such as this can fail is mind-boggling. This is especially important once we redefine failures. Redefining Failure Failures in software systems aren't just about crashes. A crash can be seen as a simple and immediate failure, but there are more complex issues to consider. In fact, crashes in the age of containers are probably the best failures. A system restarts seamlessly with barely an interruption. Data Corruption Data corruption is far more severe and insidious than a crash. It carries with it long-term consequences. Corrupted data can lead to security and reliability problems that are challenging to fix, requiring extensive reworking and potentially unrecoverable data. Cloud computing has led to defensive programming techniques, like circuit breakers and retries, emphasizing comprehensive testing and logging to catch and handle failures gracefully. In a way, this environment sent us back in terms of quality. A fail-fast system at the data level could stop this from happening. Addressing a bug goes beyond a simple fix. It requires understanding its root cause and preventing reoccurrence, extending into comprehensive logging, testing, and process improvements. This ensures that the bug is fully addressed, reducing the chances of it reoccurring. Don't Fix the Bug If it's a bug in production you should probably revert, if you can't instantly revert production. This should always be possible and if it isn't this is something you should work on. Failures must be fully understood before a fix is undertaken. In my own companies, I often skipped that step due to pressure, in a small startup that is forgivable. In larger companies, we need to understand the root cause. A culture of debriefing for bugs and production issues is essential. The fix should also include process mitigation that prevents similar issues from reaching production. Debugging Failure Fail-fast systems are much easier to debug. They have inherently simpler architecture and it is easier to pinpoint an issue to a specific area. It is crucial to throw exceptions even for minor violations (e.g. validations). This prevents cascading types of bugs that prevail in loose systems. This should be further enforced by unit tests that verify the limits we define and verify proper exceptions are thrown. Retries should be avoided in the code as they make debugging exceptionally difficult and their proper place is in the OPS layer. To facilitate that further, timeouts should be short by default. Avoiding Cascading Failure Failure isn't something we can avoid, predict, or fully test against. The only thing we can do is soften the blow when a failure occurs. Often this "softening" is achieved by using long-running tests meant to replicate extreme conditions as much as possible with the goal of finding our application's weak spots. This is rarely enough, robust systems need to revise these tests often based on real production failures. A great example of a fail-safe would be a cache of REST responses that lets us keep working even when a service is down. Unfortunately, this can lead to complex niche issues such as cache poisoning or a situation in which a banned user still had access due to cache. Hybrid in Production Fail-safe is best applied only in production/staging and in the OPS layer. This reduces the amount of changes between production and dev, we want them to be as similar as possible, yet it's still a change that can negatively impact production. However, the benefits are tremendous as observability can get a clear picture of system failures. The discussion here is a bit colored by my more recent experience of building observable cloud architectures. However, the same principle applies to any type of software whether embedded or in the cloud. In such cases we often choose to implement fail-safe in the code, in this case, I would suggest implementing it consistently and consciously in a specific layer. There's also a special case of libraries/frameworks that often provide inconsistent and badly documented behaviors in these situations. I myself am guilty of such inconsistency in some of my work. It's an easy mistake to make. Final Word This is my last post on the theory of debugging series that's part of my book/course on debugging. We often think of debugging as the action we take when something fails, it isn't. Debugging starts the moment we write the first line of code. We make decisions that will impact the debugging process as we code, often we're just unaware of these decisions until we get a failure. I hope this post and series will help you write code that is prepared for the unknown. Debugging, by its nature, deals with the unexpected. Tests can't help. But as I illustrated in my previous posts, there are many simple practices we can undertake that would make it easier to prepare. This isn't a one-time process, it's an iterative process that requires re-evaluation of decisions made as we encounter failure.

By Shai Almog

CORE

Create Tweets With X API v2

Do you want to learn how to create Tweets from a Java application using the X (Twitter) API v2? This blog will show you in a step-by-step guide how to do so. Enjoy! Introduction X (Twitter) provides an API that allows you to interact with your account from an application. Currently, two versions exist. In this blog, you will use the most recent X API v2. Although a lot of information can be found on how to set up your environment and how to interact with the API, it took me quite some time to do so from within a Java application. In this blog, you will learn how to set up your account and how you can create tweets from a Java application. The sources for this blog can be found on GitHub. Prerequisites Prerequisites for this blog are: Basic knowledge of Java, Java 21 is used; An X account; A website you own (not mandatory, but for security reasons the better). Set up Developer Account The first thing to do is to set up a developer account. Navigate to the sign-up page. Beware that there exist multiple types of accounts: free account, basic account, pro account, and enterprise account. Scroll down all the way to the bottom of the page and choose to create a free account by clicking the button Sign up Free Account. You will need to describe your use case using at least 250 characters. After signing up, you end up in the developer portal. Create Project and App With the Free tier, you can create one Project and one App. Create the Project and the App. Authentication As you are going to create Tweets for a user, you will need to set up the authentication using OAuth 2.0 Authorization Code Flow with PKCE. However, it is important that you have configured your App correctly. Navigate to your App in the developer portal and click the Edit button in the User authentication settings. Different sections are available here where you are required to add information and to choose options. App Permissions These permissions enable OAuth 1.0a Authentication. It is confusing that you need to check one of these bullets because OAuth 1.0a Authentication will not be used in your use case. However, because you will create a tweet, select Read and Write, just to be sure. Type of App The type of App will enable OAuth 2.0 Authentication, this is the one you will use. You will invoke the API from an application. Therefore, choose a Web App, Automated App, or Bot. App Info In the App info section, you need to provide a Callback URI and a Website URL. The Callback URI is important as you will see later on in the next paragraphs. Fill in the URL of your website. You can use any URL, but the Callback URI will be used to provide you with an access token, so it is better to use the URL of a website you own. Click the Save button to save your changes. A Client ID and Client Secret are generated and save them. Create Tweet Everything is set up in the developer portal. Now it is time to create the Java application in order to be able to create a Tweet. Twitter API Client Library for Java In order to create the tweet, you will make use of the Twitter API Client Library for Java. Beware that, at the time of writing, the library is still in beta. The library also only supports the X API v2 endpoints, but that is exactly the endpoints you will be using, so that is not a problem. Add the dependency to the pom file: XML <dependency> <groupId>com.twitter</groupId> <artifactId>twitter-api-java-sdk</artifactId> <version>2.0.3</version> </dependency> Authorization In order to be able to create tweets on behalf of your account, you need to authorize the App. The source code below is based on the example provided in the SDK. You need the Client ID and Client Secret you saved before. If you lost them, you can generate a new secret in the developer portal. Navigate to your App, and click the Keys and Tokens tab. Scroll down retrieve the Client ID and generate a new Client Secret. The main method executes the following steps: Set the correct credentials as environment variables: TWITTER_OAUTH2_CLIENT_ID: the OAuth 2.0 client ID TWITTER_OAUTH2_CLIENT_SECRET: the Oauth 2.0 Client Secret TWITTER_OAUTH2_ACCESS_TOKEN: leave it blank TWITTER_OAUTH2_REFRESH_TOKEN: leave it blank Authorize the App and retrieve an access and refresh token. Set the newly received access and refresh token in the credentials object. Call the X API in order to create the tweet. Java public static void main(String[] args) { TwitterCredentialsOAuth2 credentials = new TwitterCredentialsOAuth2(System.getenv("TWITTER_OAUTH2_CLIENT_ID"), System.getenv("TWITTER_OAUTH2_CLIENT_SECRET"), System.getenv("TWITTER_OAUTH2_ACCESS_TOKEN"), System.getenv("TWITTER_OAUTH2_REFRESH_TOKEN")); OAuth2AccessToken accessToken = getAccessToken(credentials); if (accessToken == null) { return; } // Setting the access & refresh tokens into TwitterCredentialsOAuth2 credentials.setTwitterOauth2AccessToken(accessToken.getAccessToken()); credentials.setTwitterOauth2RefreshToken(accessToken.getRefreshToken()); callApi(credentials); } The getAccessToken method executes the following steps: Creates a Twitter service object: Set the Callback URI to the one you specified in the developer portal. Set the scopes (what is allowed) you want to authorize. By using offline.access, you will receive a refresh token which allows you to retrieve a new access token without prompting the user via the refresh token flow. This means that you can continuously create tweets without the need of user interaction. An authorization URL is provided to you where you will authorize the App for the requested scopes. You are redirected to the Callback URI and in the URL the authorization code will be visible. The getAccessToken method waits until you copy the authorization code and hit enter. The access token and refresh token are printed to the console and returned from the method. Java private static OAuth2AccessToken getAccessToken(TwitterCredentialsOAuth2 credentials) { TwitterOAuth20Service service = new TwitterOAuth20Service( credentials.getTwitterOauth2ClientId(), credentials.getTwitterOAuth2ClientSecret(), "<Fill in your Callback URI as configured in your X App in the developer portal>", "offline.access tweet.read users.read tweet.write"); OAuth2AccessToken accessToken = null; try { final Scanner in = new Scanner(System.in, "UTF-8"); System.out.println("Fetching the Authorization URL..."); final String secretState = "state"; PKCE pkce = new PKCE(); pkce.setCodeChallenge("challenge"); pkce.setCodeChallengeMethod(PKCECodeChallengeMethod.PLAIN); pkce.setCodeVerifier("challenge"); String authorizationUrl = service.getAuthorizationUrl(pkce, secretState); System.out.println("Go to the Authorization URL and authorize your App:\n" + authorizationUrl + "\nAfter that paste the authorization code here\n>>"); final String code = in.nextLine(); System.out.println("\nTrading the Authorization Code for an Access Token..."); accessToken = service.getAccessToken(pkce, code); System.out.println("Access token: " + accessToken.getAccessToken()); System.out.println("Refresh token: " + accessToken.getRefreshToken()); } catch (Exception e) { System.err.println("Error while getting the access token:\n " + e); e.printStackTrace(); } return accessToken; } Now that you authorized the App, you are able to see that you have done so in your X settings. Navigate to Settings and Privacy in your X account. Navigate to Security and account access. Navigate to Apps and sessions. Navigate to Connected apps. Here you will find the App you authorized and which authorizations it has. The callApi method executes the following steps: Create a TwitterApi instance with the provided credentials. Create a TweetRequest. Create the Tweet. Java private static void callApi(TwitterCredentialsOAuth2 credentials) { TwitterApi apiInstance = new TwitterApi(credentials); TweetCreateRequest tweetCreateRequest = new TweetCreateRequest(); // TweetCreateRequest | tweetCreateRequest.setText("Hello World!"); try { TweetCreateResponse result = apiInstance.tweets().createTweet(tweetCreateRequest) .execute(); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling TweetsApi#createTweet"); System.err.println("Status code: " + e.getCode()); System.err.println("Reason: " + e.getResponseBody()); System.err.println("Response headers: " + e.getResponseHeaders()); e.printStackTrace(); } } Add an sdk.properties file to the root of the repository, otherwise, an Exception will be thrown (the Exception is not blocking, but spoils the output). If everything went well, you now have created your first Tweet! Obtain New Access Token You only need to execute the source code above once. The retrieved access token, however, will only stay valid for two hours. After that time (or earlier), you need to retrieve a new access token using the refresh token. The source code below is based on the example provided in the SDK. The main method executes the following steps: Set the credentials including the access and refresh token you obtained in the previous sections. Add a callback to the TwitterApi instance in order to retrieve a new access and refresh token. Request to refresh the token, the callback method MainToken will set the new tokens and again a Tweet can be created. Java public static void main(String[] args) { TwitterApi apiInstance = new TwitterApi(new TwitterCredentialsOAuth2(System.getenv("TWITTER_OAUTH2_CLIENT_ID"), System.getenv("TWITTER_OAUTH2_CLIENT_SECRET"), System.getenv("TWITTER_OAUTH2_ACCESS_TOKEN"), System.getenv("TWITTER_OAUTH2_REFRESH_TOKEN"))); apiInstance.addCallback(new MaintainToken()); try { apiInstance.refreshToken(); } catch (Exception e) { System.err.println("Error while trying to refresh existing token : " + e); e.printStackTrace(); return; } callApi(apiInstance); } The MaintainToken method only sets the new tokens. Java class MaintainToken implements ApiClientCallback { @Override public void onAfterRefreshToken(OAuth2AccessToken accessToken) { System.out.println("access: " + accessToken.getAccessToken()); System.out.println("refresh: " + accessToken.getRefreshToken()); } } Conclusion In this blog, you learned how to configure an App in the developer portal. You learned how to authorize your App from a Java application and how to create a Tweet.

By Gunter Rotsaert

CORE

Securing the Generative AI Frontier: Specialized Tools and Frameworks for AI Firewall

In the previous post — Understanding Prompt Injection and Other Risks of Generative AI, you learned about the security risks, vulnerabilities, and mitigations associated with Generative AI. The article elaborated on prompt injection and introduced other security risks. In this article, you will learn about specialized tools and frameworks for prompt inspection and protection or AI firewalls. The Rise of Generative AI and Emerging Security Challenges The rapid advancements in generative artificial intelligence (AI) have ushered in an era of unprecedented creativity and innovation. Along with the advancements, this transformative technology has also introduced a host of new security challenges that demand urgent attention. As AI systems become increasingly sophisticated, capable of autonomously generating content ranging from text to images and videos, the potential for malicious exploitation has grown exponentially. Threat actors, including cybercriminals and nation-state actors, have recognized the power of these generative AI tools and are actively seeking to leverage them for nefarious purposes. In particular, Generative AI, powered by deep learning architectures such as Generative Adversarial Networks (GANs) and language models like GPT (Generative Pre-trained Transformer), has unlocked remarkable capabilities in content creation, text generation, image synthesis, and more. While these advancements hold immense promise for innovation and productivity, they also introduce profound security challenges. AI-Powered Social Engineering Attacks One of the primary concerns is the rise of AI-powered social engineering attacks. Generative AI can be used to create highly convincing and personalized phishing emails, deepfakes, and other forms of manipulated content that can deceive even the most vigilant individuals. These attacks can be deployed at scale, making them a formidable threat to both individuals and organizations. Vulnerabilities in AI-Integrated Applications The integration of large language models (LLMs) into critical applications, such as chatbots and virtual assistants, introduces new vulnerabilities. Adversaries can exploit these models through techniques like prompt injection, where they can coerce the AI system to generate harmful or sensitive outputs. The potential for data poisoning, where malicious data is used to corrupt the training of AI models, further exacerbates the security risks. Specialized AI Security Tools and Frameworks AI Security encompasses a multifaceted approach, involving proactive measures to prevent exploitation, robust authentication mechanisms, continuous monitoring for anomalies, and rapid response capabilities. At the heart of this approach lies the concept of Prompt Inspection and Protection, akin to deploying an AI Firewall to scrutinize inputs, outputs, and processes, thereby mitigating risks and ensuring the integrity of AI systems. To address these challenges, the development of specialized tools and frameworks for Prompt Inspection and Protection, or AI Firewall, has become a crucial priority. These solutions leverage advanced AI and machine learning techniques to detect and mitigate security threats in AI applications. Robust Intelligence's AI Firewall One such tool is Robust Intelligence's AI Firewall, which provides real-time protection for AI applications by automatically configuring guardrails to address the specific vulnerabilities of each model. It covers a wide range of security and safety threats, including prompt injection attacks, insecure output handling, data poisoning, and sensitive information disclosure. Nightfall AI's Firewalls for AI Another prominent solution is Nightfall AI's Firewalls for AI, which allows organizations to safeguard their AI applications against a variety of security risks. Nightfall's platform can be deployed via API, SDK, or reverse proxy, with the API-based approach offering flexibility and ease of use for developers. Intel's AI Technologies for Network Applications Intel's AI Technologies for Network Applications also play a significant role in the AI security landscape. This suite of tools and libraries, such as the Traffic Analytics Development Kit (TADK), enables the integration of real-time AI within network security applications like web application firewalls and next-generation firewalls. These solutions leverage AI and machine learning models to detect malicious content and traffic anomalies. Broader AI Governance Frameworks and Standards Beyond these specialized tools, broader AI governance frameworks and standards, such as those from the OECD, UNESCO, and ISO/IEC, provide valuable guidance on ensuring the trustworthy and responsible development and deployment of AI systems. Companies like IBM have introduced a Framework for Securing Generative AI. These principles and guidelines can inform the overall approach to AI Firewall implementation. Additional Tools and Frameworks for AI Security Several tools and frameworks have emerged to bolster AI security and facilitate Prompt Inspection and Protection. These solutions leverage a combination of techniques, including anomaly detection, behavior analysis, and adversarial training, to fortify AI systems against threats. Among these, some notable examples include: AI Guard An integrated security platform specifically designed for AI environments, AI Guard employs advanced algorithms to detect and neutralize adversarial inputs, unauthorized access attempts, and anomalous behavior patterns in real-time. It offers seamless integration with existing AI infrastructure and customizable policies to adapt to diverse use cases. DeepShield Developed by leading AI security researchers, DeepShield is a comprehensive framework for securing deep learning models against attacks. It encompasses techniques such as input sanitization, model verification, and runtime monitoring to detect and mitigate threats proactively. DeepShield's modular architecture facilitates easy deployment across various AI applications, from natural language processing to computer vision. SentinelAI A cloud-based AI security platform, SentinelAI combines machine learning algorithms with human oversight to provide round-the-clock protection for AI systems. It offers features such as dynamic risk assessment, model explainability, and threat intelligence integration, empowering organizations to stay ahead of evolving security threats effectively. Conclusion As the generative AI era continues to unfold, the imperative for robust AI security has never been more pressing. By leveraging specialized tools and frameworks, organizations especially enterprises can safeguard their AI applications, protect sensitive data, and build resilience against the evolving threat landscape. Prompt Inspection and Protection, facilitated by the mentioned specialized tools and frameworks, serve as indispensable safeguards in this endeavor, enabling us to harness the benefits of AI innovation while safeguarding against its inherent risks. By embracing a proactive and adaptive approach to AI security, you can navigate the complexities of the Generative AI era with confidence, ensuring a safer and more resilient technological landscape for generations to come. Further Reading Firewalls for AI: The Essential Guide Framework for Securing Generative AI

By Vidyasagar (Sarath Chandra) Machupalli

CORE

Evaluating Message Brokers

What Is a Message Broker? A message broker is an important component of asynchronous distributed systems. It acts as a bridge in the producer-consumer pattern. Producers write messages to the broker, and the consumer reads the message from the broker. The broker handles queuing, routing, and delivery of messages. The diagram below shows how the broker is used in the producer-consumer pattern: This article discusses the popular brokers used today and when to use them. Simple Queue Service (SQS) Simple Queue Service (SQS) is offered by Amazon Web Services (AWS) as a managed message queue service. AWS fully manages the queue, making SQS an easy solution for passing messages between different components of software running on AWS infrastructure. The section below details what is supported and what is not in SQS Supported Pay for what you use: SQS only charges for the messages read and written to the queue. There is no recurring or base charge for using SQS. Ease of setup: SQS is a fully managed AWS service, no infrastructure setup is required for using SQS. Reading and writing are also simple either using direct REST APIs provided by SQS or using AWS lambda functions. Support for FIFO queues: Besides regular standard queues, SQS also supports FIFO queues. For applications that need strict ordering of messages, FIFO queues come in handy. Scale: SQS scales elastically with the application, no need to worry about capacity and pre-provisioning. There is no limit to the number of messages per queue, and queues offer nearly unlimited output. Queue for failed messages/dead-letter queue: All the messages that can't be processed are sent to a "dead-letter" queue. SQS takes care of moving messages automatically into the dead-letter queue based on the retry configuration of the main queue. Not Supported Lack of message broadcast: SQS doesn't have a way for multiple consumers to retrieve the same message with its "exactly once" transmission. For multiple consumer use cases, SQS needs to be used along with AWS SNS, which needs multiple queues subscribed to the same SNS topic. Replay: SQS doesn't have the ability to replay old messages. Replay is sometimes required for debugging and testing. Kinesis Kinesis is another offering of AWS. Kinesis streams enable large-scale data ingestion and real-time processing of streaming data. Like SQS, Kinesis is also a fully managed service. Below are details of what is supported and what is not in Kinesis. Supported Ease of setup: Kinesis like SQS is a fully managed AWS service, no infrastructure setup is required. Message broadcast: Kinesis allows multiple consumers to read the same message from the stream concurrently. AWS integration: Kinesis integrates seamlessly with other AWS services as part of the other AWS services. Replay: Kinesis allows the replay of messages as long as seven days in the past, and provides the ability for a client to consume messages at a later time. Real-time analytics: Provides support for ingestion, processing, and analysis of large data streams in real-time. Not Supported Strict message ordering: Kinesis supports in-order processing within a shard, however, provides no guarantee on ordering between shards. Lack of dead-letter queue: There is no support for dead dead-letter queue out of the box. Every application that consumes the stream has to deal with failure on its own. Auto-scaling: Kinesis streams don't scale dynamically in response to demand. Streams need to be provisioned ahead of time to meet the anticipated demand of both producer and consumer. Cost: For a large volume of data, pricing can be really high in comparison to other brokers. Kafka Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation. Apache is famous for its high throughput and scalability. It excels in real-time analytics and monitoring. Below are details of what is supported and what is not in Kafka. Supported Message broadcast: Kafka allows multiple consumers to read the same message from the stream. Replay: Kafka allows messages to be replayed from a specific point in a topic. Message retention policy decides how far back a message can be replayed. Unlimited message retention: Kafka allows unlimited message retention based on the retention policy configured. Real-time analytics: Provides support for ingestion, processing, and analysis of large data streams in real-time. Open source: Kafka is an open project, which resulted in widespread adoption and community support. It has lots of configuration options available which gives the opportunity to fine-tune based on the specific use case. Not Supported Automated setup: Since Kafka is an open source, the developer needs to set up the infrastructure and Kafka cluster setup. That said, most of the public cloud providers provide managed Kafka. Simple onboarding: For Kafka clusters that are not through managed services understanding the infrastructure can become a daunting task. Apache does provide lots of documentation, but it takes time for new developers to understand. Queue semantics: In the true sense, Kafka is a distributed immutable event log, not a queuing system. It does not inherently support distributing tasks to multiple workers so that each task is processed exactly once. Dynamic partition: It is difficult to dynamically change a number of Kafka topics. This limits the scalability of the system when workload increases. The large number of partitions needs to be pre-provisioned to support the maximum load. Pulsar Pulsar is an open-source, distributed messaging and streaming platform. It is an open-source system developed by the Apache Software Foundation. It provides highly scalable, flexible, and durable for real-time data streaming and processing. Below are details of what is supported and what is not in Pulsar. Supported Multi-tenancy: Pulsar supports multi-tenancy as a first-class citizen. It provides access control across data and actions using tenant policies. Seamless geo-replication: Pulsar synchronizes data across multiple regions without any third-party replication tools. Replay: Similar to Kafka, Pulsar allows messages to be replayed from a specific point in a topic. Message retention policy decides how far back a message can be replayed. Unlimited message retention: Similar to Kafka, Pulsar allows unlimited message retention based on the retention policy configured. Flexible models: Pulsar supports both streaming and queuing models. It provides strict message ordering within a partition. Not Supported Automated setup: Similar to Kafka, Apache is open-source, and the developer needs to set up the infrastructure. Robust ecosystem: Pulsar is relatively new compared to Kafka. It doesn't have large community support similar to Kafka. Out-of-the-box Integration: Pulsar lacks out-of-the-box integration and support compared to Kafka and SQS. Conclusion Managed services require minimal maintenance effort, but non-managed services need regular, dedicated maintenance capacity. On the flip side, non-managed services provide better flexibility and tuning opportunities than managed services. In the end, choosing the right broker depends on the project's needs. Understanding the strengths and gaps of each broker helps developers make informed decisions.

By rahul goel

When Your Code Comes Back to You...

Go through your code and follow the business logic. Whenever a question or doubt arises, there is potential for improvement. Your Code May Come back to You for Various Reasons The infrastructure, environment, or dependencies have evolved You want to reuse your code or logic in another context You need to introduce someone else or present your work before a wider audience The business requirements have changed Some improvements are needed There is a functional bug; etc. There are two, equally valid approaches here — either you fix the issue(s) with minimal effort and move on to the next task, or you take the chance to revisit what you have done, evaluate and possibly improve it, or even decide it is no longer needed, based on the experience and knowledge you have gained in the meantime. The big difference is that when you re-visit your code, you improve your skills as a side effect of doing your daily job. You may consider this a small investment that will pay for itself by increasing your efficiency in the future. A Few Examples Why did I do all this, where can I find the requirements? Developers often context switch between unrelated tasks — you can save time for onboarding yourself and others by maintaining better comments/documentation. A reference to a ticket could do the job, especially if there are multiple tickets. If possible, keep the requirements together with your code, otherwise try to summarize them. Hmm, this part is inefficient! In many cases this happens due to chasing deadlines, blindly copying code around, or not considering the real amount of data during development. You may find yourself retrieving the same data many times too. Writing efficient code always pays off by saving on iterations to improve performance. When you revisit your code, you may find that there are new and better ways to achieve the same goal. Oh, this is brittle — my assumptions may not hold in the future! "This will never happen" — you have heard it so many times at all levels of competence. No comment is needed here — a good reason why you should avoid writing brittle code is that you may want to reuse it in a different context. It's really hard to make no assumptions, but when you revisit your code, you should do your best to make as few assumptions as possible. Also consider that your code may run in different environments, where defaults and conventions may differ — never rely on things like date and number formats, order or completeness of data, availability of configuration or external services, etc. Oops, it is incomplete — it only covers a subset of the business requirements! You have no one to blame — this is your own code. Don't leave it incomplete, because it will come back to you and that always happens at the worst time possible. I'm lost following my own logic ... You definitely hit technical debt — and technical debt is immortal. As you develop professionally, you start doing things in more standard and widely recognized ways, so they are easier to maintain. It is quite tempting not to touch something that works. However, remember that, even if it works, it is only useable in the present context. Unreadable code is not reusable, not to mention it is hard to maintain. Fighting the technical debt pays by saving time and effort by allowing you to reuse code and logic. Uh, it's so big, it will take too much time to improve and I don't have enough time right now! Yet another type of technical debt. In a large and complex piece of code, some parts may appear unreachable in the actual context, making the code even less readable. This could be a problem, but nobody complained so far, so let's wait... Don't trust this line of thinking. The complaints will always come at the worst times. Summary Even when it isn't recognized by management or your peers, the effort of revisiting your own code makes you a better professional, which in turn gives you a better position on the market. Additionally, keeping your code clean and high-quality is satisfying, without the need for someone else's assessment — and being satisfied with your work is a good motivation to keep going. For myself, I would summarize all of the above in a single phrase — don't copy code but revisit it, especially if it's your own. It's like re-entering your new password when you change it — it can help you memorize it better, even if it's easier to copy and paste the same string twice. Nothing stops you from doing all this when developing new code too.

By Rumen Dimov