DZone Spotlight

Sunday, May 19 View All Articles »

You Can Shape Trend Reports — Participate in DZone Research Surveys + Enter the Raffles!

By Caitlin Candelmo

Hello, DZone Community! We have several surveys in progress as part of our research for upcoming Trend Reports. We would love for you to join us by sharing your experiences and insights (anonymously if you choose) — readers just like you drive the content that we cover in our Trend Reports. you can find details for each research survey below Over the coming months, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Data Engineering Research As a continuation of our annual data-related research, we're consolidating our database, data pipeline, and data and analytics scopes into a single 12-minute survey that will guide help the narratives of our July Database Systems Trend Report and data engineering report later in the year. Our 2024 Data Engineering Survey explores: Database types, languages, and use cases Distributed database design + architectures Data observability, security, and governance Data pipelines, real-time processing, and structured storage Vector data and databases + other AI-driven data capabilities Join the Data Engineering Research You'll also have the chance to enter the $500 raffle at the end of the survey — five random people will be drawn and will receive $100 each (USD)! Cloud and Kubernetes Research This year, we're combining our annual cloud native and Kubernetes research into one 10-minute survey that dives further into these topics as they relate to both one another and at the intersection of security, observability, AI, and more. DZone's research will be informing these Trend Reports: May – Cloud Native: Championing Cloud Development Across the SDLC September – Kubernetes in the Enterprise Our 2024 Cloud Native Survey covers: Microservices, container orchestration, and tools/solutions Kubernetes use cases, pain points, and security measures Cloud infrastructure, costs, tech debt, and security threats AI for release management + monitoring/observability Join the Cloud Native Research Don't forget to enter the $750 raffle at the end of the survey! Five random people will be selected to each receive $150 (USD). Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Publications team More

Advanced Linux Troubleshooting Techniques for Site Reliability Engineers

By Prashanth Ravula

CORE

In Site Reliability Engineering (SRE), the ability to quickly and effectively troubleshoot issues within Linux systems is crucial. This article explores advanced troubleshooting techniques beyond basic tools and commands, focusing on kernel debugging, system call tracing, performance analysis, and using the Extended Berkeley Packet Filter (eBPF) for real-time data gathering. Kernel Debugging Kernel debugging is a fundamental skill for any SRE working with Linux. It allows for deep inspection of the kernel's behavior, which is critical when diagnosing system crashes or performance bottlenecks. Tools and Techniques GDB (GNU Debugger) GDB can debug kernel modules and the Linux kernel. It allows setting breakpoints, stepping through the code, and inspecting variables. GNU Debugger Official Documentation: This is the official documentation for GNU Debugger, providing a comprehensive overview of its features. KGDB The kernel debugger allows the kernel to be debugged using GDB over a serial connection or a network. Using kgdb, kdb, and the kernel debugger internals provides a detailed explanation of how kgdb can be enabled and configured. Dynamic Debugging (dyndbg) Linux's dynamic debug feature enables real-time debugging messages that help trace kernel operations without rebooting the system. The official Dynamic Debug page describes how to use the dynamic debug (dyndbg) feature. Tracing System Calls With strace strace is a powerful diagnostic tool that monitors the system calls used by a program and the signals received by a program. It is instrumental in understanding the interaction between applications and the Linux kernel. Usage To trace system calls, strace can be attached to a running process or start a new process under strace. It logs all system calls, which can be analyzed to find faults in system operations. Example: Shell root@ubuntu:~# strace -p 2009 strace: Process 2009 attached munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 mmap(NULL, 134221824, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xe02057400000 munmap(0xe02057400000, 134221824) = 0 In the above example, the -p flag is the process, and 2009 is the pid. Similarly, you can use the -o flag to log the output to a file instead of dumping everything on the screen. You can review the following article to understand system calls on Linux with strace. Performance Analysis With perf perf is a versatile tool used for system performance analysis. It provides a rich set of commands to collect, analyze, and report on hardware and software events. Key Features perf record: Gathers performance data into a file, perf.data, which can be further analyzed using perf report to identify hotspots perf report: This report analyzes the data collected by perf record and displays where most of the time was spent, helping identify performance bottlenecks. Event-based sampling: perf can record data based on specific events, such as cache misses or CPU cycles, which helps pinpoint performance issues more accurately. Example: Shell root@ubuntu:/tmp# perf record ^C[ perf record: Woken up 17 times to write data ] [ perf record: Captured and wrote 4.619 MB perf.data (83123 samples) ] root@ubuntu:/tmp# root@ubuntu:/tmp# perf report Samples: 83K of event 'cpu-clock:ppp', Event count (approx.): 20780750000 Overhead Command Shared Object Symbol 17.74% swapper [kernel.kallsyms] [k] cpuidle_idle_call 8.36% stress [kernel.kallsyms] [k] __do_softirq 7.17% stress [kernel.kallsyms] [k] finish_task_switch.isra.0 6.90% stress [kernel.kallsyms] [k] el0_da 5.73% stress libc.so.6 [.] random_r 3.92% stress [kernel.kallsyms] [k] flush_end_io 3.87% stress libc.so.6 [.] random 3.71% stress libc.so.6 [.] 0x00000000001405bc 2.71% kworker/0:2H-kb [kernel.kallsyms] [k] ata_scsi_queuecmd 2.58% stress libm.so.6 [.] __sqrt_finite 2.45% stress stress [.] 0x0000000000000f14 1.62% stress stress [.] 0x000000000000168c 1.46% stress [kernel.kallsyms] [k] __pi_clear_page 1.37% stress libc.so.6 [.] rand 1.34% stress libc.so.6 [.] 0x00000000001405c4 1.22% stress stress [.] 0x0000000000000e94 1.20% stress [kernel.kallsyms] [k] folio_batch_move_lru 1.20% stress stress [.] 0x0000000000000f10 1.16% stress libc.so.6 [.] 0x00000000001408d4 0.84% stress [kernel.kallsyms] [k] handle_mm_fault 0.77% stress [kernel.kallsyms] [k] release_pages 0.65% stress [kernel.kallsyms] [k] super_lock 0.62% stress [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.61% stress [kernel.kallsyms] [k] blk_done_softirq 0.61% stress [kernel.kallsyms] [k] _raw_spin_lock 0.60% stress [kernel.kallsyms] [k] folio_add_lru 0.58% kworker/0:2H-kb [kernel.kallsyms] [k] finish_task_switch.isra.0 0.55% stress [kernel.kallsyms] [k] __rcu_read_lock 0.52% stress [kernel.kallsyms] [k] percpu_ref_put_many.constprop.0 0.46% stress stress [.] 0x00000000000016e0 0.45% stress [kernel.kallsyms] [k] __rcu_read_unlock 0.45% stress [kernel.kallsyms] [k] dynamic_might_resched 0.42% stress [kernel.kallsyms] [k] _raw_spin_unlock 0.41% stress [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.40% stress [kernel.kallsyms] [k] mas_walk 0.39% stress [kernel.kallsyms] [k] arch_counter_get_cntvct 0.39% stress [kernel.kallsyms] [k] rwsem_read_trylock 0.39% stress [kernel.kallsyms] [k] up_read 0.38% stress [kernel.kallsyms] [k] down_read 0.37% stress [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.36% stress [kernel.kallsyms] [k] free_unref_page_commit 0.34% stress [kernel.kallsyms] [k] memset 0.32% stress libc.so.6 [.] 0x00000000001408c8 0.30% stress [kernel.kallsyms] [k] sync_inodes_sb 0.29% stress [kernel.kallsyms] [k] iterate_supers 0.29% stress [kernel.kallsyms] [k] percpu_counter_add_batch Real-Time Data Gathering With eBPF eBPF allows for creating small programs that run on the Linux kernel in a sandboxed environment. These programs can track system calls and network messages, providing real-time insights into system behavior. Applications Network monitoring: eBPF can monitor network traffic in real-time, providing insights into packet flow and protocol usage without significant performance overhead. Security: eBPF helps implement security policies by monitoring system calls and network activity to detect and prevent malicious activities. Performance monitoring: It can track application performance by monitoring function calls and system resource usage, helping SREs optimize performance. Conclusion Advanced troubleshooting in Linux involves a combination of tools and techniques that provide deep insights into system operations. Tools like GDB, strace, perf, and eBPF are essential for any SRE looking to enhance their troubleshooting capabilities. By leveraging these tools, SREs can ensure the high reliability and performance of Linux systems in production environments. More

Trend Report

Modern API Management

When assessing prominent topics across DZone — and the software engineering space more broadly — it simply felt incomplete to conduct research on the larger impacts of data and the cloud without talking about such a crucial component of modern software architectures: APIs. Communication is key in an era when applications and data capabilities are growing increasingly complex. Therefore, we set our sights on investigating the emerging ways in which data that would otherwise be isolated can better integrate with and work alongside other app components and across systems.For DZone's 2024 Modern API Management Trend Report, we focused our research specifically on APIs' growing influence across domains, prevalent paradigms and implementation techniques, security strategies, AI, and automation. Alongside observations from our original research, practicing tech professionals from the DZone Community contributed articles addressing key topics in the API space, including automated API generation via no and low code; communication architecture design among systems, APIs, and microservices; GraphQL vs. REST; and the role of APIs in the modern cloud-native landscape.

Refcard #395

Open Source Migration Practices and Patterns

By Nuwan Dias

CORE

Open Source Migration Practices and Patterns

Refcard #171

MongoDB Essentials

By Abhishek Gupta

CORE

Building Intelligent AI Agents Using Semantic Kernels and Azure OpenAI Models: A Step-By-Step Guide

In this article, we’ll explore how to build intelligent AI agents using Azure Open AI and semantic kernels (Microsoft C# SDK). You can combine it with Open AI, Azure Open AI, Hugging Face, or any other model. We’ll cover the fundamentals, dive into implementation details, and provide practical code examples in C#. Whether you’re a beginner or an experienced developer, this guide will help you harness the power of AI for your applications. What Is Semantic Kernel? In Kevin Scott's talk on "The era of the AI copilot," he showcased how Microsoft's Copilot system uses a mix of AI models and plugins to enhance user experiences. At the core of this setup is an AI orchestration layer, which allows Microsoft to combine these AI components to create innovative features for users. For developers looking to create their own copilot-like experiences using AI plugins, Microsoft has introduced Semantic kernel. Semantic Kernel is an open-source framework that enables developers to build intelligent agents by providing a common interface for various AI models and algorithms. The Semantic Kernel SDK allows you to integrate the power of large language models (LLMs) in your own applications. The Semantic Kernel SDK allows developers to integrate prompts to LLMs and results in their applications, and potentially craft their own copilot-like experiences. It allows developers to focus on building intelligent applications without worrying about the underlying complexities of AI models. Semantic Kernel is built on top of the .NET ecosystem and provides a robust and scalable platform for building intelligent apps/agents. Figure courtesy of Microsoft Key Features of Semantic Kernel Modular architecture: Semantic Kernel has a modular architecture that allows developers to easily integrate new AI models and algorithms. Knowledge graph: Semantic Kernel provides a built-in knowledge graph that enables developers to store and query complex relationships between entities. Machine learning: Semantic Kernel supports various machine learning algorithms, including classification, regression, and clustering. Natural language processing: Semantic Kernel provides natural language processing capabilities, including text analysis and sentiment analysis. Integration with external services: Semantic Kernel allows developers to integrate with external services, such as databases and web services. Let's dive deep into writing some intelligent code using Semantic kernel C# SDK. I will write them in steps so it will be easy to follow along. Step 1: Setting up the Environment Let's set up our environment. You will need to install the following to follow along. .NET 8 or later Semantic Kernel SDK (available on NuGet) Your preferred IDE (Visual Studio, Visual Studio Code, etc.) Azure OpenAI access Step 2: Creating a New Project in VS Open Visual Studio and create a blank empty console DotNet 8 Application. Step 3: Install NuGet References Right-click on the project --> click on Manage NuGet reference section to install the below 2 latest NuGet packages. 1) Microsoft.SemanticKernel 2) Microsoft.Extensions.Configuration.json Note: To avoid Hardcoding Azure Open AI key and endpoint, I am storing these as key-value pair into appsettings.json, and using the #2 package, I can easily retrieve them based on the key. Step 4: Create and Deploy Azure OpenAI Model Once you have obtained access to Azure OpenAI service, login to the Azure portal or Azure OpenAI studio to create Azure OpenAI resource. The screenshots below are from the Azure portal: You can also create an Azure Open AI service resource using Azure CLI by running the following command: PowerShell az cognitiveservices account create -n <nameoftheresource> -g <Resourcegroupname> -l <location> \ --kind OpenAI --sku s0 --subscription subscriptionID You can see your resource from Azure OpenAI studio as well by navigating to this page and selecting the resource that was created from: Deploy a Model Azure OpenAI includes several types of base models as shown in the studio when you navigate to the Deployments tab. You can also create your own custom models by using existing base models as per your requirements. Let's use the deployed GPT-35-turbo model and see how to consume it in the Azure OpenAI studio. Fill in the details and click Create. Once the model is deployed, grab the Azure OpenAI key and endpoint to paste it inside the appsettings.json file as shown below Step 5: Create Kernel in the Code Step 6: Create a Plugin to Call the Azure OpenAI Model Step 7: Use Kernel To Invoke the LLM Models Once you run the program by pressing F5 you will see the response generated from the Azure OpenAI model. Complete Code C# using Microsoft.Extensions.Configuration; using Microsoft.SemanticKernel; var config = new ConfigurationBuilder() .AddJsonFile("appsettings.json", optional: true, reloadOnChange: true) .Build(); var builder = Kernel.CreateBuilder(); builder.Services.AddAzureOpenAIChatCompletion( deploymentName: config["AzureOpenAI:DeploymentModel"] ?? string.Empty, endpoint: config["AzureOpenAI:Endpoint"] ?? string.Empty, apiKey: config["AzureOpenAI:ApiKey"] ?? string.Empty); var semanticKernel = builder.Build(); Console.WriteLine(await semanticKernel.InvokePromptAsync("Give me shopping list for cooking Sushi")); Conclusion By combining AI LLM models with semantic kernels, you’ll create intelligent applications that go beyond simple keyword matching. Experiment, iterate, and keep learning to build remarkable apps that truly understand and serve your needs.

By Naga Santhosh Reddy Vootukuri

CORE

Mastering SQL Server: Identifying and Optimizing Slow Queries for Enhanced Performance

SQL Server serves as a robust solution for handling and examining extensive amounts of data. Nevertheless, when databases expand and evolve into intricate structures, sluggish queries may arise as a notable concern, impacting the effectiveness of your applications and user contentment. This piece will delve into effective approaches for pinpointing and enhancing slow queries within SQL Server, guaranteeing optimal operational performance of your database. Identifying Slow Queries 1. Utilize SQL Server Management Studio (SSMS) Activity Monitor Launch SSMS, establish a connection to your server, right-click on the server name, and choose Activity Monitor. Review the Recent Expensive Queries section to pinpoint queries that are utilizing a significant amount of resources. Data Collection Reports Configure data collection to gather system data that can help in identifying troublesome queries. Go to Management -> Data Collection, and configure the data collection sets. You can access reports later by right-clicking on Data Collection and selecting Reports. Prior to proceeding, we will first establish the sample c. Subsequently, adhere to the provided steps below to insert the sample data, explore the views and stored procedures, and optimize the query. MS SQL CREATE DATABASE IFCData; GO USE IFCData; GO CREATE TABLE Flights ( FlightID INT PRIMARY KEY, FlightNumber VARCHAR(10), DepartureAirportCode VARCHAR(3), ArrivalAirportCode VARCHAR(3), DepartureTime DATETIME, ArrivalTime DATETIME ); GO CREATE TABLE Passengers ( PassengerID INT PRIMARY KEY, FirstName VARCHAR(50), LastName VARCHAR(50), Email VARCHAR(100) ); GO CREATE TABLE ServicesUsed ( ServiceID INT PRIMARY KEY, PassengerID INT, FlightID INT, ServiceType VARCHAR(50), UsageTime DATETIME, DurationMinutes INT, FOREIGN KEY (PassengerID) REFERENCES Passengers(PassengerID), FOREIGN KEY (FlightID) REFERENCES Flights(FlightID) ); GO Please input the sample data. This serves as the sample data that will be utilized in the example below. Here is the code to copy and paste to insert. MS SQL -- Inserting data into Flights INSERT INTO Flights VALUES (1, 'UA123', 'SFO', 'LAX', '2024-05-01 08:00:00', '2024-05-01 09:30:00'), (2, 'AA456', 'NYC', 'MIA', '2024-05-01 09:00:00', '2024-05-01 12:00:00'), (3, 'DL789', 'LAS', 'SEA', '2024-05-02 07:00:00', '2024-05-02 09:00:00'), (4, 'UA123', 'LAX', 'SFO', '2024-05-02 10:00:00', '2024-05-02 11:30:00'), (5, 'AA456', 'MIA', 'NYC', '2024-05-02 13:00:00', '2024-05-02 16:00:00'), (6, 'DL789', 'SEA', 'LAS', '2024-05-03 08:00:00', '2024-05-03 10:00:00'), (7, 'UA123', 'SFO', 'LAX', '2024-05-03 12:00:00', '2024-05-03 13:30:00'), (8, 'AA456', 'NYC', 'MIA', '2024-05-03 17:00:00', '2024-05-03 20:00:00'), (9, 'DL789', 'LAS', 'SEA', '2024-05-04 07:00:00', '2024-05-04 09:00:00'), (10, 'UA123', 'LAX', 'SFO', '2024-05-04 10:00:00', '2024-05-04 11:30:00'), (11, 'AA456', 'MIA', 'NYC', '2024-05-04 13:00:00', '2024-05-04 16:00:00'), (12, 'DL789', 'SEA', 'LAS', '2024-05-05 08:00:00', '2024-05-05 10:00:00'); -- Inserting data into Passengers INSERT INTO Passengers VALUES (1, 'Vikay', 'Singh', 'johndoe@example.com'), (2, 'Mario', 'Smith', 'janesmith@example.com'), (3, 'Alice', 'Johnson', 'alicejohnson@example.com'), (4, 'Bob', 'Brown', 'bobbrown@example.com'), (5, 'Carol', 'Davis', 'caroldavis@example.com'), (6, 'David', 'Martinez', 'davidmartinez@example.com'), (7, 'Eve', 'Clark', 'eveclark@example.com'), (8, 'Frank', 'Lopez', 'franklopez@example.com'), (9, 'Grace', 'Harris', 'graceharris@example.com'), (10, 'Harry', 'Lewis', 'harrylewis@example.com'), (11, 'Ivy', 'Walker', 'ivywalker@example.com'), (12, 'Jack', 'Hall', 'jackhall@example.com'); -- Inserting data into ServicesUsed INSERT INTO ServicesUsed VALUES (1, 1, 1, 'WiFi', '2024-05-01 08:30:00', 60), (2, 2, 1, 'Streaming', '2024-05-01 08:45:00', 30), (3, 3, 3, 'WiFi', '2024-05-02 07:30:00', 90), (4, 4, 4, 'WiFi', '2024-05-02 10:30:00', 60), (5, 5, 5, 'Streaming', '2024-05-02 13:30:00', 120), (6, 6, 6, 'Streaming', '2024-05-03 08:30:00', 110), (7, 7, 7, 'WiFi', '2024-05-03 12:30:00', 90), (8, 8, 8, 'WiFi', '2024-05-03 17:30:00', 80), (9, 9, 9, 'Streaming', '2024-05-04 07:30:00', 95), (10, 10, 10, 'Streaming', '2024-05-04 10:30:00', 85), (11, 11, 11, 'WiFi', '2024-05-04 13:30:00', 75), (12, 12, 12, 'WiFi', '2024-05-05 08:30:00', 65); 2. Dynamic Management Views (DMVs) DMVs provide a way to gain insights into the health of a SQL Server instance. To identify slow-running queries that could be affecting your IFCData database performance, you can use the sys.dm_exec_query_stats, sys.dm_exec_sql_text, and sys.dm_exec_query_plan DMVs: MS SQL SELECT TOP 10 qs.total_elapsed_time / qs.execution_count AS avg_execution_time, qs.total_logical_reads / qs.execution_count AS avg_logical_reads, st.text AS query_text, qp.query_plan FROM sys.dm_exec_query_stats AS qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) AS qp ORDER BY avg_execution_time DESC; This query provides a snapshot of the most resource-intensive queries by average execution time, helping you pinpoint areas where query optimization could improve performance. Enhancing Performance Advanced Query Optimization Techniques: Enhance Join Performance Join operations play a crucial role in database tasks, particularly when dealing with extensive tables. By optimizing the join conditions and the sequence in which tables are joined, it is possible to greatly minimize the time taken for query execution. In order to derive valuable insights from various tables within the IFCData database, it is essential to utilize appropriate SQL joins. By linking passenger details with flights and services utilized, a comprehensive understanding can be obtained. Here is a guide on how to effectively join the Flights, Passengers, and ServicesUsed tables for in-depth analysis. MS SQL SELECT p.FirstName, p.LastName, p.Email, f.FlightNumber, f.DepartureAirportCode, f.ArrivalAirportCode, s.ServiceType, s.UsageTime, s.DurationMinutes FROM Passengers p JOIN ServicesUsed s ON p.PassengerID = s.PassengerID JOIN Flights f ON s.FlightID = f.FlightID WHERE f.DepartureAirportCode = 'SFO'; -- Example condition to filter by departure airport This query efficiently merges data from the three tables, offering a comprehensive overview of the flight details and services utilized by each passenger, with a filter applied for a specific departure airport. Such a query proves valuable in analyzing passenger behavior, patterns of service usage, and operational efficiency. Performance Tuning Tools 1. SQL Server Profiler SQL Server Profiler captures and analyzes database events. This tool is essential for identifying slow-running queries and understanding how queries interact with the database. Example: Set up a trace to capture query execution times: Start SQL Server Profiler. Create a new trace and select the events you want to capture, such as SQL:BatchCompleted. Add a filter to capture only events where the duration is greater than a specific threshold, e.g., 1,000 milliseconds. Run the trace during a period of typical usage to gather data on any queries that exceed your threshold. 2. Database Engine Tuning Advisor (DTA) Database Engine Tuning Advisor analyzes workloads and recommends changes to indexes, indexed views, and partitioning. Example: To use DTA, you first need to capture a workload in a file or table. Here’s how to use it with a file: Capture a workload using SQL Server Profiler. Save the workload to a file. Open DTA, connect to your server and select the workload file. Configure the analysis, specifying the databases to tune and the types of recommendations you're interested in. Run the analysis. DTA will propose changes such as creating new indexes or modifying existing ones to optimize performance. 3. Query Store Query Store collects detailed performance information about queries, making it easier to monitor performance variations and understand the impact of changes. Example: Enable Query Store and force a plan for a query that intermittently performs poorly: It's executed successfully. Here is the code below. MS SQL -- Enable Query Store for IFCData database ALTER DATABASE IFCData SET QUERY_STORE = ON; -- Configure Query Store settings ALTER DATABASE IFCData SET QUERY_STORE (OPERATION_MODE = READ_WRITE, -- Allows Query Store to capture query information CLEANUP_POLICY = (STALE_QUERY_THRESHOLD_DAYS = 30), -- Data older than 30 days will be cleaned up DATA_FLUSH_INTERVAL_SECONDS = 900, -- Data is written to disk every 15 minutes INTERVAL_LENGTH_MINUTES = 60, -- Aggregated in 60-minute intervals MAX_STORAGE_SIZE_MB = 500, -- Limits the storage size of Query Store data to 500 MB QUERY_CAPTURE_MODE = AUTO); -- Captures all queries that are significant based on internal algorithms Upon activation, Query Store commences the collection of data regarding query execution, which can be examined through a range of reports accessible in SQL Server Management Studio (SSMS). Below are a few essential applications and queries that can be utilized to analyze data from the Query Store for the IFCData database. Queries with high resource consumption: Detect queries that utilize a significant amount of resources, aiding in the identification of areas that require performance enhancements. Query code: MS SQL SELECT TOP 10 qs.query_id, qsp.query_sql_text, rs.avg_cpu_time, rs.avg_logical_io_reads, rs.avg_duration, rs.count_executions FROM sys.query_store_plan AS qp JOIN sys.query_store_query AS qs ON qp.query_id = qs.query_id JOIN sys.query_store_query_text AS qsp ON qs.query_text_id = qsp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_cpu_time DESC; 2. Analyzing query performance decline: MS SQL SELECT rs.start_time, rs.end_time, qp.query_plan, rs.avg_duration FROM sys.query_store_runtime_stats AS rs JOIN sys.query_store_plan AS qp ON rs.plan_id = qp.plan_id WHERE qp.query_id = YOUR_QUERY_ID -- Specify the query ID you want to analyze ORDER BY rs.start_time; Assess the performance of queries across various periods to identify any declines in performance. 3. Monitoring changes in query plans: MS SQL SELECT qp.plan_id, qsp.query_sql_text, qp.last_execution_time FROM sys.query_store_plan AS qp JOIN sys.query_store_query AS qs ON qp.query_id = qs.query_id JOIN sys.query_store_query_text AS qsp ON qs.query_text_id = qsp.query_text_id WHERE qs.query_id = 1 -- Specify the query ID you want to analyze ORDER BY qp.last_execution_time DESC; I am using query_id = 1. In your case, it can be any number. Track the alterations in query plans over time for a particular query, facilitating the comprehension of performance fluctuations. Conclusion By systematically identifying slow queries and applying targeted optimization techniques, you can significantly enhance the performance of your SQL Server databases. Regular monitoring and maintenance are key to sustaining these performance gains over time. With the right tools and techniques, you can transform your SQL Server into a high-performing, efficient database management system. Further Reading Learn DMVs Best Practice to monitor the query load Performing DBCC CHECKDB

By Vijay Panwar

CORE

Using CloudTrail Lake To Enable Auditing of Enterprise Applications

For a long time, AWS CloudTrail has been the foundational technology that enabled organizations to meet compliance requirements by capturing audit logs for all AWS API invocations. CloudTrail Lake extends CloudTrail's capabilities by adding support for a SQL-like query language to analyze audit events. The audit events are stored in a columnar format called ORC to enable high-performance SQL queries. An important capability of CloudTrail Lake is the ability to ingest audit logs from custom applications or partner SaaS applications. With this capability, an organization can get a single aggregated view of audit events across AWS API invocations and their enterprise applications. As each end-to-end business process can span multiple enterprise applications, an aggregated view of audit events across them becomes a critical need. This article discusses an architectural approach to leverage CloudTrail Lake for auditing enterprise applications and the corresponding design considerations. Architecture Let us start by taking a look at the architecture diagram. This architecture uses SQS Queues and AWS Lambda functions to provide an asynchronous and highly concurrent model for disseminating audit events from the enterprise application. At important steps in business transactions, the application will call relevant AWS SDK APIs to send the audit event details as a message to the Audit event SQS queue. A lambda function is associated with the SQS queue so that it is triggered whenever a message is added to the queue. It will call the putAuditEvents() API provided by CloudTrail Lake to ingest Audit Events into the Event Data Store configured for this enterprise application. Note that the architecture shows two other Event Data stores to illustrate that events from the enterprise application can be correlated with events in the other data stores. Required Configuration Start by creating an Event Data Store which accepts events of category AuditEventLog. Note down the ARN of the event data store created. It will be needed for creating an integration channel. Shell aws cloudtrail create-event-data-store \ --name custom-events-datastore \ --no-multi-region-enabled \ --retention-period 90 \ --advanced-event-selectors '[ { "Name": "Select all external events", "FieldSelectors": [ { "Field": "eventCategory", "Equals": ["ActivityAuditLog"] } ] } ]' Create an Integration with the source as "My Custom Integration" and choose the delivery location as the event data store created in the previous step. Note the ARN of the channel created; it will be needed for coding the Lambda function. Shell aws cloudtrail create-channel \ --region us-east-1 \ --destinations '[{"Type": "EVENT_DATA_STORE", "Location": "<event data store arn>"}]' \ --name custom-events-channel \ --source Custom Create a Lambda function that would contain the logic to receive messages from an SQS queue, transform the message into an audit event, and send it to the channel created in the previous step using the putAuditEvents() API. Refer to the next section to understand the main steps to be included in the lambda function logic. Add permissions through an inline policy for the Lambda function, to be authorized to put audit events into the Integration channel. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "cloudtrail-data:PutAuditEvents", "Resource": "<channel arn>" }] } Create a SQS queue of type "Standard" with an associated dead letter queue. Add permissions to the Lambda function using an inline policy to allow receiving messages from the SQS Queue. JSON { "Version": "2012-10-17", "Statement": [ { "Sid": "Statement1", "Effect": "Allow", "Action": "sqs:*", "Resource": "<SQS Queue arn>" } ] } In the Lambda function configuration, add a trigger by choosing the source as "SQS" and specifying the ARN of the SQS queue created in the previous step. Ensure that Report batch item failures option is selected. Finally, ensure that permissions to send messages to this queue are added to the IAM Role assigned to your enterprise application. Lambda Function Code The code sample will focus on the Lambda function, as it is at the crux of the solution. Java public class CustomAuditEventHandler implements RequestHandler<SQSEvent, SQSBatchResponse> { Java public SQSBatchResponse handleRequest(final SQSEvent event, final Context context) { List<SQSMessage> records = event.getRecords(); AWSCloudTrailData client = AWSCloudTrailDataClientBuilder.defaultClient(); PutAuditEventsRequest request = new PutAuditEventsRequest(); List<AuditEvent> auditEvents = new ArrayList<AuditEvent>(); request.setChannelArn(channelARN); for (SQSMessage record : records) { AuditEvent auditEvent = new AuditEvent(); // Add logic in the transformToEventData() operation to transform contents of // the message to the event data format needed by Cloud Trail Lake. String eventData = transformToEventData(record); context.getLogger().log("Event Data JSON: " + eventData); auditEvent.setEventData(eventData); // Set a source event ID. This could be useful to correlate the event // data stored in Cloud Trail Lake to relevant information in the enterprise // application. auditEvent.setId(record.getMessageId()); auditEvents.add(auditEvent); } request.setAuditEvents(auditEvents); PutAuditEventsResult putAuditEvents = client.putAuditEvents(request); context.getLogger().log("Put Audit Event Results: " + putAuditEvents.toString()); SQSBatchResponse response = new SQSBatchResponse(); List<BatchItemFailure> failures = new ArrayList<SQSBatchResponse.BatchItemFailure>(); for (ResultErrorEntry result : putAuditEvents.getFailed()) { BatchItemFailure batchItemFailure = new BatchItemFailure(result.getId()); failures.add(batchItemFailure); context.getLogger().log("Failed Event ID: " + result.getId()); } response.setBatchItemFailures(failures); return response; } The first thing to note is that the type specification for the Class uses SQSBatchResponse, as we want the audit event messages to be processed as batches. Each Enterprise application would have its own format for representing audit messages. The logic to transform the messages to the format needed by CloudTrail Lake data schema should be part of the Lambda function. This would allow for using the same architecture even if the audit events need to be ingested into a different (SIEM) tool instead of CloudTrail Lake. Apart from the event data itself, the putAuditEvents() API of CloudTrail Lake expects a source event id to be provided for each event. This could be used to tie the audit event stored in the CloudTrail Lake to relevant information in the enterprise application. The messages which failed to be ingested should be added to list of failed records in the SQSBatchResponse object. This will ensure that all the successfully processed records are deleted from the SQS Queue and failed records are retried at a later time. Note that the code is using the source event id (result.getID()) as the ID for failed records. This is because the source event id was set as the message id earlier in the code. If a different identifier has to be used as the source event id, it has to be mapped to the message id. The mapping will help with finding the message ids for records that were not successfully ingested while framing the lambda function response. Architectural Considerations This section discusses the choices made for this architecture and the corresponding trade-offs. These need to be considered carefully while designing your solution. FIFO VS Standard Queues Audit events are usually self-contained units of data. So, the order in which they are ingested into the CloudTrail Lake should not affect the information conveyed by them in any manner. Hence, there is no need to use a FIFO queue to maintain the information integrity of audit events. Standard queues provide higher concurrency than FIFO queues with respect to fanning out messages to Lambda function instances. This is because, unlike FIFO queues, they do not have to maintain the order of messages at the queue or message group level. Achieving a similar level of concurrency with FIFO queues would require increasing the complexity of the source application as it has to include logic to fan out messages across message groups. With standard queues, there is a small chance of multiple deliveries of the same message. This should not be a problem as duplicates could be filtered out as part of the Cloud Data Lake queries. SNS Vs SQS: This architecture uses SQS instead of SNS for the following reasons: SNS does not support Lambda functions to be triggered for standard topics. SQS through its retry logic, provides better reliability with respect to delivering messages to the recipient than SNS. This is a valuable capability, especially for data as important as audit events. SQS can be configured to group audit events and send those to Lambda to be processed in batches. This helps with the performance/cost of the Lambda function and avoids overwhelming CloudTrail Lake with a high number of concurrent connection requests. There are other factors to consider as well such as the usage of private links, VPC integration, and message encryption in transit, to securely transmit audit events. The concurrency and message delivery settings provided by SQS-Lambda integration should also be tuned based on the throughput and complexity of the audit events. The approach presented and the architectural considerations discussed provide a good starting point for using CloudTrail Lake with enterprise applications.

By Balaji Nagarajan

Securing Secrets: A Guide To Implementing Secrets Management in DevSecOps Pipelines

Introduction to Secrets Management In the world of DevSecOps, where speed, agility, and security are paramount, managing secrets effectively is crucial. Secrets, such as passwords, API keys, tokens, and certificates, are sensitive pieces of information that, if exposed, can lead to severe security breaches. To mitigate these risks, organizations are turning to secret management solutions. These solutions help securely store, access, and manage secrets throughout the software development lifecycle, ensuring they are protected from unauthorized access and misuse. This article aims to provide an in-depth overview of secrets management in DevSecOps, covering key concepts, common challenges, best practices, and available tools. Security Risks in Secrets Management The lack of implementing secrets management poses several challenges. Primarily, your organization might already have numerous secrets stored across the codebase. Apart from the ongoing risk of exposure, keeping secrets within your code promotes other insecure practices such as reusing secrets, employing weak passwords, and neglecting to rotate or revoke secrets due to the extensive code modifications that would be needed. Here below are some of the risks highlighting the potential risks of improper secrets management: Data Breaches If secrets are not properly managed, they can be exposed, leading to unauthorized access and potential data breaches. Example Scenario A Software-as-a-Service (SaaS) company uses a popular CI/CD platform to automate its software development and deployment processes. As part of their DevSecOps practices, they store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue Unfortunately, the CI/CD platform they use experiences a security vulnerability that allows attackers to gain unauthorized access to the secrets management tool's API. This vulnerability goes undetected by the company's security monitoring systems. Consequence Attackers exploit the vulnerability and gain access to the secrets stored in the management tool. With these credentials, they are able to access the company's production systems and databases. They exfiltrate sensitive customer data, including personally identifiable information (PII) and financial records. Impact The data breach leads to significant financial losses for the company due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is tarnished, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such data breaches, the company could have implemented the following preventive measures: Regularly auditing and monitoring access to the secrets management tool to detect unauthorized access. Implementing multi-factor authentication (MFA) for accessing the secrets management tool. Ensuring that the secrets management tool is regularly patched and updated to address any security vulnerabilities. Limiting access to secrets based on the principle of least privilege, ensuring that only authorized users and systems have access to sensitive credentials. Implementing strong encryption for storing secrets to mitigate the impact of unauthorized access. Conducting regular security assessments and penetration testing to identify and address potential security vulnerabilities in the CI/CD platform and associated tools. Credential Theft Attackers may steal secrets, such as API keys or passwords, to gain unauthorized access to systems or resources. Example Scenario A fintech startup uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue An attacker gains access to the company's internal network by exploiting a vulnerability in an outdated web server. Once inside the network, the attacker uses a variety of techniques, such as phishing and social engineering, to gain access to a developer's workstation. Consequence The attacker discovers that the developer has stored plaintext files containing sensitive credentials, including database passwords and API keys, on their desktop. The developer had mistakenly saved these files for convenience and had not securely stored them in the secrets management tool. Impact With access to the sensitive credentials, the attacker gains unauthorized access to the company's databases and other systems. They exfiltrate sensitive customer data, including financial records and personal information, leading to regulatory fines and damage to the company's reputation. Preventive Measures To prevent such credential theft incidents, the fintech startup could have implemented the following preventive measures: Educating developers and employees about the importance of securely storing credentials and the risks of leaving them in plaintext files. Implementing strict access controls and auditing mechanisms for accessing and managing secrets in the secrets management tool. Using encryption to store sensitive credentials in the secrets management tool, ensures that even if credentials are stolen, they cannot be easily used without decryption keys. Regularly rotating credentials and monitoring for unusual or unauthorized access patterns to detect potential credential theft incidents early. Misconfiguration Improperly configured secrets management systems can lead to accidental exposure of secrets. Example Scenario A healthcare organization uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as database passwords and API keys, in a secrets management tool integrated with their pipelines. Issue A developer inadvertently misconfigures the permissions on the secrets management tool, allowing unintended access to sensitive credentials. The misconfiguration occurs when the developer sets overly permissive access controls, granting access to a broader group of users than intended. Consequence An attacker discovers the misconfigured access controls and gains unauthorized access to the secrets management tool. With access to sensitive credentials, the attacker can now access the healthcare organization's databases and other systems, potentially leading to data breaches and privacy violations. Impact The healthcare organization suffers reputational damage and financial losses due to the data breach. They may also face regulatory fines for failing to protect sensitive information. Preventive Measures To prevent such misconfiguration incidents, the healthcare organization could have implemented the following preventive measures: Implementing least privilege access controls to ensure that only authorized users and systems have access to sensitive credentials. Regularly auditing and monitoring access to the secrets management tool to detect and remediate misconfigurations. Implementing automated checks and policies to enforce proper access controls and configurations for secrets management. Providing training and guidance to developers and administrators on best practices for securely configuring and managing access to secrets. Compliance Violations Failure to properly manage secrets can lead to violations of regulations such as GDPR, HIPAA, or PCI DSS. Example Scenario A financial services company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue The financial services company fails to adhere to regulatory requirements for managing and protecting sensitive information. Specifically, they do not implement proper encryption for storing sensitive credentials and do not maintain proper access controls for managing secrets. Consequence Regulatory authorities conduct an audit of the company's security practices and discover compliance violations related to secrets management. The company is found to be non-compliant with regulations such as PCI DSS (Payment Card Industry Data Security Standard) and GDPR (General Data Protection Regulation). Impact The financial services company faces significant financial penalties for non-compliance with regulatory requirements. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent such compliance violations, the financial services company could have implemented the following preventive measures: Implementing encryption for storing sensitive credentials in the secrets management tool to ensure compliance with data protection regulations. Implementing strict access controls and auditing mechanisms for managing and accessing secrets to prevent unauthorized access. Conducting regular compliance audits and assessments to identify and address any non-compliance issues related to secrets management. Lack of Accountability Without proper auditing and monitoring, it can be difficult to track who accessed or modified secrets, leading to a lack of accountability. Example Scenario A technology company uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue The company does not establish clear ownership and accountability for managing and protecting secrets. There is no designated individual or team responsible for ensuring that proper security practices are followed when storing and accessing secrets. Consequence Due to the lack of accountability, there is no oversight or monitoring of access to sensitive credentials. As a result, developers and administrators have unrestricted access to secrets, increasing the risk of unauthorized access and data breaches. Impact The lack of accountability leads to a data breach where sensitive credentials are exposed. The company faces financial losses due to regulatory fines, legal fees, and loss of customer trust. Additionally, the company's reputation is damaged, leading to a decrease in customer retention and potential business partnerships. Preventive Measures To prevent such lack of accountability incidents, the technology company could have implemented the following preventive measures: Designating a specific individual or team responsible for managing and protecting secrets, including implementing and enforcing security policies and procedures. Implementing access controls and auditing mechanisms to monitor and track access to secrets, ensuring that only authorized users have access. Providing regular training and awareness programs for employees on the importance of secrets management and security best practices. Conducting regular security audits and assessments to identify and address any gaps in secrets management practices. Operational Disruption If secrets are not available when needed, it can disrupt the operation of DevSecOps pipelines and applications. Example Scenario A financial institution uses a popular CI/CD platform to automate its software development and deployment processes. They store sensitive credentials, such as encryption keys and API tokens, in a secrets management tool integrated with their pipelines. Issue During a routine update to the secrets management tool, a misconfiguration occurs that causes the tool to become unresponsive. As a result, developers are unable to access the sensitive credentials needed to deploy new applications and services. Consequence The operational disruption leads to a delay in deploying critical updates and features, impacting the financial institution's ability to serve its customers effectively. The IT team is forced to troubleshoot the issue, leading to downtime and increased operational costs. Impact The operational disruption results in financial losses due to lost productivity and potential revenue. Additionally, the financial institution's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such operational disruptions, the financial institution could have implemented the following preventive measures: Implementing automated backups and disaster recovery procedures for the secrets management tool to quickly restore service in case of a failure. Conducting regular testing and monitoring of the secrets management tool to identify and address any performance issues or misconfigurations. Implementing a rollback plan to quickly revert to a previous version of the secrets management tool in case of a failed update or configuration change. Establishing clear communication channels and escalation procedures to quickly notify stakeholders and IT teams in case of operational disruption. Dependency on Third-Party Services Using third-party secrets management services can introduce dependencies and potential risks if the service becomes unavailable or compromised. Example Scenario A software development company uses a popular CI/CD platform to automate its software development and deployment processes. They rely on a third-party secrets management tool to store sensitive credentials, such as API keys and database passwords, used in their pipelines. Issue The third-party secrets management tool experiences a service outage due to a cyber attack on the service provider's infrastructure. As a result, the software development company is unable to access the sensitive credentials needed to deploy new applications and services. Consequence The dependency on the third-party secrets management tool leads to a delay in deploying critical updates and features, impacting the software development company's ability to deliver software on time. The IT team is forced to find alternative ways to manage and store sensitive credentials temporarily. Impact The dependency on the third-party secrets management tool results in financial losses due to lost productivity and potential revenue. Additionally, the software development company's reputation is damaged, leading to a loss of customer trust and potential business partnerships. Preventive Measures To prevent such dependencies on third-party services, the software development company could have implemented the following preventive measures: Implementing a backup plan for storing and managing sensitive credentials locally in case of a service outage or disruption. Diversifying the use of secrets management tools by using multiple tools or providers to reduce the impact of a single service outage. Conducting regular reviews and assessments of third-party service providers to ensure they meet security and reliability requirements. Implementing a contingency plan to quickly switch to an alternative secrets management tool or provider in case of a service outage or disruption. Insider Threats Malicious insiders may abuse their access to secrets for personal gain or to harm the organization. Example Scenario A technology company uses a popular CI/CD platform to automate their software development and deployment processes. They store sensitive credentials, such as API keys and database passwords, in a secrets management tool integrated with their pipelines. Issue An employee with privileged access to the secrets management tool decides to leave the company and maliciously steals sensitive credentials before leaving. The employee had legitimate access to the secrets management tool as part of their job responsibilities but chose to abuse that access for personal gain. Consequence The insider threat leads to the theft of sensitive credentials, which are then used by the former employee to gain unauthorized access to the company's systems and data. This unauthorized access can lead to data breaches, financial losses, and damage to the company's reputation. Impact The insider threat results in financial losses due to potential data breaches and the need to mitigate the impact of the stolen credentials. Additionally, the company's reputation is damaged, leading to a loss of customer trust and potential legal consequences. Preventive Measures To prevent insider threats involving secrets management, the technology company could have implemented the following preventive measures: Implementing strict access controls and least privilege principles to limit the access of employees to sensitive credentials based on their job responsibilities. Conducting regular audits and monitoring of access to the secrets management tool to detect and prevent unauthorized access. Providing regular training and awareness programs for employees on the importance of data security and the risks of insider threats. Implementing behavioral analytics and anomaly detection mechanisms to identify and respond to suspicious behavior or activities involving sensitive credentials. Best Practices for Secrets Management Here are some best practices for secrets management in DevSecOps pipelines: Use a dedicated secrets management tool: Utilize a specialized tool or service designed for securely storing and managing secrets. Encrypt secrets at rest and in transit: Ensure that secrets are encrypted both when stored and when transmitted over the network. Use strong access controls: Implement strict access controls to limit who can access secrets and what they can do with them. Regularly rotate secrets: Regularly rotate secrets (e.g., passwords, API keys) to minimize the impact of potential compromise. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or a secrets management tool instead. Use environment-specific secrets: Use different secrets for different environments (e.g., development, staging, production) to minimize the impact of a compromised secret. Monitor and audit access: Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and the risk of exposure. Regularly review and update policies: Regularly review and update your secrets management policies and procedures to ensure they are up-to-date and effective. Educate and train employees: Educate and train employees on the importance of secrets management and best practices for handling secrets securely. Use-Cases of Secrets Management For Different Tools Here are the common use cases for different tools of secrets management: IBM Cloud Secrets Manager Securely storing and managing API keys Managing database credentials Storing encryption keys Managing certificates Integrating with CI/CD pipelines Compliance and audit requirements by providing centralized management and auditing of secrets usage. Ability to dynamically generate and rotate secrets HashiCorp Vault Centralized secrets management for distributed systems Dynamic secrets generation and management Encryption and access controls for secrets Secrets rotation for various types of secrets AWS Secrets Manager Securely store and manage AWS credentials Securely store and manage other types of secrets used in AWS services Integration with AWS services for seamless access to secrets Automatic secrets rotation for supported AWS services Azure Key Vault Centralized secrets management for Azure applications Securely store and manage secrets, keys, and certificates Encryption and access policies for secrets Automated secrets rotation for keys, secrets, and certificates CyberArk Conjur Secrets management and privileged access management Secrets retrieval via REST API for integration with CI/CD pipelines Secrets versioning and access controls Automated secrets rotation using rotation policies and scheduled tasks Google Cloud Secret Manager Centralized secrets management for Google Cloud applications Securely store and manage secrets, API keys, and certificates Encryption at rest and in transit for secrets Automated and manual secrets rotation with integration with Google Cloud Functions These tools cater to different cloud environments and offer various features for securely managing and rotating secrets based on specific requirements and use cases. Implement Secrets Management in DevSecOps Pipelines Understanding CI/CD in DevSecOps CI/CD in DevSecOps involves automating the build, test, and deployment processes while integrating security practices throughout the pipeline to deliver secure and high-quality software rapidly. Continuous Integration (CI) CI is the practice of automatically building and testing code changes whenever a developer commits code to the version control system (e.g., Git). The goal is to quickly detect and fix integration errors. Continuous Delivery (CD) CD extends CI by automating the process of deploying code changes to testing, staging, and production environments. With CD, every code change that passes the automated tests can potentially be deployed to production. Continuous Deployment (CD) CD goes one step further than continuous delivery by automatically deploying every code change that passes the automated tests to production. This requires a high level of automation and confidence in the automated tests. Continuous Compliance (CC) CC refers to the practice of integrating compliance checks and controls into the automated CI/CD pipeline. It ensures that software deployments comply with relevant regulations, standards, and internal policies throughout the development lifecycle. DevSecOps DevSecOps integrates security practices into the CI/CD pipeline, ensuring that security is built into the software development process from the beginning. This includes performing security testing (e.g., static code analysis, dynamic application security testing) as part of the pipeline and managing secrets securely. The following picture depicts the DevSecOps lifecycles: Picture courtesy Implement Secrets Management Into DevSecOps Pipelines Implementing secrets management into DevSecOps pipelines involves securely handling and storing sensitive information such as API keys, passwords, and certificates. Here's a step-by-step guide to implementing secrets management in DevSecOps pipelines: Select a Secrets Management Solution Choose a secrets management tool that aligns with your organization's security requirements and integrates well with your existing DevSecOps tools and workflows. Identify Secrets Identify the secrets that need to be managed, such as database credentials, API keys, encryption keys, and certificates. Store Secrets Securely Use the selected secrets management tool to securely store secrets. Ensure that secrets are encrypted at rest and in transit and that access controls are in place to restrict who can access them. Integrate Secrets Management into CI/CD Pipelines Update your CI/CD pipeline scripts and configurations to integrate with the secrets management tool. Use the tool's APIs or SDKs to retrieve secrets securely during the pipeline execution. Implement Access Controls Implement strict access controls to ensure that only authorized users and systems can access secrets. Use role-based access control (RBAC) to manage permissions. Rotate Secrets Regularly Regularly rotate secrets to minimize the impact of potential compromise. Automate the rotation process as much as possible to ensure consistency and security. Monitor and Audit Access Monitor and audit access to secrets to detect and respond to unauthorized access attempts. Use logging and monitoring tools to track access and usage. Best Practices for Secrets Management Into DevSecOps Pipelines Implementing secrets management in DevSecOps pipelines requires careful consideration to ensure security and efficiency. Here are some best practices: Use a secrets management tool: Utilize a dedicated to store and manage secrets securely. Encrypt secrets: Encrypt secrets both at rest and in transit to protect them from unauthorized access. Avoid hardcoding secrets: Never hardcode secrets in your code or configuration files. Use environment variables or secrets management tools to inject secrets into your CI/CD pipelines. Rotate secrets: Implement a secrets rotation policy to regularly rotate secrets, such as passwords and API keys. Automate the rotation process wherever possible to reduce the risk of human error. Implement access controls: Use role-based access controls (RBAC) to restrict access to secrets based on the principle of least privilege. Monitor and audit access: Enable logging and monitoring to track access to secrets and detect any unauthorized access attempts. Automate secrets retrieval: Automate the retrieval of secrets in your CI/CD pipelines to reduce manual intervention and improve security. Use secrets injection: Use tools or libraries that support secrets injection (e.g., Kubernetes secrets, Docker secrets) to securely inject secrets into your application during deployment. Conclusion Secrets management is a critical aspect of DevSecOps that cannot be overlooked. By implementing best practices such as using dedicated secrets management tools, encrypting secrets, and implementing access controls, organizations can significantly enhance the security of their software development and deployment pipelines. Effective secrets management not only protects sensitive information but also helps in maintaining compliance with regulatory requirements. As DevSecOps continues to evolve, it is essential for organizations to prioritize secrets management as a fundamental part of their security strategy.

By Josephine Eskaline Joyce

CORE

How To Add Custom Attributes in Python Logging

Logging is essential for any software system. Using logs, you can troubleshoot a wide range of issues, including debugging an application bug, security defect, system slowness, etc. In this article, we will discuss how to use Python logging effectively using custom attributes. Python Logging Before we delve in, I briefly want to explain a basic Python logging module with an example. #!/opt/bb/bin/python3.7 import logging import sys root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) logging.info("I love Dzone!") The above example prints the following when executed: INFO - 2024-03-09 19:49:07,734 I love Dzone! In the example above, we are creating the root logger and the logging format for log messages. On line 6, logging.getLogger() returns the logger if already created; if not, it goes one level above the hierarchy and returns the parent logger. We define our own StreamHandler to print the log message at the console. Whenever we log messages, it is essential to log the basic attributes of the LogRecord. On line 10, we define the basic format that includes level name, time in string format, and the actual message itself. The handler thus created is set at the root logger level. We could use any pre-defined log attribute name and the format from the LogRecord library. However, let's say you want to print some additional attributes like contextId, a custom logging adapter to the rescue. Logging Adapter class MyLoggingAdapter(logging.LoggerAdapter): def __init__(self, logger): logging.LoggerAdapter.__init__(self, logger=logger, extra={}) def process(self, msg, kwargs): return msg, kwargs We create our own version of Logging Adapter and pass "extra" parameters as a dictionary for the formatter. ContextId Filter import contextvars class ContextIdFilter(logging.Filter): context_id = contextvars.ContextVar('context_id', default='') def filter(self, record): # add a new UUID to the context. req_id = str(uuid.uuid4()) if not self.context_id.get(): self.context_id.set(req_id) record.context_id = self.context_id.get() return True We create our own filter that extends the logging filter, which returns True if the specified log record should be logged. We simply add our parameter to the log record and return True always, thus adding our unique id to the record. In our example above, a unique id is generated for every new context. For an existing context, we return already stored contextId from the contextVars. Custom Logger import logging root = logging.getLogger() root.setLevel(logging.DEBUG) std_out_logger = logging.StreamHandler(sys.stdout) std_out_logger.setLevel(logging.INFO) std_out_formatter = logging.Formatter("%(levelname)s - %(asctime)s ContextId:%(context_id)s %(message)s") std_out_logger.setFormatter(std_out_formatter) root.addHandler(std_out_logger) root.addFilter(ContextIdFilter()) adapter = MyLoggingAdapter(root) adapter.info("I love Dzone!") adapter.info("this is my custom logger") adapter.info("Exiting the application") Now let's put it together in our logger file. Add the contextId filter to the root. Please note that we are using our own adapter in place of logging wherever we need to log the message. Running the code above prints the following message: INFO - 2024-04-20 23:54:59,839 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 I love Dzone! INFO - 2024-04-20 23:54:59,842 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 this is my custom logger INFO - 2024-04-20 23:54:59,843 ContextId:c10af4e9-6ea4-4cdf-9743-ea24d0febab6 Exiting the application By setting root.propagate = False, events logged to this logger will be passed to the handlers of higher logging, aka parent logging class. Conclusion Python does not provide a built-in option to add custom parameters in logging. Instead, we create a wrapper logger on top of the Python root logger and print our custom parameters. This would be helpful at the time of debugging request-specific issues.

By Ganesh Nagarajan

Microservices Design Patterns for Highly Resilient Architecture

The monolithic architecture was historically used by developers for a long time — and for a long time, it worked. Unfortunately, these architectures use fewer parts that are larger, thus meaning they were more likely to fail in entirety if a single part failed. Often, these applications ran as a singular process, which only exacerbated the issue. Microservices solve these specific issues by having each microservice run as a separate process. If one cog goes down, it doesn’t necessarily mean the whole machine stops running. Plus, diagnosing and fixing defects in smaller, highly cohesive services is often easier than in larger monolithic ones. Microservices design patterns provide tried-and-true fundamental building blocks that can help write code for microservices. By utilizing patterns during the development process, you save time and ensure a higher level of accuracy versus writing code for your microservices app from scratch. In this article, we cover a comprehensive overview of microservices design patterns you need to know, as well as when to apply them. Key Benefits of Using Microservices Design Patterns Microservices design patterns offer several key benefits, including: Scalability: Microservices allow applications to be broken down into smaller, independent services, each responsible for a specific function or feature. This modular architecture enables individual services to be scaled independently based on demand, improving overall system scalability and resource utilization. Flexibility and agility: Microservices promote flexibility and agility by decoupling different parts of the application. Each service can be developed, deployed, and updated independently, allowing teams to work autonomously and release new features more frequently. This flexibility enables faster time-to-market and easier adaptation to changing business requirements. Resilience and fault isolation: Microservices improve system resilience and fault isolation by isolating failures to specific services. If one service experiences an issue or failure, it does not necessarily impact the entire application. This isolation minimizes downtime and improves system reliability, ensuring that the application remains available and responsive. Technology diversity: Microservices enable technology diversity by allowing each service to be built using the most suitable technology stack for its specific requirements. This flexibility enables teams to choose the right tools and technologies for each service, optimizing performance, development speed, and maintenance. Improved development and deployment processes: Microservices streamline development and deployment processes by breaking down complex applications into smaller, manageable components. This modular architecture simplifies testing, debugging, and maintenance tasks, making it easier for development teams to collaborate and iterate on software updates. Scalability and cost efficiency: Microservices enable organizations to scale their applications more efficiently by allocating resources only to the services that require them. This granular approach to resource allocation helps optimize costs and ensures that resources are used effectively, especially in cloud environments where resources are billed based on usage. Enhanced fault tolerance: Microservices architecture allows for better fault tolerance as services can be designed to gracefully degrade or fail independently without impacting the overall system. This ensures that critical functionalities remain available even in the event of failures or disruptions. Easier maintenance and updates: Microservices simplify maintenance and updates by allowing changes to be made to individual services without affecting the entire application. This reduces the risk of unintended side effects and makes it easier to roll back changes if necessary, improving overall system stability and reliability. Let's go ahead and look for different Microservices Design Patterns. Database per Service Pattern The database is one of the most important components of microservices architecture, but it isn’t uncommon for developers to overlook the database per service pattern when building their services. Database organization will affect the efficiency and complexity of the application. The most common options that a developer can use when determining the organizational architecture of an application are: Dedicated Database for Each Service A database dedicated to one service can’t be accessed by other services. This is one of the reasons that makes it much easier to scale and understand from a whole end-to-end business aspect. Picture a scenario where your databases have different needs or access requirements. The data owned by one service may be largely relational, while a second service might be better served by a NoSQL solution and a third service may require a vector database. In this scenario, using dedicated services for each database could help you manage them more easily. This structure also reduces coupling as one service can’t tie itself to the tables of another. Services are forced to communicate via published interfaces. The downside is that dedicated databases require a failure protection mechanism for events where communication fails. Single Database Shared by All Services A single shared database isn’t the standard for microservices architecture but bears mentioning as an alternative nonetheless. Here, the issue is that microservices using a single shared database lose many of the key benefits developers rely on, including scalability, robustness, and independence. Still, sharing a physical database may be appropriate in some situations. When a single database is shared by all services, though, it’s very important to enforce logical boundaries within it. For example, each service should own its have schema, and read/write access should be restricted to ensure that services can’t poke around where they don’t belong. Saga Pattern A saga is a series of local transactions. In microservices applications, a saga pattern can help maintain data consistency during distributed transactions. The saga pattern is an alternative solution to other design patterns that allow for multiple transactions by giving rollback opportunities. A common scenario is an e-commerce application that allows customers to purchase products using credit. Data may be stored in two different databases: One for orders and one for customers. The purchase amount can’t exceed the credit limit. To implement the Saga pattern, developers can choose between two common approaches. 1. Choreography Using the choreography approach, a service will perform a transaction and then publish an event. In some instances, other services will respond to those published events and perform tasks according to their coded instructions. These secondary tasks may or may not also publish events, according to presets. In the example above, you could use a choreography approach so that each local e-commerce transaction publishes an event that triggers a local transaction in the credit service. 2. Orchestration An orchestration approach will perform transactions and publish events using an object to orchestrate the events, triggering other services to respond by completing their tasks. The orchestrator tells the participants what local transactions to execute. Saga is a complex design pattern that requires a high level of skill to successfully implement. However, the benefit of proper implementation is maintained data consistency across multiple services without tight coupling. API Gateway Pattern For large applications with multiple clients, implementing an API gateway pattern is a compelling option One of the largest benefits is that it insulates the client from needing to know how services have been partitioned. However, different teams will value the API gateway pattern for different reasons. One of these possible reasons is that it grants a single entry point for a group of microservices by working as a reverse proxy between client apps and the services. Another is that clients don’t need to know how services are partitioned, and service boundaries can evolve independently since the client knows nothing about them. The client also doesn’t need to know how to find or communicate with a multitude of ever-changing services. You can also create a gateway for specific types of clients (for example, backends for frontends) which improves ergonomics and reduces the number of roundtrips needed to fetch data. Plus, an API gateway pattern can take care of crucial tasks like authentication, SSL termination, and caching, which makes your app more secure and user-friendly. Another advantage is that the pattern insulates the client from needing to know how services have been partitioned. Before moving on to the next pattern, there’s one more benefit to cover: Security. The primary way the pattern improves security is by reducing the attack surface area. By providing a single entry point, the API endpoints aren’t directly exposed to clients, and authorization and SSL can be efficiently implemented. Developers can use this design pattern to decouple internal microservices from client apps so a partially failed request can be utilized. This ensures a whole request won’t fail because a single microservice is unresponsive. To do this, the encoded API gateway utilizes the cache to provide an empty response or return a valid error code. Circuit Breaker Design Pattern This pattern is usually applied between services that are communicating synchronously. A developer might decide to utilize the circuit breaker when a service is exhibiting high latency or is completely unresponsive. The utility here is that failure across multiple systems is prevented when a single microservice is unresponsive. Therefore, calls won’t be piling up and using the system resources, which could cause significant delays within the app or even a string of service failures. Implementing this pattern as a function in a circuit breaker design requires an object to be called to monitor failure conditions. When a failure condition is detected, the circuit breaker will trip. Once this has been tripped, all calls to the circuit breaker will result in an error and be directed to a different service. Alternatively, calls can result in a default error message being retrieved. There are three states of the circuit breaker pattern functions that developers should be aware of. These are: Open: A circuit breaker pattern is open when the number of failures has exceeded the threshold. When in this state, the microservice gives errors for the calls without executing the desired function. Closed: When a circuit breaker is closed, it’s in the default state and all calls are responded to normally. This is the ideal state developers want a circuit breaker microservice to remain in — in a perfect world, of course. Half-open: When a circuit breaker is checking for underlying problems, it remains in a half-open state. Some calls may be responded to normally, but some may not be. It depends on why the circuit breaker switched to this state initially. Command Query Responsibility Segregation (CQRS) A developer might use a command query responsibility segregation (CQRS) design pattern if they want a solution to traditional database issues like data contention risk. CQRS can also be used for situations when app performance and security are complex and objects are exposed to both reading and writing transactions. The way this works is that CQRS is responsible for either changing the state of the entity or returning the result in a transaction. Multiple views can be provided for query purposes, and the read side of the system can be optimized separately from the write side. This shift allows for a reduction in the complexity of all apps by separately querying models and commands so: The write side of the model handles persistence events and acts as a data source for the read side The read side of the model generates projections of the data, which are highly denormalized views Asynchronous Messaging If a service doesn’t need to wait for a response and can continue running its code post-failure, asynchronous messaging can be used. Using this design pattern, microservices can communicate in a way that’s fast and responsive. Sometimes this pattern is referred to as event-driven communication. To achieve the fastest, most responsive app, developers can use a message queue to maximize efficiency while minimizing response delays. This pattern can help connect multiple microservices without creating dependencies or tightly coupling them. While there are tradeoffs one makes with async communication (such as eventual consistency), it’s still a flexible, scalable approach to designing a microservices architecture. Event Sourcing The event-sourcing design pattern is used in microservices when a developer wants to capture all changes in an entity’s state. Using event stores like Kafka or alternatives will help keep track of event changes and can even function as a message broker. A message broker helps with the communication between different microservices, monitoring messages and ensuring communication is reliable and stable. To facilitate this function, the event sourcing pattern stores a series of state-changing events and can reconstruct the current state by replaying the occurrences of an entity. Using event sourcing is a viable option in microservices when transactions are critical to the application. This also works well when changes to the existing data layer codebase need to be avoided. Strangler-Fig Pattern Developers mostly use the strangler design pattern to incrementally transform a monolith application to microservices. This is accomplished by replacing old functionality with a new service — and, consequently, this is how the pattern receives its name. Once the new service is ready to be executed, the old service is “strangled” so the new one can take over. To accomplish this successful transfer from monolith to microservices, a facade interface is used by developers that allows them to expose individual services and functions. The targeted functions are broken free from the monolith so they can be “strangled” and replaced. Utilizing Design Patterns To Make Organization More Manageable Setting up the proper architecture and process tooling will help you create a successful microservice workflow. Use the design patterns described above and learn more about microservices in my blog to create a robust, functional app.

By Gaurav Shekhar

Convert Your Code From Jupyter Notebooks To Automated Data and ML Pipelines Using AWS

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added. In practice, data scientists often work with Jupyter Notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of an ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following: Amazon SageMaker: A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly AWS Glue: A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints. Use Case For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes. This post doesn’t go into the details of the model but demonstrates a way to build an ML pipeline that builds and deploys any ML model. Solution Overview The following diagram summarizes the approach for the retraining pipeline. The workflow contains the following elements: AWS Glue crawler: You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. AWS Glue triggers: Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers. AWS Glue job: An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location. AWS Glue workflow: An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image. The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters. When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status. At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more. Setting up the Environment To set up the environment, complete the following steps: Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet. Download the following code into your local directory. Organization of Code The code to build the pipeline has the following directory structure: --Glue workflow orchestration --glue_scripts --DataExtractionJob.py --DataProcessingJob.py --MessagingQueueJob,py --TrainingJob.py --base_resources.template --deploy.sh --glue_resources.template The code directory is divided into three parts: AWS CloudFormation templates: The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow. AWS Glue scripts: The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs. Bash script: A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers. Implementing the Solution Complete the following steps: Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region. The following code example is a path for Region us-west-2: Shell algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest" For more information about BlazingText parameters, see Common parameters for built-in algorithms. Enter the following code in your terminal: Shell sh deploy.sh -s dev AWS_PROFILE=your_profile_name This step sets up the infrastructure of the pipeline. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE. On the AWS Glue console, manually start the pipeline. In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs. To begin the workflow, in the Workflow section, select DevMLWorkflow. From the Actions drop-down menu, choose Run. View the progress of your workflow on the History tab and select the latest RUN ID. The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion. After the workflow is successful, open the Amazon SageMaker console. Under Inference, choose Endpoint. The following screenshot shows that the endpoint of the workflow deployed is ready. Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application. Cleaning Up Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code: Python def delete_resources(self): endpoint_name = self.endpoint try: sagemaker.delete_endpoint(EndpointName=endpoint_name) print("Deleted Test Endpoint ", endpoint_name) except Exception as e: print('Model endpoint deletion failed') try: sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name) print("Deleted Test Endpoint Configuration ", endpoint_name) except Exception as e: print(' Endpoint config deletion failed') try: sagemaker.delete_model(ModelName=endpoint_name) print("Deleted Test Endpoint Model ", endpoint_name) except Exception as e: print('Model deletion failed') This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.

By Sai Sharanya Nalla

How You Can Use Logs To Feed Security

If your system is facing an imminent security threat—or worse, you’ve just suffered a breach—then logs are your go-to. If you’re a security engineer working closely with developers and the DevOps team, you already know that you depend on logs for threat investigation and incident response. Logs offer a detailed account of system activities. Analyzing those logs helps you fortify your digital defenses against emerging risks before they escalate into full-blown incidents. At the same time, your logs are your digital footprints, vital for compliance and auditing. Your logs contain a massive amount of data about your systems (and hence your security), and that leads to some serious questions: How do you handle the complexity of standardizing and analyzing such large volumes of data? How do you get the most out of your log data so that you can strengthen your security? How do you know what to log? How much is too much? Recently, I’ve been trying to use tools and services to get a handle on my logs. In this post, I’ll look at some best practices for using these tools—how they can help with security and identifying threats. And finally, I’ll look at how artificial intelligence may play a role in your log analysis. How To Identify Security Threats Through Logs Logs are essential for the early identification of security threats. Here’s how: Identifying and Mitigating Threats Logs are a gold mine of streaming, real-time analytics, and crucial information that your team can use to its advantage. With dashboards, visualizations, metrics, and alerts set up to monitor your logs you can effectively identify and mitigate threats. In practice, I’ve used both Sumo Logic and the ELK stack (a combination of Elasticsearch, Kibana, Beats, and Logstash). These tools can help your security practice by allowing you to: Establish a baseline of behavior and quickly identify anomalies in service or application behavior. Look for things like unusual access times, spikes in data access, or logins from unexpected areas of the world. Monitor access to your systems for unexpected connections. Watch for frequent and unusual access to critical resources. Watch for unusual outbound traffic that might signal data exfiltration. Watch for specific types of attacks, such as SQL injection or DDoS. For example, I monitor how rate-limiting deals with a burst of requests from the same device or IP using Sumo Logic’s Cloud Infrastructure Security. Watch for changes to highly critical files. Is someone tampering with config files? Create and monitor audit trails of user activity. This forensic information can help you to trace what happened with suspicious—or malicious—activities. Closely monitor authentication/authorization logs for frequent failed attempts. Cross-reference logs to watch for complex, cross-system attacks, such as supply chain attacks or man-in-the-middle (MiTM) attacks. Using a Sumo Logic dashboard of logs, metrics, and traces to track down security threats It’s also best practice to set up alerts to see issues early, giving you the lead time needed to deal with any threat. The best tools are also infrastructure agnostic and can be run on any number of hosting environments. Insights for Future Security Measures Logs help you with more than just looking into the past to figure out what happened. They also help you prepare for the future. Insights from log data can help your team craft its security strategies for the future. Benchmark your logs against your industry to help identify gaps that may cause issues in the future. Hunt through your logs for signs of subtle IOCs (indicators of compromise). Identify rules and behaviors that you can use against your logs to respond in real-time to any new threats. Use predictive modeling to anticipate future attack vectors based on current trends. Detect outliers in your datasets to surface suspicious activities What to Log. . . And How Much to Log So we know we need to use logs to identify threats both present and future. But to be the most effective, what should we log? The short answer is—everything! You want to capture everything you can, all the time. When you’re first getting started, it may be tempting to try to triage logs, guessing as to what is important to keep and what isn’t. But logging all events as they happen and putting them in the right repository for analysis later is often your best bet. In terms of log data, more is almost always better. But of course, this presents challenges. Who’s Going To Pay for All These Logs? When you retain all those logs, it can be very expensive. And it’s stressful to think about how much money it will cost to store all of this data when you just throw it in an S3 bucket for review later. For example, on AWS a daily log data ingest of 100GB/day with the ELK stack could create an annual cost of hundreds of thousands of dollars. This often leads to developers “self-selecting” what they think is — and isn’t — important to log. Your first option is to be smart and proactive in managing your logs. This can work for tools such as the ELK stack, as long as you follow some basic rules: Prioritize logs by classification: Figure out which logs are the most important, classify them as such, and then be more verbose with those logs. Rotate logs: Figure out how long you typically need logs and then rotate them off servers. You probably only need debug logs for a matter of weeks, but access logs for much longer. Log sampling: Only log a sampling of high-volume services. For example, log just a percentage of access requests but log all error messages. Filter logs: Pre-process all logs to remove unnecessary information, condensing their size before storing them. Alert-based logging: Configure alerts based on triggers or events that subsequently turn logging on or make your logging more verbose. Use tier-based storage: Store more recent logs on faster, more expensive storage. Move older logs to cheaper, slow storage. For example, you can archive old logs to Amazon S3. These are great steps, but unfortunately, they can involve a lot of work and a lot of guesswork. You often don’t know what you need from the logs until after the fact. A second option is to use a tool or service that offers flat-rate pricing; for example, Sumo Logic’s $0 ingest. With this type of service, you can stream all of your logs without worrying about overwhelming ingest costs. Instead of a per-GB-ingested type of billing, this plan bills based on the valuable analytics and insights you derive from that data. You can log everything and pay just for what you need to get out of your logs. In other words, you are free to log it all! Looking Forward: The Role of AI in Automating Log Analysis The right tool or service, of course, can help you make sense of all this data. And the best of these tools work pretty well. The obvious new tool to help you make sense of all this data is AI. With data that is formatted predictably, we can apply classification algorithms and other machine-learning techniques to find out exactly what we want to know about our application. AI can: Automate repetitive tasks like data cleaning and pre-processing Perform automated anomaly detection to alert on abnormal behaviors Automatically identify issues and anomalies faster and more consistently by learning from historical log data Identify complex patterns quickly Use large amounts of historical data to more accurately predict future security breaches Reduce alert fatigue by reducing false positives and false negatives Use natural language processing (NLP) to parse and understand logs Quickly integrate and parse logs from multiple, disparate systems for a more holistic view of potential attack vectors AI probably isn’t coming for your job, but it will probably make your job a whole lot easier. Conclusion Log data is one of the most valuable and available means to ensure your applications’ security and operations. It can help guard against both current and future attacks. And for log data to be of the most use, you should log as much information as you can. The last problem you want during a security crisis is to find out you didn’t log the information you need.

By Alvin Lee

CORE

The Beginner's Guide To Understanding Graph Databases

As the volume of data increases exponentially and queries become more complex, relationships become a critical component for data analysis. In turn, specialized solutions such as graph databases that explicitly optimize for relationships are needed. Other databases aren’t designed to be able to search and query data based on the intricate relationships found in complex data structures. Graph databases are optimized to handle connected data by modeling the information into a graph, which maps data through nodes and relationships. With this article, readers will traverse a beginner’s guide to graph databases, their terminologies, and comparisons with relational databases. They will also explore graph databases from cloud providers like AWS Neptune to open-source solutions. Additionally, this article can help develop a better understanding of how graph databases are useful for applications such as social network analysis, fraud detection, and many other areas. Readers will also learn how graph databases are used for applications like knowledge graph databases and social media analytics. What Is a Graph Database? A graph database is a purpose-built NoSQL database specializing in data structured in complex network relationships, where entities and their relationships have interconnections. Data is modeled using graph structures, and the essential elements of this structure are nodes, which represent entities, and edges, which represent the relationships between entities. The nodes and edges of a graph can all have attributes. Critical Components of Graph Databases Nodes These are the primary data elements representing entities such as people, businesses, accounts, or any other item you might find in a database. Each node can store a set of key-value pairs as properties. Edges Edges are the lines that connect nodes, defining their relationships. In addition to nodes, edges can also have properties – such as weight, type, or strength – that clarify their relationship. Properties Nodes and edges can each have properties that can be used to store metadata about those objects. These can include names, dates, or any other relevant descriptive attributes to a node or edge. How Graph Databases Store and Process Data In a graph database, nodes and relationships are considered first-class citizens — in contrast to relational databases, nodes are stored in tabular forms, and relationships are computed at query time. This lets graph databases treat the data relationships as having as much value as the data, which enables faster traversal of connected data. With their traversal algorithms, graph databases can explore the relationships between nodes and edges to answer complicated queries like the shortest path, fraud detection, or network analysis. Various graph-specific query languages – Neo4j’s Cypher and Tinkerpop’s Gremlin – enable these operations by focusing on pattern matching and deep-link analytics. Practical Applications and Benefits Graph databases shine in any application where the relationships between the data points are essential, such as web and social networks, recommendation engines, and a whole host of other apps where it’s necessary to know how deep and wide the relationships go. In areas such as fraud detection and network security, it’s essential to adjust and adapt dynamically; this is something graph databases do very well. In conclusion, graph databases offer a solid infrastructure for working with complex, highly connected data. They offer many advantages over relational databases regarding modeling relationships and the interactions between the data. Key Components and Terminology Nodes and Their Properties Nodes are the basic building blocks of a graph database. They typically represent some object or a specific instance, be it a person, place, or thing. For each node, we have a vertex in the graph structure. The node can also contain several properties (also called "labels" in the database context). Each of these properties is a key-value pair, where the value expands or further clarifies the object, and its content depends on the application of the graph database. Edges: Defining Relationships Edges, on the other hand, are the links that tie the nodes together. They are directional, so they can have a start node and an end node (thus defining the flow between one node and another). These edges also define the nature of the relationship—whether it is internalizational or social. Labels: Organizing Nodes The labels help group nodes that might have similarities (Person nodes, Company nodes, etc.) so that graph databases can retrieve sets of nodes more quickly. For example, in a social network analysis, Person and Company nodes might be grouped using labels. Relationships and Their Characteristics Relationships connect nodes, but they also have properties, such as strength, status, or duration, that can define how the relationship might differ between nodes. Graph Query Languages: Cypher and Gremlin Graph databases require unique particular languages to use their often complicated structure, and these languages differ from graph databases. Cypher, used with Neo4j, is a reasonably relatively pattern-based language. Gremlin, used with other graph databases, is more procedural and can traverse more complex graph structures. Both languages are expressive and powerful, capable of queries that would be veritable nightmares written in the languages used with traditional databases. Tools for Managing and Exploring Graph Data Neo4j offers a suite of tools designed to enhance the usability of graph databases: Neo4j Bloom: Explore graph data visually without using a graph query language. Neo4j Browser: A web-based application for executing Cypher queries and visualizing the results. Neo4j Data Importer and Neo4j Desktop: These tools for importing data into a Neo4j database and handling Neo4j database instances, respectively. Neo4j Ops Manager: Useful for managing multiple Neo4j instances to ensure that large-scale deployments can be managed and optimized. Neo4j Graph Data Science: This library is an extension of Neo4j that augments its capabilities, which are more commonly associated with data science. It enables sophisticated analytical tasks to be performed directly on graph data. Equipped with these fundamental components and tools, users can wield the power of graph databases to handle complex data and make knowledgeable decisions based on networked knowledge systems. Comparing Graph Databases With Other Databases While graph and relational databases are designed to store and help us make sense of data, they fundamentally differ in how they accomplish this. Graph databases are built on the foundation of nodes and edges, making them uniquely fitted for dealing with complex relationships between data points. That foundation’s core is structure, representing connected entities through nodes and their relationships through edges. Relational databases arrange data in ‘rows and columns’ – tables, whereas graph databases are ‘nodes and edges.’ This difference in structure makes such a direct comparison between the two kinds of databases compelling. Graph databases organize data in this way naturally, whereas it’s not as easy to represent relationships between certain types of data points in relational databases. After all, they were invented to deal with transactions (i.e., a series of swaps of ‘rows and columns’ between two sides, such as a payment or refund between a seller and a customer). Data Models and Scalability Graph databases store data in a graph with nodes, edges, and properties. They are instrumental in domains with complex relationships, such as social networks or recommendation engines. As an example of the opposite end of the spectrum, relational databases contain data in tables, which is well-suited for applications requiring high levels of data integrity (i.e., applications such as those involved in financial systems or managing customer relationships). Another benefit, for example, is their horizontal scalability: graph databases grow proportionally to their demands by adding more machines to a network instead of the vertical scalability (adding more oomph to an existing machine) typical for a relational database. Query Performance and Flexibility One reason is that graph databases are generally much faster at executing complex queries with deep relationships because they can traverse nodes and edges—unlike relational databases, which might have to perform lots of joins that could speed up or slow down depending on the size of the data set. In addition, graph databases excel in the ease with which the data model can be changed without severe consequences. As business requirements evolve and users learn more about how their data should interact, a graph database can be more readily adapted without costly redesigns. Though better suited for providing strong transactional guarantees or ACID compliance, relational databases are less adept at model adjustments. Use of Query Languages The different languages of query also reflect the distinct nature of these databases. Whereas graph databases tend to use a language tailored to the way a graph is traversed—such as Gremlin or Cypher—relational databases have long been managed and queried through SQL, a well-established language for structured data. Suitability for Different Data Types Relational databases are well suited for handling large datasets with a regular and relatively simple structure. In contrast, graph databases shine in environments where the structures are highly interconnected, and the relationships are as meaningful as the data. In conclusion, while graph and relational databases have pros and cons, which one to use depends on the application’s requirements. Graph databases are better for analyzing intricate and evolving relationships, which makes them ideal for modern applications that demand a detailed understanding of networked data. Advantages of Graph Databases Graph databases are renowned for their efficiency and flexibility, mainly when dealing with complex, interconnected data sets. Here are some of the key advantages they offer: High Performance and Real-Time Data Handling Performance is a huge advantage for graph databases. It comes from the ease, speed, and efficiency with which it can query linked data. Graph databases often beat relational databases at handling complex, connected data. They are well suited to continual, real-time updates and queries, unlike, e.g., Hadoop HDFS. Enhanced Data Integrity and Contextual Awareness Keeping these connections intact across channels and data formats, graph databases maintain rich data relationships and allow that data to be easily linked. This structure surfaces nuance in interactions humans could not otherwise discern, saving time and making the data more consumable. It gives users relevant insights to understand the data better and helps businesses make more informed decisions. Scalability and Flexibility Graph databases have been designed to scale well. They can accommodate the incessant expansion of the underlying data and the constant evolution of the data schema without downtime. They can also scale well in terms of the number of data sources they can link, and again, this linking can temporarily accommodate a continuous evolution of the schema without interrupting service. They are, therefore, particularly well-suited to environments in which rapid adaptation is essential. Advanced Query Capabilities These graphs-based systems can quickly run powerful recursive path queries to retrieve direct (‘one hop’) and indirect (‘two hops’ and ‘twenty hops’) connections, making running complex subgraph pattern-matching queries easy. Moreover, complex group-by-aggregate queries (such as Netflix’s tag aggregation) are also natively supported, allowing arbitrary degree flexibility in aggregating selective dimensions, such as in big-data setups with multiple dimensions, such as time series, demographics, or geographics. AI and Machine Learning Readiness The fact that graph databases naturally represent entities and inter-relations as a structured set of connections makes them especially well-suited for AI and machine-learning foundational infrastructures since they support fast real-time changes and rely on expressive, ergonomic declarative query languages that make deep-link traversal and scalability a simple matter – two features that are critical in the case of next-generation data analytics and inference. These advantages make graph databases a good fit for an organization that needs to manage and efficiently draw meaningful insights from dataset relationships. Everyday Use Cases for Graph Databases Graph databases are being used by more industries because they are particularly well-suited for handling complex connections between data and keeping the whole system fast. Let’s look at some of the most common uses for graph databases. Financial and Insurance Services The financial and insurance services sector increasingly uses graph databases to detect fraud and other risks; how these systems model business events and customer data as a graph allows them to detect fraud and suspicious links between various entities, and the technique of Entity Link Analysis takes this a step further, allowing the detection of potential fraud in the interactions between different kinds of entities. Infrastructure and Network Management Graph databases are well-suited for infrastructure mapping and keeping network inventories up to date. Serving up an interactive map of the network estate and performing network tracing algorithms to walk across the graph is straightforward. Likewise, it makes writing new algorithms to identify problematic dependencies, vulnerable bottlenecks, or higher-order latency issues much easier. Recommendation Systems Many companies – including major e-commerce giants like Amazon – use graph databases to power recommendation engines. These keep track of which products and services you’ve purchased and browsed in the past to suggest things you might like, improving the customer experience and engagement. Social Networking Platforms Social networks such as Facebook, Twitter, and LinkedIn all use graph databases to manage and query huge amounts of relational data concerning people, their relationships, and interactions. This makes them very good at quickly navigating across vast social networks, finding influential users, detecting communities, and identifying key players. Knowledge Graphs in Healthcare Healthcare organizations assemble critical knowledge about patient profiles, past ailments, and treatments in knowledge graphs, while graph queries implemented on graph databases identify patient patterns and trends. These can influence how treatments proceed positively and how patients fare. Complex Network Monitoring Graph databases are used to model and monitor complex network infrastructures, including telecommunications networks or end-to-end environments of clouds (data-center infrastructure including physical networking, storage, and virtualization). This application is undoubtedly crucial for the robustness and scalability of those systems and environments that form the essential backbone of the modern information infrastructure. Compliance and Governance Organizations also use graph databases to manage data related to compliance and governance, such as access controls, data retention policies, and audit trails, to ensure they can continue to meet high standards of data security and regulatory compliance. AI and Machine Learning Graph databases are also essential for developing artificial intelligence and machine learning applications. They allow developers to create standardized means of storing and querying data for applications such as natural language processing, computer vision, and advanced recommendation systems, which is essential for making AI applications more intelligent and responsive. Unraveling Financial Crimes Graphs provide a way to trace the structure of shell corporate entities that criminals use to launder money, studying whether the patterns of supplies to shell companies and cash flows from shell companies to other entities are suspicious. Such applications are helpful for law enforcement and regulatory agencies to unravel complex money laundering networks and fight against financial crime. Automotive Industry In the automotive industry, graph queries help analyze the relationships between tens of thousands of car parts, enabling real-time interactive analysis that has the potential to improve manufacturing and maintenance processes. Criminal Network Analysis In law enforcement, graph databases are used to identify criminal networks, address patterns, and identify critical links in criminal organizations to bring operations down efficiently from all sides. Data Lineage Tracking Graph technology can also track data lineage (the details of where an item of data, such as a fact or number, was created, how it was copied, and where it was used). This is important for auditing and verifying that data assets are not corrupted. This diverse array of applications underscores the versatility of graph databases and their utility in representing and managing complex, interconnected data across multiple diverse fields. Challenges and Considerations Graph databases are built around modeling structures in a specific domain, in a process resembling both knowledge or ontology engineering, and a practical challenge that can require specialized "graph data engineers." All these requirements point to important scalability issues and potentially limit the appeal of this technology to many beyond the opponents of a data web. Inconsistency of data across the system remains a critical issue since developing homogeneous systems that can maintain data consistency while maintaining flexibility and expressivity is challenging. While graph queries don’t require as much coding as SQL, paths for traversal across the data still have to be spelled out explicitly. This increases the effort needed to write queries and prevents graph queries from being as easily abstracted and reused as SQL code, impairing their generalization. Furthermore, because there isn’t a unified standard for capabilities or query languages, developers invent their own – a further step in API fragmentation. Another significant issue is knowing which machine is the best place to put that data, given all the subtle relationships between nodes, deciding that is crucial to performance but hard to do on the fly. As necessary, many existing graph database systems weren’t architected for today’s high volumes of data, so they can end up being performance bottlenecks. From a project management standpoint, failure to accurately capture and map business requirements to technical requirements often results in confusion and delay. Poor data quality, inadequate access to data sources, verbose data modeling, or time-consuming data modeling will magnify the pain of a graph data project. On the end-user side, asking people to learn new languages or skills in order to read some graphs could deter adoption, while the difficulty of sharing those graphs or collaborating on the analysis will eventually lower the range and impact of the insights. The Windows 95 interface had an excellent early advantage in the virtues of simplicity: we can tell the same story about graph technologies nowadays. Adopting this technology is also hindered when the analysis process is criticized as too time-consuming. From a technical perspective, managing large graphs by storing and querying complex structures presents more significant challenges. For example, the data must be distributed on a cluster of multiple machines, adding another level of complexity for developers. Data is typically sharded (split) into smaller parts and stored on various machines, coordinated by an "intelligent" virtual server managing access control and query across multiple shards. Choosing the Right Graph Database When selecting a graph database, it’s crucial to consider the queries’ complexity and the data’s interconnectedness. A well-chosen graph database can significantly enhance the performance and scalability of data-driven applications. Key Factors to Consider Native graph storage and processing: Opt for databases designed from the ground up to handle graph data structures. Property graphs and Graph Query Languages: Ensure the database supports robust graph query languages and can handle property graphs efficiently. Data ingestion and integration capabilities: The ability to seamlessly integrate and ingest data from various sources is vital for dynamic data environments. Development tools and graph visualization: Tools that facilitate development and allow intuitive graph visualizations to improve usability and insights. Graph data science and analytics: Databases with advanced analytics and data science capabilities can provide deeper insights. Support for OLTP, OLAP, and HTAP: Depending on the application, support for transactional (OLTP), analytical (OLAP), and hybrid (HTAP) processing may be necessary. ACID compliance and system durability: Essential for ensuring data integrity and reliability in transaction-heavy environments Scalability and performance: The database should scale vertically and horizontally to handle growing data loads. Enterprise security and privacy features: Robust security features are crucial to protect sensitive data and ensure privacy. Deployment flexibility: The database should match the organization’s deployment strategy, whether on-premises or cloud. Open-source foundation and community support: A strong community and open-source foundation can provide extensive support and flexibility. Business and technology partnerships: Partnerships can offer additional support and integration options, enhancing the database’s capabilities. Comparing Popular Graph Databases Dgraph: This is the most performant and scalable option for enterprise systems that need to handle massive amounts of fast-flowing data. Memgraph: An open-source, in-memory storage database with a query language specially designed for real-time data and analytics Neo4j: Offers a comprehensive graph data science library and is well-suited for static data storage and Java-oriented developers Each of these databases has its advantages: Memgraph is the strongest contender in the Python ecosystem (you can choose Python, C++, or Rust for your custom stored procedures), and Neo4j’s managed solution offers the most control over your deployment into the cloud (its AuraDB service provides a lot of power and flexibility). Community and Free Resources Memgraph has a free community edition and a paid enterprise edition, and Neo4j has a community "Labs" edition, a free enterprise trial, and hosting services. These are all great ways for developers to get their feet wet without investing upfront. In conclusion, choosing the proper graph database to use is contingent upon understanding the realities of your project well enough and the potential of the database to which you are selecting. If you bear this notion in mind, your organization will be using graph databases to their full potential to enhance its data infrastructure and insights. Conclusion Having navigated through the expansive realm of graph databases, the hope is that you now know not only the basics of these beautiful databases, from nodes to edges, from vertex storage to indexing, but also those of their applications across industries, including finance, government, and healthcare. This master guide comprehensively introduces graph databases, catering to sophomores and seniors in the database field. Now, every reader of this broad stratum is fully prepared to take the following steps in understanding how graph databases work, how they compare against traditional and non-relational databases, and where they are utilized in the real world. We have seen that choosing a graph database requires careful consideration of the project’s requirements and features. The reflections and difficulties highlighted the importance of correct implementation and the advantage of the graph database in changing our way of processing and looking at data. The graph databases’ complexity and power allow us to provide new insights and be more efficient in computation. In this way, new data management and analysis methods may be developed. References Graph Databases for Beginners How to choose a graph database: we compare 6 favorites What is A Graph Database? A Beginner's Guide Video: What is a graph database? (in 10 minutes) AWS: What Is a Graph Database? Neo4j: What is a graph database? Wikipedia: Graph database Geeks for Geeks: What is Graph Database – Introduction Memgraph: What is a Graph Database? Geeks for Geeks: Introduction to Graph Database on NoSQL Graph Databases: A Comprehensive Overview and Guide. Part1 Graph database concepts AWS: What’s the Difference Between a Graph Database and a Relational Database? Comparison of Relational Databases and Graph Databases Nebula Graph: Graph Database vs Relational Database: What to Choose? Graph database vs. relational database: Key differences The ultimate guide to graph databases Neo4j: Why Graph Databases? What is a Graph Database and What are the Benefits of Graph Databases What Are the Major Advantages of Using a Graph Database? Graph Databases for Beginners: Why Graph Technology Is the Future Understanding Graph Databases: Unleashing the Power of Connected Data in Data Science Use cases for graph databases 7 Graph Database Use Cases That Will Change Your Mind When Connected Data Matters Most 17 Use Cases for Graph Databases and Graph Analytics The Challenges of Working with a Graph Database Where the Path Leads: State of the Art and Challenges of Graph Database Systems 5 Reasons Graph Data Projects Fail 16 Things to Consider When Selecting the Right Graph Database How to Select a Graph Database: Best Practices at RoyalFlush Neo4j vs Memgraph - How to Choose a Graph Database?

By Shantanu Kumar

CORE

Understanding RLAIF: A Technical Overview of Scaling LLM Alignment With AI Feedback

With recent achievements and attention to LLMs and the resultant Artificial Intelligence “Summer,” there has been a renaissance in model training methods aimed at getting to the most optimal, performant model as quickly as possible. Much of this has been achieved through brute scale — more chips, more data, more training steps. However, many teams have been focused on how we can train these models more efficiently and intelligently to achieve the desired results. Training LLMs typically include the following phases: Pretraining: This initial phase lays the foundation, taking the model from a set of inert neurons to a basic language generator. While the model ingests vast amounts of data (e.g., the entire internet), the outputs at this stage are often nonsensical, though not entirely gibberish. Supervised Fine-Tuning (SFT): This phase elevates the model from its unintelligible state, enabling it to generate more coherent and useful outputs. SFT involves providing the model with specific examples of desired behavior, and teaching it what is considered "helpful, useful, and sensible." Models can be deployed and used in production after this stage. Reinforcement Learning (RL): Taking the model from "working" to "good," RL goes beyond explicit instruction and allows the model to learn implicit preferences and desires of users through labeled preference data. This enables developers to encourage desired behaviors without needing to explicitly define why those behaviors are preferred. In-context learning: Also known as prompt engineering, this technique allows users to directly influence model behavior at inference time. By employing methods like constraints and N-shot learning, users can fine-tune the model's output to suit specific needs and contexts. Note this is not an exhaustive list, there are many other methods and phases that may be incorporated into idiosyncratic training pipelines Introducing Reward and Reinforcement Learning Humans excel at pattern recognition, often learning and adapting without conscious effort. Our intellectual development can be seen as a continuous process of increasingly complex pattern recognition. A child learns not to jump in puddles after experiencing negative consequences, much like an LLM undergoing SFT. Similarly, a teenager observing social interactions learns to adapt their behavior based on positive and negative feedback – the essence of Reinforcement Learning. Reinforcement Learning in Practice: The Key Components Preference data: Reinforcement Learning in LLMs typically require multiple (often 2) example outputs and a prompt/input in order to demonstrate a ‘gradient’. This is intended to show that certain behaviors are preferred relative to others. As an example, in RLHF, human users may be presented with a prompt and two examples and asked to choose which they prefer, or in other methods, they may be presented with an output and asked to improve on it in some way (where the improved version will be captured as the ‘preferred’ option). Reward model: A reward model is trained directly on the preference data. For a set of responses to a given input, each response can be assigned a scalar value representing its ‘rank’ within the set (for binary examples, this can be 0 and 1). The reward model is then trained to predict these scalar values given a novel input and output pair. That is, the RM is able to reproduce or predict a user’s preference Generator model: This is the final intended artifact. In simplified terms, during the Reinforcement Training Process, the Generator model generates an output, which is then scored by the Reward Model, and the resultant reward is fed back to the algorithm which decides how to mutate the Generator Model. For example, the algorithm will update the model to increase the odds of generating a given output when provided a positive reward and do the opposite in a negative reward scenario. In the LLM landscape, RLHF has been a dominant force. By gathering large volumes of human preference data, RLHF has enabled significant advancements in LLM performance. However, this approach is expensive, time-consuming, and susceptible to biases and vulnerabilities. This limitation has spurred the exploration of alternative methods for obtaining reward information at scale, paving the way for the emergence of RLAIF – a revolutionary approach poised to redefine the future of AI development. Understanding RLAIF: A Technical Overview of Scaling LLM Alignment With AI Feedback The core idea behind RLAIF is both simple and profound: if LLMs can generate creative text formats like poems, scripts, and even code, why can't they teach themselves? This concept of self-improvement promises to unlock unprecedented levels of quality and efficiency, surpassing the limitations of RLHF. And this is precisely what researchers have achieved with RLAIF. As with any form of Reinforcement Learning, the key lies in assigning value to outputs and training a Reward Model to predict those values. RLAIF's innovation is the ability to generate these preference labels automatically, at scale, without relying on human input. While all LLMs ultimately stem from human-generated data in some form, RLAIF leverages existing LLMs as "teachers" to guide the training process, eliminating the need for continuous human labeling. Using this method, the authors have been able to achieve comparable or even better results from RLAIF as opposed to RLHF. See below the graph of ‘Harmless Response Rate’ comparing the various approaches: To achieve this, the authors developed a number of methodological innovations. In-context learning and prompt engineering: RLAIF leverages in-context learning and carefully designed prompts to elicit preference information from the teacher LLM. These prompts provide context, examples (for few-shot learning), and the samples to be evaluated. The teacher LLMs output then serves as the reward signal. Chain-of-thought reasoning: To enhance the teacher LLM's reasoning capabilities, RLAIF employs Chain-of-Thought (CoT) prompting. While the reasoning process itself isn't directly used, it leads to more informed and nuanced preference judgments from the teacher LLM. Addressing position bias: To mitigate the influence of response order on the teacher's preference, RLAIF averages preferences obtained from multiple prompts with varying response orders. To understand this a little more directly, imagine the AI you are trying to train as a student, learning and improving through a continuous feedback loop. And then imagine an off-the-shelf AI, that has been through extensive training already, as the teacher. The teacher rewards the student for taking certain actions, coming up with certain responses, and so on, and punishes it otherwise. The way it does this is by ‘testing’ the student, by giving it quizzes where the student must select the optimal response. These tests are generated via ‘contrastive’ prompts, where the teacher generates slightly variable responses by slightly varying prompts in order to generate the responses. For example, in the context of code generation, one prompt might encourage the LLM to generate efficient code, potentially at the expense of readability, while the other emphasizes code clarity and documentation. The teacher then assigns its own preference as the ‘ground truth’ and asks the Student to indicate what it thinks is the preferred output. By comparing the students’ responses under these contrasting prompts, RLAIF assesses which response better aligns with the desired attribute. The student, meanwhile, aims to maximize the accumulated reward. So every time it gets punished, it decides to change something about itself so it doesn’t make a mistake again, and get punished again. When it is rewarded, it aims to reinforce that behavior so it is more likely to reproduce the same response in the future. In this way, over successive quizzes, the student gets better and better and punished less and less. While punishments never go to zero, the Student does converge to some minimum which represents the optimal performance it is able to achieve. From there, future inferences made by the student are likely to be of much higher quality than if RLAIF was not employed. The evaluation of synthetic (LLM-generated) preference data is crucial for effective alignment. RLAIF utilizes a "self-rewarding" score, which compares the generation probabilities of two responses under contrastive prompts. This score reflects the relative alignment of each response with the desired attribute. Finally, Direct Preference Optimization (DPO), an efficient RL algorithm, leverages these self-rewarding scores to optimize the student model, encouraging it to generate responses that align with human values. DPO directly optimizes an LLM towards preferred responses without needing to explicitly train a separate reward model. RLAIF in Action: Applications and Benefits RLAIF's versatility extends to various tasks, including summarization, dialogue generation, and code generation. Research has shown that RLAIF can achieve comparable or even superior performance to RLHF, while significantly reducing the reliance on human annotations. This translates to substantial cost savings and faster iteration cycles, making RLAIF particularly attractive for rapidly evolving LLM development. Moreover, RLAIF opens doors to a future of "closed-loop" LLM improvement. As the student model becomes better aligned through RLAIF, it can, in turn, be used as a more reliable teacher model for subsequent RLAIF iterations. This creates a positive feedback loop, potentially leading to continual improvement in LLM alignment without additional human intervention. So how can you leverage RLAIF? It’s actually quite simple if you already have an RL pipeline: Prompt set: Start with a set of prompts designed to elicit the desired behaviors. Alternatively, you can utilize an off-the-shelf LLM to generate these prompts. Contrastive prompts: For each prompt, create two slightly varied versions that emphasize different aspects of the target behavior (e.g., helpfulness vs. safety). LLMs can also automate this process. Response generation: Capture the responses from the student LLM for each prompt variation. Preference elicitation: Create meta-prompts to obtain preference information from the teacher LLM for each prompt-response pair. RL pipeline integration: Utilize the resulting preference data within your existing RL pipeline to guide the student model's learning and optimization. Challenges and Limitations Despite its potential, RLAIF faces challenges that require further research. The accuracy of AI annotations remains a concern, as biases from the teacher LLM can propagate to the student model. Furthermore, biases incorporated into this preference data can eventually become ‘crystallized’ in the teacher LLM which makes it difficult to remove afterward. Additionally, studies have shown that RLAIF-aligned models can sometimes generate responses with factual inconsistencies or decreased coherence. This necessitates exploring techniques to improve the factual grounding and overall quality of the generated text. Addressing these issues necessitates exploring techniques to enhance the reliability, quality, and objectivity of AI feedback. Furthermore, the theoretical underpinnings of RLAIF require careful examination. While the effectiveness of self-rewarding scores has been demonstrated, further analysis is needed to understand its limitations and refine the underlying assumptions. Emerging Trends and Future Research RLAIF's emergence has sparked exciting research directions. Comparing it with other RL methods like Reinforcement Learning from Execution Feedback (RLEF) can provide valuable insights into their respective strengths and weaknesses. One direction involves investigating fine-grained feedback mechanisms that provide more granular rewards at the individual token level, potentially leading to more precise and nuanced alignment outcomes. Another promising avenue explores the integration of multimodal information, incorporating data from images and videos to enrich the alignment process and foster a more comprehensive understanding within LLMs. Drawing inspiration from human learning, researchers are also exploring the application of curriculum learning principles in RLAIF, gradually increasing the complexity of tasks to enhance the efficiency and effectiveness of the alignment process. Additionally, investigating the potential for a positive feedback loop in RLAIF, leading to continual LLM improvement without human intervention, represents a significant step towards a more autonomous and self-improving AI ecosystem. Furthermore, there may be an opportunity to improve the quality of this approach by grounding feedback in the real world. As an example, if the agent were able to execute code, perform real-world experiments, or integrate with a robotic system to ‘instantiate’ feedback in the real world to capture more objective feedback, it would be able to capture more accurate, and reliable preference information without losing scaling advantages. However, ethical considerations remain paramount. As RLAIF empowers LLMs to shape their own alignment, it's crucial to ensure responsible development and deployment. Establishing robust safeguards against potential misuse and mitigating biases inherited from teacher models are essential for building trust and ensuring the ethical advancement of this technology. As mentioned previously, RLAIF has the potential to propagate and amplify biases present in the source data, which must be carefully examined before scaling this approach. Conclusion: RLAIF as a Stepping Stone To Aligned AI RLAIF presents a powerful and efficient approach to LLM alignment, offering significant advantages over traditional RLHF methods. Its scalability, cost-effectiveness, and potential for self-improvement hold immense promise for the future of AI development. While acknowledging the current challenges and limitations, ongoing research efforts are actively paving the way for a more reliable, objective, and ethically sound RLAIF framework. As we continue to explore this exciting frontier, RLAIF stands as a stepping stone towards a future where LLMs seamlessly integrate with human values and expectations, unlocking the full potential of AI for the benefit of society.

By Obaid Sarvana