IBM TechXchange Korean CyberSecurity User Group (한국 사이버보안 사용자 그룹)

This online user group is intended for IBM Security product users in Korea to communicate with IBM experts, share advice and best practices with peers and stay up to date regarding product enhancements, regional user group meetings, webinars, how-to blogs and other helpful materials.

View Only

Back to Blog List

프롬프트 인젝션으로 AI를 해킹하는 방법: NIST 보고서

By Kyoyoung Choi posted Mon June 03, 2024 09:42 PM

프롬프트 인젝션으로 AI를 해킹하는 방법: NIST 보고서

How AI can be hacked with prompt injection: NIST report

미국 국립표준기술연구소(NIST)는 AI 라이프 사이클을 밀접하게 관찰하고 있는데, 그만한 이유가 있습니다. AI가 확산됨에 따라 AI 사이버 보안 취약점의 발견과 악용도 증가하고 있기 때문입니다. 프롬프트 인젝션은 특히 생성형 AI를 겨냥하는 취약점 중 하나입니다.

NIST의 ‘적대적 머신 러닝: 공격과 완화에 대한 분류 및 용어'에서는 프롬프트 인젝션과 같은 다양한 적대적 머신 러닝(AML) 전술과 사이버 공격을 정의하고 사용자에게 이를 방어하고 관리할 수 있는 방법에 대해 조언합니다. AML 기법은 머신 러닝(ML) 시스템이 어떻게 작동하는지에 대한 정보를 추출하여 시스템 조작 방법을 알아냅니다. 이 정보는 보안, 안전 장치를 우회하고, 악용할 수 있는 경로를 열기 위해 AI와 대규모 언어 모델(LLM)을 공격하는 데 사용됩니다.

The National Institute of Standards and Technology (NIST) closely observes the AI lifecycle, and for good reason. As AI proliferates, so does the discovery and exploitation of AI cybersecurity vulnerabilities. Prompt injection is one such vulnerability that specifically attacks generative AI.

In Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, NIST defines various adversarial machine learning (AML) tactics and cyberattacks, like prompt injection, and advises users on how to mitigate and manage them. AML tactics extract information about how machine learning (ML) systems behave to discover how they can be manipulated. That information is used to attack AI and its large language models (LLMs) to circumvent security, bypass safeguards and open paths to exploit.

프롬프트 인젝션이란?

What is prompt injection?

NIST는 프롬프트 주입 공격을 직접적 및 간접적 두가지 유형으로 정의합니다. 직접 프롬프트 인젝션을 사용하면 사용자가 LLM이 의도하지 않거나 승인되지 않은 작업을 수행하도록 하는 텍스트 프롬프트를 입력합니다. 간접 프롬프트 인젝션은 공격자가 LLM이 가져오는 데이터를 오염시키거나 성능 저하를 일으키는 경우입니다.

가장 잘 알려진 직접 프롬프트 인젝션 방법 중 하나는 ChatGPT에 사용되는 프롬프트 인젝션인 DAN(Do Anything Now)입니다. DAN은 롤플레이를 사용하여 조정 필터를 우회합니다. 첫 번째 반복에서 프롬프트는 ChatGPT에 이제 DAN이라고 지시했습니다. DAN은 악의적인 사람이 폭발물을 만들고 폭발시키는 것을 도와주는 등의 무엇이든 할 수 있습니다. 이 기법은 역할극 시나리오를 따라 범죄나 유해한 정보를 제공하지 못하도록 하는 필터를 우회했습니다. ChatGPT의 개발사인 OpenAI는 이 수법을 추적하고 모델을 업데이트하여 활용되지 못하도록 있지만, 사용자들은 계속해서 필터를 우회하고 있으며 이 수법은 (최소한) DAN 12.0까지 진화했습니다.

NIST defines two prompt injection attack types: direct and indirect. With direct prompt injection, a user enters a text prompt that causes the LLM to perform unintended or unauthorized actions. An indirect prompt injection is when an attacker poisons or degrades the data that an LLM draws from.

One of the best-known direct prompt injection methods is DAN, Do Anything Now, a prompt injection used against ChatGPT. DAN uses roleplay to circumvent moderation filters. In its first iteration, prompts instructed ChatGPT that it was now DAN. DAN could do anything it wanted and should pretend, for example, to help a nefarious person create and detonate explosives. This tactic evaded the filters that prevented it from providing criminal or harmful information by following a roleplay scenario. OpenAI, the developers of ChatGPT, track this tactic and update the model to prevent its use, but users keep circumventing filters to the point that the method has evolved to (at least) DAN 12.0.

NIST에 따르면 간접 프롬프트 인젝션은 공격자가 PDF, 문서, 웹 페이지 또는 가짜 음성을 생성하는 데 사용되는 오디오 파일과 같이 생성형 AI 모델이 수집할 소스를 제공할 수 있어야 합니다. 간접 프롬프트 인젝션은 이러한 공격을 찾아서 수정할 수 있는 간단한 방법이 없는 생성형 AI의 가장 큰 보안 결함으로 널리 알려져 있습니다. 이 프롬프트 유형의 예는 광범위하고 다양합니다. 터무니 없는 것(챗봇이 '해적 대화'를 사용하여 응답하도록 유도하는 것)부터 피해를 주는 것(사회적으로 조작된 채팅을 사용하여 사용자가 신용카드 및 기타 개인 데이터를 공개하도록 유도하는 것) 광범위한 것(AI 어시스턴트를 하이재킹하여 전체 연락처 목록에 스캠 메일을 보내는 것)까지 다양합니다.

Indirect prompt injection, as NIST notes, depends on an attacker being able to provide sources that a generative AI model would ingest, like a PDF, document, web page or even audio files used to generate fake voices. Indirect prompt injection is widely believed to be generative AI’s greatest security flaw, without simple ways to find and fix these attacks. Examples of this prompt type are wide and varied. They range from absurd (getting a chatbot to respond using “pirate talk”) to damaging (using socially engineered chat to convince a user to reveal credit card and other personal data) to wide-ranging (hijacking AI assistants to send scam emails to your entire contact list).

Explore AI cybersecurity solutions

프롬프트 인젝션 공격을 막으려면?
How to stop prompt injection attacks

이러한 공격은 잘 알아차릴 수 없기 때문에 효과적으로 막기가 어렵습니다. 직접 프롬프트 인젝션으로부터 어떻게 보호할 수 있을까요? NIST에서 언급했듯이, 이를 완전히 막을 수는 없지만 방어 전략을 통해 어느 정도 예방할 수 있습니다. NIST는 모델 제작자에게 훈련 데이터 세트를 신중하게 큐레이션할 것을 제안합니다. 또한 어떤 유형의 입력이 프롬프트 인젝션 시도를 나타내는지에 대해 모델을 훈련하고 적대적인 프롬프트를 식별하는 방법에 대해 훈련할 것을 제안합니다.

간접 프롬프트 주입의 경우, NIST는 인간의 개입을 통해 모델을 미세 조정하는 것을 제안합니다.이것을 인간 피드백 데이터를 통한 학습(RLHF)이라고 합니다. RLHF는 모델이 원치 않는 행동을 방지하는 인간의 가치와 더 잘 일치하도록 돕습니다. 또 다른 제안은 검색된 입력에서 명령을 필터링하여 외부 소스에서 원치 않는 명령이 실행되는 것을 방지하는 것입니다. NIST는 또한 검색된 소스에 의존하지 않고 실행하는 공격을 탐지하는 데 도움이 되는 LLM 모더레이터를 사용할 것을 제안합니다. 마지막으로 NIST는 해석 가능성 기반 솔루션을 제안합니다. 즉, 비정상적인 입력을 인식하는 모델의 예측 궤적을 사용하여 비정상적인 입력을 탐지한 후 차단 할 수 있습니다.

생성형 AI와 그 취약점을 악용하려는 사람들은 계속해서 사이버 보안 환경을 변화시킬 것입니다. 그러나 하지만 그 동일한 변화의 힘이 솔루션을 제공할 수도 있습니다.. IBM Security가 보안 방어를 강화하는 AI 사이버 보안 솔루션을 제공하는 방법에 대해 자세히 알아보세요

These attacks tend to be well hidden, which makes them both effective and hard to stop. How do you protect against direct prompt injection? As NIST notes, you can’t stop them completely, but defensive strategies add some measure of protection. For model creators, NIST suggests ensuring training datasets are carefully curated. They also suggest training the model on what types of inputs signal a prompt injection attempt and training on how to identify adversarial prompts.

For indirect prompt injection, NIST suggests human involvement to fine-tune models, known as reinforcement learning from human feedback (RLHF). RLHF helps models align better with human values that prevent unwanted behaviors. Another suggestion is to filter out instructions from retrieved inputs, which can prevent executing unwanted instructions from outside sources. NIST further suggests using LLM moderators to help detect attacks that don’t rely on retrieved sources to execute. Finally, NIST proposes interpretability-based solutions. That means that the prediction trajectory of the model that recognizes anomalous inputs can be used to detect and then stop anomalous inputs.

Generative AI and those who wish to exploit its vulnerabilities will continue to alter the cybersecurity landscape. But that same transformative power can also deliver solutions. Learn more about how IBM Security delivers AI cybersecurity solutions that strengthen security defenses.

https://securityintelligence.com/articles/ai-prompt-injection-nist-report/?utm_medium=OSocial&utm_source=Linkedin&utm_content=RSRWW&utm_id=IBMSecurityLIPostInjectionAttacks20240502&sf188168829=1

0 comments

6 views

Permalink

https://community.ibm.com/community/user/blogs/kyoyoung-choi2/2024/06/03/ai-nist

IBM TechXchange Korean CyberSecurity User Group (한국 사이버보안 사용자 그룹)

IBM TechXchange Korean CyberSecurity User Group (한국 사이버보안 사용자 그룹)

프롬프트 인젝션으로 AI를 해킹하는 방법: NIST 보고서

By Kyoyoung Choi posted Mon June 03, 2024 09:42 PM

Permalink

Additional
Resources

Office

Quick Links

IBM TechXchange Korean CyberSecurity User Group (한국 사이버보안 사용자 그룹)

IBM TechXchange Korean CyberSecurity User Group (한국 사이버보안 사용자 그룹)

프롬프트 인젝션으로 AI를 해킹하는 방법: NIST 보고서

By Kyoyoung Choi posted Mon June 03, 2024 09:42 PM

Permalink

Additional Resources

Office

Quick Links

Additional
Resources