import pytesseract
import cv2
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
img = cv2.imread('C:\\Users\\BW\\Desktop\\PhishEmailWithImageandLink.png')
def process_image(img):
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 49, 52)
ret, thresh = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY)
img_new = Image.fromarray(thresh)
text = pytesseract.image_to_string(img_new, lang='eng')
cv2.imshow('result', thresh)
cv2.waitKey(0)
print(text)
This will give you some output. What you need is the url part:
https://onedrive.live.com/downoad?
cid=46 b98fe6f0d79519&resid=46b98fe
60479519!
1759&authkey=ad8palo26hIn_dm
As you can see, there is a space between 45 and b9 (second line). Also, the third line is wrong.
Apart from that, the link should be the same from your screenshot. With some preprocessing (for example isolate the rectangle containing the link - not so difficult) you should do easily the job.
------------------------------
Bruce Wayne
------------------------------
Original Message:
Sent: Sun October 06, 2019 04:56 AM
From: UAEX Exchange
Subject: Extracting Links from Photo in a phishing email
------------------------------
UAEX Exchange
Original Message:
Sent: Mon September 16, 2019 05:49 AM
From: FUser User
Subject: Extracting Links from Photo in a phishing email
Hi
Can you post an example of the photo you are talking about ?
Thanks
------------------------------
FUser User
Original Message:
Sent: Wed August 14, 2019 04:42 AM
From: ahmed abushanab
Subject: Extracting Links from Photo in a phishing email
Hello Members,
I need to extract the clickable links from a photo in a phishing email, while I am parsing an eml attachment of an email, it shows the pictures inside the body, but I need the link itself to be parsed and added as an artifact.