L o a d i n g

User pain points

A common ask for Snipping Tool from users has been to add the ability to extract text from images. System initiated user feedback: Add OCR to the snipping tool - 152 upvotes, Feedback Hub feature request: Support for OCR (optical character recognition) - 20 upvotes.

The framing and research.

In addition to users requesting this feature through UIFs, a user research survey (from Aug 2022) explored users' open-ended answers to what types of content they would like to capture. 'Items with data or text' was the top use case for all 6 participants in the study.Compepetive analysis also showed inline OCR feature becoming a tablestake for modern screenshot apps. I collected data from subreddit r/windows11 to further validate our discovery of user pain points. See the image below.


Definition of Success

With the addition of text extraction in annotation mode, it will be important to track user engagement with the new feature and monitor for increased users entering annotation mode after taking a snip: Discoverability: 8% of users who open annotation mode discover the new text extraction functionality. Engagement: Increase captures (Screenshots/MAD) by 1%

Design explorations.

Competitive analysis mostly showed variations of UI treatment on the texts after the images OCR-ed and users' micro-interaction with the OCR-ed text. One thing that i had to cognizant is laying down the OCR framework to be scalable to other valuable features such as text translation, text redaction and text-to-audio voice speech. When it came to using smoke-overlay as the chosen UI treatment, we had concerns about and seeked guidance from the Accessibility team.

The delivery

The rounds of internal and LT reviews were conducted gathered feedbacks. Usability testing was also conducted by the UX Research team through usertesting(dot)com. After rounds of iterations, there were external leads (outside the key stakeholder) needed to sign off on the design. Accessibility, Motion Design, Content Design, and Icon Design cross-functioned with us in making sure the specs fulfill all their respective requirements.

The accessibility documentations

This specs show how users with disability would use keyboarding and screen reader to navigate through the experience. We used the guidelines from the Accessibility team to design.

The motion piece

There was an alignment with the key stakeholders that this feature should be branded as a AI feature; meaning it should have a delighter. This delighter is in the form of UI animation when users interact with the OCR tool and a beautiful loading state while images are being OCR-ed. This is the documentation on how to build the animation and loading state.

The full-spec document.

The full-spec doc contains summary, rationale, definition of success, scoping, feature requirement, feature details (this is where lo-fi/hi-fi mockups and prototypes are attached), first-run experience scenario, telemetry work, open questions, resources and approval checklist. Copy writing by the PM , mockups/prototypes by the designer

The implementation

The implementation ran smoothly for the most part. There was a memorable moment when the back-end developer was asking the design guidance on how text order work when the OCR-ed texts are in a complex layout (such as in homepage; read below for more details).

The output of the OCR engine

The back-end dev showed us the output of the OCR engine (see the picture to the right); and he explained to us the technical feasibility of what can be done to "craft" the output. He asked for our guidance on what the users expects to do with the crafted output. In other words, if we can compile letters into texts, texts into sentences, and sentences into pagagraps; if the paragraphs are scatered throughout different locations in the complex layout, how do we make sure that related paragraphs are grouped together.

In the complex layout, do we know the direction of the text order

In a complex layout (like the picture to the right), when we highlight a paragraph and want to continue to the next paragraph, how do we know the next paragraph is based on location (we don't have semantic understanding of the paragraph)

The framework of the text order

We partner with dev to figure out a framework that makes sense for average users, but not perfect. We utilized the Bootstrap grid system (which is ubiquitus in webpage layouts) to shape the direction of columns and rows. The layout needs to be sliced into rows and columns (and rows and columns and so on...). Rows would have the direction from top to bottom; columns are from left to right. The simplefied diagram to the right is the framework.

The customer insights

Users voiced their positive reviews for this feature on various social media (see the picture to the right). Additionaly, the data we gathered from telemetry, 2M usage sessions in Monthly Active Device (as of Feb 2025); About 40% higher than our projected success.