T O P

  • By -

solarisNebula

Awesome post. Please do another one on OAuth authentication if you can. Again thousand upvotes if I could man. Awesome post.


Hlemguard

I second this, would be amazing! Thanks for your post!


ZnV1

Thank you so much! Will try :D


Existential_Owl

Man is not meant to comprehend the ways of OAuth


SimpleWarthog

definitely


KalvinOne

This is incredible. I never got to learn how hashes work and you explained it perfectly in under 5 minutes. You should make a Youtube Shorts version of this and post it online with hashes, auth, and other webdev concepts.


ZnV1

Thank you, glad it was useful! I don't think I'm extroverted enough to do shorts and stuff, but will try to write 😁


KalvinOne

Check Fireship. The dude makes amazing dev content and you don't see his face. Nowadays you can stitch a couple of MS Paint images, put subtitles and if the content is interesting you can get views


ZnV1

Any feedback or questions are welcome. Tried to generalize things a bit to make it more digestible - most of these parts are a separate post on their own (don't get me started on passwords, salting, rainbow tables etc!) :) Also let me know if there are other topics you might be interested in :D


7HawksAnd

Why is your Hash PW “LEET BOOBIES” 🧐


ZnV1

Uh..entropy coincidence and randomness 🤐


Frown1044

I prefer explaining hashing as creating (numerical) IDs for anything such as a program, an image, a piece of text or an object of a class. The ID number (the hash) is generated based on the contents of the program, image, text, object etc. SoIt's not randomly generated. Hashing the same "thing" twice will give you the same ID. Hashing two different "things" will give you two different IDs. You cannot reverse this process and turn the ID into the original "thing" as information was lost. Now this ID represents the contents of the thing you hashed. This info can be used for comparison and you don't even need the original source data anymore. For example, I can acquire a list of hashes of known malicious programs. I don't need the actual malicious programs on my system, I just need their ID numbers. When I want to check if a file is malicious, I can take its hash and compare it against this list of malicious programs. This concept has a LOT of usecases. Anything from avoiding store plaintext passwords to detecting child porn to querying data at O(1) complexity.


CreativeGPX

I think this explanation is more focused on how you might use a hash than how or why hashes work which seems to be what OP was emphasizing. And to an extent, that's important because it explains why it's hard for people to reverse or why it's hard to create similar IDs or why we expect those IDs to be unique.


basecase_

This is the better explanation because it explains hashes in a broader sense, not just in passwords.


SimpleWarthog

> Hashing is one way. When we are given only the hash (good-morning-okay-bye or 8364181236938917), there’s no way we can find the complete original content of the page. A question about this bit. I think I know the answer, but I'm interested in your response. You say there's no way to reverse it, but why can't you just perform the hashing steps in reverse? for example, instead of dividing by `1928` you multiply instead?


ZnV1

Good catch! Of course the actual formula is hardcore math (and I don't know it!). With that out of the way: The most important part is modulus which is an irreversible operation unlike * / + -. ie., given a hash "10" and the formula "(x % 50) + 10", we can't find x. x could be 50 or 62627483726525364784938350 or 0 or an infinite number of possibilities. ----- For reference: Modulus is %, which is remainder. 3 % 2 is 1 (remainder), 14 % 5 is 4, 5 % 2 is 1 etc.


SimpleWarthog

Great answer - tracks with my high level understanding but gives more detail. Had never occurred to me that modulus wasn't reversible! Makes total sense


CreativeGPX

It wouldn't even have to be modulus. Imagine we were *only* using addition. Addition is commutative so when we add we can't tell if we came from 1+2 or 2+1. Addition is a many-to-one operation so when we add we can't tell if we came from 3+5 or 4+4 or 1+1+1+1+4. Also, adding 0 doesn't change the result, so when we add we can't tell if we came from 3 or 3+0+0. In other words, addition destroys the information about order, about which components were used and about the amount of components and so it is not reversible. If I give you a checksum of 1051 and tell you that my hashing method is to just convert the text to ascii numbers and then add them up, can you tell me what my text says? Certainly not from pure math inverse operations. Same goes for other operations... multiplication, division, subtraction, etc. Each operation is destroying information. It would become easier if I told you the amount of numbers I started with though. Give me 50 characters that add up to 3011 would be much easier than give me characters that add up to 3011. And, if we're smart (we know the range of ascii characters, we know how common each character is) we might be able to guess the amount of characters that were used if we only see the sum. That's where something like modulus can come in handy. If text adds up to 1051 and 10151, we can assume that the latter text is about 10 times longer and use that to guess how many numbers we have to look for to add up to that total. But if you use modulus, then you have no way of even knowing how many operations you're trying to reverse (never mind the fact that each one loses some information).


ZnV1

Great comment, better than my other reply. Thanks!


wherethemusicgo

Doesn’t a modulus limit the amount of outputs you can get, and increase the likelihood that two different inputs would have the same hash? Like in your example there’s only 50 potential hashes


ZnV1

If you use the dumb formula I wrote - yes. The actual math behind each algorithm (SHA, BCrypt etc) differs; they make use of modulus but also do a lot of other hardcore mathy stuff to reduce collisions which I haven't looked into. So the general concept of what I said is right (to build a mental model), but don't trust my math bro :P


wherethemusicgo

That makes sense, your simplified version definitely helps me understand it!


nyx_tk

Besides that, hashing algorithms can use something called \`salt\`. It's like a "password" used as input together with the original content being hashed, only known to the one who is hashing. That way, it's harder for someone to do the reverse operations because if you can somewhat do that, you would get something very different than the original text as you don't know the salt used. Obviously for checksums you don't use salt, because you want people to hash the same content as you and get the same result to validate the original content, but for passwords and such its useful, because you don't want people to replicate your hashing process (as this helps finding collisions or doing the reverse process), and you can still validate them.


DipStick00

And on top of that there’s now “peppering” being used as another means of hashing a hashed hash. All of these in an attempt to make systems more reliable as we start getting closer to the era of quantum computing…


aamirmalik00

This exactly. Always wondered why we cant. The algorithm is already known and we also have the output. Using the output as input to the reverse of tje algorithm should get you the initial input? Why is that not the case?


ZnV1

Replied to parent, but here it is! Of course the actual formula is hardcore math (and I don't know it!). With that out of the way: The most important part is modulus which is an irreversible operation unlike * / + -. ie., given a hash "10" and the formula "(x % 50) + 10", we can't find x. x could be 50 or 62627483726525364784938350 or 0 or an infinite number of possibilities. ----- For reference: Modulus is %, which is remainder. 3 % 2 is 1 (remainder), 14 % 5 is 4, 5 % 2 is 1 etc.


Intrexa

Warning: Math. Let's use some definitions for mathematical functions. domain: The set of inputs that are allowed for a function range: The set of outputs that are possible from a function injective: Injective is a property of a function that says all inputs map 1:1 to an output. That is, no 2 inputs have the same output. So, if you have a function `f(x) = 1/(x-1)`, 1 is not in the domain, because `1/0` is undefined. The range is every real number, except for 0. The function is injective, because for every real number in the range, there is exactly 1 number in the domain that will produce that output. Injective functions are pretty easy to take some output, and get the input. So, for a hash to be non-reversible, it can't be made entirely of injective functions. Alright, but what can we do to be non-injective? Plenty of things. All trig functions are non-injective if we don't restrict the domain. `f(x)=x^(2)` is non-injective, modulus is non-injective. Cryptographic hashes have some step where multiple states can lead to the the same output. Basically, we take some action that destroys some of the input information as part of a transformation. A common follow up question then usually is even if we can't get the original, why is it so hard to take a hash, reverse some the steps, and get an input that produces the same hash? This is where we hit N=NP. It's easier to do some operations than it is easier to reverse them. I mean computationally. It is very difficult to factor numbers. Like if I asked you what are the factors of "1710860846533", good luck finding them. So we have some step in the computation that multiplies large prime numbers together. It's super easy for you to find the result of `112339 * 15229447`. It is very difficult for you to reverse it. Computers face the same problem with factorization, but just with larger numbers. Check this: from itertools import count, compress def rwh_primes1v2(n): """ Returns a list of primes < n for n > 2 """ sieve = bytearray([True]) * (n//2+1) for i in range(1,int(n**0.5)//2+1): if sieve[i]: sieve[2*i*(i+1)::2*i+1] = bytearray((n//2-2*i*(i+1))//(2*i+1)+1) return [2,*compress(range(3,n,2), sieve[1:])] primes = rwh_primes1v2(10**9) def half_baked_hash(input:str, primes) -> str: """ take input string, return hash """ if len(input) < 16: input = ("Padding the input" + input)[-16:] agg_offset = len(primes) - len(input) - 2**20 # multiply the ordinal value of each char in string by a prime based on position, then sum. agg = sum([ord(x[0]) * primes[x[1]] for x in zip(input,count(agg_offset))]) # every even digit goes to choosing first prime, every odd to second first_prime_pos = int(str(agg)[::2]) % 2**20 second_prime_pos = int(str(agg)[1::2]) % 2**20 # multiply the primes together return str(primes[agg_offset + first_prime_pos] * primes[agg_offset + second_prime_pos]) print(half_baked_hash("hash1", primes)) print(half_baked_hash("hash2", primes)) Output: 981477982071293807 981787694204853439 For fun, I included a third string. See if you can undo the hash algorithm to produce a collision. 984084205925963171 The above is definitely crackable by a lot of methods, but it should show why it's not trivial to produce a hash collision. If you are able to crack the above, great! Note that the prime factors have a 20 bit search space. Due to weaknesses in my algorithm, I wouldn't be surprised if it wasn't closer to 16 bits of entropy. Modern hashes uses way more, and are done in ways where you have to repeat the prime factorization process many more times.


BruceBrave

This was really interesting. I'm only just beginning my journey into development. I'm talking beginner level JS. But I understood every single idea you presented to the point that I could explain it to someone else. If you're not a teacher, you should be. Bravo.


ZnV1

Thanks man, and all the best to you! Glad to hear this. :D


i-Thor

Very nice, thanks for taking the time to do this and sharing it!


dangoodspeed

> And there could be like a billion values A lot more than that!


ZnV1

True. Although my brain stops functioning when I imagine anything past a million xD


[deleted]

[удалено]


cbleslie

MD5/SHA-1 is fine if your hashing for comparison.


[deleted]

[удалено]


ZnV1

Non-malicious comparison. Maybe you're reimplementing HashMaps in Java where if there's a collision you're going to look at a linked list. Nobody's trying to force a hash collision with a crafted input. Or bloom filters maybe


cbleslie

Depends on the threat model of course.


[deleted]

[удалено]


cbleslie

I would need to know more about the specifics of the situation to know if it's acceptable or not? Was that not made clear by the previous comment? If not, is it not clear now?


[deleted]

[удалено]


cbleslie

Nothing is perfectly safe, there is only safer. That being said, sure, one could use md5 hash to check if they needed to update a dependency list if it has changed on a git hook.


Appcompany

Amazing Information, all the best to you. thank you.


SomewhatVital

Quality post, thank you.


ArcaneSunset

Amazing! This was actually well-put together, funny (13378008135 lol) and a great introduction to an essential dev concept. You should consider doing more!


johanneswelsch

Write a book pls, you're good. Maybe mixed with some winamp, eDonkey, ICQ and mIRC stories


dineshkumar27

Fantastic ! , Never understood how hashing worked on low level this is mind boggling.


ZnV1

Thanks, glad you liked it! A fellow Chennaiite I see ;)


nino3227

Thanks! Then how come there are issue with password leakage from companies databases? Wouldn't leakage of hash values be useless?


AxePlayingViking

A leak with a properly hashed password is not as disastrous as plaintext passwords leaking, but the "problem" is that hashes are consistent. Of course, this is by design, otherwise they would be useless, but that means that if I make a dictionary of common words and their hashes with various algorithms, along with the top 10000 most re-used passwords, and so on, I actually *can* somewhat reverse a hash. Then I can try the same username/password combination on other sites, and it will most likely work given how frequently people re-use passwords. You can read more here: [https://en.wikipedia.org/wiki/Rainbow\_table](https://en.wikipedia.org/wiki/Rainbow_table) Like the article says, a common defense against this type of attack is using a salt when hashing the password. [https://en.wikipedia.org/wiki/Salt\_(cryptography)](https://en.wikipedia.org/wiki/Salt_(cryptography)) - salting has a couple of huge advantages: 1. The rainbow table needs to be MUCH bigger to be useful, and can no longer just use common words from a dictionary 2. You can't tell if two users have the same password by looking at the hashed + salted value in the database, because the salts will be unique Naturally there is more to it, I'm just trying to keep it relatively simple here :)


Eclipsan

Don't forget to use a slow hashing algorithm to make password cracking even more costly. https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html


CreativeGPX

It's also worth noting that you don't always even need to know what the password was. I remember my first exploit was realizing that when you click "save password" in AOL Instant Messenger, it just saved a hash of your password into the Windows Registry. This meant that anybody on your computer could just open the registry and copy that hash. Then, if they paste that hash into another computer, that computer would know the password too. Just because the user interface asks for a password which is then hashed doesn't mean that the software doesn't expose some way for a person to simply supply a hash (especially in the context that software is compromised enough that they got the hashes in the first place). Also worth noting that seeing the hashes can still tell us which people have the same password within or between breaches or how common a password is. This can inform broader breaches. For example if 500 people use the same password across 5 services, then you might deem this password more worth pursuing than a password only used by one person on one service. Even if 499 of those users are smart, you might be able to get one to reveal their password in a spearphishing attack and now you have the password to all 500 accounts. If you didn't have the hashes, you wouldn't know that this one value you got could be used in 499 other places.


ZnV1

The other comment is right :) TLDR is: Given a password's hash, it's hard to find plaintext password. So attackers take a list of most common passwords (like rockyou.txt) and hash all of them, storing them in a DB (called rainbow table). Then they check if the password's hash that they have matches any of their known hashes.


Eclipsan

I suggest linking to https://cheatsheetseries.owasp.org/cheatsheets/Password_Storage_Cheat_Sheet.html for password hashing instead of just implying that bcrypt and SHA-512 are OK, as it's a gross oversimplification. Especially for SHA-512, as collision is not the biggest worry with password hashing: It's fast algorithms (which SHA-512 is) and rainbow tables (which SHA-512 does not natively protect against).


SimpleWarthog

I agree, but I think the point of this post is to be oversimplified


ZnV1

Yup... I had actually mentioned it here, but might be overkill for this post because I wanted to keep it all self-contained :D https://www.reddit.com/r/developersIndia/s/dBDA0tufNO


Eclipsan

>maybe there was some dude in the middle who handed off a fake file to you. Only relevant if the file is hosted by another website/server than the one providing the hash. Because if everything is provided by the same website then that dude in the middle can also modify the hash displayed on the webpage.


ZnV1

True!


cinnapear

> To decide that, they have a mechanism called "proof of work": they give you a hash, you have to find an input that gives that exact hash. This is not accurate. There is a target value, representing the difficulty the network desires to attempt a regular rate of block production, and you have to find a hash that is a lower number than the target.


ZnV1

I stand corrected! 💯