I'll start with second question. Yes, the uv coordinates from 3 vertices that form a polygon on the model form a 2D triangle on the texture. Each pixel on the triangle that ends up rendered to the screen corresponds to some point within that triangle on the texture. Because the neighboring polygons will typically have neighboring triangles on the texture, the process of mapping polygons to texture triangles is often referred to as unwrapping. It really does look like the mesh of the model has been unwrapped and flattened over the texture. Onto the second question.
In principle, especially if you are writing your own shader code, you can transform coordinates any way you like. Canonically, however, there are 3 transforms that the coordinate from a model vertex passes through before it becomes a 2D coordinate on the screen.
World transform: This takes the vertex coordinate from model's coordinate space to world coordinate space.
View transform: Takes coordinate from world space to view space. (Makes it relative to camera's coordinate,s in other words.)
Projection transform: Projects coordinate into screen coordinates.
All of these are actually performed as 4-dimensional transforms. Vertex starts out as a 4D vector: r = (x, y, z, 1). It is then multiplied by the 3 transform matrices.
r' = r * W * V * P.
The final vector r' has form (x', y', depth*distance, distance). As a last step, rendering hardware divides it through by the last component, to give you (x'/distance, y'/distance, depth, 1). The first two components are the actual screen coordinates. Depth is value between 0 and 1 that will be used for depth test. For each pixel within a triangle, the depth value is interpolated from depths of 3 vertices, and if depth test is enabled, which it usually is, only pixels who are closer to camera than these already rendered at same coordinate will be added.
Ok, so I probably should explain a bit more about transform matrices. Lets start with world transform. It is a 4x4 matrix with the following structure.
|Rxx Rxy Rxz 0|
Ryx Ryy Ryz 0
Rzx Rzy Rzz 0
Tx Ty Tz 1
The R components describe a rotation around the model's origin. You can think of it as a separate matrix R. (Rxx, Ryx, Rzx) is a unit vector which describes the orientation of model's X axis in world coordinates. Same for the next two columns for Y and Z axes respectively. Furthermore, (Rxx, Rxy, Rxz) is also a unit vector which you can use for inverse transform. These are properties of a unitary matrix. That means that if you transpose R, which I'll mark as R', you can invert the rotation. If r' = r*R, then r = r'*R'.
This leaves the (Tx, Ty, Tz) component. If you are familiar with matrix multiplication, you'll notice that: (x, y, z, 1) * W = (x, y, z) * R + (Tx, Ty, Tz). The last component remains being 1. So (Tx, Ty, Tz) is the translation component. In fact, these are the coordinates of the model's center in the world coordinates.
The view matrix has an identical structure. The difference is that rotation describes rotation of world's axes relative to camera, and translation part is world's origin in camera's coordinates. Typical situation is that camera is described as an entity in the world, so it will have the same world transform matrix associated with it as any model would. In that case, you can easily get the view transform from camera's world transform by taking the inverse. The view transform matrix will then have the following components.
|Rxx Ryx Rzx 0|
Rxy Ryy Rzy 0
Rxz Ryz Rzz 0
Tx' Ty' Tz' 1
Notice that the R component was transposed. The translation has to be in camera's coordinates, so you get it from original translation like so.
(Tx', Ty', Tz') = (-Tx, -Ty, -Tz) * R'
Finally, the projection matrix. I'm not going to get into too much detail, just mention what it does. The projection matrix has the following structure.
|2*Zn/Vx 0 0 0|
0 2*Zn/Vy 0 0
0 0 Q 1
0 0 -Zn*Q 0
Where Q=Zf/(Zf-Zn). Zn and Zf are the distances to near and far planes respectively. Anything on the near plane will be rendered with depth 0. Anything closer will not be rendered. Anything on the far plane will have depth of 1. Anything further will not be rendered. The Vx and Vy are width and height of your view port at the near plane. Together, these describe the view frustum. It's like a pyramid with its top sliced off by near plane, far plane being the base of the pyramid, and the camera is where the top vertex of the pyramid would have been. Only things inside the frustum will end up on the screen.
Both OpenGL and DirectX have built-in functions that can generate the projection matrix for you based on your chosen parameter set. If you are rendering using fixed pipeline, these 3 matrices will be set as parameters for rendering. If you are rendering using a shader, you will have opportunity to pass all 3 as parameters to the shader and perform these transformations within the shader. However, like I said earlier, if you are writing your own shader, you can do whatever you want. A common thing to do is multiply view and projection matrix together and pass them as a single argument to the shader.
Anyways, this is the full process of getting the 3D coordinates of the vertex from a model to its final location on the screen. The final coordinates will run between -1 and 1 for both x and y with (0, 0) being in the center of the view port.